Message-ID:

6 May, 2024: The networking issue during the past two days has been identified and fixed.

devel / comp.arch / Re: Load/store with bit offset and mask

A question.

These days, every reasonable highly-performing architecture allows
unaligned loads on a byte boundary. This includes being able to
handle crossing cache line boundaries, crossing page boundaries,
page faults and whatever else.

What about load and store instructions which allow, in addition
a constant or a register?

To be useful, stores at least would have to have an additional
parameter specifying the size, in bits, to be stored. For
symmetry, and to save masking off, loads should have that
capability, too.

For use cases, obviously compression comes to mind, plus image
processing (for example the formats having RGB in 3*10 bits).

Instruction format would be something like

LDBIT Rt,[Rb + R_bit_offset], R_mask

Comments? Too complicated? Makes the handling too slow/big
because of the three shifts? Has been tried recently and didn't
fly because... ?

Re: Load/store with bit offset and mask

<cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30534&group=comp.arch#30534

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:d:b0:3b0:4ae1:f0d0 with SMTP id x13-20020a05622a000d00b003b04ae1f0d0mr632075qtw.214.1674243958799;
Fri, 20 Jan 2023 11:45:58 -0800 (PST)
X-Received: by 2002:a9d:6499:0:b0:684:e371:b7ea with SMTP id
g25-20020a9d6499000000b00684e371b7eamr634745otl.137.1674243958534; Fri, 20
Jan 2023 11:45:58 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 20 Jan 2023 11:45:58 -0800 (PST)
In-Reply-To: <tqeopv$2tve7$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:702d:9fa4:bdb8:e277;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:702d:9fa4:bdb8:e277
References: <tqeopv$2tve7$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
Subject: Re: Load/store with bit offset and mask
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 20 Jan 2023 19:45:58 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 20 Jan 2023 19:45 UTC

On Friday, January 20, 2023 at 1:07:46 PM UTC-6, Thomas Koenig wrote:
> A question.
>
> These days, every reasonable highly-performing architecture allows
> unaligned loads on a byte boundary. This includes being able to
> handle crossing cache line boundaries, crossing page boundaries,
> page faults and whatever else.
<
Agreed, aligned only memory of the 1980 RISC machines was a mistake.
Agreed, misaligned access can cross line and page boundaries.
Agreed this makes memory references potentially have 2 page faults.
>
> What about load and store instructions which allow, in addition
> a constant or a register?
<
My 66000 has::
<
STD #3.141592326584,[Rbase+Rindex<<scale+Dispalcement]
>
> To be useful, stores at least would have to have an additional
> parameter specifying the size, in bits, to be stored. For
> symmetry, and to save masking off, loads should have that
> capability, too.
<
You lost me, here.
>
> For use cases, obviously compression comes to mind, plus image
> processing (for example the formats having RGB in 3*10 bits).
>
> Instruction format would be something like
>
> LDBIT Rt,[Rb + R_bit_offset], R_mask
>
> Comments? Too complicated? Makes the handling too slow/big
> because of the three shifts? Has been tried recently and didn't
> fly because... ?
<
Insert and extract save the day:: removing bit alignment from
memory references.
<
LDD Rc,[Rbase+Rindex<<scale+Displacement]
SLA Rb,Rc,<width:offset>
<
INS Rc,Rb,<width:offset>
STD Rc,[Rbase+Rindex<<scale+Displacement]
<
Bit aligned memory references are 2-gate (plus wire delay) longer
than byte aligned memory references.
<
Given double width extract/insert and a check for container
stepping, generally gets the bit alignment done at low cost,
for this seldom needed feature.

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Friday, January 20, 2023 at 1:07:46 PM UTC-6, Thomas Koenig wrote:
>> A question.
>>
>> These days, every reasonable highly-performing architecture allows
>> unaligned loads on a byte boundary. This includes being able to
>> handle crossing cache line boundaries, crossing page boundaries,
>> page faults and whatever else.
><
> Agreed, aligned only memory of the 1980 RISC machines was a mistake.
> Agreed, misaligned access can cross line and page boundaries.
> Agreed this makes memory references potentially have 2 page faults.
>>
>> What about load and store instructions which allow, in addition
>> a constant or a register?
><
> My 66000 has::
><
> STD #3.141592326584,[Rbase+Rindex<<scale+Dispalcement]
>>
>> To be useful, stores at least would have to have an additional
>> parameter specifying the size, in bits, to be stored. For
>> symmetry, and to save masking off, loads should have that
>> capability, too.
><

> You lost me, here.

What I mean is the ability to load/store 1 to register size bits
from a base pointer + m bits (where m can be large), without
affecting the other bits in memory.

>>
>> For use cases, obviously compression comes to mind, plus image
>> processing (for example the formats having RGB in 3*10 bits).
>>
>> Instruction format would be something like
>>
>> LDBIT Rt,[Rb + R_bit_offset], R_mask
>>
>> Comments? Too complicated? Makes the handling too slow/big
>> because of the three shifts? Has been tried recently and didn't
>> fly because... ?
><
> Insert and extract save the day:: removing bit alignment from
> memory references.
><
> LDD Rc,[Rbase+Rindex<<scale+Displacement]
> SLA Rb,Rc,<width:offset>

This assumes that the bits to be loaded are contained
in a single 64-bit word, correct?

><
> INS Rc,Rb,<width:offset>
> STD Rc,[Rbase+Rindex<<scale+Displacement]

Same for store. The code if this crosses a boundary would
be somewhat ugly: Loading the first register, checking
if the bits come from one doubleword only, conditionally
loading a second one and splicing the data together.

><
> Bit aligned memory references are 2-gate (plus wire delay) longer
> than byte aligned memory references.
><
> Given double width extract/insert and a check for container
> stepping, generally gets the bit alignment done at low cost,
> for this seldom needed feature.

It's probably useful in compression and decompression, or if
somebody is really crazy about saving memory.

Anything else?

Re: Load/store with bit offset and mask

<8c4abbd7-f1bc-4db8-b22b-e4275ac5e1d5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30538&group=comp.arch#30538

copy link Newsgroups: comp.arch

X-Received: by 2002:a37:b005:0:b0:706:5055:fa2c with SMTP id z5-20020a37b005000000b007065055fa2cmr563312qke.292.1674248426283;
Fri, 20 Jan 2023 13:00:26 -0800 (PST)
X-Received: by 2002:aca:1202:0:b0:354:9da8:98a9 with SMTP id
2-20020aca1202000000b003549da898a9mr717401ois.9.1674248426004; Fri, 20 Jan
2023 13:00:26 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 20 Jan 2023 13:00:25 -0800 (PST)
In-Reply-To: <tqev52$2u2bg$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:702d:9fa4:bdb8:e277;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:702d:9fa4:bdb8:e277
References: <tqeopv$2tve7$1@newsreader4.netcologne.de> <cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqev52$2u2bg$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <8c4abbd7-f1bc-4db8-b22b-e4275ac5e1d5n@googlegroups.com>
Subject: Re: Load/store with bit offset and mask
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 20 Jan 2023 21:00:26 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 20 Jan 2023 21:00 UTC

On Friday, January 20, 2023 at 2:56:05 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Friday, January 20, 2023 at 1:07:46 PM UTC-6, Thomas Koenig wrote:
> >> A question.
> >>
> >> These days, every reasonable highly-performing architecture allows
> >> unaligned loads on a byte boundary. This includes being able to
> >> handle crossing cache line boundaries, crossing page boundaries,
> >> page faults and whatever else.
> ><
> > Agreed, aligned only memory of the 1980 RISC machines was a mistake.
> > Agreed, misaligned access can cross line and page boundaries.
> > Agreed this makes memory references potentially have 2 page faults.
> >>
> >> What about load and store instructions which allow, in addition
> >> a constant or a register?
> ><
> > My 66000 has::
> ><
> > STD #3.141592326584,[Rbase+Rindex<<scale+Dispalcement]
> >>
> >> To be useful, stores at least would have to have an additional
> >> parameter specifying the size, in bits, to be stored. For
> >> symmetry, and to save masking off, loads should have that
> >> capability, too.
> ><
>
> > You lost me, here.
> What I mean is the ability to load/store 1 to register size bits
> from a base pointer + m bits (where m can be large), without
> affecting the other bits in memory.
> >>
> >> For use cases, obviously compression comes to mind, plus image
> >> processing (for example the formats having RGB in 3*10 bits).
> >>
> >> Instruction format would be something like
> >>
> >> LDBIT Rt,[Rb + R_bit_offset], R_mask
> >>
> >> Comments? Too complicated? Makes the handling too slow/big
> >> because of the three shifts? Has been tried recently and didn't
> >> fly because... ?
> ><
> > Insert and extract save the day:: removing bit alignment from
> > memory references.
> ><
> > LDD Rc,[Rbase+Rindex<<scale+Displacement]
> > SLA Rb,Rc,<width:offset>
> This assumes that the bits to be loaded are contained
> in a single 64-bit word, correct?
<
LDD Rc,[Rbase+Rindex<<scale+Displacement]
LDD Rd,[Rbase+Rindex<<scale+Displacement+8]
CARRY Rd,{i}
SLA Rb,Rc,<width:offset>
<
does the any width from 1..64 from any position 127..0
> ><
> > INS Rc,Rb,<width:offset>
> > STD Rc,[Rbase+Rindex<<scale+Displacement]
> Same for store. The code if this crosses a boundary would
> be somewhat ugly: Loading the first register, checking
> if the bits come from one doubleword only, conditionally
> loading a second one and splicing the data together.
> ><
> > Bit aligned memory references are 2-gate (plus wire delay) longer
> > than byte aligned memory references.
> ><
> > Given double width extract/insert and a check for container
> > stepping, generally gets the bit alignment done at low cost,
> > for this seldom needed feature.
> It's probably useful in compression and decompression, or if
> somebody is really crazy about saving memory.
>
> Anything else?

Re: Load/store with bit offset and mask

<tqevhd$27op0$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30539&group=comp.arch#30539

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Load/store with bit offset and mask
Date: Fri, 20 Jan 2023 13:02:35 -0800
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <tqevhd$27op0$1@dont-email.me>
References: <tqeopv$2tve7$1@newsreader4.netcologne.de>
<cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 20 Jan 2023 21:02:37 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="b019575ed5414548da3daf2e6af069c4";
logging-data="2351904"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+RYOEkTSBw26t9slOP/5kEC5IHpazfxjI="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:J6L5FQUzL27YfEUMblWmPlBITCM=
In-Reply-To: <cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Fri, 20 Jan 2023 21:02 UTC

On 1/20/2023 11:45 AM, MitchAlsup wrote:
> On Friday, January 20, 2023 at 1:07:46 PM UTC-6, Thomas Koenig wrote:
>> A question.
>>
>> These days, every reasonable highly-performing architecture allows
>> unaligned loads on a byte boundary. This includes being able to
>> handle crossing cache line boundaries, crossing page boundaries,
>> page faults and whatever else.
> <
> Agreed, aligned only memory of the 1980 RISC machines was a mistake.
> Agreed, misaligned access can cross line and page boundaries.
> Agreed this makes memory references potentially have 2 page faults.
>>
>> What about load and store instructions which allow, in addition
>> a constant or a register?
> <
> My 66000 has::
> <
> STD #3.141592326584,[Rbase+Rindex<<scale+Dispalcement]
>>
>> To be useful, stores at least would have to have an additional
>> parameter specifying the size, in bits, to be stored. For
>> symmetry, and to save masking off, loads should have that
>> capability, too.
> <
> You lost me, here.
>>
>> For use cases, obviously compression comes to mind, plus image
>> processing (for example the formats having RGB in 3*10 bits).
>>
>> Instruction format would be something like
>>
>> LDBIT Rt,[Rb + R_bit_offset], R_mask
>>
>> Comments? Too complicated? Makes the handling too slow/big
>> because of the three shifts? Has been tried recently and didn't
>> fly because... ?
> <
> Insert and extract save the day:: removing bit alignment from
> memory references.
> <
> LDD Rc,[Rbase+Rindex<<scale+Displacement]
> SLA Rb,Rc,<width:offset>
> <
> INS Rc,Rb,<width:offset>
> STD Rc,[Rbase+Rindex<<scale+Displacement]
> <
> Bit aligned memory references are 2-gate (plus wire delay) longer
> than byte aligned memory references.
> <
> Given double width extract/insert and a check for container
> stepping, generally gets the bit alignment done at low cost,
> for this seldom needed feature.

When, some time ago, we discussed this in the context of IIRC picking
apart compressed strings, I thought you were going to implement a "load
bit field" instruction in order to reduce the number of instructions for
the inner loop by IIRC 3-4. Did you change your mind on this?

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

It appears that Thomas Koenig <tkoenig@netcologne.de> said:
>A question.
>
>These days, every reasonable highly-performing architecture allows
>unaligned loads on a byte boundary. This includes being able to
>handle crossing cache line boundaries, crossing page boundaries,
>page faults and whatever else.
>
>What about load and store instructions which allow, in addition
>a constant or a register?
>
>To be useful, stores at least would have to have an additional
>parameter specifying the size, in bits, to be stored. For
>symmetry, and to save masking off, loads should have that
>capability, too.

The Vax had extract, insert, and compare bit field instructions. They
took an address, a size, and an offset. The only alignment restriction
was that if the address was a register, the field had to be entirely
within that register.

I think the IBM STRETCH had bit aligned loads and stores too but I don't
have the book about it handy.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Re: Everything old is still new again, Load/store with bit offset and mask

<b53ded62-627a-454c-bc28-593c3c2ee671n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30541&group=comp.arch#30541

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:6214:14f1:b0:534:2c9f:64fe with SMTP id k17-20020a05621414f100b005342c9f64femr818015qvw.111.1674250664811;
Fri, 20 Jan 2023 13:37:44 -0800 (PST)
X-Received: by 2002:a05:6808:394c:b0:363:bef1:30bc with SMTP id
en12-20020a056808394c00b00363bef130bcmr932229oib.113.1674250664471; Fri, 20
Jan 2023 13:37:44 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 20 Jan 2023 13:37:44 -0800 (PST)
In-Reply-To: <tqf15n$1kc9$1@gal.iecc.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:702d:9fa4:bdb8:e277;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:702d:9fa4:bdb8:e277
References: <tqeopv$2tve7$1@newsreader4.netcologne.de> <tqf15n$1kc9$1@gal.iecc.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b53ded62-627a-454c-bc28-593c3c2ee671n@googlegroups.com>
Subject: Re: Everything old is still new again, Load/store with bit offset and mask
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 20 Jan 2023 21:37:44 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 20 Jan 2023 21:37 UTC

On Friday, January 20, 2023 at 3:30:35 PM UTC-6, John Levine wrote:
> It appears that Thomas Koenig <tko...@netcologne.de> said:
> >A question.
> >
> >These days, every reasonable highly-performing architecture allows
> >unaligned loads on a byte boundary. This includes being able to
> >handle crossing cache line boundaries, crossing page boundaries,
> >page faults and whatever else.
> >
> >What about load and store instructions which allow, in addition
> >a constant or a register?
> >
> >To be useful, stores at least would have to have an additional
> >parameter specifying the size, in bits, to be stored. For
> >symmetry, and to save masking off, loads should have that
> >capability, too.
> The Vax had extract, insert, and compare bit field instructions. They
> took an address, a size, and an offset. The only alignment restriction
> was that if the address was a register, the field had to be entirely
> within that register.
>
> I think the IBM STRETCH had bit aligned loads and stores too but I don't
> have the book about it handy.
<
68020 had memory based bit field stuff.
>
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly

Re: Load/store with bit offset and mask

<49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30542&group=comp.arch#30542

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:2b91:b0:705:af79:c22e with SMTP id dz17-20020a05620a2b9100b00705af79c22emr620571qkb.674.1674250746116;
Fri, 20 Jan 2023 13:39:06 -0800 (PST)
X-Received: by 2002:a05:6808:6397:b0:363:867:4887 with SMTP id
ec23-20020a056808639700b0036308674887mr776987oib.218.1674250745909; Fri, 20
Jan 2023 13:39:05 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 20 Jan 2023 13:39:05 -0800 (PST)
In-Reply-To: <tqevhd$27op0$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:702d:9fa4:bdb8:e277;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:702d:9fa4:bdb8:e277
References: <tqeopv$2tve7$1@newsreader4.netcologne.de> <cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqevhd$27op0$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>
Subject: Re: Load/store with bit offset and mask
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 20 Jan 2023 21:39:06 +0000
Content-Type: text/plain; charset="UTF-8"

by: MitchAlsup - Fri, 20 Jan 2023 21:39 UTC

On Friday, January 20, 2023 at 3:02:40 PM UTC-6, Stephen Fuld wrote:
> On 1/20/2023 11:45 AM, MitchAlsup wrote:

> > Insert and extract save the day:: removing bit alignment from
> > memory references.
> > <
> > LDD Rc,[Rbase+Rindex<<scale+Displacement]
> > SLA Rb,Rc,<width:offset>
> > <
> > INS Rc,Rb,<width:offset>
> > STD Rc,[Rbase+Rindex<<scale+Displacement]
> > <
> > Bit aligned memory references are 2-gate (plus wire delay) longer
> > than byte aligned memory references.
> > <
> > Given double width extract/insert and a check for container
> > stepping, generally gets the bit alignment done at low cost,
> > for this seldom needed feature.
<
> When, some time ago, we discussed this in the context of IIRC picking
> apart compressed strings, I thought you were going to implement a "load
> bit field" instruction in order to reduce the number of instructions for
> the inner loop by IIRC 3-4. Did you change your mind on this?
<
I must be getting old, I don't remember that conversation in enough
detail to say. But I don't think I ever contemplated bit aligned memory
references.
>
>
>
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Everything old is still new again, Load/store with bit offset and mask

<eb954dfe-1a0b-431e-b079-aa1a1dfa67c5n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30543&group=comp.arch#30543

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:622a:5c0f:b0:3b6:3a58:911a with SMTP id gd15-20020a05622a5c0f00b003b63a58911amr579218qtb.350.1674261074394;
Fri, 20 Jan 2023 16:31:14 -0800 (PST)
X-Received: by 2002:a9d:64c4:0:b0:684:b7cf:6795 with SMTP id
n4-20020a9d64c4000000b00684b7cf6795mr907563otl.76.1674261074016; Fri, 20 Jan
2023 16:31:14 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 20 Jan 2023 16:31:13 -0800 (PST)
In-Reply-To: <tqf15n$1kc9$1@gal.iecc.com>
Injection-Info: google-groups.googlegroups.com; posting-host=136.50.14.162; posting-account=AoizIQoAAADa7kQDpB0DAj2jwddxXUgl
NNTP-Posting-Host: 136.50.14.162
References: <tqeopv$2tve7$1@newsreader4.netcologne.de> <tqf15n$1kc9$1@gal.iecc.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <eb954dfe-1a0b-431e-b079-aa1a1dfa67c5n@googlegroups.com>
Subject: Re: Everything old is still new again, Load/store with bit offset and mask
From: jim.brak...@ieee.org (JimBrakefield)
Injection-Date: Sat, 21 Jan 2023 00:31:14 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 2685

by: JimBrakefield - Sat, 21 Jan 2023 00:31 UTC

On Friday, January 20, 2023 at 3:30:35 PM UTC-6, John Levine wrote:
> It appears that Thomas Koenig <tko...@netcologne.de> said:
> >A question.
> >
> >These days, every reasonable highly-performing architecture allows
> >unaligned loads on a byte boundary. This includes being able to
> >handle crossing cache line boundaries, crossing page boundaries,
> >page faults and whatever else.
> >
> >What about load and store instructions which allow, in addition
> >a constant or a register?
> >
> >To be useful, stores at least would have to have an additional
> >parameter specifying the size, in bits, to be stored. For
> >symmetry, and to save masking off, loads should have that
> >capability, too.
> The Vax had extract, insert, and compare bit field instructions. They
> took an address, a size, and an offset. The only alignment restriction
> was that if the address was a register, the field had to be entirely
> within that register.
>
> I think the IBM STRETCH had bit aligned loads and stores too but I don't
> have the book about it handy.
>
> --
> Regards,
> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
> Please consider the environment before reading this e-mail. https://jl.ly

The IBM Stretch had 24-bit bit-addressable memory pointers (2MB max)
Bytes could be from one to eight bits, used a 64-bit instruction versus 32-bit inst.
Also had branch on bit inst.

For modifying IO port configurations ARM bit banding makes sense?

Re: Load/store with bit offset and mask

<tqg4b8$2gj3p$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30547&group=comp.arch#30547

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Load/store with bit offset and mask
Date: Fri, 20 Jan 2023 23:30:48 -0800
Organization: A noiseless patient Spider
Lines: 173
Message-ID: <tqg4b8$2gj3p$1@dont-email.me>
References: <tqeopv$2tve7$1@newsreader4.netcologne.de>
<cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqevhd$27op0$1@dont-email.me>
<49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 21 Jan 2023 07:30:49 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="1a260d83b94dfa01bb6227b19b07ec6d";
logging-data="2641017"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+G67L0KxXmPpOME5ydDf1J3F4OMvrIz24="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:i5NPGi5D3OdEH4Znkhxu8cKL+jw=
Content-Language: en-US
In-Reply-To: <49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>

by: Stephen Fuld - Sat, 21 Jan 2023 07:30 UTC

On 1/20/2023 1:39 PM, MitchAlsup wrote:
> On Friday, January 20, 2023 at 3:02:40 PM UTC-6, Stephen Fuld wrote:
>> On 1/20/2023 11:45 AM, MitchAlsup wrote:
>
>>> Insert and extract save the day:: removing bit alignment from
>>> memory references.
>>> <
>>> LDD Rc,[Rbase+Rindex<<scale+Displacement]
>>> SLA Rb,Rc,<width:offset>
>>> <
>>> INS Rc,Rb,<width:offset>
>>> STD Rc,[Rbase+Rindex<<scale+Displacement]
>>> <
>>> Bit aligned memory references are 2-gate (plus wire delay) longer
>>> than byte aligned memory references.
>>> <
>>> Given double width extract/insert and a check for container
>>> stepping, generally gets the bit alignment done at low cost,
>>> for this seldom needed feature.
> <
>> When, some time ago, we discussed this in the context of IIRC picking
>> apart compressed strings, I thought you were going to implement a "load
>> bit field" instruction in order to reduce the number of instructions for
>> the inner loop by IIRC 3-4. Did you change your mind on this?
> <
> I must be getting old, I don't remember that conversation in enough
> detail to say. But I don't think I ever contemplated bit aligned memory
> references.

I found it. Below I have pasted as a quotation, a post that includes
most of the context, the proposed solution, and your response.

> From: "MitchAlsup" <MitchAlsup@aol.com>
> Subject: Re: RISC-V vs. Aarch64
> Date: Fri, 21 Jan 2022 09:53:16 -0800 (PST)
> Message-ID: <9d652029-a997-4bec-9182-769002a6bd1dn@googlegroups.com>
> Lines: 143
>
> On Friday, January 21, 2022 at 11:05:28 AM UTC-6, Stephen Fuld wrote:
>> On 1/13/2022 9:53 AM, MitchAlsup wrote:
>
>> > I came up with this in 5 minutes::
>> > This assumes the input bit-length selector is an vector of characters and that the
>> > chars contain values from {1..64}
>> > <
>> > void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
>> > {
>> > uint64_t len,
>> > bit=0,
>> > word=0,
>> > extract,
>> > container1 = packed[0],
>> > container2 = packed[1];
>> >
>> > for( unsigned int i = 0; i < count; i++ )
>> > {
>> > len = size[i];
>> > bit += len;
>> > extract = ( len << 32 ) | ( bit & 0x3F );
>> > if( word != bit >> 6 )
>> > {
>> > container1 = container2;
>> > container2 = packed[++word];
>> > }
>> > unpacked[i] = {container2, container1} >> extract;
>> > }
>> > }
>> > <
>> > This translates into pretty nice My 66000 ISA:
>> > <
>> > ENTRY unpack
>> > unpack:
>> > MOV R5,#0
>> > MOV R6,#0
>> > LDD R7,[R2]
>> > LDD R8,[R2+8]
>> > MOV R9,#0
>> > loop:
>> > LDUB R10,[R1+R9]
>> > ADD R5,R5,R10
>> > AND R11,R5,#63
>> > SL R12,R10,#32
>> > OR R11,R11,R12
>> > SR R12,R6,#6
>> > CMP R11,R6,R12
>> > PEQ R11,{111}
>> > ADD R6,R6,#1
>> > MOV R7,R8
>> > LDD R8,[R2+R6<<3]
>> > CARRY R8,{{I}}
>> > SL R12,R7,R11
>> > STD R12,[R3+R9<<3]
>> > ADD R9,R9,#1
>> > CMP R11,R9,R4
>> > BLT R11,loop
>> > RET
>> > <
>> > Well at least straightforwardly.
>>
>>
>> If Terje is right, and he almost always is, it is worth trying to come
>> up with a better solution for this type of problem. So, as a start, I
>> came up with what follows. This certainly isn’t the final solution. It
>> is intended to start a discussion on better ways to do this. And the
>> usual disclaimer, IANAHG, so this is from a software perspective. But I
>> did try to fit it “in the spirit” of the MY 66000, and it takes
>> advantages of that design’s unique capabilities.
>>
>> The idea is to add one new instruction, which typically would be in the
>> shadow of a preceding Carry meta instruction. I called the new
>> instruction Load Bit Field (LBF).
>>
>> It is a two source, one result instruction, but uses the carry register
>> for an additional source and destination. The syntax is
>>
>> LBF Result register, field length (in bits), buffer starting address
>> (in bytes)
>>
>> The carry register contains the offset, in bits, from the start of the
>> buffer where the desired field starts.
>>
>> The instruction computes the start of the desired field by adding the
>> high order all but three bits of the carry register to get the starting
>> byte number, then uses the low order three bits to get the starting bit
>> number. The instruction extracts the field, starting at the computed
>> bit address with length as given in the register specified in the
>> register, and right justifies that field in the result register. The
>> higher order bits in the result register are set to zero. If the output
>> bit of the Carry instruction is set, the length value is added to the
>> Carry register.
> <
> A bit more on the CISC side than desired (most of the time)--3
> exceptions possible, 2 memory accesses. Also note, my original
> solution can produce signed or unsigned output stream. This is
> going to take 2 cycles in AGEN, and 2 result register writes.
>>
>> In order to speed up this instruction, and given that it will frequently
>> occur in a fairly tight loop, I think (hope) that the hardware can take
>> advantage of the “streaming” buffers otherwise used for VVM operations.
>> Anyway, if one had this instruction, the main loop in the code above
>> could be something like
>>
>>
>> loop:
>> LDUB R10,[R1+R9]
>> CARRY R6,IO
>> LBF R12,R10,R2 ;I am not sure about R2, It should be the start of
>> the packed buffer.
>> STD R12,[R3+R9<<3]
>> ADD R9,R9,#1
>> CMP R11,R9,R4
>> BLT R11,loop
>>
>> For a savings of about 10 instructions in the I cache, but fewer in
>> execution (but still significant) depending upon how often the
>> instructions under the predicate are executed.
>>
> I have to admit, this looks fairly juicy--just have to plow my way
> through and see what comes out.
>>
>> Anyway, Of course, I invite comments, criticisms, etc. One obvious
>> drawback is that this only addresses the "decompression" side. While I
>> briefly considered a "Store Bit Field", I discarded it as it seemed too
>> complex, and presumably would used less frequently, as
>> compression/coding happens less frequently than decompression/decoding.

end of copied old post.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Load/store with bit offset and mask

<bcb2431b-03de-4c05-9bf2-8551ad7a4e9fn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30552&group=comp.arch#30552

copy link Newsgroups: comp.arch

X-Received: by 2002:a05:620a:840f:b0:706:926a:9011 with SMTP id pc15-20020a05620a840f00b00706926a9011mr738426qkn.351.1674319787016;
Sat, 21 Jan 2023 08:49:47 -0800 (PST)
X-Received: by 2002:a9d:6551:0:b0:684:c6e2:92fa with SMTP id
q17-20020a9d6551000000b00684c6e292famr888799otl.31.1674319786781; Sat, 21 Jan
2023 08:49:46 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 21 Jan 2023 08:49:46 -0800 (PST)
In-Reply-To: <tqg4b8$2gj3p$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1568:65cc:f0c4:a6e0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1568:65cc:f0c4:a6e0
References: <tqeopv$2tve7$1@newsreader4.netcologne.de> <cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqevhd$27op0$1@dont-email.me> <49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>
<tqg4b8$2gj3p$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <bcb2431b-03de-4c05-9bf2-8551ad7a4e9fn@googlegroups.com>
Subject: Re: Load/store with bit offset and mask
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 21 Jan 2023 16:49:47 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 8723

by: MitchAlsup - Sat, 21 Jan 2023 16:49 UTC

On Saturday, January 21, 2023 at 1:30:52 AM UTC-6, Stephen Fuld wrote:
> On 1/20/2023 1:39 PM, MitchAlsup wrote:
> > On Friday, January 20, 2023 at 3:02:40 PM UTC-6, Stephen Fuld wrote:
> >> On 1/20/2023 11:45 AM, MitchAlsup wrote:
> >
> >>> Insert and extract save the day:: removing bit alignment from
> >>> memory references.
> >>> <
> >>> LDD Rc,[Rbase+Rindex<<scale+Displacement]
> >>> SLA Rb,Rc,<width:offset>
> >>> <
> >>> INS Rc,Rb,<width:offset>
> >>> STD Rc,[Rbase+Rindex<<scale+Displacement]
> >>> <
> >>> Bit aligned memory references are 2-gate (plus wire delay) longer
> >>> than byte aligned memory references.
> >>> <
> >>> Given double width extract/insert and a check for container
> >>> stepping, generally gets the bit alignment done at low cost,
> >>> for this seldom needed feature.
> > <
> >> When, some time ago, we discussed this in the context of IIRC picking
> >> apart compressed strings, I thought you were going to implement a "load
> >> bit field" instruction in order to reduce the number of instructions for
> >> the inner loop by IIRC 3-4. Did you change your mind on this?
> > <
> > I must be getting old, I don't remember that conversation in enough
> > detail to say. But I don't think I ever contemplated bit aligned memory
> > references.
> I found it. Below I have pasted as a quotation, a post that includes
> most of the context, the proposed solution, and your response.
>
> > From: "MitchAlsup" <Mitch...@aol.com>
> > Subject: Re: RISC-V vs. Aarch64
> > Date: Fri, 21 Jan 2022 09:53:16 -0800 (PST)
> > Message-ID: <9d652029-a997-4bec...@googlegroups.com>
> > Lines: 143
> >
> > On Friday, January 21, 2022 at 11:05:28 AM UTC-6, Stephen Fuld wrote:
> >> On 1/13/2022 9:53 AM, MitchAlsup wrote:
> >
> >> > I came up with this in 5 minutes::
> >> > This assumes the input bit-length selector is an vector of characters and that the
> >> > chars contain values from {1..64}
> >> > <
> >> > void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
> >> > {
> >> > uint64_t len,
> >> > bit=0,
> >> > word=0,
> >> > extract,
> >> > container1 = packed[0],
> >> > container2 = packed[1];
> >> >
> >> > for( unsigned int i = 0; i < count; i++ )
> >> > {
> >> > len = size[i];
> >> > bit += len;
> >> > extract = ( len << 32 ) | ( bit & 0x3F );
> >> > if( word != bit >> 6 )
> >> > {
> >> > container1 = container2;
> >> > container2 = packed[++word];
> >> > }
> >> > unpacked[i] = {container2, container1} >> extract;
> >> > }
> >> > }
> >> > <
> >> > This translates into pretty nice My 66000 ISA:
> >> > <
> >> > ENTRY unpack
> >> > unpack:
> >> > MOV R5,#0
> >> > MOV R6,#0
> >> > LDD R7,[R2]
> >> > LDD R8,[R2+8]
> >> > MOV R9,#0
> >> > loop:
> >> > LDUB R10,[R1+R9]
> >> > ADD R5,R5,R10
> >> > AND R11,R5,#63
> >> > SL R12,R10,#32
> >> > OR R11,R11,R12
> >> > SR R12,R6,#6
> >> > CMP R11,R6,R12
> >> > PEQ R11,{111}
> >> > ADD R6,R6,#1
> >> > MOV R7,R8
> >> > LDD R8,[R2+R6<<3]
> >> > CARRY R8,{{I}}
> >> > SL R12,R7,R11
> >> > STD R12,[R3+R9<<3]
> >> > ADD R9,R9,#1
> >> > CMP R11,R9,R4
> >> > BLT R11,loop
> >> > RET
> >> > <
> >> > Well at least straightforwardly.
> >>
> >>
> >> If Terje is right, and he almost always is, it is worth trying to come
> >> up with a better solution for this type of problem. So, as a start, I
> >> came up with what follows. This certainly isn’t the final solution. It
> >> is intended to start a discussion on better ways to do this. And the
> >> usual disclaimer, IANAHG, so this is from a software perspective. But I
> >> did try to fit it “in the spirit” of the MY 66000, and it takes
> >> advantages of that design’s unique capabilities.
> >>
> >> The idea is to add one new instruction, which typically would be in the
> >> shadow of a preceding Carry meta instruction. I called the new
> >> instruction Load Bit Field (LBF).
> >>
> >> It is a two source, one result instruction, but uses the carry register
> >> for an additional source and destination. The syntax is
> >>
> >> LBF Result register, field length (in bits), buffer starting address
> >> (in bytes)
> >>
> >> The carry register contains the offset, in bits, from the start of the
> >> buffer where the desired field starts.
> >>
> >> The instruction computes the start of the desired field by adding the
> >> high order all but three bits of the carry register to get the starting
> >> byte number, then uses the low order three bits to get the starting bit
> >> number. The instruction extracts the field, starting at the computed
> >> bit address with length as given in the register specified in the
> >> register, and right justifies that field in the result register. The
> >> higher order bits in the result register are set to zero. If the output
> >> bit of the Carry instruction is set, the length value is added to the
> >> Carry register.
> > <
> > A bit more on the CISC side than desired (most of the time)--3
> > exceptions possible, 2 memory accesses. Also note, my original
> > solution can produce signed or unsigned output stream. This is
> > going to take 2 cycles in AGEN, and 2 result register writes.
> >>
> >> In order to speed up this instruction, and given that it will frequently
> >> occur in a fairly tight loop, I think (hope) that the hardware can take
> >> advantage of the “streaming” buffers otherwise used for VVM operations.
> >> Anyway, if one had this instruction, the main loop in the code above
> >> could be something like
> >>
> >>
> >> loop:
> >> LDUB R10,[R1+R9]
> >> CARRY R6,IO
> >> LBF R12,R10,R2 ;I am not sure about R2, It should be the start of
> >> the packed buffer.
> >> STD R12,[R3+R9<<3]
> >> ADD R9,R9,#1
> >> CMP R11,R9,R4
> >> BLT R11,loop
> >>
> >> For a savings of about 10 instructions in the I cache, but fewer in
> >> execution (but still significant) depending upon how often the
> >> instructions under the predicate are executed.
> >>
> > I have to admit, this looks fairly juicy--just have to plow my way
> > through and see what comes out.
> >>
> >> Anyway, Of course, I invite comments, criticisms, etc. One obvious
> >> drawback is that this only addresses the "decompression" side. While I
> >> briefly considered a "Store Bit Field", I discarded it as it seemed too
> >> complex, and presumably would used less frequently, as
> >> compression/coding happens less frequently than decompression/decoding..
>
> end of copied old post.
<
Exactly 1 year ago today.
<
In any event, this has not yet made it into my ISA.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: Load/store with bit offset and mask

<62ecf4bf-0ad4-4a5e-8b39-a73b077ba9ffn@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30553&group=comp.arch#30553

copy link Newsgroups: comp.arch

X-Received: by 2002:ac8:7ca5:0:b0:3b6:321d:5093 with SMTP id z5-20020ac87ca5000000b003b6321d5093mr652273qtv.595.1674337673260;
Sat, 21 Jan 2023 13:47:53 -0800 (PST)
X-Received: by 2002:a05:6870:c4c:b0:13b:6986:2649 with SMTP id
lf12-20020a0568700c4c00b0013b69862649mr1718544oab.261.1674337672861; Sat, 21
Jan 2023 13:47:52 -0800 (PST)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 21 Jan 2023 13:47:52 -0800 (PST)
In-Reply-To: <tqg4b8$2gj3p$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:1568:65cc:f0c4:a6e0;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:1568:65cc:f0c4:a6e0
References: <tqeopv$2tve7$1@newsreader4.netcologne.de> <cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqevhd$27op0$1@dont-email.me> <49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>
<tqg4b8$2gj3p$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <62ecf4bf-0ad4-4a5e-8b39-a73b077ba9ffn@googlegroups.com>
Subject: Re: Load/store with bit offset and mask
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sat, 21 Jan 2023 21:47:53 +0000
Content-Type: text/plain; charset="UTF-8"
X-Received-Bytes: 3773

by: MitchAlsup - Sat, 21 Jan 2023 21:47 UTC

On Saturday, January 21, 2023 at 1:30:52 AM UTC-6, Stephen Fuld wrote:
> On 1/20/2023 1:39 PM, MitchAlsup wrote:
> >> On 1/20/2023 11:45 AM, MitchAlsup wrote:

> >> > void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
> >> > {
> >> > uint64_t len,
> >> > bit=0,
> >> > word=0,
> >> > extract,
> >> > container1 = packed[0],
> >> > container2 = packed[1];
> >> >
> >> > for( unsigned int i = 0; i < count; i++ )
> >> > {
> >> > len = size[i];
> >> > bit += len;
> >> > extract = ( len << 32 ) | ( bit & 0x3F );
> >> > if( word != bit >> 6 )
> >> > {
> >> > container1 = container2;
> >> > container2 = packed[++word];
> >> > }
> >> > unpacked[i] = {container2, container1} >> extract;
> >> > }
> >> > }
<
I spend an hour and came up with a better subroutine::
<
void unpack( uchar_t size[], uint64_t packed[],
uint64_t unpacked[], uint64_t count )
{ uint64_t len,
bit =0,
word =1,
container1,
container2 = packed[0];

for( unsigned int i = 0; i < count; i++ )
{
container1 = container2;
container2 = packed[word++]
do {
len = size[i];
unpacked[i] = ({container2, container1} >> bit)
& ~(~0 <<len);
bit += len;
} while( bit < 64 );
bit &= 63;
}
} <
Which has the ability to be compiled into::
<
unpack:
MOV R6,#0
MOV R7,#1
LDD R9,[R2]
begin_for_loop:
MOV R10,#0
for_loop:
MOV R8,R9
LDD R9,[R4+R7<<3]
begin_do_loop:
VEC R12,{}
do_loop:
LDUB R5,[R1+R10]
CARRY R9,{I}
SLL R11,R8,R6
SRL R12,#-1,R5
AND R11,R11,~R12
LOOP LE,R6,R5,#63
end_do_loop:
AND R6,R6,#63
ADD R10,R10,#1
CMP R11,R10,R4
BLT R11,for_loop
end_for_loop:
RET
<
Inner loops execute 5 instructions {LDSB, SRL, SLL, AND, LOOP}
and contains the CARRY instruction-modifier.
<
Apparently creating a mask and using it is just as efficient as building an
extraction variable (so I changed the code to reflect).
<
Also note:: the previous code was off by the length of the first bit field.
<
Latency analysis::
<
The first 3 instructions can issue together (depending on width of machine).
Instructions {4 and 6} can begin execution as soon as the LDUB data arrives.
So the loop latency is LD-latency+2 = 5 cycles.
<
I doubt any small increment in the memory reference instructions would
make this look "lots faster", and is therefore contraindicated.

Re: Load/store with bit offset and mask

<tqjfu9$30vrp$1@newsreader4.netcologne.de>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30554&group=comp.arch#30554

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2a0a-a540-2398-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: Load/store with bit offset and mask
Date: Sun, 22 Jan 2023 14:07:05 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <tqjfu9$30vrp$1@newsreader4.netcologne.de>
References: <tqeopv$2tve7$1@newsreader4.netcologne.de>
<cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqevhd$27op0$1@dont-email.me>
<49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>
<tqg4b8$2gj3p$1@dont-email.me>
<62ecf4bf-0ad4-4a5e-8b39-a73b077ba9ffn@googlegroups.com>
Injection-Date: Sun, 22 Jan 2023 14:07:05 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2a0a-a540-2398-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2a0a:a540:2398:0:7285:c2ff:fe6c:992d";
logging-data="3178361"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)

by: Thomas Koenig - Sun, 22 Jan 2023 14:07 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> I spend an hour and came up with a better subroutine::
><
> void unpack( uchar_t size[], uint64_t packed[],
> uint64_t unpacked[], uint64_t count )
> {
> uint64_t len,
> bit =0,
> word =1,
> container1,
> container2 = packed[0];
>
> for( unsigned int i = 0; i < count; i++ )
> {
> container1 = container2;
> container2 = packed[word++]
> do {
> len = size[i];
> unpacked[i] = ({container2, container1} >> bit)
> & ~(~0 <<len);
> bit += len;
> } while( bit < 64 );
> bit &= 63;
> }
> }

Cleaning this up a little bit, I get

#include <stdint.h>

typedef unsigned char uchar_t;

void
unpack (uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count)
{ uint64_t len, bit = 0, word = 1, container1, container2 = packed[0];

for (unsigned int i = 0; i < count; i++)
{
container1 = container2;
container2 = packed[word++];
__uint128_t cont;
cont = ((__uint128_t)container2 << 64) | container1;
do
{
len = size[i];
unpacked[i] = (cont >> bit) & ~(~0 << len);
bit += len;
}
while (bit < 64);
bit &= 63;
}
}

> Which has the ability to be compiled into::

Nicely put :-)

Right now, it hits https://github.com/bagel99/llvm-my66000/issues/24
(which I just opened today, before reading your post).

Re: Load/store with bit offset and mask

<tqk56o$387to$1@dont-email.me>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30556&group=comp.arch#30556

copy link Newsgroups: comp.arch

Path: i2pn2.org!i2pn.org!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: Load/store with bit offset and mask
Date: Sun, 22 Jan 2023 12:10:00 -0800
Organization: A noiseless patient Spider
Lines: 129
Message-ID: <tqk56o$387to$1@dont-email.me>
References: <tqeopv$2tve7$1@newsreader4.netcologne.de>
<cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqevhd$27op0$1@dont-email.me>
<49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>
<tqg4b8$2gj3p$1@dont-email.me>
<62ecf4bf-0ad4-4a5e-8b39-a73b077ba9ffn@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 22 Jan 2023 20:10:00 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="d1e8815841052b9ce7d3da99b8da32b7";
logging-data="3415992"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19iydrHNnOX9y0SU0+7p/oeRnGXAXh2Yww="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.6.1
Cancel-Lock: sha1:31yzxYStUhJgbB1zsDDjlRqeV/Q=
In-Reply-To: <62ecf4bf-0ad4-4a5e-8b39-a73b077ba9ffn@googlegroups.com>
Content-Language: en-US

by: Stephen Fuld - Sun, 22 Jan 2023 20:10 UTC

On 1/21/2023 1:47 PM, MitchAlsup wrote:
> On Saturday, January 21, 2023 at 1:30:52 AM UTC-6, Stephen Fuld wrote:
>> On 1/20/2023 1:39 PM, MitchAlsup wrote:
>>>> On 1/20/2023 11:45 AM, MitchAlsup wrote:
>
>>>>> void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
>>>>> {
>>>>> uint64_t len,
>>>>> bit=0,
>>>>> word=0,
>>>>> extract,
>>>>> container1 = packed[0],
>>>>> container2 = packed[1];
>>>>>
>>>>> for( unsigned int i = 0; i < count; i++ )
>>>>> {
>>>>> len = size[i];
>>>>> bit += len;
>>>>> extract = ( len << 32 ) | ( bit & 0x3F );
>>>>> if( word != bit >> 6 )
>>>>> {
>>>>> container1 = container2;
>>>>> container2 = packed[++word];
>>>>> }
>>>>> unpacked[i] = {container2, container1} >> extract;
>>>>> }
>>>>> }
> <
> I spend an hour and came up with a better subroutine::
> <
> void unpack( uchar_t size[], uint64_t packed[],
> uint64_t unpacked[], uint64_t count )
> {
> uint64_t len,
> bit =0,
> word =1,
> container1,
> container2 = packed[0];
>
> for( unsigned int i = 0; i < count; i++ )
> {
> container1 = container2;
> container2 = packed[word++]
> do {
> len = size[i];
> unpacked[i] = ({container2, container1} >> bit)
> & ~(~0 <<len);
> bit += len;
> } while( bit < 64 );
> bit &= 63;
> }
> }
> <
> Which has the ability to be compiled into::
> <
> unpack:
> MOV R6,#0
> MOV R7,#1
> LDD R9,[R2]
> begin_for_loop:
> MOV R10,#0
> for_loop:
> MOV R8,R9
> LDD R9,[R4+R7<<3]
> begin_do_loop:
> VEC R12,{}
> do_loop:
> LDUB R5,[R1+R10]
> CARRY R9,{I}
> SLL R11,R8,R6
> SRL R12,#-1,R5
> AND R11,R11,~R12
> LOOP LE,R6,R5,#63
> end_do_loop:
> AND R6,R6,#63
> ADD R10,R10,#1
> CMP R11,R10,R4
> BLT R11,for_loop
> end_for_loop:
> RET
> <
> Inner loops execute 5 instructions {LDSB, SRL, SLL, AND, LOOP}
> and contains the CARRY instruction-modifier.
> <
> Apparently creating a mask and using it is just as efficient as building an
> extraction variable (so I changed the code to reflect).
> <
> Also note:: the previous code was off by the length of the first bit field.
> <
> Latency analysis::
> <
> The first 3 instructions can issue together (depending on width of machine).
> Instructions {4 and 6} can begin execution as soon as the LDUB data arrives.
> So the loop latency is LD-latency+2 = 5 cycles.
> <
> I doubt any small increment in the memory reference instructions would
> make this look "lots faster", and is therefore contraindicated.

Nice. It looks like you have reduced the advantage of the load bit
field instruction. Some comments

1. When I made the suggestion, I was just looking to make minimal
changes to the generated code. I didn't look at the possibility of
using the VEC/LOOP instructions. If these could be used loop for the in
my code, I think it saves another instruction in the loop.

2. Your solution still have the overhead of extra code for when you
cross a 64 bit boundary. You have changed it from a big PRED, to an
outer loop. But it is still there. So you have to factor in the extra
instructions in the outer loop times 64 divided by the average bit field
length. Note that that also means you have to load and parse the inner
loop once per 64 bits, whereas, my proposed solution handles 64 bit
crossings without any extra instructions, nor does it require the
reload/reparse of the inner loop (assuming you can use VVM for my inner
loop)

3. Your solution requires three instructions per bit field,two shifts
and an AND, whereas these are combined into a single instruction in my
proposed solution.

Of course, whether the using the proposed instruction is "lots faster"
depends on your definition of "lots". :-)

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: Load/store with bit offset and mask

<88282efb-af8c-4973-a1f5-db90f9f78204n@googlegroups.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=30557&group=comp.arch#30557

copy link Newsgroups: comp.arch

X-Received: by 2002:ad4:40ca:0:b0:535:604c:15ff with SMTP id x10-20020ad440ca000000b00535604c15ffmr417210qvp.25.1674419963246;
Sun, 22 Jan 2023 12:39:23 -0800 (PST)
X-Received: by 2002:a05:6808:1aac:b0:368:ca97:3a2a with SMTP id
bm44-20020a0568081aac00b00368ca973a2amr1303942oib.261.1674419963032; Sun, 22
Jan 2023 12:39:23 -0800 (PST)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sun, 22 Jan 2023 12:39:22 -0800 (PST)
In-Reply-To: <tqk56o$387to$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:291:29f0:6963:9ac5:3b86:338c;
posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 2600:1700:291:29f0:6963:9ac5:3b86:338c
References: <tqeopv$2tve7$1@newsreader4.netcologne.de> <cb3a96df-4256-48fd-9fe5-2cfcb0c24e91n@googlegroups.com>
<tqevhd$27op0$1@dont-email.me> <49b0fa11-eb6e-4af2-8a6c-09a58dd7161an@googlegroups.com>
<tqg4b8$2gj3p$1@dont-email.me> <62ecf4bf-0ad4-4a5e-8b39-a73b077ba9ffn@googlegroups.com>
<tqk56o$387to$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <88282efb-af8c-4973-a1f5-db90f9f78204n@googlegroups.com>
Subject: Re: Load/store with bit offset and mask
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Sun, 22 Jan 2023 20:39:23 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

by: MitchAlsup - Sun, 22 Jan 2023 20:39 UTC

On Sunday, January 22, 2023 at 2:10:03 PM UTC-6, Stephen Fuld wrote:
> On 1/21/2023 1:47 PM, MitchAlsup wrote:
> > On Saturday, January 21, 2023 at 1:30:52 AM UTC-6, Stephen Fuld wrote:
> >> On 1/20/2023 1:39 PM, MitchAlsup wrote:
> >>>> On 1/20/2023 11:45 AM, MitchAlsup wrote:
> >
> >>>>> void unpack( uchar_t size[], uint64_t packed[], uint64_t unpacked[], uint64_t count )
> >>>>> {
> >>>>> uint64_t len,
> >>>>> bit=0,
> >>>>> word=0,
> >>>>> extract,
> >>>>> container1 = packed[0],
> >>>>> container2 = packed[1];
> >>>>>
> >>>>> for( unsigned int i = 0; i < count; i++ )
> >>>>> {
> >>>>> len = size[i];
> >>>>> bit += len;
> >>>>> extract = ( len << 32 ) | ( bit & 0x3F );
> >>>>> if( word != bit >> 6 )
> >>>>> {
> >>>>> container1 = container2;
> >>>>> container2 = packed[++word];
> >>>>> }
> >>>>> unpacked[i] = {container2, container1} >> extract;
> >>>>> }
> >>>>> }
> > <
> > I spend an hour and came up with a better subroutine::
> > <
> > void unpack( uchar_t size[], uint64_t packed[],
> > uint64_t unpacked[], uint64_t count )
> > {
> > uint64_t len,
> > bit =0,
> > word =1,
> > container1,
> > container2 = packed[0];
> >
> > for( unsigned int i = 0; i < count; i++ )
> > {
> > container1 = container2;
> > container2 = packed[word++]
> > do {
> > len = size[i];
> > unpacked[i] = ({container2, container1} >> bit)
> > & ~(~0 <<len);
> > bit += len;
> > } while( bit < 64 );
> > bit &= 63;
> > }
> > }
> > <
> > Which has the ability to be compiled into::
> > <
> > unpack:
> > MOV R6,#0
> > MOV R7,#1
> > LDD R9,[R2]
> > begin_for_loop:
> > MOV R10,#0
> > for_loop:
> > MOV R8,R9
> > LDD R9,[R4+R7<<3]
> > begin_do_loop:
> > VEC R12,{}
> > do_loop:
> > LDUB R5,[R1+R10]
> > CARRY R9,{I}
> > SLL R11,R8,R6
> > SRL R12,#-1,R5
> > AND R11,R11,~R12
> > LOOP LE,R6,R5,#63
> > end_do_loop:
> > AND R6,R6,#63
> > ADD R10,R10,#1
> > CMP R11,R10,R4
> > BLT R11,for_loop
> > end_for_loop:
> > RET
> > <
> > Inner loops execute 5 instructions {LDSB, SRL, SLL, AND, LOOP}
> > and contains the CARRY instruction-modifier.
> > <
> > Apparently creating a mask and using it is just as efficient as building an
> > extraction variable (so I changed the code to reflect).
> > <
> > Also note:: the previous code was off by the length of the first bit field.
> > <
> > Latency analysis::
> > <
> > The first 3 instructions can issue together (depending on width of machine).
> > Instructions {4 and 6} can begin execution as soon as the LDUB data arrives.
> > So the loop latency is LD-latency+2 = 5 cycles.
> > <
> > I doubt any small increment in the memory reference instructions would
> > make this look "lots faster", and is therefore contraindicated.
> Nice. It looks like you have reduced the advantage of the load bit
> field instruction. Some comments
>
> 1. When I made the suggestion, I was just looking to make minimal
> changes to the generated code. I didn't look at the possibility of
> using the VEC/LOOP instructions. If these could be used loop for the in
> my code, I think it saves another instruction in the loop.
<
I have coded this with and without VEC/LOOP
>
> 2. Your solution still have the overhead of extra code for when you
> cross a 64 bit boundary. You have changed it from a big PRED, to an
> outer loop. But it is still there. So you have to factor in the extra
> instructions in the outer loop times 64 divided by the average bit field
> length. Note that that also means you have to load and parse the inner
> loop once per 64 bits, whereas, my proposed solution handles 64 bit
> crossings without any extra instructions, nor does it require the
> reload/reparse of the inner loop (assuming you can use VVM for my inner
> loop)
<
I spent ½ hour trying to put a loop around extraction from 64 bits
the loop overhead makes this impractical. I have to keep track of
bits, i, word, all independently. This makes the smallest inner loop
larger than the single loop with the embedded if statement.
<
void unpackS( uchar_t size[], uint64_t packed[],
uint32_t unpacked[], uint64_t count )
{ uint32_t len,
bit =0,
word =1,
container = packed[0];

for( unsigned int i = 0; i < count; bit &= 31 )
{
container |= packed[word++] << 32;
do {
len = size[i];
unpacked[i] = container & ~(~0 <<len);
container >>= len;
bit += len;
i++;
} while( bit < 32 && i < count )
}
} <
I also tried 32-bit containers (above) and 64-bit containers (previous);
the best code I can see is as shown previously.
>
> 3. Your solution requires three instructions per bit field, two shifts
> and an AND, whereas these are combined into a single instruction in my
> proposed solution.
<
I used the ~0<<len shift instead of loading a mask.
My alternative of inserting the len into the shift (making it an extract)
takes more instructions than generating my own mask !?!
>
> Of course, whether the using the proposed instruction is "lots faster"
> depends on your definition of "lots". :-)
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Subject	Author
Load/store with bit offset and mask	Thomas Koenig
Re: Load/store with bit offset and mask	MitchAlsup
Re: Load/store with bit offset and mask	Thomas Koenig
Re: Load/store with bit offset and mask	MitchAlsup
Re: Load/store with bit offset and mask	Stephen Fuld
Re: Load/store with bit offset and mask	MitchAlsup
Re: Load/store with bit offset and mask	Stephen Fuld
Re: Load/store with bit offset and mask	MitchAlsup
Re: Load/store with bit offset and mask	MitchAlsup
Re: Load/store with bit offset and mask	Thomas Koenig
Re: Load/store with bit offset and mask	Stephen Fuld
Re: Load/store with bit offset and mask	MitchAlsup
Re: Everything old is still new again, Load/store with bit offset and mask	John Levine
Re: Everything old is still new again, Load/store with bit offset and mask	MitchAlsup
Re: Everything old is still new again, Load/store with bit offset and mask	JimBrakefield