Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

After a number of decimal places, nobody gives a damn.


devel / comp.arch / Re: VVM question

SubjectAuthor
* VVM questionThomas Koenig
+* Re: VVM questionAnton Ertl
|+* Re: VVM questionThomas Koenig
||`* Re: VVM questionAnton Ertl
|| `- Re: VVM questionThomas Koenig
|`* Re: VVM questionThomas Koenig
| +* Re: VVM questionAnton Ertl
| |`* Re: VVM questionThomas Koenig
| | `- Re: VVM questionAnton Ertl
| `* Re: VVM questionQuadibloc
|  `- Re: VVM questionAnton Ertl
+* Re: VVM questionTerje Mathisen
|`- Re: VVM questionThomas Koenig
`* Re: VVM questionMitchAlsup
 `* Re: VVM questionThomas Koenig
  +* Re: VVM questionStephen Fuld
  |`* Re: VVM questionAnton Ertl
  | `* Re: VVM questionTerje Mathisen
  |  +- Re: VVM questionluke.l...@gmail.com
  |  `* Re: VVM questionluke.l...@gmail.com
  |   +* Re: VVM questionTerje Mathisen
  |   |`- Re: VVM questionluke.l...@gmail.com
  |   `* Re: VVM questionMitchAlsup
  |    `- Re: VVM questionluke.l...@gmail.com
  +* Re: VVM questionMitchAlsup
  |`* Re: VVM questionStephen Fuld
  | `* Re: VVM questionThomas Koenig
  |  `* Re: VVM questionStephen Fuld
  |   `* Re: VVM questionMitchAlsup
  |    `* Re: VVM questionThomas Koenig
  |     +* Re: VVM questionMitchAlsup
  |     |+* Re: VVM questionluke.l...@gmail.com
  |     ||`* Re: VVM questionMitchAlsup
  |     || +- Re: VVM questionluke.l...@gmail.com
  |     || +* Re: VVM questionEricP
  |     || |+- Re: VVM questionluke.l...@gmail.com
  |     || |`- Re: VVM questionMitchAlsup
  |     || `* Re: VVM questionTerje Mathisen
  |     ||  +* Re: VVM questionEricP
  |     ||  |`* Re: VVM questionMitchAlsup
  |     ||  | `* Re: VVM questionThomas Koenig
  |     ||  |  +* Re: VVM questionMitchAlsup
  |     ||  |  |`- Re: VVM questionThomas Koenig
  |     ||  |  `* Re: VVM questionAnton Ertl
  |     ||  |   `* Re: VVM questionIvan Godard
  |     ||  |    `- Re: VVM questionTerje Mathisen
  |     ||  +* Re: VVM questionluke.l...@gmail.com
  |     ||  |`- Re: VVM questionMitchAlsup
  |     ||  `* Re: VVM questionStephen Fuld
  |     ||   `* Re: VVM questionluke.l...@gmail.com
  |     ||    `* Re: VVM questionMitchAlsup
  |     ||     +* Re: VVM questionluke.l...@gmail.com
  |     ||     |+- Re: VVM questionMitchAlsup
  |     ||     |`* Re: VVM questionIvan Godard
  |     ||     | `* Re: VVM questionMitchAlsup
  |     ||     |  `* Re: VVM questionIvan Godard
  |     ||     |   `* Re: VVM questionMitchAlsup
  |     ||     |    `- Re: VVM questionIvan Godard
  |     ||     `* Re: VVM questionStephen Fuld
  |     ||      `- Re: VVM questionMitchAlsup
  |     |`- Re: VVM questionluke.l...@gmail.com
  |     `* Re: VVM questionStephen Fuld
  |      +* Re: VVM questionThomas Koenig
  |      |+* Re: VVM questionTerje Mathisen
  |      ||+* Re: VVM questionThomas Koenig
  |      |||`* Re: VVM questionMitchAlsup
  |      ||| +- Re: VVM questionThomas Koenig
  |      ||| `* Re: VVM questionThomas Koenig
  |      |||  `- Re: VVM questionMitchAlsup
  |      ||`- Re: VVM questionMitchAlsup
  |      |+- Re: VVM questionStephen Fuld
  |      |`- Re: VVM questionMitchAlsup
  |      `* Re: VVM questionMitchAlsup
  |       +* Re: VVM questionStephen Fuld
  |       |`* Re: VVM questionMitchAlsup
  |       | +- Re: VVM questionTerje Mathisen
  |       | `* Re: VVM questionStephen Fuld
  |       |  `- Re: VVM questionluke.l...@gmail.com
  |       `* Re: VVM questionThomas Koenig
  |        `- Re: VVM questionMitchAlsup
  `* Re: VVM questionluke.l...@gmail.com
   `- Re: VVM questionMitchAlsup

Pages:1234
Re: VVM question

<sg29ht$rbs$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20084&group=comp.arch#20084

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!T3F9KNSTSM9ffyC31YXeHw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 10:10:36 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sg29ht$rbs$1@gioia.aioe.org>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<sg23h5$5dj$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Info: gioia.aioe.org; logging-data="28028"; posting-host="T3F9KNSTSM9ffyC31YXeHw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Tue, 24 Aug 2021 08:10 UTC

Thomas Koenig wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
>>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>>> Back to the posed question:
>>>> <
>>>> If the programmer unrolled the loop by hand (like DGEMM without transposes):
>>>> The LDs would need to be coded using offsets from the index register to be
>>>> recognized as dense::
>>>>
>>>> MOV Ri,#0
>>>> VEC R8,{}
>>>> LDD R4,[R2,Ri<<3]
>>>> LDD R5,[R2,Ri<<3+8]
>>>> LDD R6,[R2,Ri<<3+16]
>>>> LDD R7,[R2,Ri<<3+24]
>>>> ...
>>>> LOOP LT,Ri,#4,Rmax
>>>> <
>>>> The above code would be recognized as dense.
>>>> <
>>>> MOV Ri,#0
>>>> ADD R9,R2,#8
>>>> ADD R9,R2,#16
>>>> ADD R10,R2,#24
>>>> VEC R8,{}
>>>> LDD R4,[R2,Ri<<3]
>>>> LDD R5,[R8,Ri<<3+8]
>>>> LDD R6,[R9,Ri<<3+16]
>>>> LDD R7,[R10,Ri<<3+24]
>>>> ...
>>>> LOOP LT,Ri,#4,Rmax
>>>> <
>>>> This loop is harder to recognize as dense--even though the number of words
>>>> in the loop is less.
>>>
>>> Hm... all possible, but less elegant that it could be. All the
>>> manual unrolling and autovectorization and... rears its ugly
>>> head again.
>>>
>>> With all the mechanisms that VVM already offers, a way for the
>>> programmer or a programming language to specify that operations
>>> such as summation can be done in any order would be a very useful
>>> addition.
>>
>> I am probably missing something here.
>
> Or, equvalently, I have been explaining things badly :-)
>
>> To me the main advantage of
>> allowing out of order summations (using summations here as shorthand for
>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units.
>
> Yes.
>
>> That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
>
> Yes.
>
>> Without that, I
>> don't see any advantage of out of order summations on VVM. If I am
>> wrong, please explain. If I am right, see below.
>
> Seeing below.
>
>>
>>
>>
>>> Suggestion:
>>>
>>> A variant of the VEC instruction, which does not specify a special
>>> register to keep the address in (which can be hardwired if there
>>> is no space in the thread header). This leaves five bits for
>>> "reduction" registters, which specify that operations on that
>>> register can be done in any order in the loop.
>>
>> Doing the operations in a different order isn't the problem.
>
> It's one half of the problem.
>
> The way VVM is currently specified, it's stricly in-order semantics
> you write down a C loop, and the hardware delivers the results
> exactly in the order you wrote down. This would have to be
> changed.
>
>
>> You need a
>> way to allow/specify the two partial sums to be added together in the
>> end.
>
> That as well.
>
>> I don't see your proposal as doing that.
>
> I thought I had implied it, but it was obviously not clear enough.
>
>
>> And, of course, it is
>> limited to five registers which must be specified in the hardware design.
>
> Five reductions in a loop would be plenty, it is usually one, or more
> rarely two.
>
>>> This would be a perfect match for OpenMP's reduction clause or
>>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>>
>> I am not an OpenMP person, and my knowledge of Fortran is old, so could
>> you please give a brief explanation of what these two things do? Thanks.
>
> #pragma omp simd reduction(+:var)
>
> before a loop will tell the compiler that it can go wild
> with the sequence of loops but that "var" will be used
> in a summation reduction.
>
> DO CONCURRENT also runs loops in an unspecified order,
> the REDUCTION clause would then allow to, for example,
> sum up all elements.
>
> One problems with C and similar languages is that you have
> to specify an ordering of the loop explicitly, which shapes
> programmer's thinking and also shapes intermediate languages
> for compilers...
>
The eventual solution for all this will be similar to Mitch's FMAC
accumulator, i.e. a form of super-accumulator which allows one or more
elements to be added per cycle, while delaying all inexact/rounding to
the very end.

A carry-save exact accumulator with ~1100 paired bits would only use a
single full adder (2 or 3 gate delays?) to accept a new input, right?

I am not sure what is the best way for such a beast to handle both
additions and subtractions: Do you need to invert/negate the value to be
subtracted?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: VVM question

<sg2fsq$bpd$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20085&group=comp.arch#20085

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 09:58:50 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg2fsq$bpd$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<sg23h5$5dj$1@newsreader4.netcologne.de> <sg29ht$rbs$1@gioia.aioe.org>
Injection-Date: Tue, 24 Aug 2021 09:58:50 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="12077"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 24 Aug 2021 09:58 UTC

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
>>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
[some snippage, hopefully context-preserving]

>>>> With all the mechanisms that VVM already offers, a way for the
>>>> programmer or a programming language to specify that operations
>>>> such as summation can be done in any order would be a very useful
>>>> addition.

[...]

>>>> Suggestion:
>>>>
>>>> A variant of the VEC instruction, which does not specify a special
>>>> register to keep the address in (which can be hardwired if there
>>>> is no space in the thread header). This leaves five bits for
>>>> "reduction" registters, which specify that operations on that
>>>> register can be done in any order in the loop.

[...]

>> The way VVM is currently specified, it's stricly in-order semantics
>> you write down a C loop, and the hardware delivers the results
>> exactly in the order you wrote down. This would have to be
>> changed.
>>
>>
>>> You need a
>>> way to allow/specify the two partial sums to be added together in the
>>> end.
>>
>> That as well.

[...]

> The eventual solution for all this will be similar to Mitch's FMAC
> accumulator, i.e. a form of super-accumulator which allows one or more
> elements to be added per cycle, while delaying all inexact/rounding to
> the very end.
>
> A carry-save exact accumulator with ~1100 paired bits would only use a
> single full adder (2 or 3 gate delays?) to accept a new input, right?

Depends on the number of functional units you have. If you have
eight, for a high-performance CPU, you would need four layers
of adders to reduce the number of numbers to be added to two.

It might make sense to go to a signed digit implementation for
the 1100-bit adder (even though that doubles the number of bits of
storage needed for the register) and only do the carry propagation
once, upon storing. Adding one binary number to a signed digit
number should be cheap enough to be done in half a cycle, so adding
one number per cycle throughput sounds feasible.

> I am not sure what is the best way for such a beast to handle both
> additions and subtractions: Do you need to invert/negate the value to be
> subtracted?

Probably easiest.

However, there are also other operations in reduction, multiplication
for example...

Re: VVM question

<sg31gh$fe2$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20089&group=comp.arch#20089

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 07:59:27 -0700
Organization: A noiseless patient Spider
Lines: 176
Message-ID: <sg31gh$fe2$1@dont-email.me>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<sg23h5$5dj$1@newsreader4.netcologne.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 24 Aug 2021 14:59:29 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fc59f1cbfecf86977d6725511f354633";
logging-data="15810"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/SEh26NNqGZotlN+L0IfLFma2YzymGjr4="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:Xqsrp8qzS8pAJFr9JwVFbXVLB3Y=
In-Reply-To: <sg23h5$5dj$1@newsreader4.netcologne.de>
Content-Language: en-US
 by: Stephen Fuld - Tue, 24 Aug 2021 14:59 UTC

On 8/23/2021 11:27 PM, Thomas Koenig wrote:
> Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
>>> MitchAlsup <MitchAlsup@aol.com> schrieb:
>>>> Back to the posed question:
>>>> <
>>>> If the programmer unrolled the loop by hand (like DGEMM without transposes):
>>>> The LDs would need to be coded using offsets from the index register to be
>>>> recognized as dense::
>>>>
>>>> MOV Ri,#0
>>>> VEC R8,{}
>>>> LDD R4,[R2,Ri<<3]
>>>> LDD R5,[R2,Ri<<3+8]
>>>> LDD R6,[R2,Ri<<3+16]
>>>> LDD R7,[R2,Ri<<3+24]
>>>> ...
>>>> LOOP LT,Ri,#4,Rmax
>>>> <
>>>> The above code would be recognized as dense.
>>>> <
>>>> MOV Ri,#0
>>>> ADD R9,R2,#8
>>>> ADD R9,R2,#16
>>>> ADD R10,R2,#24
>>>> VEC R8,{}
>>>> LDD R4,[R2,Ri<<3]
>>>> LDD R5,[R8,Ri<<3+8]
>>>> LDD R6,[R9,Ri<<3+16]
>>>> LDD R7,[R10,Ri<<3+24]
>>>> ...
>>>> LOOP LT,Ri,#4,Rmax
>>>> <
>>>> This loop is harder to recognize as dense--even though the number of words
>>>> in the loop is less.
>>>
>>> Hm... all possible, but less elegant that it could be. All the
>>> manual unrolling and autovectorization and... rears its ugly
>>> head again.
>>>
>>> With all the mechanisms that VVM already offers, a way for the
>>> programmer or a programming language to specify that operations
>>> such as summation can be done in any order would be a very useful
>>> addition.
>>
>> I am probably missing something here.
>
> Or, equvalently, I have been explaining things badly :-)
>
>> To me the main advantage of
>> allowing out of order summations (using summations here as shorthand for
>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units.
>
> Yes.
>
>> That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
>
> Yes.
>
>> Without that, I
>> don't see any advantage of out of order summations on VVM. If I am
>> wrong, please explain. If I am right, see below.
>
> Seeing below.
>
>>
>>
>>
>>> Suggestion:
>>>
>>> A variant of the VEC instruction, which does not specify a special
>>> register to keep the address in (which can be hardwired if there
>>> is no space in the thread header). This leaves five bits for
>>> "reduction" registters, which specify that operations on that
>>> register can be done in any order in the loop.
>>
>> Doing the operations in a different order isn't the problem.
>
> It's one half of the problem.
>
> The way VVM is currently specified, it's stricly in-order semantics
> you write down a C loop, and the hardware delivers the results
> exactly in the order you wrote down.

Sort of. If you code the two statements
C := A + B
D := E + F
it may be, say because of a cache miss on the load of B, that the second
addition occurs before the first. I think it is more accurate to say
that the hardware makes it appear with correct semantics, even if
internally, it takes some liberties with the ordering, as long as the
result is the same.

> This would have to be
> changed.

If it makes a difference, yes. That is why I keep going back to summing
a vector of unsigned integers, where it doesn't. But if it does, then
you need some syntax like you describe below to tell the hardware it is
OK. Then you need some mechanism in the hardware that the compiler can
generate to tell the hardware weather it matters or not.

>> You need a
>> way to allow/specify the two partial sums to be added together in the
>> end.
>
> That as well.
>
>> I don't see your proposal as doing that.
>
> I thought I had implied it, but it was obviously not clear enough.

The problem is, you can't use a single accumulator as the hardware can't
do two/four adds to the same accumulator in the same cycle. If you are
doing say 2/4 partial sums, you need 2/4 places to store them (e.g.
registers), say in the event of an interrupt, and you need to tell the
hardware that, at the end, it needs to add the partial sums. It is this
syntax in the VVM ISA that is not there.

>> And, of course, it is
>> limited to five registers which must be specified in the hardware design.
>
> Five reductions in a loop would be plenty, it is usually one, or more
> rarely two.

Sure. I was thinking more about the fact that the choice of five, say
R1-R5, limits the compiler's register allocation flexibility. Perhaps
this isn't a big deal, but it is something.

>>> This would be a perfect match for OpenMP's reduction clause or
>>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>>
>> I am not an OpenMP person, and my knowledge of Fortran is old, so could
>> you please give a brief explanation of what these two things do? Thanks.
>
> #pragma omp simd reduction(+:var)
>
> before a loop will tell the compiler that it can go wild
> with the sequence of loops but that "var" will be used
> in a summation reduction.
>
> DO CONCURRENT also runs loops in an unspecified order,
> the REDUCTION clause would then allow to, for example,
> sum up all elements.

That makes excellent sense. Thank you.

> One problems with C and similar languages is that you have
> to specify an ordering of the loop explicitly,

So you need something like a pragma to perform similarly to the
Concurrrent Reduction clause in Fortran.

> which shapes
> programmer's thinking and also shapes intermediate languages
> for compilers...

Are you saying something about C programmers versus say Fortran
programmers? :-)

As for intermediate languages, if they can solve it for Fortran, I guess
they could do the same for C.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM question

<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20090&group=comp.arch#20090

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:bf47:: with SMTP id p68mr21394126qkf.202.1629818737126; Tue, 24 Aug 2021 08:25:37 -0700 (PDT)
X-Received: by 2002:a9d:6a4b:: with SMTP id h11mr34057685otn.5.1629818736667; Tue, 24 Aug 2021 08:25:36 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr2.eu1.usenetexpress.com!feeder.usenetexpress.com!tr1.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 08:25:36 -0700 (PDT)
In-Reply-To: <sg1mru$h6d$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com> <sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com> <sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de> <sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com> <sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 15:25:37 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 94
 by: MitchAlsup - Tue, 24 Aug 2021 15:25 UTC

On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> >> Back to the posed question:
> >> <
> >> If the programmer unrolled the loop by hand (like DGEMM without transposes):
> >> The LDs would need to be coded using offsets from the index register to be
> >> recognized as dense::
> >>
> >> MOV Ri,#0
> >> VEC R8,{}
> >> LDD R4,[R2,Ri<<3]
> >> LDD R5,[R2,Ri<<3+8]
> >> LDD R6,[R2,Ri<<3+16]
> >> LDD R7,[R2,Ri<<3+24]
> >> ...
> >> LOOP LT,Ri,#4,Rmax
> >> <
> >> The above code would be recognized as dense.
> >> <
> >> MOV Ri,#0
> >> ADD R9,R2,#8
> >> ADD R9,R2,#16
> >> ADD R10,R2,#24
> >> VEC R8,{}
> >> LDD R4,[R2,Ri<<3]
> >> LDD R5,[R8,Ri<<3+8]
> >> LDD R6,[R9,Ri<<3+16]
> >> LDD R7,[R10,Ri<<3+24]
> >> ...
> >> LOOP LT,Ri,#4,Rmax
> >> <
> >> This loop is harder to recognize as dense--even though the number of words
> >> in the loop is less.
> >
> > Hm... all possible, but less elegant that it could be. All the
> > manual unrolling and autovectorization and... rears its ugly
> > head again.
> >
> > With all the mechanisms that VVM already offers, a way for the
> > programmer or a programming language to specify that operations
> > such as summation can be done in any order would be a very useful
> > addition.
> I am probably missing something here. To me the main advantage of
> allowing out of order summations (using summations here as shorthand for
<
The word you are looking for is "reduce", or "reduction"
I want this series of calculations reduced to a single number.
<
> other similar type operations), was to allow the hardware to make use of
> multiple functional units. That is, a core with two adders could, if
> allowed, complete the summation in about half the time.
<
This comes with numeric "issues". As with such, I do not want compilers
creating code to use wide resources WITHOUT some word from the
programmer saying it is "OK this time". (#pragma or such)
<
> Without that, I
> don't see any advantage of out of order summations on VVM. If I am
> wrong, please explain. If I am right, see below.
> > Suggestion:
> >
> > A variant of the VEC instruction, which does not specify a special
> > register to keep the address in (which can be hardwired if there
> > is no space in the thread header). This leaves five bits for
> > "reduction" registters, which specify that operations on that
> > register can be done in any order in the loop.
<
> Doing the operations in a different order isn't the problem. You need a
> way to allow/specify the two partial sums to be added together in the
> end. I don't see your proposal as doing that. And, of course, it is
> limited to five registers which must be specified in the hardware design.
<
Done properly, you want each loop performing Kahan-Babuška summations
not lossy clumsy double FP.
<
Kahan-Babuška summation should be known to and embedded in the compiler.
<
> >
> > This would be a perfect match for OpenMP's reduction clause or
> > for the planned REDUCTION addition to Fortran's DO CONCURRENT.
> I am not an OpenMP person, and my knowledge of Fortran is old, so could
> you please give a brief explanation of what these two things do? Thanks.
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: VVM question

<249ce81a-3671-4e2c-a4d3-96de0caf9a70n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20091&group=comp.arch#20091

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7207:: with SMTP id a7mr34861597qtp.32.1629818935189; Tue, 24 Aug 2021 08:28:55 -0700 (PDT)
X-Received: by 2002:a05:6830:3115:: with SMTP id b21mr34035006ots.240.1629818934964; Tue, 24 Aug 2021 08:28:54 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr3.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 08:28:54 -0700 (PDT)
In-Reply-To: <sg23h5$5dj$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com> <sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com> <sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de> <sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com> <sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me> <sg23h5$5dj$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <249ce81a-3671-4e2c-a4d3-96de0caf9a70n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 15:28:55 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 112
 by: MitchAlsup - Tue, 24 Aug 2021 15:28 UTC

On Tuesday, August 24, 2021 at 1:27:51 AM UTC-5, Thomas Koenig wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> > On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> >> MitchAlsup <Mitch...@aol.com> schrieb:
> >>> Back to the posed question:
> >>> <
> >>> If the programmer unrolled the loop by hand (like DGEMM without transposes):
> >>> The LDs would need to be coded using offsets from the index register to be
> >>> recognized as dense::
> >>>
> >>> MOV Ri,#0
> >>> VEC R8,{}
> >>> LDD R4,[R2,Ri<<3]
> >>> LDD R5,[R2,Ri<<3+8]
> >>> LDD R6,[R2,Ri<<3+16]
> >>> LDD R7,[R2,Ri<<3+24]
> >>> ...
> >>> LOOP LT,Ri,#4,Rmax
> >>> <
> >>> The above code would be recognized as dense.
> >>> <
> >>> MOV Ri,#0
> >>> ADD R9,R2,#8
> >>> ADD R9,R2,#16
> >>> ADD R10,R2,#24
> >>> VEC R8,{}
> >>> LDD R4,[R2,Ri<<3]
> >>> LDD R5,[R8,Ri<<3+8]
> >>> LDD R6,[R9,Ri<<3+16]
> >>> LDD R7,[R10,Ri<<3+24]
> >>> ...
> >>> LOOP LT,Ri,#4,Rmax
> >>> <
> >>> This loop is harder to recognize as dense--even though the number of words
> >>> in the loop is less.
> >>
> >> Hm... all possible, but less elegant that it could be. All the
> >> manual unrolling and autovectorization and... rears its ugly
> >> head again.
> >>
> >> With all the mechanisms that VVM already offers, a way for the
> >> programmer or a programming language to specify that operations
> >> such as summation can be done in any order would be a very useful
> >> addition.
> >
> > I am probably missing something here.
> Or, equvalently, I have been explaining things badly :-)
> > To me the main advantage of
> > allowing out of order summations (using summations here as shorthand for
> > other similar type operations), was to allow the hardware to make use of
> > multiple functional units.
> Yes.
> > That is, a core with two adders could, if
> > allowed, complete the summation in about half the time.
> Yes.
<
Who chooses time versus numerical precision ? Human, of software.
IEEE 754-1985 took hardware off the list of who could choose.
<
Even in integer code, unrolled summation changes where exceptions
get thrown (if thrown at all).
<
> >Without that, I
> > don't see any advantage of out of order summations on VVM. If I am
> > wrong, please explain. If I am right, see below.
> Seeing below.
> >
> >
> >
> >> Suggestion:
> >>
> >> A variant of the VEC instruction, which does not specify a special
> >> register to keep the address in (which can be hardwired if there
> >> is no space in the thread header). This leaves five bits for
> >> "reduction" registters, which specify that operations on that
> >> register can be done in any order in the loop.
> >
> > Doing the operations in a different order isn't the problem.
> It's one half of the problem.
>
> The way VVM is currently specified, it's stricly in-order semantics
> you write down a C loop, and the hardware delivers the results
> exactly in the order you wrote down. This would have to be
> changed.
> > You need a
> > way to allow/specify the two partial sums to be added together in the
> > end.
> That as well.
> >I don't see your proposal as doing that.
> I thought I had implied it, but it was obviously not clear enough.
> > And, of course, it is
> > limited to five registers which must be specified in the hardware design.
> Five reductions in a loop would be plenty, it is usually one, or more
> rarely two.
> >> This would be a perfect match for OpenMP's reduction clause or
> >> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
> >
> > I am not an OpenMP person, and my knowledge of Fortran is old, so could
> > you please give a brief explanation of what these two things do? Thanks.
> #pragma omp simd reduction(+:var)
>
> before a loop will tell the compiler that it can go wild
> with the sequence of loops but that "var" will be used
> in a summation reduction.
>
> DO CONCURRENT also runs loops in an unspecified order,
> the REDUCTION clause would then allow to, for example,
> sum up all elements.
>
> One problems with C and similar languages is that you have
> to specify an ordering of the loop explicitly, which shapes
> programmer's thinking and also shapes intermediate languages
> for compilers...

Re: VVM question

<3ff4fd0f-cdd2-4095-8aa0-fbfa37ec9e49n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20092&group=comp.arch#20092

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7ef6:: with SMTP id r22mr1323767qtc.158.1629819459100; Tue, 24 Aug 2021 08:37:39 -0700 (PDT)
X-Received: by 2002:a9d:5603:: with SMTP id e3mr32495819oti.178.1629819458830; Tue, 24 Aug 2021 08:37:38 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 08:37:38 -0700 (PDT)
In-Reply-To: <sg29ht$rbs$1@gioia.aioe.org>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com> <sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com> <sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de> <sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com> <sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me> <sg23h5$5dj$1@newsreader4.netcologne.de> <sg29ht$rbs$1@gioia.aioe.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3ff4fd0f-cdd2-4095-8aa0-fbfa37ec9e49n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 15:37:39 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 153
 by: MitchAlsup - Tue, 24 Aug 2021 15:37 UTC

On Tuesday, August 24, 2021 at 3:10:39 AM UTC-5, Terje Mathisen wrote:
> Thomas Koenig wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> >> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> >>> MitchAlsup <Mitch...@aol.com> schrieb:
> >>>> Back to the posed question:
> >>>> <
> >>>> If the programmer unrolled the loop by hand (like DGEMM without transposes):
> >>>> The LDs would need to be coded using offsets from the index register to be
> >>>> recognized as dense::
> >>>>
> >>>> MOV Ri,#0
> >>>> VEC R8,{}
> >>>> LDD R4,[R2,Ri<<3]
> >>>> LDD R5,[R2,Ri<<3+8]
> >>>> LDD R6,[R2,Ri<<3+16]
> >>>> LDD R7,[R2,Ri<<3+24]
> >>>> ...
> >>>> LOOP LT,Ri,#4,Rmax
> >>>> <
> >>>> The above code would be recognized as dense.
> >>>> <
> >>>> MOV Ri,#0
> >>>> ADD R9,R2,#8
> >>>> ADD R9,R2,#16
> >>>> ADD R10,R2,#24
> >>>> VEC R8,{}
> >>>> LDD R4,[R2,Ri<<3]
> >>>> LDD R5,[R8,Ri<<3+8]
> >>>> LDD R6,[R9,Ri<<3+16]
> >>>> LDD R7,[R10,Ri<<3+24]
> >>>> ...
> >>>> LOOP LT,Ri,#4,Rmax
> >>>> <
> >>>> This loop is harder to recognize as dense--even though the number of words
> >>>> in the loop is less.
> >>>
> >>> Hm... all possible, but less elegant that it could be. All the
> >>> manual unrolling and autovectorization and... rears its ugly
> >>> head again.
> >>>
> >>> With all the mechanisms that VVM already offers, a way for the
> >>> programmer or a programming language to specify that operations
> >>> such as summation can be done in any order would be a very useful
> >>> addition.
> >>
> >> I am probably missing something here.
> >
> > Or, equvalently, I have been explaining things badly :-)
> >
> >> To me the main advantage of
> >> allowing out of order summations (using summations here as shorthand for
> >> other similar type operations), was to allow the hardware to make use of
> >> multiple functional units.
> >
> > Yes.
> >
> >> That is, a core with two adders could, if
> >> allowed, complete the summation in about half the time.
> >
> > Yes.
> >
> >> Without that, I
> >> don't see any advantage of out of order summations on VVM. If I am
> >> wrong, please explain. If I am right, see below.
> >
> > Seeing below.
> >
> >>
> >>
> >>
> >>> Suggestion:
> >>>
> >>> A variant of the VEC instruction, which does not specify a special
> >>> register to keep the address in (which can be hardwired if there
> >>> is no space in the thread header). This leaves five bits for
> >>> "reduction" registters, which specify that operations on that
> >>> register can be done in any order in the loop.
> >>
> >> Doing the operations in a different order isn't the problem.
> >
> > It's one half of the problem.
> >
> > The way VVM is currently specified, it's stricly in-order semantics
> > you write down a C loop, and the hardware delivers the results
> > exactly in the order you wrote down. This would have to be
> > changed.
> >
> >
> >> You need a
> >> way to allow/specify the two partial sums to be added together in the
> >> end.
> >
> > That as well.
> >
> >> I don't see your proposal as doing that.
> >
> > I thought I had implied it, but it was obviously not clear enough.
> >
> >
> >> And, of course, it is
> >> limited to five registers which must be specified in the hardware design.
> >
> > Five reductions in a loop would be plenty, it is usually one, or more
> > rarely two.
> >
> >>> This would be a perfect match for OpenMP's reduction clause or
> >>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
> >>
> >> I am not an OpenMP person, and my knowledge of Fortran is old, so could
> >> you please give a brief explanation of what these two things do? Thanks.
> >
> > #pragma omp simd reduction(+:var)
> >
> > before a loop will tell the compiler that it can go wild
> > with the sequence of loops but that "var" will be used
> > in a summation reduction.
> >
> > DO CONCURRENT also runs loops in an unspecified order,
> > the REDUCTION clause would then allow to, for example,
> > sum up all elements.
> >
> > One problems with C and similar languages is that you have
> > to specify an ordering of the loop explicitly, which shapes
> > programmer's thinking and also shapes intermediate languages
> > for compilers...
> >
> The eventual solution for all this will be similar to Mitch's FMAC
> accumulator, i.e. a form of super-accumulator which allows one or more
> elements to be added per cycle, while delaying all inexact/rounding to
> the very end.
>
> A carry-save exact accumulator with ~1100 paired bits would only use a
> single full adder (2 or 3 gate delays?) to accept a new input, right?
<
A carry save accumulator uses a 4-input-2-output compressor. With
true complement signals, this is 2 gates of delay (3-input XOR is 1 gate
in TC form). Without TC signals it is 3 gates.
>
> I am not sure what is the best way for such a beast to handle both
> additions and subtractions: Do you need to invert/negate the value to be
> subtracted?
<
The key property is that the value spinning around in the accumulator
remains the polarity of when it started (+ or - but not both) while
the multiplier can produce + or - on a per calculation basis.
But in practice one can work in the negations at the cost of a
single additional gate of delay.
<
> Terje
>
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"

Re: VVM question

<sg3416$11k$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20093&group=comp.arch#20093

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 08:42:29 -0700
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <sg3416$11k$1@dont-email.me>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 24 Aug 2021 15:42:30 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fc59f1cbfecf86977d6725511f354633";
logging-data="1076"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/Esn4PcabSEcrMVVyn9+v+dDJQFlIVnwY="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:wVxG6qHWLtmtUYMjA0h8Qzcu1MY=
In-Reply-To: <ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 24 Aug 2021 15:42 UTC

On 8/24/2021 8:25 AM, MitchAlsup wrote:
> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:

snip

>>> With all the mechanisms that VVM already offers, a way for the
>>> programmer or a programming language to specify that operations
>>> such as summation can be done in any order would be a very useful
>>> addition.
>> I am probably missing something here. To me the main advantage of
>> allowing out of order summations (using summations here as shorthand for
> <
> The word you are looking for is "reduce", or "reduction"
> I want this series of calculations reduced to a single number.

Yes, I worded that clumsily. I probably should have used the word
"operation" to indicate whatever is needed to perform the reduction.
But using simple summation as an example, makes saying things like
"partial sums" meaningful. Is there a generally accepted better term
for the intermediate results of a reduction?

>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units. That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
> <
> This comes with numeric "issues". As with such, I do not want compilers
> creating code to use wide resources WITHOUT some word from the
> programmer saying it is "OK this time". (#pragma or such)

Absolutely agreed. But Thomas has said such things are coming, at least
for Fortran. But adding a pragma for C doesn't seem unreasonable, as it
potentially allows a huge performance improvement. But clearly it
shouldn't be the default.

> <
>> Without that, I
>> don't see any advantage of out of order summations on VVM. If I am
>> wrong, please explain. If I am right, see below.
>>> Suggestion:
>>>
>>> A variant of the VEC instruction, which does not specify a special
>>> register to keep the address in (which can be hardwired if there
>>> is no space in the thread header). This leaves five bits for
>>> "reduction" registters, which specify that operations on that
>>> register can be done in any order in the loop.
> <
>> Doing the operations in a different order isn't the problem. You need a
>> way to allow/specify the two partial sums to be added together in the
>> end. I don't see your proposal as doing that. And, of course, it is
>> limited to five registers which must be specified in the hardware design.
> <
> Done properly, you want each loop performing Kahan-Babuška summations
> not lossy clumsy double FP.
> <
> Kahan-Babuška summation should be known to and embedded in the compiler.

While I agree with that, it still doesn't address the issues of needing
multiple intermediate results "places"(to allow multiple partial
reductions (is that the right word?) to proceed in parallel, and how to
combine them at the end. You need some ISA syntax to specify such things.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM question

<d5dd6304-8eec-421a-ba4d-274142df8de5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20094&group=comp.arch#20094

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a37:aa01:: with SMTP id t1mr27730621qke.369.1629819819419;
Tue, 24 Aug 2021 08:43:39 -0700 (PDT)
X-Received: by 2002:a9d:4e96:: with SMTP id v22mr31160693otk.110.1629819819200;
Tue, 24 Aug 2021 08:43:39 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 08:43:38 -0700 (PDT)
In-Reply-To: <sg2fsq$bpd$1@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<sg23h5$5dj$1@newsreader4.netcologne.de> <sg29ht$rbs$1@gioia.aioe.org> <sg2fsq$bpd$1@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d5dd6304-8eec-421a-ba4d-274142df8de5n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 15:43:39 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Tue, 24 Aug 2021 15:43 UTC

On Tuesday, August 24, 2021 at 4:58:52 AM UTC-5, Thomas Koenig wrote:
> Terje Mathisen <terje.m...@tmsw.no> schrieb:
> > Thomas Koenig wrote:
> >> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> >>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> [some snippage, hopefully context-preserving]
> >>>> With all the mechanisms that VVM already offers, a way for the
> >>>> programmer or a programming language to specify that operations
> >>>> such as summation can be done in any order would be a very useful
> >>>> addition.
> [...]
> >>>> Suggestion:
> >>>>
> >>>> A variant of the VEC instruction, which does not specify a special
> >>>> register to keep the address in (which can be hardwired if there
> >>>> is no space in the thread header). This leaves five bits for
> >>>> "reduction" registters, which specify that operations on that
> >>>> register can be done in any order in the loop.
> [...]
> >> The way VVM is currently specified, it's stricly in-order semantics
> >> you write down a C loop, and the hardware delivers the results
> >> exactly in the order you wrote down. This would have to be
> >> changed.
> >>
> >>
> >>> You need a
> >>> way to allow/specify the two partial sums to be added together in the
> >>> end.
> >>
> >> That as well.
> [...]
> > The eventual solution for all this will be similar to Mitch's FMAC
> > accumulator, i.e. a form of super-accumulator which allows one or more
> > elements to be added per cycle, while delaying all inexact/rounding to
> > the very end.
> >
> > A carry-save exact accumulator with ~1100 paired bits would only use a
> > single full adder (2 or 3 gate delays?) to accept a new input, right?
<
> Depends on the number of functional units you have. If you have
> eight, for a high-performance CPU, you would need four layers
> of adders to reduce the number of numbers to be added to two.
<
Each layer of 4-2 compression costs 2 gates of delay. 8-2 4 gates
16-2 6 gates......
>
> It might make sense to go to a signed digit implementation for
> the 1100-bit adder (even though that doubles the number of bits of
> storage needed for the register) and only do the carry propagation
> once, upon storing.
<
Carry save is perfectly adequate for an accumulator as carries are
only moving forward ln2( #inputs ) per cycle. IT is easy to build
a Carry Save Find First circuit, so you know where in that 1076 bit
accumulator is the most likely highest bit of significance. {You
will only be off by 1 at most}
<
< Adding one binary number to a signed digit
> number should be cheap enough to be done in half a cycle, so adding
> one number per cycle throughput sounds feasible.
<
> > I am not sure what is the best way for such a beast to handle both
> > additions and subtractions: Do you need to invert/negate the value to be
> > subtracted?
<
New value into summation can be + or -
Current running sum cannot (or can at another couple of gate delays.)
<
> Probably easiest.
>
> However, there are also other operations in reduction, multiplication
> for example...

Re: VVM question

<1aec6237-b33c-4cf2-b9a2-faa38d702a3bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20095&group=comp.arch#20095

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1926:: with SMTP id es6mr22238314qvb.3.1629820374101;
Tue, 24 Aug 2021 08:52:54 -0700 (PDT)
X-Received: by 2002:a9d:d35:: with SMTP id 50mr31922918oti.22.1629820373952;
Tue, 24 Aug 2021 08:52:53 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 08:52:53 -0700 (PDT)
In-Reply-To: <sg3416$11k$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com> <sg3416$11k$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <1aec6237-b33c-4cf2-b9a2-faa38d702a3bn@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 15:52:54 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: MitchAlsup - Tue, 24 Aug 2021 15:52 UTC

On Tuesday, August 24, 2021 at 10:42:32 AM UTC-5, Stephen Fuld wrote:
> On 8/24/2021 8:25 AM, MitchAlsup wrote:
> > On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
> >> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> snip
> >>> With all the mechanisms that VVM already offers, a way for the
> >>> programmer or a programming language to specify that operations
> >>> such as summation can be done in any order would be a very useful
> >>> addition.
> >> I am probably missing something here. To me the main advantage of
> >> allowing out of order summations (using summations here as shorthand for
> > <
> > The word you are looking for is "reduce", or "reduction"
> > I want this series of calculations reduced to a single number.
> Yes, I worded that clumsily. I probably should have used the word
> "operation" to indicate whatever is needed to perform the reduction.
> But using simple summation as an example, makes saying things like
> "partial sums" meaningful. Is there a generally accepted better term
> for the intermediate results of a reduction?
<
As far as I know: accumulator is about as good as it gets.
<
> >> other similar type operations), was to allow the hardware to make use of
> >> multiple functional units. That is, a core with two adders could, if
> >> allowed, complete the summation in about half the time.
> > <
> > This comes with numeric "issues". As with such, I do not want compilers
> > creating code to use wide resources WITHOUT some word from the
> > programmer saying it is "OK this time". (#pragma or such)
<
> Absolutely agreed. But Thomas has said such things are coming, at least
> for Fortran. But adding a pragma for C doesn't seem unreasonable, as it
> potentially allows a huge performance improvement. But clearly it
> shouldn't be the default.
> > <
> >> Without that, I
> >> don't see any advantage of out of order summations on VVM. If I am
> >> wrong, please explain. If I am right, see below.
> >>> Suggestion:
> >>>
> >>> A variant of the VEC instruction, which does not specify a special
> >>> register to keep the address in (which can be hardwired if there
> >>> is no space in the thread header). This leaves five bits for
> >>> "reduction" registters, which specify that operations on that
> >>> register can be done in any order in the loop.
> > <
> >> Doing the operations in a different order isn't the problem. You need a
> >> way to allow/specify the two partial sums to be added together in the
> >> end. I don't see your proposal as doing that. And, of course, it is
> >> limited to five registers which must be specified in the hardware design.
> > <
> > Done properly, you want each loop performing Kahan-Babuška summations
> > not lossy clumsy double FP.
> > <
> > Kahan-Babuška summation should be known to and embedded in the compiler.
<
> While I agree with that, it still doesn't address the issues of needing
> multiple intermediate results "places"(to allow multiple partial
> reductions (is that the right word?) to proceed in parallel, and how to
> combine them at the end. You need some ISA syntax to specify such things.
<
A single "core" can work on a single reduction.
It takes multiple cores to work on multiple reductions simultaneously.
cores, by themselves, are multiple clock cycles apart, and so are not
in general, allowed to know anything about what the others are up to.
<
To properly sum multiple reductions one needs a way to ship the
multiple 1076-bit reductions to a common adder/normalizer/rounder.
One HAS to perform a single rounding to get the correct final result
in a Kahan sense (754 has not gone this far, yet).
<
unums (posits) use the word "quire" as the accumulator.
<
> --
> - Stephen Fuld
> (e-mail address disguised to prevent spam)

Re: VVM question

<sg35r2$ucg$1@gioia.aioe.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20096&group=comp.arch#20096

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!sa6kcu6mSvh5VOr71AVWvw.user.46.165.242.91.POSTED!not-for-mail
From: terje.ma...@tmsw.no (Terje Mathisen)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 18:13:21 +0200
Organization: Aioe.org NNTP Server
Message-ID: <sg35r2$ucg$1@gioia.aioe.org>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com>
<sg3416$11k$1@dont-email.me>
<1aec6237-b33c-4cf2-b9a2-faa38d702a3bn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: gioia.aioe.org; logging-data="31120"; posting-host="sa6kcu6mSvh5VOr71AVWvw.user.gioia.aioe.org"; mail-complaints-to="abuse@aioe.org";
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101
Firefox/60.0 SeaMonkey/2.53.8.1
X-Notice: Filtered by postfilter v. 0.9.2
 by: Terje Mathisen - Tue, 24 Aug 2021 16:13 UTC

MitchAlsup wrote:
> On Tuesday, August 24, 2021 at 10:42:32 AM UTC-5, Stephen Fuld wrote:
>> On 8/24/2021 8:25 AM, MitchAlsup wrote:
>>> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
>>>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
>> snip
>>>>> With all the mechanisms that VVM already offers, a way for the
>>>>> programmer or a programming language to specify that operations
>>>>> such as summation can be done in any order would be a very useful
>>>>> addition.
>>>> I am probably missing something here. To me the main advantage of
>>>> allowing out of order summations (using summations here as shorthand for
>>> <
>>> The word you are looking for is "reduce", or "reduction"
>>> I want this series of calculations reduced to a single number.
>> Yes, I worded that clumsily. I probably should have used the word
>> "operation" to indicate whatever is needed to perform the reduction.
>> But using simple summation as an example, makes saying things like
>> "partial sums" meaningful. Is there a generally accepted better term
>> for the intermediate results of a reduction?
> <
> As far as I know: accumulator is about as good as it gets.
> <
>>>> other similar type operations), was to allow the hardware to make use of
>>>> multiple functional units. That is, a core with two adders could, if
>>>> allowed, complete the summation in about half the time.
>>> <
>>> This comes with numeric "issues". As with such, I do not want compilers
>>> creating code to use wide resources WITHOUT some word from the
>>> programmer saying it is "OK this time". (#pragma or such)
> <
>> Absolutely agreed. But Thomas has said such things are coming, at least
>> for Fortran. But adding a pragma for C doesn't seem unreasonable, as it
>> potentially allows a huge performance improvement. But clearly it
>> shouldn't be the default.
>>> <
>>>> Without that, I
>>>> don't see any advantage of out of order summations on VVM. If I am
>>>> wrong, please explain. If I am right, see below.
>>>>> Suggestion:
>>>>>
>>>>> A variant of the VEC instruction, which does not specify a special
>>>>> register to keep the address in (which can be hardwired if there
>>>>> is no space in the thread header). This leaves five bits for
>>>>> "reduction" registters, which specify that operations on that
>>>>> register can be done in any order in the loop.
>>> <
>>>> Doing the operations in a different order isn't the problem. You need a
>>>> way to allow/specify the two partial sums to be added together in the
>>>> end. I don't see your proposal as doing that. And, of course, it is
>>>> limited to five registers which must be specified in the hardware design.
>>> <
>>> Done properly, you want each loop performing Kahan-Babuška summations
>>> not lossy clumsy double FP.
>>> <
>>> Kahan-Babuška summation should be known to and embedded in the compiler.
> <
>> While I agree with that, it still doesn't address the issues of needing
>> multiple intermediate results "places"(to allow multiple partial
>> reductions (is that the right word?) to proceed in parallel, and how to
>> combine them at the end. You need some ISA syntax to specify such things.
> <
> A single "core" can work on a single reduction.
> It takes multiple cores to work on multiple reductions simultaneously.
> cores, by themselves, are multiple clock cycles apart, and so are not
> in general, allowed to know anything about what the others are up to.
> <
> To properly sum multiple reductions one needs a way to ship the
> multiple 1076-bit reductions to a common adder/normalizer/rounder.
> One HAS to perform a single rounding to get the correct final result
> in a Kahan sense (754 has not gone this far, yet).

You can round each super-acc result to 109 bits or more, at which point
you are guaranteed to get the same result when you add these two
together and then do a final round to double, as if you actually did the
full 1100-bit merge and then rounded, right?

....

No, it is easy to contruct a counter-example. :-(

OTOH, I expect that such a super-acc with redundant storage would have a
way to canonicalize it while storing, and at this point it is easy to
merge multiple accumulators as unsigned integer arrays.

If the hw also reports the position of the most significant bit, then
the merging can be handled more efficiently by starting at that position
and add all the accumulator results, then iterate over subsequent 64-bit
blocks until the end or until no change is possible at the rounding point.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Re: VVM question

<sg36fd$kev$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20097&group=comp.arch#20097

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: sfu...@alumni.cmu.edu.invalid (Stephen Fuld)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 09:24:11 -0700
Organization: A noiseless patient Spider
Lines: 90
Message-ID: <sg36fd$kev$1@dont-email.me>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com>
<sg3416$11k$1@dont-email.me>
<1aec6237-b33c-4cf2-b9a2-faa38d702a3bn@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 24 Aug 2021 16:24:13 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="fc59f1cbfecf86977d6725511f354633";
logging-data="20959"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18JcvlukkJsZgCTg3+flA5km8SjTPcZaoc="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
Thunderbird/78.13.0
Cancel-Lock: sha1:FvYHd1i/37NSA5C/AgN7HZWgq/o=
In-Reply-To: <1aec6237-b33c-4cf2-b9a2-faa38d702a3bn@googlegroups.com>
Content-Language: en-US
 by: Stephen Fuld - Tue, 24 Aug 2021 16:24 UTC

On 8/24/2021 8:52 AM, MitchAlsup wrote:
> On Tuesday, August 24, 2021 at 10:42:32 AM UTC-5, Stephen Fuld wrote:
>> On 8/24/2021 8:25 AM, MitchAlsup wrote:
>>> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
>>>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
>> snip
>>>>> With all the mechanisms that VVM already offers, a way for the
>>>>> programmer or a programming language to specify that operations
>>>>> such as summation can be done in any order would be a very useful
>>>>> addition.
>>>> I am probably missing something here. To me the main advantage of
>>>> allowing out of order summations (using summations here as shorthand for
>>> <
>>> The word you are looking for is "reduce", or "reduction"
>>> I want this series of calculations reduced to a single number.
>> Yes, I worded that clumsily. I probably should have used the word
>> "operation" to indicate whatever is needed to perform the reduction.
>> But using simple summation as an example, makes saying things like
>> "partial sums" meaningful. Is there a generally accepted better term
>> for the intermediate results of a reduction?
> <
> As far as I know: accumulator is about as good as it gets.
> <
>>>> other similar type operations), was to allow the hardware to make use of
>>>> multiple functional units. That is, a core with two adders could, if
>>>> allowed, complete the summation in about half the time.
>>> <
>>> This comes with numeric "issues". As with such, I do not want compilers
>>> creating code to use wide resources WITHOUT some word from the
>>> programmer saying it is "OK this time". (#pragma or such)
> <
>> Absolutely agreed. But Thomas has said such things are coming, at least
>> for Fortran. But adding a pragma for C doesn't seem unreasonable, as it
>> potentially allows a huge performance improvement. But clearly it
>> shouldn't be the default.
>>> <
>>>> Without that, I
>>>> don't see any advantage of out of order summations on VVM. If I am
>>>> wrong, please explain. If I am right, see below.
>>>>> Suggestion:
>>>>>
>>>>> A variant of the VEC instruction, which does not specify a special
>>>>> register to keep the address in (which can be hardwired if there
>>>>> is no space in the thread header). This leaves five bits for
>>>>> "reduction" registters, which specify that operations on that
>>>>> register can be done in any order in the loop.
>>> <
>>>> Doing the operations in a different order isn't the problem. You need a
>>>> way to allow/specify the two partial sums to be added together in the
>>>> end. I don't see your proposal as doing that. And, of course, it is
>>>> limited to five registers which must be specified in the hardware design.
>>> <
>>> Done properly, you want each loop performing Kahan-Babuška summations
>>> not lossy clumsy double FP.
>>> <
>>> Kahan-Babuška summation should be known to and embedded in the compiler.
> <
>> While I agree with that, it still doesn't address the issues of needing
>> multiple intermediate results "places"(to allow multiple partial
>> reductions (is that the right word?) to proceed in parallel, and how to
>> combine them at the end. You need some ISA syntax to specify such things.
> <
> A single "core" can work on a single reduction.

Sure. Apparently, I am not making myself clear. I want to be able
*optionally* to take advantage of multiple functional units in a single
core to speed up a single reduction. As an example, if I want to sum
the elements of an array, and I have two FUs that can do the addition, I
want each to sum half the elements in parallel, then a final add to
combine the two partial sums.

As was pointed out earlier, you can certainly get this by unrolling the
loop, then doing the final add of the two partial sums outside of the
loop. But this makes the code "specify" the number of lanes (i.e. how
many times the loop is unrolled by), thus makes it hardware model
specific (i.e. it can't take advantage of a more powerful model with say
four appropriate FUs).

One of the beauties of VVM is that, for most loops, you code it for one
unit, but the hardware "automagically" and transparently invokes
multiple FUs to speed things up. I want to be able to extend this
capability (transparently taking advantage of multiple FUs if they are
available), to reductions, *where the programmer/compiler knows it will
be OK to do so*.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Re: VVM question

<sg36jq$t4c$1@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20098&group=comp.arch#20098

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!paganini.bofh.team!news.dns-netz.com!news.freedyn.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 16:26:34 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg36jq$t4c$1@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<sg23h5$5dj$1@newsreader4.netcologne.de> <sg29ht$rbs$1@gioia.aioe.org>
<sg2fsq$bpd$1@newsreader4.netcologne.de>
<d5dd6304-8eec-421a-ba4d-274142df8de5n@googlegroups.com>
Injection-Date: Tue, 24 Aug 2021 16:26:34 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="29836"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 24 Aug 2021 16:26 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Tuesday, August 24, 2021 at 4:58:52 AM UTC-5, Thomas Koenig wrote:

>> Depends on the number of functional units you have. If you have
>> eight, for a high-performance CPU, you would need four layers
>> of adders to reduce the number of numbers to be added to two.
><
> Each layer of 4-2 compression costs 2 gates of delay. 8-2 4 gates
> 16-2 6 gates......

So 4-2 compression is currently what people are now using, not
one of the (many) other possibilities, and not the full
adders that I assumed?

I didn't know that, thanks for the info.

>>

>> It might make sense to go to a signed digit implementation for
>> the 1100-bit adder (even though that doubles the number of bits of
>> storage needed for the register) and only do the carry propagation
>> once, upon storing.
><
> Carry save is perfectly adequate for an accumulator as carries are
> only moving forward ln2( #inputs ) per cycle. IT is easy to build
> a Carry Save Find First circuit, so you know where in that 1076 bit
> accumulator is the most likely highest bit of significance. {You
> will only be off by 1 at most}

Interesting.

Re: VVM question

<sg37ao$t4c$2@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20099&group=comp.arch#20099

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 16:38:48 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg37ao$t4c$2@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 24 Aug 2021 16:38:48 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="29836"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 24 Aug 2021 16:38 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:
> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:

>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units. That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
><
> This comes with numeric "issues". As with such, I do not want compilers
> creating code to use wide resources WITHOUT some word from the
> programmer saying it is "OK this time". (#pragma or such)

Agreed - this should be specified in the language (and preferably
not with a #pragma, which are sort of an evil hack, but for things
that are standardized like OpenMP that is accepatble).

Unfortunately, it takes a big hammer to get code to be vectorized on
SIMD systems, such as ignoring all these niceties, so a big hammer
is what is supplied, with seductively named compiler options such
as -Ofast or -ffast-math.

(There's a double-langauge pun lurking here. "fast" means "almost"
in German, and one could vary an old saying about experiments to
"Fast programs, fast results, fast richtig" (where the last part
means "almost correct"). Maybe something for Anton's office wall?)

I appreciate what VVM can do here, by keeping the sequential
dependencies alive. There's simply a case where the language
does not specify this, as in DO CONCURRENT or #pragma simd,
when these semantics do not apply and where more speed would
help (which is the point you made above).

>> Doing the operations in a different order isn't the problem. You need a
>> way to allow/specify the two partial sums to be added together in the
>> end. I don't see your proposal as doing that. And, of course, it is
>> limited to five registers which must be specified in the hardware design.
><
> Done properly, you want each loop performing Kahan-Babuška summations
> not lossy clumsy double FP.
><
> Kahan-Babuška summation should be known to and embedded in the compiler.

Depends if you need it or not. It is a considerable overhead /
slowdown, and putting it in for all cases also would not be a
good thing.

Re: VVM question

<d0b52ea5-5fba-4629-bc94-d45433c2f417n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20100&group=comp.arch#20100

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:622a:104c:: with SMTP id f12mr34890061qte.339.1629823430554;
Tue, 24 Aug 2021 09:43:50 -0700 (PDT)
X-Received: by 2002:a9d:1b5:: with SMTP id e50mr33939238ote.76.1629823430271;
Tue, 24 Aug 2021 09:43:50 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 09:43:50 -0700 (PDT)
In-Reply-To: <sg37ao$t4c$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com> <sg37ao$t4c$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d0b52ea5-5fba-4629-bc94-d45433c2f417n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 16:43:50 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Lines: 55
 by: MitchAlsup - Tue, 24 Aug 2021 16:43 UTC

On Tuesday, August 24, 2021 at 11:38:50 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
>
> >> other similar type operations), was to allow the hardware to make use of
> >> multiple functional units. That is, a core with two adders could, if
> >> allowed, complete the summation in about half the time.
> ><
> > This comes with numeric "issues". As with such, I do not want compilers
> > creating code to use wide resources WITHOUT some word from the
> > programmer saying it is "OK this time". (#pragma or such)
> Agreed - this should be specified in the language (and preferably
> not with a #pragma, which are sort of an evil hack, but for things
> that are standardized like OpenMP that is accepatble).
>
> Unfortunately, it takes a big hammer to get code to be vectorized on
> SIMD systems, such as ignoring all these niceties, so a big hammer
> is what is supplied, with seductively named compiler options such
> as -Ofast or -ffast-math.
>
> (There's a double-langauge pun lurking here. "fast" means "almost"
> in German, and one could vary an old saying about experiments to
> "Fast programs, fast results, fast richtig" (where the last part
> means "almost correct"). Maybe something for Anton's office wall?)
>
> I appreciate what VVM can do here, by keeping the sequential
> dependencies alive. There's simply a case where the language
> does not specify this, as in DO CONCURRENT or #pragma simd,
> when these semantics do not apply and where more speed would
> help (which is the point you made above).
> >> Doing the operations in a different order isn't the problem. You need a
> >> way to allow/specify the two partial sums to be added together in the
> >> end. I don't see your proposal as doing that. And, of course, it is
> >> limited to five registers which must be specified in the hardware design.
> ><
> > Done properly, you want each loop performing Kahan-Babuška summations
> > not lossy clumsy double FP.
> ><
> > Kahan-Babuška summation should be known to and embedded in the compiler.
> Depends if you need it or not. It is a considerable overhead /
> slowdown, and putting it in for all cases also would not be a
> good thing.
<
On some implementations is may have considerable overhead (6 FP ops
mostly dependent, instead of 1) on others it cost no more than FMAC all by
itself. In my 66000 ISA it can be encoded in 2 instructions and performed
in 1 (CARRY is an instruction-modifier and does not necessarily execute)

Re: VVM question

<sg37of$t4c$3@newsreader4.netcologne.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20101&group=comp.arch#20101

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!newsreader4.netcologne.de!news.netcologne.de!.POSTED.2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de!not-for-mail
From: tkoe...@netcologne.de (Thomas Koenig)
Newsgroups: comp.arch
Subject: Re: VVM question
Date: Tue, 24 Aug 2021 16:46:07 -0000 (UTC)
Organization: news.netcologne.de
Distribution: world
Message-ID: <sg37of$t4c$3@newsreader4.netcologne.de>
References: <sftuaa$but$1@newsreader4.netcologne.de>
<5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
<3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me>
<65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<sg23h5$5dj$1@newsreader4.netcologne.de> <sg29ht$rbs$1@gioia.aioe.org>
<sg2fsq$bpd$1@newsreader4.netcologne.de>
<d5dd6304-8eec-421a-ba4d-274142df8de5n@googlegroups.com>
Injection-Date: Tue, 24 Aug 2021 16:46:07 -0000 (UTC)
Injection-Info: newsreader4.netcologne.de; posting-host="2001-4dd7-dc2a-0-7285-c2ff-fe6c-992d.ipv6dyn.netcologne.de:2001:4dd7:dc2a:0:7285:c2ff:fe6c:992d";
logging-data="29836"; mail-complaints-to="abuse@netcologne.de"
User-Agent: slrn/1.0.3 (Linux)
 by: Thomas Koenig - Tue, 24 Aug 2021 16:46 UTC

MitchAlsup <MitchAlsup@aol.com> schrieb:

> Carry save is perfectly adequate for an accumulator as carries are
> only moving forward ln2( #inputs ) per cycle. IT is easy to build
> a Carry Save Find First circuit, so you know where in that 1076 bit
> accumulator is the most likely highest bit of significance. {You
> will only be off by 1 at most}

Having second thoughts here...

Assuming I calculate

01..111111111....111000
+ 00..000000000....001000
=========================
10..000000000....000000

with the 1076 bit accumulator, the carry would still have to be
propagated all the way, across more than 1024 bits if I am unlucky.
This is why I suggested signed digits, because carries only propagate
a single position there (if done correctly).

How would a carry-save adder deal with this situation?

Re: VVM question

<c2a9a666-3cbf-43e1-9d91-964cf90892c8n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20102&group=comp.arch#20102

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1926:: with SMTP id es6mr22638594qvb.3.1629825249190;
Tue, 24 Aug 2021 10:14:09 -0700 (PDT)
X-Received: by 2002:a05:6808:1494:: with SMTP id e20mr3485929oiw.122.1629825248985;
Tue, 24 Aug 2021 10:14:08 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Tue, 24 Aug 2021 10:14:08 -0700 (PDT)
In-Reply-To: <sg37of$t4c$3@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<sg23h5$5dj$1@newsreader4.netcologne.de> <sg29ht$rbs$1@gioia.aioe.org>
<sg2fsq$bpd$1@newsreader4.netcologne.de> <d5dd6304-8eec-421a-ba4d-274142df8de5n@googlegroups.com>
<sg37of$t4c$3@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c2a9a666-3cbf-43e1-9d91-964cf90892c8n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Tue, 24 Aug 2021 17:14:09 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Tue, 24 Aug 2021 17:14 UTC

On Tuesday, August 24, 2021 at 11:46:09 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Carry save is perfectly adequate for an accumulator as carries are
> > only moving forward ln2( #inputs ) per cycle. IT is easy to build
> > a Carry Save Find First circuit, so you know where in that 1076 bit
> > accumulator is the most likely highest bit of significance. {You
> > will only be off by 1 at most}
> Having second thoughts here...
>
> Assuming I calculate
>
> 01..111111111....111000
> + 00..000000000....001000
> =========================
> 10..000000000....000000
>
> with the 1076 bit accumulator, the carry would still have to be
> propagated all the way, across more than 1024 bits if I am unlucky.
<
But this propagation takes place in the adder not in the accumulator
which remains carry save during all the iterations. The adder is used
only after all of the reductions from all of the lanes arrive and are
compressed into a final carry save result. This adder will be of Kogee
Stone variety, where the middle order bits are the last to resolve,
allowing one to begin finding the point of greatest significance
early.
<
> This is why I suggested signed digits, because carries only propagate
> a single position there (if done correctly).
>
> How would a carry-save adder deal with this situation?
>
You are going to get carry save results out of the multiplier
every cycle. In carry save form, you use 2 signals for each bit.
We know fast multipliers perform their work in not just carry
save but also True-Complement (to make the XORs fast),
so I assume we have carry-save and true-complement out
of the multiplier.
<
Anytime one gets a value out of a latch (of flip-flop) one has
access to both true and compliment values. The accumulator
"runs" through this flip-flop, so we have both carry save and true
complement data from the 'flop.
<
Thus we have 2-bits from the multiplier and 2 bits from the
accumulator flop and we can add these 2 values in 2 gates
of delay producing carry save output, which we then flop.
<
Since this addition only costs 2 gates of delay, we could take
results from 3 lanes and 1 accumulator in 4 gates, 7 lanes
and 1 accumulator in 6 gates: the reduction can easily take
as many as 16 lanes, per cycle. I doubt anyone is going to
build a machine "that wide".

Re: VVM question

<d3d896f7-3b78-4e18-a193-9c63692cdff7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20158&group=comp.arch#20158

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:a8f:: with SMTP id v15mr11482747qkg.329.1630100187075;
Fri, 27 Aug 2021 14:36:27 -0700 (PDT)
X-Received: by 2002:a9d:5603:: with SMTP id e3mr9765831oti.178.1630100186844;
Fri, 27 Aug 2021 14:36:26 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 27 Aug 2021 14:36:26 -0700 (PDT)
In-Reply-To: <sfvckb$bok$2@newsreader4.netcologne.de>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.178.51; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.178.51
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d3d896f7-3b78-4e18-a193-9c63692cdff7n@googlegroups.com>
Subject: Re: VVM question
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 27 Aug 2021 21:36:27 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Fri, 27 Aug 2021 21:36 UTC

On Monday, August 23, 2021 at 6:44:45 AM UTC+1, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:

> Will it run several iterations in parallel without source code
> modification, or not?

yes, by running the exact same instructions in in-flight out-of-order multi-issue micro-architecture.

VVM took me 2 years to understand rhe concept, it was a lightbulb moment but only after i had a brainwave in SVP64. more of a 2x4 cluebat smack to the side of the head, but hey.

to conceptually understand VVM in full,.first take either Cray-style Vectors or "SIMD where it is implemented as a horizontal for-loop". doesn't matter which: the important part is:

you have a for-loop on the current instruction, it goes from 0 to LEN-1

thus, instruction1 you have a for-loop from element 0 to element LEN+1
instruction 2: for-loop from element 0 to element LEN+1

got that bit so far?

Vertical-First, of which VVM is a type, you do this:

START LOOP
instruction 1 element 0
instruction 2 element 0
instruction 3 element 0
LOOP BACK
instruction 1 element 1
instruction 2 element 1
.....
LOOP BACK
instruction 1 element LEN-1
Instruction 2 element LEN-1
instruction 3 element LEN-1
END LOOP

that's it.

that's all there is to it. if you understand this difference,
between Horizontal-First and Vertical-First element/instruction
processing order, you understand VVM.

thus you can see:

* a simple single issue may do this dead easy, exactly
as the scalar code is written
* a multi issue version may analyse the loop and shove multiple
overlapping elements into overlapping in-flight buffers.

there is no actual optimisation, it just turns out that
the LOOP instruction declares various things such
as the loop invariant, and which registers can be
considered "Vectorised".

[somebody please do check this:
the limitation as i understand it is that those registers
marked as "Vectorised" *cannot* have data passed in to
them. as best i have been able to tell you *must* use
LD to get data into a Vector element and you likewise
*must* ST that element towards the end of the loop]

if you do this then the HW is happy due to it detecting
that the elements are in fact loop independent, and need
only concern itself about Memory R/W Hazards.

SVP64's Vertical First Mode on the other hand has
*actual* mapping to *actual* registers one to one
with the conceptual element numbering illustrated
above, and consequently you *do not* have to rely
on LD/ST. SVP64 is however a 64bit ISA so not
especially compact.

l.

Re: VVM question

<7cd7b86b-ca8d-4522-be84-355e8a68d5a5n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20159&group=comp.arch#20159

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2914:: with SMTP id m20mr11108251qkp.497.1630100602445; Fri, 27 Aug 2021 14:43:22 -0700 (PDT)
X-Received: by 2002:a05:6808:10c8:: with SMTP id s8mr16695907ois.175.1630100602224; Fri, 27 Aug 2021 14:43:22 -0700 (PDT)
Path: i2pn2.org!i2pn.org!aioe.org!feeder1.feed.usenet.farm!feed.usenet.farm!tr3.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 27 Aug 2021 14:43:22 -0700 (PDT)
In-Reply-To: <d3d896f7-3b78-4e18-a193-9c63692cdff7n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com> <sfvckb$bok$2@newsreader4.netcologne.de> <d3d896f7-3b78-4e18-a193-9c63692cdff7n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7cd7b86b-ca8d-4522-be84-355e8a68d5a5n@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 27 Aug 2021 21:43:22 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 84
 by: MitchAlsup - Fri, 27 Aug 2021 21:43 UTC

On Friday, August 27, 2021 at 4:36:28 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, August 23, 2021 at 6:44:45 AM UTC+1, Thomas Koenig wrote:
> > MitchAlsup <Mitch...@aol.com> schrieb:
> > Will it run several iterations in parallel without source code
> > modification, or not?
> yes, by running the exact same instructions in in-flight out-of-order multi-issue micro-architecture.
>
> VVM took me 2 years to understand rhe concept, it was a lightbulb moment but only after i had a brainwave in SVP64. more of a 2x4 cluebat smack to the side of the head, but hey.
>
> to conceptually understand VVM in full,.first take either Cray-style Vectors or "SIMD where it is implemented as a horizontal for-loop". doesn't matter which: the important part is:
>
> you have a for-loop on the current instruction, it goes from 0 to LEN-1
>
> thus, instruction1 you have a for-loop from element 0 to element LEN+1
> instruction 2: for-loop from element 0 to element LEN+1
>
> got that bit so far?
>
> Vertical-First, of which VVM is a type, you do this:
>
> START LOOP
> instruction 1 element 0
> instruction 2 element 0
> instruction 3 element 0
> LOOP BACK
> instruction 1 element 1
> instruction 2 element 1
> ....
> LOOP BACK
> instruction 1 element LEN-1
> Instruction 2 element LEN-1
> instruction 3 element LEN-1
> END LOOP
>
> that's it.
>
> that's all there is to it. if you understand this difference,
> between Horizontal-First and Vertical-First element/instruction
> processing order, you understand VVM.
>
> thus you can see:
>
> * a simple single issue may do this dead easy, exactly
> as the scalar code is written
> * a multi issue version may analyse the loop and shove multiple
> overlapping elements into overlapping in-flight buffers.
>
> there is no actual optimisation, it just turns out that
> the LOOP instruction declares various things such
> as the loop invariant, and which registers can be
> considered "Vectorised".
>
> [somebody please do check this:
> the limitation as i understand it is that those registers
> marked as "Vectorised" *cannot* have data passed in to
> them. as best i have been able to tell you *must* use
> LD to get data into a Vector element and you likewise
> *must* ST that element towards the end of the loop]
<
The difference between Scalar, Vector, and Loop Carried is::
a) Scalar registers are read once put into stations and then
.....just used to supply operands to instructions
b) Vector registers capture a result produced in this iteration
c) Loop Carried registers capture a result produced in a
.....previous iteration.
<
The distinction is used in the operand capture portion of
the stations.
>
> if you do this then the HW is happy due to it detecting
> that the elements are in fact loop independent, and need
> only concern itself about Memory R/W Hazards.
<
One of the things I learned when building wide OoO machines
is the stunning nature of how LITTLE OoOness is needed to
get within spitting distance of "all you can get" performance.
>
> SVP64's Vertical First Mode on the other hand has
> *actual* mapping to *actual* registers one to one
> with the conceptual element numbering illustrated
> above, and consequently you *do not* have to rely
> on LD/ST. SVP64 is however a 64bit ISA so not
> especially compact.
>
> l.

Re: VVM question

<b39bc6eb-2d43-45a3-b15f-bf6452e311a6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20160&group=comp.arch#20160

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:7315:: with SMTP id x21mr10253579qto.392.1630101917933;
Fri, 27 Aug 2021 15:05:17 -0700 (PDT)
X-Received: by 2002:aca:f145:: with SMTP id p66mr15920537oih.30.1630101917719;
Fri, 27 Aug 2021 15:05:17 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.snarked.org!border2.nntp.dca1.giganews.com!nntp.giganews.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 27 Aug 2021 15:05:17 -0700 (PDT)
In-Reply-To: <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.178.54; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.178.54
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <b39bc6eb-2d43-45a3-b15f-bf6452e311a6n@googlegroups.com>
Subject: Re: VVM question
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 27 Aug 2021 22:05:17 +0000
Content-Type: text/plain; charset="UTF-8"
Lines: 21
 by: luke.l...@gmail.com - Fri, 27 Aug 2021 22:05 UTC

On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:

> The major problem is where does one store the state on an interrupt
> taken inside of the loop. I am letting my subconscious dwell on it right
> now.

the answer is, without an actual register, you can't.

even if it was a Special Purpose register for storing
in-flight data, it's still a register.

if it was an actual GPR/FPR, actually named
in the instruction (and given as a src/dst into
the FMAC as the accumulator) *now* you
have a register to actually store in-flight
data during an interrupt. it can even be
contextswitched.

a lot of things in SVP64 look overly complex until precise interrupt
handling is thrown into the pot.

l.

Re: VVM question

<0b4f8910-7d6c-4a8f-b0a1-90fe304076e1n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20161&group=comp.arch#20161

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:620a:2914:: with SMTP id m20mr11221436qkp.497.1630102508060;
Fri, 27 Aug 2021 15:15:08 -0700 (PDT)
X-Received: by 2002:a54:4883:: with SMTP id r3mr8191314oic.7.1630102507867;
Fri, 27 Aug 2021 15:15:07 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!fdn.fr!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 27 Aug 2021 15:15:07 -0700 (PDT)
In-Reply-To: <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.178.54; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.178.54
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <0b4f8910-7d6c-4a8f-b0a1-90fe304076e1n@googlegroups.com>
Subject: Re: VVM question
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 27 Aug 2021 22:15:08 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Fri, 27 Aug 2021 22:15 UTC

On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:
> My preferred means is to make a way to specify that a function unit is
> performing a reduction, and that it should not deliver its value at the
> end of its calculation, but hold on to it an use it in the next calculation.

in SVP64 we do not have explicit reduction or iteration instructions:
iteration is performed simply by issuing a Vector-Context on top
of a base instruction where OH LOOK! the difference between the
register number src snd dest happens to be exactly one, and OH LOOK!
on each time round the multi-issue-capable execution backend
you get spammed with ADD r1, r2 ADD r2, r3 ADD r3, r4 etc etc
which looks an awful lot like iteration / reduction and is inherently
precise exception interruptible.

we are however also working out an explicit (fixed, predictable)
paralleliseable reduction schedule, it is quite tricky. but, as a fixed
schedule, even non-commutative operations can be thrown at it.

l.

Re: VVM question

<2eccc573-f9ba-4d4f-9dae-0f5107f505a6n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20162&group=comp.arch#20162

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a0c:df04:: with SMTP id g4mr1470265qvl.10.1630102989655;
Fri, 27 Aug 2021 15:23:09 -0700 (PDT)
X-Received: by 2002:aca:59c6:: with SMTP id n189mr17437545oib.44.1630102989472;
Fri, 27 Aug 2021 15:23:09 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 27 Aug 2021 15:23:09 -0700 (PDT)
In-Reply-To: <sg36fd$kev$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.178.54; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.178.54
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <sg1mru$h6d$1@dont-email.me>
<ec20f6f4-06a3-4215-b51b-182b9980c7d8n@googlegroups.com> <sg3416$11k$1@dont-email.me>
<1aec6237-b33c-4cf2-b9a2-faa38d702a3bn@googlegroups.com> <sg36fd$kev$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2eccc573-f9ba-4d4f-9dae-0f5107f505a6n@googlegroups.com>
Subject: Re: VVM question
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Fri, 27 Aug 2021 22:23:09 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Fri, 27 Aug 2021 22:23 UTC

On Tuesday, August 24, 2021 at 5:24:16 PM UTC+1, Stephen Fuld wrote:

> As was pointed out earlier, you can certainly get this by unrolling the
> loop, then doing the final add of the two partial sums outside of the
> loop. But this makes the code "specify" the number of lanes (i.e. how
> many times the loop is unrolled by), thus makes it hardware model
> specific (i.e. it can't take advantage of a more powerful model with say
> four appropriate FUs).

we designed a reduction "schedule" which does exactly that and
is not tied to the microarchitectural width.

it is a recursive tree algorithm (butterfly-like) as part of the SVP64
Specification. hardware will be required to implement that
algorithm or one that produces the exact same result.

it is possible in other words if you are willing to make the decision
to make it part of the specification of the ISA. ARM did something
similar i believe

however i am not sure that a Vertical-First ISA such as VVM the
concept of Horizontal Reduction even makes sense. i have a strong
feeling they are mutually exclusively incompatible.

l.

Re: VVM question

<2e6e65bf-493a-4d8d-8c3e-9d0af174109bn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20164&group=comp.arch#20164

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:108a:: with SMTP id a10mr10721221qtj.14.1630108530245;
Fri, 27 Aug 2021 16:55:30 -0700 (PDT)
X-Received: by 2002:aca:d8c3:: with SMTP id p186mr2641890oig.51.1630108530046;
Fri, 27 Aug 2021 16:55:30 -0700 (PDT)
Path: i2pn2.org!i2pn.org!news.niel.me!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 27 Aug 2021 16:55:29 -0700 (PDT)
In-Reply-To: <b39bc6eb-2d43-45a3-b15f-bf6452e311a6n@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=104.59.204.55; posting-account=H_G_JQkAAADS6onOMb-dqvUozKse7mcM
NNTP-Posting-Host: 104.59.204.55
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
<b39bc6eb-2d43-45a3-b15f-bf6452e311a6n@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <2e6e65bf-493a-4d8d-8c3e-9d0af174109bn@googlegroups.com>
Subject: Re: VVM question
From: MitchAl...@aol.com (MitchAlsup)
Injection-Date: Fri, 27 Aug 2021 23:55:30 +0000
Content-Type: text/plain; charset="UTF-8"
 by: MitchAlsup - Fri, 27 Aug 2021 23:55 UTC

On Friday, August 27, 2021 at 5:05:20 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:
>
> > The major problem is where does one store the state on an interrupt
> > taken inside of the loop. I am letting my subconscious dwell on it right
> > now.
<
Reminding others that this state is 1076 bits long.
<
> the answer is, without an actual register, you can't.
>
> even if it was a Special Purpose register for storing
> in-flight data, it's still a register.
>
> if it was an actual GPR/FPR, actually named
> in the instruction (and given as a src/dst into
> the FMAC as the accumulator) *now* you
> have a register to actually store in-flight
> data during an interrupt. it can even be
> context switched.
<
OK, so what register do you have that is big enough to hold all 1076 bits ?
>
> a lot of things in SVP64 look overly complex until precise interrupt
> handling is thrown into the pot.
>
> l.

Re: VVM question

<73488067-c07c-43fb-b3d7-6243aca7f3d7n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20166&group=comp.arch#20166

  copy link   Newsgroups: comp.arch
X-Received: by 2002:a05:6214:1d25:: with SMTP id f5mr12438225qvd.26.1630112205552;
Fri, 27 Aug 2021 17:56:45 -0700 (PDT)
X-Received: by 2002:a05:6830:3115:: with SMTP id b21mr10932492ots.240.1630112205293;
Fri, 27 Aug 2021 17:56:45 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Fri, 27 Aug 2021 17:56:45 -0700 (PDT)
In-Reply-To: <2e6e65bf-493a-4d8d-8c3e-9d0af174109bn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.178.51; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.178.51
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
<b39bc6eb-2d43-45a3-b15f-bf6452e311a6n@googlegroups.com> <2e6e65bf-493a-4d8d-8c3e-9d0af174109bn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <73488067-c07c-43fb-b3d7-6243aca7f3d7n@googlegroups.com>
Subject: Re: VVM question
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 28 Aug 2021 00:56:45 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Sat, 28 Aug 2021 00:56 UTC

On Saturday, August 28, 2021 at 12:55:31 AM UTC+1, MitchAlsup wrote:

> OK, so what register do you have that is big enough to hold all 1076 bits ?

none, i'm limiting things to 64 bit, so that'd need to be
done across multiple registers (long arithmetic, big
integer math).

one option to explore: the iteration/reduction in SVP64, the
parallel schedule, *requires* a Vector src/dest for use as
an accumulator, because it is used to store partial parallel
results as part of the tree reduction. this has an advantage
in that it solves the precise exception problem.

the only reason this is possible is because SVP64 has explicit
not implicit Vector registers, mapped onto the (enlargened)
scalar regfile.

if VVM was similarly extended to have *actual* Vector registers
as opposed to its current design of only mapping elements onto
in-flight Reservation Stations, *then* there would be somewhere
to put large in-flight data (such as 1076 bits).

l

Re: VVM question

<4ArWI.8247$o45.3531@fx46.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20173&group=comp.arch#20173

  copy link   Newsgroups: comp.arch
Path: i2pn2.org!i2pn.org!aioe.org!news.uzoreto.com!newsfeed.xs4all.nl!newsfeed7.news.xs4all.nl!news-out.netnews.com!news.alt.net!fdc2.netnews.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx46.iad.POSTED!not-for-mail
From: ThatWoul...@thevillage.com (EricP)
User-Agent: Thunderbird 2.0.0.24 (Windows/20100228)
MIME-Version: 1.0
Newsgroups: comp.arch
Subject: Re: VVM question
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com> <sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com> <sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de> <sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com> <sg13vd$hom$1@newsreader4.netcologne.de> <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com> <b39bc6eb-2d43-45a3-b15f-bf6452e311a6n@googlegroups.com> <2e6e65bf-493a-4d8d-8c3e-9d0af174109bn@googlegroups.com>
In-Reply-To: <2e6e65bf-493a-4d8d-8c3e-9d0af174109bn@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Lines: 30
Message-ID: <4ArWI.8247$o45.3531@fx46.iad>
X-Complaints-To: abuse@UsenetServer.com
NNTP-Posting-Date: Sat, 28 Aug 2021 14:17:36 UTC
Date: Sat, 28 Aug 2021 10:17:20 -0400
X-Received-Bytes: 2235
 by: EricP - Sat, 28 Aug 2021 14:17 UTC

MitchAlsup wrote:
> On Friday, August 27, 2021 at 5:05:20 PM UTC-5, luke.l...@gmail.com wrote:
>> On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:
>>
>>> The major problem is where does one store the state on an interrupt
>>> taken inside of the loop. I am letting my subconscious dwell on it right
>>> now.
> <
> Reminding others that this state is 1076 bits long.
> <
>> the answer is, without an actual register, you can't.
>>
>> even if it was a Special Purpose register for storing
>> in-flight data, it's still a register.
>>
>> if it was an actual GPR/FPR, actually named
>> in the instruction (and given as a src/dst into
>> the FMAC as the accumulator) *now* you
>> have a register to actually store in-flight
>> data during an interrupt. it can even be
>> context switched.
> <
> OK, so what register do you have that is big enough to hold all 1076 bits ?
>> a lot of things in SVP64 look overly complex until precise interrupt
>> handling is thrown into the pot.
>>
>> l.

Did you decide against the barf-buffer?

Re: VVM question

<a113a20b-818b-4127-af8b-161763a18187n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20174&group=comp.arch#20174

  copy link   Newsgroups: comp.arch
X-Received: by 2002:ac8:66d8:: with SMTP id m24mr12893803qtp.39.1630161029045;
Sat, 28 Aug 2021 07:30:29 -0700 (PDT)
X-Received: by 2002:a05:6830:3115:: with SMTP id b21mr13375656ots.240.1630161028828;
Sat, 28 Aug 2021 07:30:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.arch
Date: Sat, 28 Aug 2021 07:30:28 -0700 (PDT)
In-Reply-To: <4ArWI.8247$o45.3531@fx46.iad>
Injection-Info: google-groups.googlegroups.com; posting-host=92.40.178.51; posting-account=soFpvwoAAADIBXOYOBcm_mixNPAaxW9p
NNTP-Posting-Host: 92.40.178.51
References: <sftuaa$but$1@newsreader4.netcologne.de> <5fd4c976-d72c-46f3-9fb4-584e72b628a2n@googlegroups.com>
<sfvckb$bok$2@newsreader4.netcologne.de> <3ae800da-d7d8-4437-b5bb-ec651b5f5700n@googlegroups.com>
<sg0gr4$pet$1@dont-email.me> <sg0n5u$74p$2@newsreader4.netcologne.de>
<sg0omd$itq$1@dont-email.me> <65bad170-8d27-4ad4-bf8f-69157e6869f2n@googlegroups.com>
<sg13vd$hom$1@newsreader4.netcologne.de> <64fe2d82-96f7-4c94-9b0c-aa05605c3fcen@googlegroups.com>
<b39bc6eb-2d43-45a3-b15f-bf6452e311a6n@googlegroups.com> <2e6e65bf-493a-4d8d-8c3e-9d0af174109bn@googlegroups.com>
<4ArWI.8247$o45.3531@fx46.iad>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <a113a20b-818b-4127-af8b-161763a18187n@googlegroups.com>
Subject: Re: VVM question
From: luke.lei...@gmail.com (luke.l...@gmail.com)
Injection-Date: Sat, 28 Aug 2021 14:30:29 +0000
Content-Type: text/plain; charset="UTF-8"
 by: luke.l...@gmail.com - Sat, 28 Aug 2021 14:30 UTC

On Saturday, August 28, 2021 at 3:17:41 PM UTC+1, EricP wrote:
> Did you decide against the barf-buffer?

https://m.youtube.com/watch?v=KvzrguhmK0o&t=17

Pages:1234
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor