novaBBS - comp.arch - Re: Signed division by 2^n

On 5/14/2021 7:54 PM, MitchAlsup wrote:
> On Friday, May 14, 2021 at 7:00:45 PM UTC-5, BGB wrote:
>> On 5/14/2021 4:29 PM, MitchAlsup wrote:
>>> On Friday, May 14, 2021 at 2:23:49 PM UTC-5, BGB wrote:
>
>>>>> <
>>>>> Possibly, but the multiplier is dealing with 53+53 bit things minimum,
>>>>> if said multiplier also FDIV and SQRT then it is 57+57
>>>>> if said multiplier also Transcendentals then it is 58+58.....
>>> <
>>>> This is assuming one uses a "square" multiplier, rather than a
>>>> "triangular" multiplier.
>>> <
>>> The proper word is parallelogram not square
>> Could also call it rhombus or diamond...
>>>>
>>>> As-is:
>>>> Square Multiplier: 54*54 -> 108
>>>> Triangular Multiplier: 54*54 -> 72
>>> <
>>> I question your definition of triangular::
>>>> Triangular Multiplier: 54×54 -> 54 !?!
>>>>
>> It is built from DSPs, which can generate an output twice as wide as the
>> inputs.
>>
>> They can be:
>> 16*16 -> 32, Signed/Unsigned
>> 17*17 -> 34, Signed/Unsigned
>> 18*18 -> 36, Nominally Signed
>> Can fake Unsigned via extra LUTs.
>>
>> If one builds a triangular multiplier, the bottom parts hang off the
>> bottom, so one gets an additional 16-18 bits of width.
> <
> Are you talking about the more significant triangle of the multiplier
> or the lesser significant triangle of the multiplier ?
> <
> <
> This is a symptom of the library you are using not a general property of
> multipliers.

I suspect it is more related to the DSP48's, which are hard logic blocks
in this FPGA. But, one is sorta stuck using them, whether or not they
are a good fit, because there is no other cost-effective alternative.

>>
>> However, because the low order bits are incomplete, they also tend to be
>> be erroneous, and are discarded.
> <
> upper......

In interesting property of multiplication is that the high order digits
still tend to be correct even if one throws out most of the low-order
digits...

>>
>> The actual usable portion is roughly the same width of the inputs, but I
>> could these low-order bits as this is what the intermediate adders tend
>> to work with, and they are discarded afterwards.
>>
>> So, 54*54->54, 72*72->72, or 80*80->80, ...
>> Would be more what one would see after generating the final output.
>>>> The LongDouble used a wider multiplier:
>>>> 72*72->90 (initial)
>>>> 85*85->90 (likely needed to avoid some issues, *)
>>>>
>>>> *: Algorithms based on iterative convergence get stuck in an infinite
>>>> loop if FADDX and FMULX use different mantissa lengths.
>>>>
>>>> This means that I would need to make them agree on a fixed 80-bit mantissa.
>>>>
>>>> Or: S.E15.F80.P32 (where P=Zero Padding)
>>>>
>>>>
>>>> I did at one point start trying to implement a combined FMAC unit, but
>>>> then realized it was likely to have a fairly high latency.
>>> <
>>> Your implementation medium is harming your ability to pull off your design.
>> Very possibly...
>>
>>
>> I just spent like the past week battling with debugging and trying to
>> get stuff to pass timing reliably.
>>
>> Switched over to trying to get it to pass timing at 75MHz, because if I
>> can get it to pass timing much at all at 75MHz, it will hopefully stop
>> unpredictably failing timing at 50MHz.
>>
>>
>> But, things like passing/failing timing, resource usage, estimated power
>> usage, ... are basically kinda like a roulette wheel which jumps all
>> over the place.
>>
>>
>> Similarly, whether or not it works in simulation is no guarantee it will
>> work on the actual FPGA (simulation starts typically with everything
>> holding zeroes, whereas the FPGA seems to start with pretty much
>> everything initialized to garbage values; requiring a global "reset"
>> strobe signal to try to pull everything into a "known good" state).
>>
>> Also, the sort of "metastability" from clock-domain crossings isn't
>> really modeled at in simulation, nor the effects of random internal
>> corruptions, or the apparent tendency of the FPGA to start experiencing
>> errors once it warms up (I stuck a RasPi heat-sink on it, probably also
>> need a case with a fan, *), ...
>>
>>
>> *: Basically, once the FPGA gets much over ~45C or so, its reliability
>> seems to get a lot worse (and stuff gets a lot more crash-prone).
> <
> That is getting hot enough the the LUTs lose their programming !

Yeah.

Whatever is going on, it seems this is enough to cause a 10mm die to get
pretty warm absent some sort of active cooling.

But, if the FPGA gets too warm, whatever it is running will tend to
either get crash-prone and/or deadlock. Usual solution is to turn it off
and let it cool off.

>>
>> It didn't come with a heat-sink though, as I guess the board designers
>> figured that passive air-cooling from the bare FPGA was sufficient?...
> <
> More likely they wanted the user of the chip to add appropriate amounts
> of cooling.

Dunno, Artix-7 and Spartan-7 are in the power-and-cost-optimized category.

Marketing materials says the FPGA operates in milliwatt range.

Vivado tends to give an estimate that it uses a little over 1 watt with
this project (but, this estimate varies wildly anywhere from ~ 0.7W to
1.3W).

It could be:
milliwatt range, doesn't need a heatsink.
watt range, probably needs a heatsink.

>>
>> Can't mount a fan directly to the heatsink though, as ~ 12mm fans aren't
>> really a thing (smallest I can find are ~ 30mm). Also seemingly not a
>> thing: a 30mm aluminum heatsink that narrows and sticks onto a 10mm die
>> (via thermal adhesive). Also preferable if the fan could run at 3.3v and
>> under 50mA (so it could be powered via a PMOD connector or similar).
> <
> Piece of cake to machine, starting from a 30×30 heat sink. I could knock one
> out in 10 minutes if I had a 30×30 to start with--the machining of the fins and
> the anodizing is the hard parts. Relieving the bottom so it is only 10×10 is
> easy.

Could be.

I suspect work-holding would be the hard part here:
This part would be too small to be held effectively in a vise.

It has been a while since I have machined anything, mostly as the garage
is an endless crap-storm. Though, could probably do almost do a lot of
it using a file, rather than using a milling machine and an endmill.

Or, if one has a 30mm copper heatsink, and a 10mm x 10mm x 3mm spacer or
similar, they could solder it onto the bottom of the heatsink.

The heatsink I have on there right now is basically a 12mm x 12mm x 6mm
aluminum square with fins. Roughly matches the size of the die, but is
of reduced effectiveness without some form of active airflow.

>>
>> ...
>>
>> Current strategy though is mostly turning it off when it starts getting
>> too warm.
>>>>
>>>> An FPU with a separate FADD and FMUL unit could give lower latency for
>>>> FADD and FMUL, and could fake FMAC with only slightly higher latency
>>>> than the combined unit.
>>> <
>>> Maybe,
>>> FADD: 2-cycles is darned hard, 3-cycles is pretty easy.
>>> FMUL: 4-cycles is rather standard for 16-gates/cycle machines.
>>> FMAC: 4-cycles is pretty hard, 5-cycles is a bit better
>>> <
>>> AMD Athlon and Opteron used FADD=4 and FMUL=4 to simplify the
>>> pipelineing and to prevent having both units deliver result in the same
>>> cycle.
> <
>> Not sure how gate-delay compares with FPGA logic levels; ATM I am mostly
>> looking at 12 .. 14 (some parts are 10 or 11 logic levels).
>>
>> Looking at traces, they mostly seem to be LUT3/LUT4/LUT5 with the
>> occasional CARRY4 or similar, traveling between pairs of FDRE elements.
>>
>>
>> Internally, the FADD and FMUL units still have a 5-cycle latency, but
>> gain an extra 2 cycles due to an input/output buffering mechanism (also
>> needed for SIMD).
>>
>> Eg:
>> EX1: FPU receives inputs from pipeline;
>> EX2: Inputs fed into FADD or FMUL;
>> ... Work Cycles ...
>> EX3: Get result from FPU.
>>
>> Previously, the FADD and FMUL would recieve inputs directly during EX1,
>> but then they needed to deal with pipeline stalls. Adding the extra
>> cycle (with the outer FPU module managing input/output buffering) makes
>> them independent of the stall, which helps timing, but also adds an
>> extra clock-cycle of latency to the operation.
>>
>> The extra buffering cycle also helps with timing, allowing more time for
>> the value to get from the register-forwarding logic to the FPU.
>>
>>
>> So, FADD Stages:
>> C1 Unpack Input Arguments
>> Find difference of exponents
>> Decide which side is 'A' and which is 'B'
>> Right-Shift FracB
>> C2 Optionally Invert FracB
>> Add (FracA+FracB+Cin)
>> C3 CLZ (Renorm 1)
>> C4 Left-Shift (Renorm 2)
>> Try to round
>> C5 Pack Output / Done
>>
>> FMUL Stages:
>> C1 Unpack Input Arguments
>> Set up Exponents
>> Multiply Input Fragments
>> C2 ADD Stuff
>> C3 ADD Stuff
>> C4 Renorm Adjust / Round
>> C5 Pack Output / Done
>>
>> Renorm is easier for FMUL, as it assumes that the values falls in the
>> range of 1.0 .. 4.0, as opposed to FADD where it can be anything.
>>> <
>>> On the other hand, a single FMAC unit can do it all::
>>> FADD: FMAC 1*Rs1+Rs2
>>> FMUL: FMAC Rs1*Rs2+0
>>> <
>>> So if you find yourself in a position where you need FMAC (say to meet
>>> IEEE 754-2008+) you can have the design team build the FMAC unit.
>>> Later on, when building the next and wider machine, you can add an
>>> FADD or FMUL or both based on statistics you have gathered from
>>> generation 1. Given and FMAC, FADD is a degenerate subset which
>>> a GOOD Verilog compiler can autogenerate if you feed it the above
>>> fixed values {FMAC 1*Rs1+Rs2 and FMAC Rs1*Rs2+0}. THis REALLY
>>> reduces the designer workloads.
> <
>> My position is that I don't feel full IEEE conformance is a realistic
>> goal for this.
>>
>>
>> From what I can gather, in a loose-sense it does seem to provide most
>> of what IEEE-754-1985 seems to ask for, with a few exceptions:
>> Denormal as Zero;
>> FADD/FSUB/FMUL Only;
>> Compare Ops (via ALU);
>> Format Conversion (via FPU or ALU);
>> ...
>>
>>
>> Native FP Formats:
>> Double / Binary64 (Scalar, 2x SIMD)
>> Single / Binary32 (Conv, 2x | 4x SIMD)
>> Half / Binary16 (Conv, 4x SIMD)
>>
>> Long Double Extension (Optional, Cost):
>> Truncated Quad / Binary128
>>
>> RGBF Extension:
>> FP8S / FP8U (Packed Conv Only)
>>>>
>>>>
>>>> There are some operations though which could exist with an FMAC unit
>>>> which would not work correctly with an FMUL+FADD glued together, but I
>>>> am already pushing the limits of what seems viable on the XC7A100T.
>>> <
>>> Yep, your implementation medium is getting in your way. So are some of
>>> your tools.
>>>
>> Yeah, probably...
>> Verilator is seemingly pretty buggy in some areas.
>>
>> I am using the freeware / feature-limited version of Vivado, not sure
>> what the Commercial / EDA version is like, or what all features they
>> disabled.
>>
>> From what I can gather, Vivado is sorta like:
>> Free: Spartan and Artix FPGAs, some lower-end Zynq and Kintex devices.
>> Per-device vouchers: They enable certain FPGAs with the purchase of the
>> associated dev boards;
>> Commercial: AFAICT, $1k per seat per year?...
>>
>>
>> Say, if I bought to get one of the Kintex dev-boards, they would
>> apparently come with a voucher to allow Vivado to target them (well,
>> otherwise, it is a lot of money for a board one can't use).
>>
>> But, the Kintex boards generally go for upwards of $1000, and I still
>> don't have a job at the moment, so this is pretty steep...
>>
>>
>> Though, synthesis on a Kintex at a -2 speed grade (for one of the FPGA's
>> supported by Vivado WebPack) implies I can achieve clock speeds of ~ 150
>> to 200 MHz, as opposed to the 50MHz or 75MHz I can get on an Artix.
>>
>> Someone else had apparently once tested it on a Kintex and got it to
>> pass timing at ~ 180MHz.
>>
>>
>>
>> When I tried before using Quartus on a Cyclone V (targeting the same
>> type as in the DE10), was able to get it up to ~ 110 MHz, but this
>> didn't seem like enough of a speedup to justify me buying a DE10 (more
>> so when the DE10 had less RAM for the FPGA part, and I could only manage
>> to fit a single BJX2 core into the FPGA).
>>
>> These boards have an ARM SoC + FPGA part, there is like 1GB for the ARM
>> SoC, but with a separate 64MB RAM module for the FPGA.
>>
>> In theory, the number of LUTS/ALMS in the DE10 is large enough that it
>> should be more competitive with an Artix or Spartan, not sure what is
>> going on there...
>>
>> But, as noted, I could clock it a little higher than the Spartan or
>> Artix, but not enough to convince me to throw money at buying the actual
>> hardware or figure out how to deal with interacting with an ARM SoC...
>>
>>
>> Zynq is kinda similar, just I would have to figure out how to go about
>> plugging the BJX2 into an AXI Bus, which would be pretty much the only
>> way it could access RAM or similar.
>>
>> Granted, If I wanted to use Vivado's MIG (Memory Interface Generator), I
>> would also need to figure out AXI.
>>
>>
>> Though, I suspect MIG may know how to make the RAM work correctly in its
>> rated speed window (vs my DDR controller which is apparently running the
>> RAM in a sort of low-power standby mode).
>>
>> I did write a controller which could, in theory, run the RAM at 150MHz
>> (within its rated speed), but couldn't figure out how to make it
>> "actually work" on the actual hardware.
>>
>>
>> But, memory bandwidth is hard...
>> An still a pretty big bottleneck it seems.
>>
>> ...

Subject	Replies	Author
Signed division by 2^n By: Thomas Koenig on Tue, 11 May 2021	99	Thomas Koenig

Bell Labs Unix -- Reach out and grep someone.

computers / comp.arch / Re: Signed division by 2^n