Message-ID:

How can you do 'New Math' problems with an 'Old Math' mind? -- Charles Schulz

On 06/08/2021 17:56, EricP wrote:
> David Brown wrote:
>>
>> I am not sure I a explaining things well here.
>
> Your explanations are fine but we have been talking about different things.
>
> You have been mostly referring to atomic memory operations between
> threads or smp cpus, and I explicitly excluded atomic operations.
>
> I was expanding on Mitch's earlier point about turning asynchronous
> events delivered using interrupt semantics into synchronous ones,
> which is what the original poster was asking about.
>

OK, that does make a difference!

> By atomic I mean memory operations between concurrent threads or cpus.
> Non interruptible sequence is orthogonal, within a single thread or cpu.
> E.g. on x86 ADD [mem],reg is a non interruptible R-M-W sequence but
> is not atomic. LCK ADD [mem],reg is both atomic and non interruptible.

Yes. On the systems I deal with, non-interruptible implies atomic since
there is only one core. (For other memory masters, such as DMA
controllers, you use other techniques to be sure that all accesses are
consistent with intended behaviour.)

(Atomic does not imply non-interruptible - if a sequence can be
restarted, and nothing is able to observe a half-way stage or change it
in the middle, then it is atomic even if it can be interrupted. The
ldrex/strex protected 32-bit increment is atomic but interruptible.)

>
>> The assembly for a 32-bit increment (of a variable in memory) in Thumb-2
>> will be something along the lines of :
>>
>>     ldr r3, [r2]
>>     add r3, r3, #1
>>     str r3, [r2]
>>
>> Three instructions, with one load and one store. With tightly-coupled
>> memory (or on an M0 to M4 microcontroller, on-board ram), loads and
>> stores are two cycles. So that is 3 instructions, 5 cycles.
>>
>> Making this atomic on an M0 microcontroller is :
>>
>>     cpsid i
>>
>>     ldr r3, [r2]
>>     add r3, r3, #1
>>     str r3, [r2]
>>
>>     cpsie
>>
>> Two extra instructions, at two cycles. (These will generally not be
>> needed inside an interrupt function, unless the same data will be
>> accessed by different interrupt routines with different priorities.)
>>
>> If you want a more flexible sequence that saves and restores interrupt
>> status, and also need it safe for the dual-issue M7, you can use:
>>
>>     mrs r1, primask
>>     cpsid i
>>     dsb
>>
>>     ldr r3, [r2]
>>     add r3, r3, #1
>>     str r3, [r2]
>>
>>     msr primask, r1
>>
>> Four instructions, four cycles overhead for making the sequence atomic.
>
> Yes, most OS programming standards require interrupt state
> be saved and restored so the code is reentrant.

It depends on where you are in the code - sometimes you already know
that interrupts are disabled (this is typically the case inside
interrupt routines in simpler microcontrollers), and very often you know
that interrupts are enabled (for an ARM Cortex-M device, that is pretty
much all the time, except during such specific critical sections).

>
>> The equivalent for this using ldrex/strex is indeed short and fast:
>>
>> loop:
>>     ldrex r3, [r2]
>>     add r3, r3, 1
>>     strex r1, r3, [r2]
>>     cmp r1, #0
>>     bne loop
>>     dsb
>
> I looked at the ARMv7 manual again and I still think,
> _for code that is within a single thread or cpu as per
> the original question about asynchronous signals_
> that the DSB is unnecessary in this particular case.
> A cpu will always see its own instructions in program order.
> As DSB stalls the cpu until all prior memory and branches have
> completed it would impose an extra overhead on the sequence.

I've looked at the manuals too, and I really don't think it is clear
enough on the matter. But ARM's examples generally do use a DSB in many
such cases. In fact, there should perhaps also be one beside the
restore interrupt status instruction too. I think it should come before
the "msr primask" (or "cpsie i") to re-enable interrupts, but I can see
in the FreeRTOS kernel code (which has been checked by ARM folk), it
comes after re-enabling interrupts.

It would be a lot easier if the ARM manuals gave more details here!

The primary point is to ensure that the interrupt disabling actually
takes effect before the load happens, so that you eliminate the
possibility of an interrupt happening after the load. While it is
correct that the effect of instructions are seen in order on the cpu,
that does not necessarily apply in combination with interrupts which
are, by definition, asynchronous. And in particular, writes can be
delated and occur later in the pipeline. Also note that an M7 is dual
issue, which could also complicate matters. (Again, the ARM manuals do
not give helpful details here as far as I can see, but use DSB liberally.)

For another example, on the AVR microcontroller (which does document
these details), interrupt disabling takes effect one instruction later
due to pipelining. Thus on the AVR, following an interrupt disable, you
always make sure you have a "harmless" instruction (a NOP, if you have
nothing better) before you have a memory access instruction.

On a Cortex-M, a DSB is a single cycle instruction in most situations.

It would be nice to be sure here - but I'd rather err on the side of
having an extra DSB than risk problems. I think I might now add one
before re-enabling interrupts too. And I will see if I can set up some
test code to provoke problems.

>
>> The ldrex and strex instructions don't take any longer than their
>> non-locking equivalents. And this is safe to use in interrupts. In
>> real-time work, it is vital to track worst-case execution time, not
>> best-case or common-case. So though the best case might be 3 cycles
>> overhead, an extra round of the loop might be another 9 cycles. (Single
>> extra rounds will happen occasionally - multiple extra rounds should not
>> be realistically possible.)
>>
>> So it seems to be a viable alternative to disabling interrupts, with
>> approximately the same overhead. However, it has two major
>> disadvantages to go with its obvious advantage of not blocking
>> interrupts.
>>
>> It will only work for sequences ending in a single write of 32 bits or
>> less, and it will only work for restartable sequences.
>
> Yes, this is a limitation of LL/SC - it cannot do double wide operations.
>
> A few years ago I proposed an enhancement here which
> allows LL/SC to multiple locations within a single cache line.
>
> LL - retains the lock if load is to the same cache line
> SCH - store conditional and hold stores and retains lock
> SCR - store conditional and release stores and releases lock
> CLL - Clear lock
>
> The extra cost in the cache controller is holding the first line updates
> separate until the release occurs in case it needs to roll back.
> For the modified line it can either use a separate cache line buffer,
> or if L2 is inclusive it can update L1 and either keep or toss that copy.
>

That could be a useful extension, yes. (On Cortex M devices, the
granularity of the lock is the entire memory space, so it would not need
to be restricted to a cache line. And often you don't have a cache, or
the memory you are accessing is in uncached tightly-coupled memory.)

>> Suppose, instead, that the atomic operation you want is not a simple
>> increment of a 32-bit value, but is storing a 64-bit value. With the
>> interrupt disable strategy, you still have exactly the same 4
>> instruction, 4 cycle overhead (or two instructions in the simplest
>> version), for both read and write routines. How do you do this with
>> ldrex/strex ?
>
> Yes unfortunately in general you don't.
>
> However there are specific situations where it can be done.
> E.g. reading a 64-bit clock on a 32-bit cpu since the clock is always
> increasing one can read high1 read low, read high2, compare high1,high2
>

Indeed. (That is precisely the method I use for 64-bit clocks - a trick
I learned from 6502 code on the BBC Micro mentioned in another thread here.)

> Another example is updating a 64-bit PTE on a 32-bit system without using
> spinlocks and never leave the intermediate PTE in an illegal state.
> One does the reads and writes setting and clearing bits
> in a particular order.

Yes, there are many specific methods that can be used when you know more
details. If a 64-bit value is only written during an interrupt, and
read from thread code, then the interrupt can just use a volatile write
(generally a "std" store double instruction in Thumb-2) and the
application code can read it safely as a volatile (with a "ldd" load
double instruction - which gets restarted if it gets interrupted).

And sometimes instead of having a 32-bit value that can be atomically
incremented from one part of the code and atomically decremented from
another, you can hold two 32-bit values which each part "owns" and can
increment independently. You read it by reading both halves and
subtracting.

>
>> You can't use the same setup as earlier. Suppose task A takes the
>> exclusive monitor lock, writes the first half of the 64-bit item, then
>> there is an interrupt. Task B wants to read the value - it takes the
>> lock, reads the 64-bit value, and releases it, thinking all is well.
>> When task A resumes, its write to the second half using strex fails, and
>> it restarts the write. In the meantime, task B is left with a corrupted
>> read that it thinks is valid. This is typically a very low probability
>> event - you never see it in testing, but it /will/ happen when you have
>> deployed thousands of systems at customers.
>>
>> So you now use ldrex/strex to control access to an independent lock flag
>> (a simple semaphore). That works for tasks, but the overhead is now
>> much bigger, and there is the possibility of failure - if another task
>> has the semaphore, the current one must cede control to it. And that
>> means blocking the task, changing dynamic priority levels so that the
>> other task can run, etc. - a full-blown RTOS mutex solution. These are
>> very powerful and useful locking mechanisms, but a /vast/ overhead
>> compared to the four instructions for interrupt disabling, and the
>> single instruction and 3 cycles needed for the 64-bit load or store.
>
> Right, this is all the atomic stuff which I am avoiding by narrowly
> constraining the problem to e.g. signals within a single thread.
>

Fair enough.

>> And what happens in interrupts? An interrupt function must not block
>> (though it can trigger a scheduler run when it exits). If a task has
>> the lock when the interrupt is triggered, it will not be able to do
>> its job.
>
> By design this is illegal in any OS I know of
> and will panic (crash/halt/bugcheck) the system.
>

The stuff I work with don't usually have nice panics or bugchecks, as
that could be too much overhead. Besides, when you don't have a screen,
or a user or a network connection, how are you going to inform someone
of the problem? A complete reset is usually the best you can do - but
that's why we have to be really careful to get these things right in the
first place!

And I agree, taking locks like this within an interrupt is never a good
plan, and usually explicitly disallowed.

> Mutexes coordinate threads, spinlocks coordinate cpu kernels.
> Spinlocks coordinate interrupts at the same interrupt priority and must
> never be shared across interrupt priority levels as they _WILL_ deadlock.
> A spinlock initialized for, say, IPL 3 should panic the OS on
> attempts to acquire or release it at any other IPL.

Spinlocks don't work at all on single cpu systems, except between
threads of the same priority with some kind of time-slicing. Even then,
they are rarely a good idea - mutexes are usually preferred.

>
>> In summary, ldrex/strex /can/ be used to implement high-level,
>> high-power, high-overhead locking mechanisms such as an RTOS mutex or
>> queue (though interrupt disabling works there too). It can be used as
>> an alternative to interrupt locking for a limited (albeit common) subset
>> of atomic operations, with little more cost in run-time but
>> significantly greater source-code complexity (and therefore scope for
>> programmer error).
>
> Yes.
>
>> In general, you want to avoid disabling interrupts for any significant
>> length of time. But a system that can't cope with them being disabled
>> for the time taken to do a small atomic access, is broken anyway.
>>
>>
>> This is a different world from multi-core processors, or systems where
>> reading from memory might take 200 clock cycles due to cache misses, or
>> where it might trigger page faults. Then ldrex/strex becomes essential.
>
> I'm thinking that exchange, compare-exchange may be essential too.
> Too many algorithms need non interruptable pointer swap.
>
> Plus 32-bit cpus with 64-bit PTE's need 64-bit LD, ST, Xchg, CmpXchg
> even if restricted to 8 byte alignment.
>

Atomic exchange and CAS is perhaps not essential, but it is definitely a
nice idea - especially with double-sized CAS.

> I'm warming to your interrupt-disable-for-N instruction.
> One can avoid any denial of service problems by simply ignoring
> further disable_N requests while current N is still counting down
> and limiting N to, say, less than 8.

That's exactly the idea, yes. But on bigger processors you might also
want additional restrictions to avoid long delays on interrupts, such as
giving a hard fault on the application if there is a page fault inside
the interrupt disable period (it is the programmer's responsibility to
make sure the critical section only accesses non-pageable memory).

But if you wanted efficient synchronisation and atomic primitives on a
big system, I don't think such an interrupt-disable-for-N instruction is
the key place to start. More important, IMHO, would be dedicated fast
memory for holding things like locks with logic to make it directly
accessible from all cores. You would eliminate the complications and
delays of cache snooping between cores. The blocks of this memory would
have special locking semantics built in.

On one dual core PPC microcontroller I used, there was a block of 16
"semaphore" registers. If a register contained 0, you could write to
it. If it contained something else, you could only write a 0 to it. So
for a thread (or cpu, depending on your preference) wanted to take a
lock, it would write its ID to the lock's semaphore register. Then it
would read it back (all with normal reads and writes) - if the write had
succeeded, it had the lock. It released the lock by writing a 0.

Obviously on a general-purpose cpu you'd need access controls and bigger
blocks, but something along these lines could give efficient locking
and/or lock-free algorithms without needing kernel help unless there was
contention, without needing bus locks, without needing snooping and
without more than minimal costly inter-processor communication.

Even better, perhaps, would be dedicated hardware to assist in
inter-processor and inter-process communication - hardware queues. (Cue
Tom Gardner and a discussion of XMOS here!)

There is lots of fun to be had trying to figure out better wheels.

>
>>>>> LL loads a word as usual sets a FF in the cache controller to
>>>>> indicate the line is "linked" and sets a clock counter to some value.
>>>>> If any other processor reads the linked cache line, the FF is reset.
>>>>> If an interrupt occurs the FF is reset.
>>>>> SC stores the word as usual but checks if the FF is still set
>>>>> and the clock counter is > 0. If both are true the store occurs.
>>>>>
>>>>> The memory coherence detector should be present even on a
>>>>> uniprocessor as
>>>>> a cache line may be invalidated by DMA. But that is just an address
>>>>> XOR.
>>>>>
>>>> Yes. And all of that is slower than a couple of instructions to
>>>> disable
>>>> interrupts.
>>> I keep wondering if you left the DMB memory barrier instructions in?
>>> If a cpu is just talking between its own interrupts and its own
>>> non-interrupt levels then it doesn't need memory barriers.
>>> A cpu loads and stores are always consistent with itself.
>>>
>>
>> The Cortex-M7 has a dual-issue core, and there are pipelines involved.
>> That means instructions following the interrupt disable could be started
>> or slightly re-organised, and you need to avoid that. A DSB is cheap in
>> these devices, and recommended by ARM in connection with interrupt
>> disables or ldrex/strex sequences. (The M0 core is simpler, and does
>> not need it.)
>
> For non-atomic, ie within a single thread between its normal and
> its own signal level, or between a single cpu and its interrupts,
> the data dependencies themselves look to me to be sufficient
> ordering which is why I thought the DSB superfluous.
>

Again, I understand what you are saying - and I'd love to be sure that
you are right.

Subject	Replies	Author
Safepoints By: antispam on Fri, 30 Jul 2021	99	antispam

How can you do 'New Math' problems with an 'Old Math' mind? -- Charles Schulz

computers / comp.arch / Re: Safepoints