Message-ID:

Alexander Graham Bell is alive and well in New York, and still waiting for a dial tone.

On 05/08/2021 03:36, EricP wrote:
> David Brown wrote:
>> On 04/08/2021 22:45, EricP wrote:
>>> David Brown wrote:
>>>> A typical device for me these days would be based on an ARM Cortex M
>>>> single-core microcontroller. The OS's I use are real-time OS's
>>>> (such as
>>>> FreeRTOS) that are linked directly to a single executable along with
>>>> the
>>>> program - the systems typically have only a single binary. (There
>>>> might
>>>> be an independent bootloader, but that stops once the main program
>>>> starts.)
>>>>
>>>> Processors in this class generally do not have atomic swap, CAS,
>>>> FetchAdd, etc., instructions. Most microcontrollers have RISC cpus
>>>> (except some of the small, old-fashioned 8-bit devices), and typically
>>>> the only tool you have other than disabling interrupts is a
>>>> load-link/store-conditional pair (ldrex/strex on ARM).
>>> ARMv5 had SWP Swap Word and SWPB Swap Byte exchange instructions.
>>> I suppose before v5 super mode had to disable interrupts.
>>> User mode would have to make a system call to do an an exchange.
>>
>> SWP and SWPB were deprecated for ARMv6, and dropped in the Thumb-2
>> instruction set used by the Cortex-M devices.
>
> I checked the ARMv7 manual before posting that and it is still there.
> It was removed for ARMv7-M which is a subset of ARMv7-R.

Yes, it is still in the ARMv7 - but deprecated. And it does not exist
at all in the Thumb-2 instruction set, which is the only available set
in the Cortex-M family. So you are right that it is in the architecture
(though I believe it is completely gone in ARMv8), but it is missing
from the devices I have been talking about.

>
>>>> Use of ldrex/strex can avoid disabling interrupts. But they are big
>>>> and
>>>> slow. Disabling interrupts around a critical section (such as an
>>>> atomic
>>>> increment, or accessing a 64-bit variable) is simpler, more efficient
>>>> and more predictable.
>>> ldrex and strex came later with ARMv6.
>>> I am surprised you say that on ARM they are big and slow.
>>> The whole point of a LL/SC mechanism is that it is extremely lean.
>>
>> Compared to disabling interrupts, LL/SC sequences are big (and therefore
>> slow). They take something like 7 or 8 instructions, with a loop (and
>> in RTOS systems, worst-case timings are important even if the loop is
>> usually only run once). Disabling interrupts are two or three
>> instructions.
>
> A non-interruptable exchange on Alpha would be
>
> loop:
>     LDL_L rx,[mem]
>     STL_C rx,[mem]
>     BEQ    rx,loop
>
> though strictly speaking that should be a forward branch to
> a backward branch so it is predicted as not taken.
>
>> There are two important aspects of LL/SC, compared to older "lock
>> everything" mechanisms such disabling interrupts (for single core
>> devices) or x86 locked instructions. It splits the locking in two so
>> that you don't have the delays seen in the x86 world, and it is vastly
>> easier to implement in the hardware of a load/store processor than fully
>> locked instructions.
>>
>> But on a single core system, it is bigger and slower than disabling
>> interrupts. And it does not work inside interrupt functions (which is a
>> big issue for microcontroller usage).
>
> Then Arm must have done something seriously wrong with its implementation.
> There is nothing in any Arm manual that I have seen to indicate a
> problem with ldrex/strex and interrupts, or that they are inherently
> slower than normal LD and ST.
>

I am not sure I a explaining things well here.

The assembly for a 32-bit increment (of a variable in memory) in Thumb-2
will be something along the lines of :

ldr r3, [r2]
add r3, r3, #1
str r3, [r2]

Three instructions, with one load and one store. With tightly-coupled
memory (or on an M0 to M4 microcontroller, on-board ram), loads and
stores are two cycles. So that is 3 instructions, 5 cycles.

Making this atomic on an M0 microcontroller is :

cpsid i

ldr r3, [r2]
add r3, r3, #1
str r3, [r2]

cpsie

Two extra instructions, at two cycles. (These will generally not be
needed inside an interrupt function, unless the same data will be
accessed by different interrupt routines with different priorities.)

If you want a more flexible sequence that saves and restores interrupt
status, and also need it safe for the dual-issue M7, you can use:

mrs r1, primask
cpsid i
dsb

ldr r3, [r2]
add r3, r3, #1
str r3, [r2]

msr primask, r1

Four instructions, four cycles overhead for making the sequence atomic.

The equivalent for this using ldrex/strex is indeed short and fast:

loop:
ldrex r3, [r2]
add r3, r3, 1
strex r1, r3, [r2]
cmp r1, #0
bne loop
dsb

The ldrex and strex instructions don't take any longer than their
non-locking equivalents. And this is safe to use in interrupts. In
real-time work, it is vital to track worst-case execution time, not
best-case or common-case. So though the best case might be 3 cycles
overhead, an extra round of the loop might be another 9 cycles. (Single
extra rounds will happen occasionally - multiple extra rounds should not
be realistically possible.)

So it seems to be a viable alternative to disabling interrupts, with
approximately the same overhead. However, it has two major
disadvantages to go with its obvious advantage of not blocking interrupts.

It will only work for sequences ending in a single write of 32 bits or
less, and it will only work for restartable sequences.

Suppose, instead, that the atomic operation you want is not a simple
increment of a 32-bit value, but is storing a 64-bit value. With the
interrupt disable strategy, you still have exactly the same 4
instruction, 4 cycle overhead (or two instructions in the simplest
version), for both read and write routines. How do you do this with
ldrex/strex ?

You can't use the same setup as earlier. Suppose task A takes the
exclusive monitor lock, writes the first half of the 64-bit item, then
there is an interrupt. Task B wants to read the value - it takes the
lock, reads the 64-bit value, and releases it, thinking all is well.
When task A resumes, its write to the second half using strex fails, and
it restarts the write. In the meantime, task B is left with a corrupted
read that it thinks is valid. This is typically a very low probability
event - you never see it in testing, but it /will/ happen when you have
deployed thousands of systems at customers.

So you now use ldrex/strex to control access to an independent lock flag
(a simple semaphore). That works for tasks, but the overhead is now
much bigger, and there is the possibility of failure - if another task
has the semaphore, the current one must cede control to it. And that
means blocking the task, changing dynamic priority levels so that the
other task can run, etc. - a full-blown RTOS mutex solution. These are
very powerful and useful locking mechanisms, but a /vast/ overhead
compared to the four instructions for interrupt disabling, and the
single instruction and 3 cycles needed for the 64-bit load or store.

And what happens in interrupts? An interrupt function must not block
(though it can trigger a scheduler run when it exits). If a task has
the lock when the interrupt is triggered, it will not be able to do its job.

In summary, ldrex/strex /can/ be used to implement high-level,
high-power, high-overhead locking mechanisms such as an RTOS mutex or
queue (though interrupt disabling works there too). It can be used as
an alternative to interrupt locking for a limited (albeit common) subset
of atomic operations, with little more cost in run-time but
significantly greater source-code complexity (and therefore scope for
programmer error).

In general, you want to avoid disabling interrupts for any significant
length of time. But a system that can't cope with them being disabled
for the time taken to do a small atomic access, is broken anyway.

This is a different world from multi-core processors, or systems where
reading from memory might take 200 clock cycles due to cache misses, or
where it might trigger page faults. Then ldrex/strex becomes essential.

>>> LL loads a word as usual sets a FF in the cache controller to
>>> indicate the line is "linked" and sets a clock counter to some value.
>>> If any other processor reads the linked cache line, the FF is reset.
>>> If an interrupt occurs the FF is reset.
>>> SC stores the word as usual but checks if the FF is still set
>>> and the clock counter is > 0. If both are true the store occurs.
>>>
>>> The memory coherence detector should be present even on a
>>> uniprocessor as
>>> a cache line may be invalidated by DMA. But that is just an address XOR.
>>>
>>
>> Yes. And all of that is slower than a couple of instructions to disable
>> interrupts.
>
> I keep wondering if you left the DMB memory barrier instructions in?
> If a cpu is just talking between its own interrupts and its own
> non-interrupt levels then it doesn't need memory barriers.
> A cpu loads and stores are always consistent with itself.
>

The Cortex-M7 has a dual-issue core, and there are pipelines involved.
That means instructions following the interrupt disable could be started
or slightly re-organised, and you need to avoid that. A DSB is cheap in
these devices, and recommended by ARM in connection with interrupt
disables or ldrex/strex sequences. (The M0 core is simpler, and does
not need it.)

Subject	Replies	Author
Safepoints By: antispam on Fri, 30 Jul 2021	99	antispam

Alexander Graham Bell is alive and well in New York, and still waiting for a dial tone.

computers / comp.arch / Re: Safepoints