Message-ID:

You mean you didn't know she was off making lots of little phone companies?

David Brown wrote:
>
> I am not sure I a explaining things well here.

Your explanations are fine but we have been talking about different things.

You have been mostly referring to atomic memory operations between
threads or smp cpus, and I explicitly excluded atomic operations.

I was expanding on Mitch's earlier point about turning asynchronous
events delivered using interrupt semantics into synchronous ones,
which is what the original poster was asking about.

By atomic I mean memory operations between concurrent threads or cpus.
Non interruptible sequence is orthogonal, within a single thread or cpu.
E.g. on x86 ADD [mem],reg is a non interruptible R-M-W sequence but
is not atomic. LCK ADD [mem],reg is both atomic and non interruptible.

> The assembly for a 32-bit increment (of a variable in memory) in Thumb-2
> will be something along the lines of :
>
> ldr r3, [r2]
> add r3, r3, #1
> str r3, [r2]
>
> Three instructions, with one load and one store. With tightly-coupled
> memory (or on an M0 to M4 microcontroller, on-board ram), loads and
> stores are two cycles. So that is 3 instructions, 5 cycles.
>
> Making this atomic on an M0 microcontroller is :
>
> cpsid i
>
> ldr r3, [r2]
> add r3, r3, #1
> str r3, [r2]
>
> cpsie
>
> Two extra instructions, at two cycles. (These will generally not be
> needed inside an interrupt function, unless the same data will be
> accessed by different interrupt routines with different priorities.)
>
> If you want a more flexible sequence that saves and restores interrupt
> status, and also need it safe for the dual-issue M7, you can use:
>
> mrs r1, primask
> cpsid i
> dsb
>
> ldr r3, [r2]
> add r3, r3, #1
> str r3, [r2]
>
> msr primask, r1
>
> Four instructions, four cycles overhead for making the sequence atomic.

Yes, most OS programming standards require interrupt state
be saved and restored so the code is reentrant.

> The equivalent for this using ldrex/strex is indeed short and fast:
>
> loop:
> ldrex r3, [r2]
> add r3, r3, 1
> strex r1, r3, [r2]
> cmp r1, #0
> bne loop
> dsb

I looked at the ARMv7 manual again and I still think,
_for code that is within a single thread or cpu as per
the original question about asynchronous signals_
that the DSB is unnecessary in this particular case.
A cpu will always see its own instructions in program order.
As DSB stalls the cpu until all prior memory and branches have
completed it would impose an extra overhead on the sequence.

> The ldrex and strex instructions don't take any longer than their
> non-locking equivalents. And this is safe to use in interrupts. In
> real-time work, it is vital to track worst-case execution time, not
> best-case or common-case. So though the best case might be 3 cycles
> overhead, an extra round of the loop might be another 9 cycles. (Single
> extra rounds will happen occasionally - multiple extra rounds should not
> be realistically possible.)
>
> So it seems to be a viable alternative to disabling interrupts, with
> approximately the same overhead. However, it has two major
> disadvantages to go with its obvious advantage of not blocking interrupts.
>
> It will only work for sequences ending in a single write of 32 bits or
> less, and it will only work for restartable sequences.

Yes, this is a limitation of LL/SC - it cannot do double wide operations.

A few years ago I proposed an enhancement here which
allows LL/SC to multiple locations within a single cache line.

LL - retains the lock if load is to the same cache line
SCH - store conditional and hold stores and retains lock
SCR - store conditional and release stores and releases lock
CLL - Clear lock

The extra cost in the cache controller is holding the first line updates
separate until the release occurs in case it needs to roll back.
For the modified line it can either use a separate cache line buffer,
or if L2 is inclusive it can update L1 and either keep or toss that copy.

> Suppose, instead, that the atomic operation you want is not a simple
> increment of a 32-bit value, but is storing a 64-bit value. With the
> interrupt disable strategy, you still have exactly the same 4
> instruction, 4 cycle overhead (or two instructions in the simplest
> version), for both read and write routines. How do you do this with
> ldrex/strex ?

Yes unfortunately in general you don't.

However there are specific situations where it can be done.
E.g. reading a 64-bit clock on a 32-bit cpu since the clock is always
increasing one can read high1 read low, read high2, compare high1,high2

Another example is updating a 64-bit PTE on a 32-bit system without using
spinlocks and never leave the intermediate PTE in an illegal state.
One does the reads and writes setting and clearing bits
in a particular order.

> You can't use the same setup as earlier. Suppose task A takes the
> exclusive monitor lock, writes the first half of the 64-bit item, then
> there is an interrupt. Task B wants to read the value - it takes the
> lock, reads the 64-bit value, and releases it, thinking all is well.
> When task A resumes, its write to the second half using strex fails, and
> it restarts the write. In the meantime, task B is left with a corrupted
> read that it thinks is valid. This is typically a very low probability
> event - you never see it in testing, but it /will/ happen when you have
> deployed thousands of systems at customers.
>
> So you now use ldrex/strex to control access to an independent lock flag
> (a simple semaphore). That works for tasks, but the overhead is now
> much bigger, and there is the possibility of failure - if another task
> has the semaphore, the current one must cede control to it. And that
> means blocking the task, changing dynamic priority levels so that the
> other task can run, etc. - a full-blown RTOS mutex solution. These are
> very powerful and useful locking mechanisms, but a /vast/ overhead
> compared to the four instructions for interrupt disabling, and the
> single instruction and 3 cycles needed for the 64-bit load or store.

Right, this is all the atomic stuff which I am avoiding by narrowly
constraining the problem to e.g. signals within a single thread.

> And what happens in interrupts? An interrupt function must not block
> (though it can trigger a scheduler run when it exits). If a task has
> the lock when the interrupt is triggered, it will not be able to do its job.

By design this is illegal in any OS I know of
and will panic (crash/halt/bugcheck) the system.

Mutexes coordinate threads, spinlocks coordinate cpu kernels.
Spinlocks coordinate interrupts at the same interrupt priority and must
never be shared across interrupt priority levels as they _WILL_ deadlock.
A spinlock initialized for, say, IPL 3 should panic the OS on
attempts to acquire or release it at any other IPL.

> In summary, ldrex/strex /can/ be used to implement high-level,
> high-power, high-overhead locking mechanisms such as an RTOS mutex or
> queue (though interrupt disabling works there too). It can be used as
> an alternative to interrupt locking for a limited (albeit common) subset
> of atomic operations, with little more cost in run-time but
> significantly greater source-code complexity (and therefore scope for
> programmer error).

Yes.

> In general, you want to avoid disabling interrupts for any significant
> length of time. But a system that can't cope with them being disabled
> for the time taken to do a small atomic access, is broken anyway.
>
>
> This is a different world from multi-core processors, or systems where
> reading from memory might take 200 clock cycles due to cache misses, or
> where it might trigger page faults. Then ldrex/strex becomes essential.

I'm thinking that exchange, compare-exchange may be essential too.
Too many algorithms need non interruptable pointer swap.

Plus 32-bit cpus with 64-bit PTE's need 64-bit LD, ST, Xchg, CmpXchg
even if restricted to 8 byte alignment.

I'm warming to your interrupt-disable-for-N instruction.
One can avoid any denial of service problems by simply ignoring
further disable_N requests while current N is still counting down
and limiting N to, say, less than 8.

>>>> LL loads a word as usual sets a FF in the cache controller to
>>>> indicate the line is "linked" and sets a clock counter to some value.
>>>> If any other processor reads the linked cache line, the FF is reset.
>>>> If an interrupt occurs the FF is reset.
>>>> SC stores the word as usual but checks if the FF is still set
>>>> and the clock counter is > 0. If both are true the store occurs.
>>>>
>>>> The memory coherence detector should be present even on a
>>>> uniprocessor as
>>>> a cache line may be invalidated by DMA. But that is just an address XOR.
>>>>
>>> Yes. And all of that is slower than a couple of instructions to disable
>>> interrupts.
>> I keep wondering if you left the DMB memory barrier instructions in?
>> If a cpu is just talking between its own interrupts and its own
>> non-interrupt levels then it doesn't need memory barriers.
>> A cpu loads and stores are always consistent with itself.
>>
>
> The Cortex-M7 has a dual-issue core, and there are pipelines involved.
> That means instructions following the interrupt disable could be started
> or slightly re-organised, and you need to avoid that. A DSB is cheap in
> these devices, and recommended by ARM in connection with interrupt
> disables or ldrex/strex sequences. (The M0 core is simpler, and does
> not need it.)

For non-atomic, ie within a single thread between its normal and
its own signal level, or between a single cpu and its interrupts,
the data dependencies themselves look to me to be sufficient
ordering which is why I thought the DSB superfluous.

Subject	Replies	Author
Safepoints By: antispam on Fri, 30 Jul 2021	99	antispam

You mean you didn't *know* she was off making lots of little phone companies?

computers / comp.arch / Re: Safepoints

You mean you didn't know she was off making lots of little phone companies?