However, the bottom line seems to be that the out-of-order work
causes us not to
immediately queue the next work item in wiphy_radio_work_done():
Correct. The logic is a hold-over from when we didn't use priorities. So the
assumption is that what is on the top of the queue is the 'running' task. This
is no longer the case. However, this logic breaks down only in very specific
conditions which you uncovered.
Specifically, the id that wiphy_radio_work_done() gets called with
match the id of the work on top of the queue (5). So next stays false and we
don't call wiphy_radio_work_next().
Can anybody explain what the reason for this 'next' variable is? It seems like
we would always want to start the next work, no?
The reason for it is that we might remove work items that are not running. We
can only run a single work item at a time, so this is distinguishing between a
'running work done' and 'non-running work done'. In the latter case, we
shouldn't start any new work.
Not that this is a fix to the out-of-order work (start 4, done 5, see above),
but I could work around iwd getting stuck by refactoring the above function
according to the attached patch. Basically I just removed the logic of this
'next' variable and unconditionally call wiphy_radio_work_next().
I think this is close. But you need to account for the above. We can do it in
one of two ways:
1. when a work item is started, it is artificially assigned INT_MIN priority.
So you can use that as an indicator whether the work item being finished is
running or not.
2. Introduce a specific bool for this purpose.
We also need to update the implementation of wiphy_radio_work_is_running() since
that is also wrong.