This will of course take a bit of time for us to completely sort out. You are correct that
there are unlocked calls to the reference count mechanism. This could be a result of some
code restructuring sometime in the past, as I hope that this didn't happen as an
I want to investigate some older code as well as the original design of the reference
If we do need a separate lock for the reference count mechanism, perhaps a spin lock could
be used instead of a mutex, as there is nothing that will block for any length of time.
This may not make any difference on FreeBSD as spinlocks may not be available, I'm not
In the meantime, I have removed the REF_FORCE_DELETE option as per your suggestion; it was
implemented in case it was ever needed, which it has not.
From: Andriy Gapon [mailto:avg@FreeBSD.org]
Sent: Thursday, March 21, 2013 11:08 AM
To: Moore, Robert
Cc: devel(a)acpica.org; Zheng, Lv
Subject: Re: [Devel] ref-counting races?
on 21/11/2012 17:18 Moore, Robert said the following:
>> -----Original Message-----
>> From: Andriy Gapon [mailto:avg@FreeBSD.org]
>> AcpiUtUpdateRefCount handling of REF_DECREMENT && Count < 1 looks
>> Not sure if that ever happens in practice, but since the code exists...
>> The case is being treated as a minor event (judging from the
>> debugging print), but we are already lucky if we see Count == 0 there
>> instead of some garbage or just plain crashing because the memory is
>> Nevertheless we discard even that luck by happily continuing into
>> potentially a repeated call to AcpiUtDeleteInternalObj.
> This code appears in order to essentially catch a case where an
> attempt is made to delete an object twice. Admittedly, it doesn't
> serve much purpose these days and should probably be removed. In
> practice, we never, ever hit the (Count < 1) case that I know of. In
> other words, I don't know of any cases where the acpica code ever
attempts to delete an object twice.
With a help from a skilled user we've been able to catch a likely culprit
of the FreeBSD + ACPICA heisenbugs, which all showed some relation to the
FreeBSD battery driver.
Probably unlike other OSes FreeBSD liberally uses AcpiGetObjectInfo in the
code that handles battery status queries from userland. What's more,
FreeBSD allows such queries to run in parallel and, thus,
AcpiGetObjectInfo may be executed concurrently by more than one thread.
In turn, AcpiGetObjectInfo calls functions like AcpiUtExecute_HID,
AcpiUtExecute_UID, etc. These functions are called without holding any
lock and thus, for example, AcpiUtExecute_HID may run in parallel.
The above mentioned functions all follow the same template:
Status = AcpiUtEvaluateObject (DeviceNode, METHOD_NAME__HID,
ACPI_BTYPE_INTEGER | ACPI_BTYPE_STRING, &ObjDesc); ... do
useful work ...
/* On exit, we must delete the return object */
It can be clearly seen that AcpiUtRemoveReference is called without
protection of any lock. Neither of AcpiUtRemoveReference ->
AcpiUtUpdateObjectReference -> AcpiUtUpdateRefCount use any lock.
So, in the above code path ReferenceCount is decremented without holding
Given that the decrement is performed in "highly" non-atomic fashion, it
is open to races.
E.g. let's assume that we have some like this in ASL:
Name (_HID, EisaId ("PNP0C0E"))
Creation of the _HID node would create a value object with reference count
Let's assume that thread T1 executes AcpiUtExecute_HID.
Then AcpiUtEvaluateObject would increment the count to 2.
Let's assume that thread T2 then concurrently executes AcpiUtExecute_HID.
It is possible that T1 and T2 would then concurrently call
T1 with REF_DECREMENT (via AcpiUtRemoveReference) and T2 with
REF_INCREMENT (via AcpiUtEvaluateObject).
Both would see the same reference count of 2 as a starting value.
If T2 is quicker to write 3 into ReferenceCount, then T1 would overwrite
it with 1.
Then, when T2 calls AcpiUtRemoveReference, it would decrement
ReferenceCount to zero and thus the value object would be freed while the
node still has a pointer to its address.
Subsequent accesses to the _HID node would be accessing freed memory.
This description is very consistent with the problem reports from users.
Taking the most recent example:
The panic was preceded by the following messages:
ACPI Error: No object attached to node 0xfffffe00094a51c0
(20110527/exresnte-138) ACPI Error: Method execution failed
[\_SB_.BAT0._UID] (Node 0xfffffe00094a51c0), AE_AML_NO_OPERAND
The panic itself is consistent with corruption of ACPICA object cache:
All the reports (some of which I mentioned at the start of this thread)
follow the same pattern: a panic in object cache code, frequently preceded
by various error messages related to battery device _UID/_HID/etc (such as
missing object or wrong object type, etc).
I see several possibilities of fixing this problem.
From worst to best:
1. Ensure that AcpiGetObjectInfo/AcpiUtExecute_HID/etc are never executed
concurrently on the same object at the OS level code (primarily in the
2. Ensure that AcpiUtRemoveReference is always invoked under some lock.
Not sure if it should be the Interpreter mutex or the Namespace mutex. I
believe that AcpiUtAddReference are always consistently called under a
lock, but I haven't verified all the code paths. On the other hand, it is
obvious that there are many places where AcpiUtRemoveReference is called
without any protection, but it is hard to conclusively enumerate _all_ of
3. Introduce locking to AcpiUtUpdateRefCount to ensure that ReferenceCount
is never modified concurrently. Essentially this should give an
equivalence of atomicity for ReferenceCount manipulations.
Additionally, it would be a good idea to make ReferenceCount == 0 check
more strict (if not fatal). Also, unused and dangerous REF_FORCE_DELETE
should be removed.
I like #3 the best because it provides the most guarantees while requiring
the least code inspection/verification efforts.
Of course, you know the code much better, so your assessment of the
problem and possible solutions could be very different.
Here is a quickly hacked together prototype patch that seems to have
helped the user:
The panic calls are to demonstrate various dangerous places in the current
They were actually hit in testing before the locking was added.