on 21/11/2012 17:18 Moore, Robert said the following:
> -----Original Message-----
> From: Andriy Gapon [mailto:avg@FreeBSD.org]
[snip]
> AcpiUtUpdateRefCount handling of REF_DECREMENT && Count
< 1 looks
> worrying.
> Not sure if that ever happens in practice, but since the code exists...
> The case is being treated as a minor event (judging from the debugging
> print), but we are already lucky if we see Count == 0 there instead of
> some garbage or just plain crashing because the memory is already freed.
> Nevertheless we discard even that luck by happily continuing into
> potentially a repeated call to AcpiUtDeleteInternalObj.
This code appears in order to essentially catch a case where an attempt is made to
delete an object twice. Admittedly, it doesn't serve much purpose these days and
should probably be removed. In practice, we never, ever hit the (Count < 1) case
that I know of. In other words, I don't know of any cases where the acpica code
ever attempts to delete an object twice.
With a help from a skilled user we've been able to catch a likely culprit of the
FreeBSD + ACPICA heisenbugs, which all showed some relation to the FreeBSD battery
driver.
Probably unlike other OSes FreeBSD liberally uses AcpiGetObjectInfo in the code
that handles battery status queries from userland. What's more, FreeBSD allows
such queries to run in parallel and, thus, AcpiGetObjectInfo may be executed
concurrently by more than one thread. In turn, AcpiGetObjectInfo calls functions
like AcpiUtExecute_HID, AcpiUtExecute_UID, etc. These functions are called
without holding any lock and thus, for example, AcpiUtExecute_HID may run in parallel.
The above mentioned functions all follow the same template:
{
Status = AcpiUtEvaluateObject (DeviceNode, METHOD_NAME__HID,
ACPI_BTYPE_INTEGER | ACPI_BTYPE_STRING, &ObjDesc);
... do useful work ...
Cleanup:
/* On exit, we must delete the return object */
AcpiUtRemoveReference (ObjDesc);
return_ACPI_STATUS (Status);
}
It can be clearly seen that AcpiUtRemoveReference is called without protection of
any lock. Neither of AcpiUtRemoveReference -> AcpiUtUpdateObjectReference ->
AcpiUtUpdateRefCount use any lock.
So, in the above code path ReferenceCount is decremented without holding any lock.
Given that the decrement is performed in "highly" non-atomic fashion, it is
open
to races.
E.g. let's assume that we have some like this in ASL:
Name (_HID, EisaId ("PNP0C0E"))
Creation of the _HID node would create a value object with reference count of 1.
Let's assume that thread T1 executes AcpiUtExecute_HID.
Then AcpiUtEvaluateObject would increment the count to 2.
Let's assume that thread T2 then concurrently executes AcpiUtExecute_HID.
It is possible that T1 and T2 would then concurrently call AcpiUtUpdateRefCount:
T1 with REF_DECREMENT (via AcpiUtRemoveReference) and T2 with REF_INCREMENT (via
AcpiUtEvaluateObject).
Both would see the same reference count of 2 as a starting value.
If T2 is quicker to write 3 into ReferenceCount, then T1 would overwrite it with 1.
Then, when T2 calls AcpiUtRemoveReference, it would decrement ReferenceCount to
zero and thus the value object would be freed while the node still has a pointer
to its address.
Subsequent accesses to the _HID node would be accessing freed memory.
This description is very consistent with the problem reports from users.
Taking the most recent example:
http://thread.gmane.org/gmane.os.freebsd.devel.acpi/7707
The panic was preceded by the following messages:
ACPI Error: No object attached to node 0xfffffe00094a51c0 (20110527/exresnte-138)
ACPI Error: Method execution failed [\_SB_.BAT0._UID] (Node 0xfffffe00094a51c0),
AE_AML_NO_OPERAND (20110527/uteval-113)
The panic itself is consistent with corruption of ACPICA object cache:
http://thread.gmane.org/gmane.os.freebsd.devel.acpi/7707/focus=7730
All the reports (some of which I mentioned at the start of this thread) follow the
same pattern: a panic in object cache code, frequently preceded by various error
messages related to battery device _UID/_HID/etc (such as missing object or wrong
object type, etc).
I see several possibilities of fixing this problem.
From worst to best:
1. Ensure that AcpiGetObjectInfo/AcpiUtExecute_HID/etc are never executed
concurrently on the same object at the OS level code (primarily in the battery
driver).
2. Ensure that AcpiUtRemoveReference is always invoked under some lock. Not sure
if it should be the Interpreter mutex or the Namespace mutex. I believe that
AcpiUtAddReference are always consistently called under a lock, but I haven't
verified all the code paths. On the other hand, it is obvious that there are many
places where AcpiUtRemoveReference is called without any protection, but it is
hard to conclusively enumerate _all_ of them.
3. Introduce locking to AcpiUtUpdateRefCount to ensure that ReferenceCount is
never modified concurrently. Essentially this should give an equivalence of
atomicity for ReferenceCount manipulations.
Additionally, it would be a good idea to make ReferenceCount == 0 check more
strict (if not fatal). Also, unused and dangerous REF_FORCE_DELETE should be removed.
I like #3 the best because it provides the most guarantees while requiring the
least code inspection/verification efforts.
Of course, you know the code much better, so your assessment of the problem and
possible solutions could be very different.
Here is a quickly hacked together prototype patch that seems to have helped the user:
http://people.freebsd.org/~avg/acpi-ref-count-2.diff
The panic calls are to demonstrate various dangerous places in the current code.
They were actually hit in testing before the locking was added.
--
Andriy Gapon