Android : Watchdog is killing system server

Here goes my first fix in Android. Watchdog of the android framework is implemented to make sure the framework is rebooted in case of non-recoverable errors like deadlocks etc. Usually, i have seen frameworks where in the entire phone is rebooted upon such errors. Android is smart in the sense that its efficient by offering the framework reboot within seconds [Technically, it is just spawning of a linux process, system_server].

Monkey is a tool available to feed in random key events for stress testing and we happened to see a lot of scenarios where in the watchdog would kill the system server, causing android framework reboot. It seemed as if the root cause was within the framework and moreover it was random key events passed to the framework over many hours. The pattern to reproduce was not known but yet happened quite often. Like all bugs, it turned out to be 100% reproducible just that it took time to figure out the pattern.

Monkey has an activitycontroller to report ANRs and memory info. This activitycontroller is registered to Activity Manager Service. It is invoked when a ANR occurs in the device. Upon invocation, monkey starts logging stack traces and dumpsys. Reporting dumpsys holds the lock on Monkey.this and causes a cyclic deadlock, when two consecutive ANRs are reported (one after the other). The first ANR caused by either service timeout or broadcast timeout is reported byActivityManagerService to Monkey's ActivityController via Binder. Meanwhile, the lock on ActivityManagerService is held by serviceTimeout or broadcastTimeout . Activitycontroller's appNotResponding() corresponding to first ANR reports procrank and acquires a lock onMonkey.this and sets few bool variables like mRequestAnrTraces and mRequestDumpsysMemInfoand returns the control to ActivityManagerService's service/broadcast timeout.VM executing monkey process switches the control to main monkey thread and it acquires thelock on Monkey.this and proceeds to report ANR traces. Meanwhile, a second ANR occurs and Activity Manager Service invokes ActivityController's appNotResponding (via binder). appNotResponding reports the procrank and waits to acquirethe lock on Monkey.this which is being held by Monkey's main thread (busy reporting details corresponding to first ANR). This results in a blocking wait for ActivityManagerService's appNotRespondingLocked(). Meanwhile, the monkey's main thread (holding lock on Monkey.this) tries to report the meminfo, invokes reportDumpsysMemInfo(), which in turn causes the android runtime to launch dumpsys process. The dumpsys process queries service manager to get a reference to meminfo service and invoke dump() on the same. The meminfo service is created byActivityManagerService's setSystemProcess(). The dump() method of the same tries to acquire a lock onActivityManagerService which is held by ActivityManagerService's service/broadcasttimeout (awaiting the response from ActivityController's appNotResponding() for the second ANR). This cyclic deadlock continues for a minute after which WatchDog thread of system_server kills system_server as it hasn't got the response from ActivityManagerService's monitor(). The monitor() of ActivityManagerService too tries to acquire lock on this and is invoked once in every minute by android.server.ServerThread.

DEADLOCK:

ActivityManager --> ActivityController --> Monkey Main --> MemInfo --> ActivityManager

Like always, the challenge is to find the pattern :-)