With all this talk of intermittent failures and folks coming up with ideas on how to solve them, I figured I should dive head first into looking at failures. I have been working with a few folks on this, specifically :parkouss and :vaibhav1994. This experiment (actually the second time doing so) is where I take a given intermittent failure bug and retrigger it. If it reproduces, then I go back in history looking for where it becomes intermittent. This weekend I wrote up some notes as I was trying to define what an intermittent is.
Lets outline the parameters first for this experiment:
- All bugs marked with keyword ‘intermittent-failure’ qualify
- Bugs must not be resolved or assigned to anybody
- Bugs must have been filed in the last 28 days (we only keep 30 days of builds)
- Bugs must have >=20 tbplrobot comments (arbitrarily picked, ideally we want something that can easily reproduce)
Here are what comes out of this:
- 356 bugs are open, not assigned and have the intermittent-failure keyword
- 25 bugs have >=20 comments meeting our criteria
The next step was to look at each of the 25 bugs and see if it makes sense to do this. In fact 13 of the bugs I decided not to take action on (remember this is an experiment, so my reasoning for ignoring these 13 could be biased):
- 5 bugs were thunderbird/mozmill only
- 3 bugs looked to be related to android harness issues
- bug 1157090 hadn’t reproduced in 2 weeks- was APZ feature which we turned off.
- one bug was only on mozilla-beta only
- 2 bugs had patches and 1 had a couple of comments indicating a developer was already looking at it
This leaves us with 12 bugs to investigate. The premise here is easy, find the first occurrence of the intermittent (branch, platform, testjob, revision) and re-trigger it 20 times (picked for simplicity). When the results are in, see if we have reproduced it. In fact, only 5 bugs reproduced the exact error in the bug when re-triggered 20 times on a specific job that showed the error.
Moving on, I started re-triggering jobs back in the pushlog to see where it was introduced. I started off with going back 2/4/6 revisions, but got more aggressive as I didn’t see patterns. Here is a summary of what the 5 bugs turned out like:
- Bug 1161915 – Windows XP PGO Reftest. Found the root cause (pgo only, lots of pgo builds were required for this) 23 revisions back.
- Bug 1160780 – OSX 10.6 mochitest-e10s-bc1. Found the root cause 33 revisions back.
- Bug 1161052 – Jetpack test failures. So many failures in the same test file, it isn’t very clear if I am reproducing the failure or finding other ones. :erikvold is working on fixing the test in bug 1163796 and ok’d disabling it if we want to.
- Bug 1161537 – OSX 10.6 Mochitest-other. Bisection didn’t find the root cause, but this is a new test case which was added. This is a case where when the new test case was added it could have been run 100+ times successfully, then when it merged with other branches a couple hours later it failed!
- bug 1155423 – Linux debug reftest-e10s-1. This reproduced 75 revisions in the past, and due to that I looked at what changed in our buildbot-configs and mozharness scripts. We actually turned on this test job (hadn’t been running e10s reftests on debug prior to this) and that caused the problem. This can’t be tracked down by re-triggering jobs into the past.
In summary, out of 356 bugs 2 root causes were found by re-triggering. In terms of time invested into this, I have put about 6 hours of time to fine the root cause of the 5 bugs.