There are all kinds of great ideas folks have for fixing intermittent issues. In fact each idea in and of itself is a worthwhile endeavor. I have spent some time over the last couple of months fixing them, filing bugs on them, and really discussing them. One question that remains- what is the definition of an intermittent.
I don’t plan to lay out a definition, instead I plan to ask some questions and lay out some parameters. According to orange factor, there are 4640 failures in the last week (May 3 -> May 10) all within 514 unique bugs. These are all failures that the sheriffs have done some kind of manual work on to star on treeherder. I am not sure anybody can find a way to paint a pretty picture to make it appear we don’t have intermittent failures.
Looking at a few bugs, there are many reasons for intermittent failures:
- core infrastructure (networking, power, large classes of machines (ec2), etc.)
- machine specific (specific machine is failing a lot of jobs)
- CI specific (buildbot issues, twisted issues, puppet issues, etc.)
- CI Harness (usually mozharness)
- Platforms (old platforms/tests we are phasing out, new platforms/tests we are phasing in)
- Test harness (mochitest + libraries, common code for tests, reftest)
- Test Cases (test cases actually causing failures, poorly written, new test cases, etc.)
- Firefox Code (we actually have introduced conditions to cause real failures- just not every time)
- Real regressions (failures which happen every time we run a test)
There are a lot of reasons, many of these have nothing to do with poor test cases or bad code in Firefox. But many of these are showing up many times a day and as a developer who wants to fix a bad test, many are not really actionable. Do we need to have some part of a definition to include something that is actionable?
Looking at the history of ‘intermittent-failure’ bugs in Bugzilla, many occur once and never occur again. In fact this is the case for over half of the bugs filed (we file upwards of 100 new bugs/week). While there are probably reasons for a given test case to fail, if it failed in August 2014 and has never failed again, is that test case intermittent? As a developer could you really do anything about this given the fact that reproducing it is virtually impossible?
This is where I start to realize we need to find a way to identify real intermittent bugs/tests and not clutter the statistics with tests which are virtually impossible to reproduce. Thinking back to what is actionable- I have found that while filing bugs for Talos regressions the closer the bug is filed to the original patch landing, the better the chance it will get fixed. Adding to that point, we only keep 30 days of builds/test packages around for our CI automation. I really think a definition of an intermittent needs to have some kind of concept of time. Should we ignore intermittent failures which occur only once in 90 days? Maybe ignore ones that don’t reproduce after 1000 iterations? Some could argue that we look in a smaller or larger window of time/iterations.
Lastly, when looking into specific bugs, I find many times they are already fixed. Many of the intermittent failures are actually fixed! Do we track how many get fixed? How many have patches and have debugging already taking place? For example in the last 28 days, we have filed 417 intermittents, of which 55 are already resolved and of the remaining 362 only 25 have occurred >=20 times. Of these 25 bugs, 4 already have patches. It appears a lot of work is done to fix intermittent failures which are actionable. Are the ones which are not being fixed not actionable? Are they in a component where all the developers are busy and heads down?
In a perfect world a failure would never occur, all tests would be green, and all users would use Firefox. In reality we have to deal with thousands of failures every week, most of which never happen again. This quarter I would like to see many folks get involved in discussions and determine:
- what is too infrequent to be intermittent? we can call this noise
- what is the general threshold where something is intermittent?
- what is the general threshold where we are too intermittent and need to backout a fix or disable a test?
- what is a reasonable timeframe to track these failures such that we can make them actionable?
Thanks for reading, I look forward to hearing from many who have ideas on this subject. Stay tuned for an upcoming blog post about re-trigging intermittent failures to find the root cause.