Tag Archives: development

Re-Triggering for a [root] cause – some notes from doing it

With all this talk of intermittent failures and folks coming up with ideas on how to solve them, I figured I should dive head first into looking at failures.  I have been working with a few folks on this, specifically :parkouss and :vaibhav1994.  This experiment (actually the second time doing so) is where I take a given intermittent failure bug and retrigger it.  If it reproduces, then I go back in history looking for where it becomes intermittent.  This weekend I wrote up some notes as I was trying to define what an intermittent is.

Lets outline the parameters first for this experiment:

  • All bugs marked with keyword ‘intermittent-failure’ qualify
  • Bugs must not be resolved or assigned to anybody
  • Bugs must have been filed in the last 28 days (we only keep 30 days of builds)
  • Bugs must have >=20 tbplrobot comments (arbitrarily picked, ideally we want something that can easily reproduce)

Here are what comes out of this:

  • 356 bugs are open, not assigned and have the intermittent-failure keyword
  • 25 bugs have >=20 comments meeting our criteria

The next step was to look at each of the 25 bugs and see if it makes sense to do this.  In fact 13 of the bugs I decided not to take action on (remember this is an experiment, so my reasoning for ignoring these 13 could be biased):

  • 5 bugs were thunderbird/mozmill only
  • 3 bugs looked to be related to android harness issues
  • bug 1157090 hadn’t reproduced in 2 weeks- was APZ feature which we turned off.
  • one bug was only on mozilla-beta only
  • 2 bugs had patches and 1 had a couple of comments indicating a developer was already looking at it

This leaves us with 12 bugs to investigate.  The premise here is easy, find the first occurrence of the intermittent (branch, platform, testjob, revision) and re-trigger it 20 times (picked for simplicity).  When the results are in, see if we have reproduced it.  In fact, only 5 bugs reproduced the exact error in the bug when re-triggered 20 times on a specific job that showed the error.

Moving on, I started re-triggering jobs back in the pushlog to see where it was introduced.  I started off with going back 2/4/6 revisions, but got more aggressive as I didn’t see patterns.  Here is a summary of what the 5 bugs turned out like:

  • Bug 1161915 – Windows XP PGO Reftest.  Found the root cause (pgo only, lots of pgo builds were required for this) 23 revisions back.
  • Bug 1160780 – OSX 10.6 mochitest-e10s-bc1.  Found the root cause 33 revisions back.
  • Bug 1161052 – Jetpack test failures.  So many failures in the same test file, it isn’t very clear if I am reproducing the failure or finding other ones.  :erikvold is working on fixing the test in bug 1163796 and ok’d disabling it if we want to.
  • Bug 1161537 – OSX 10.6 Mochitest-other.  Bisection didn’t find the root cause, but this is a new test case which was added.  This is a case where when the new test case was added it could have been run 100+ times successfully, then when it merged with other branches a couple hours later it failed!
  • bug 1155423 – Linux debug reftest-e10s-1.  This reproduced 75 revisions in the past, and due to that I looked at what changed in our buildbot-configs and mozharness scripts.  We actually turned on this test job (hadn’t been running e10s reftests on debug prior to this) and that caused the problem.  This can’t be tracked down by re-triggering jobs into the past.

In summary, out of 356 bugs 2 root causes were found by re-triggering.  In terms of time invested into this, I have put about 6 hours of time to fine the root cause of the 5 bugs.


Filed under testdev

intermittent oranges- missing a real definition

There are all kinds of great ideas folks have for fixing intermittent issues.  In fact each idea in and of itself is a worthwhile endeavor.   I have spent some time over the last couple of months fixing them, filing bugs on them, and really discussing them.  One question that remains- what is the definition of an intermittent.

I don’t plan to lay out a definition, instead I plan to ask some questions and lay out some parameters.  According to orange factor, there are 4640 failures in the last week (May 3 -> May 10) all within 514 unique bugs.  These are all failures that the sheriffs have done some kind of manual work on to star on treeherder.  I am not sure anybody can find a way to paint a pretty picture to make it appear we don’t have intermittent failures.

Looking at a few bugs, there are many reasons for intermittent failures:

  • core infrastructure (networking, power, large classes of machines (ec2), etc.)
  • machine specific (specific machine is failing a lot of jobs)
  • CI specific (buildbot issues, twisted issues, puppet issues, etc.)
  • CI Harness (usually mozharness)
  • Platforms (old platforms/tests we are phasing out, new platforms/tests we are phasing in)
  • Test harness (mochitest + libraries, common code for tests, reftest)
  • Test Cases (test cases actually causing failures, poorly written, new test cases, etc.)
  • Firefox Code (we actually have introduced conditions to cause real failures- just not every time)
  • Real regressions (failures which happen every time we run a test)

There are a lot of reasons, many of these have nothing to do with poor test cases or bad code in Firefox.  But many of these are showing up many times a day and as a developer who wants to fix a bad test, many are not really actionable.  Do we need to have some part of a definition to include something that is actionable?

Looking at the history of ‘intermittent-failure’ bugs in Bugzilla, many occur once and never occur again.  In fact this is the case for over half of the bugs filed (we file upwards of 100 new bugs/week).  While there are probably reasons for a given test case to fail, if it failed in August 2014 and has never failed again, is that test case intermittent?  As a developer could you really do anything about this given the fact that reproducing it is virtually impossible?

This is where I start to realize we need to find a way to identify real intermittent bugs/tests and not clutter the statistics with tests which are virtually impossible to reproduce.  Thinking back to what is actionable- I have found that while filing bugs for Talos regressions the closer the bug is filed to the original patch landing, the better the chance it will get fixed.  Adding to that point, we only keep 30 days of builds/test packages around for our CI automation.  I really think a definition of an intermittent needs to have some kind of concept of time.  Should we ignore intermittent failures which occur only once in 90 days?  Maybe ignore ones that don’t reproduce after 1000 iterations?  Some could argue that we look in a smaller or larger window of time/iterations.

Lastly, when looking into specific bugs, I find many times they are already fixed.  Many of the intermittent failures are actually fixed!  Do we track how many get fixed?  How many have patches and have debugging already taking place?  For example in the last 28 days, we have filed 417 intermittents, of which 55 are already resolved and of the remaining 362 only 25 have occurred >=20 times.  Of these 25 bugs, 4 already have patches.  It appears a lot of work is done to fix intermittent failures which are actionable.  Are the ones which are not being fixed not actionable?  Are they in a component where all the developers are busy and heads down?

In a perfect world a failure would never occur, all tests would be green, and all users would use Firefox.  In reality we have to deal with thousands of failures every week, most of which never happen again.  This quarter I would like to see many folks get involved in discussions and determine:

  • what is too infrequent to be intermittent? we can call this noise
  • what is the general threshold where something is intermittent?
  • what is the general threshold where we are too intermittent and need to backout a fix or disable a test?
  • what is a reasonable timeframe to track these failures such that we can make them actionable?

Thanks for reading, I look forward to hearing from many who have ideas on this subject.  Stay tuned for an upcoming blog post about re-trigging intermittent failures to find the root cause.

Leave a comment

Filed under testdev

polishing browser-chrome – coming to a branch near you soon

The last 2 weeks I have gone head first into a world of resolving some issues with our mochitest browser-chrome tests with RyanVM, Armen, and the help of Gavin and many developers who are fixing problems left and right.

There are 3 projects I have been focusing on:

1) Moving our Linux debug browser chrome tests off our old fedora slaves in a datacenter and running them on ec2 slave instances, in bug 987892.

These are live and green on all Firefox 29, 30, and 31 trees!  More work is needed for Firefox-28 and ESR-24 which should be wrapped up this week.  Next week we can stop running all linux unittests on fedora slaves.

2) Splitting all the developer tools tests out of the browser-chrome suite into their own suite in bug 984930.

browser-chrome tests have been a thorn in the side of the sheriff team for many months.  More and more the rapidly growing features and tests of developer tools have been causing the entire browser-chrome suite to fail, in cases of debug to run for hours.  Splitting this out gives us a small shield of isolation.  In fact, we have this running well on Cedar, we are pushing hard to have this rolled out to our production and development branches by the end of this week!

3) Splitting the remaining browser chrome tests into 3 chunks, in bug 819963.

Just like the developer tools, we have been running browser-chrome in 3 chunks on Cedar.  With just 7 tests disabled, we are very green and consistently green. 



While there are a lot of other changes going on under the hood, what will be seen by next week on your favorite branch of Firefox will be:

  • ‘dt’ jobs for opt, and ‘dt1’, ‘dt2’, ‘dt3’ jobs for debug
  • ‘bc’ job will turn into ‘bc1’, ‘bc2’, ‘bc3’
  • much faster turnaround times on bc tests (62 minutes is the slowest right now, the rest are averaging ~20 minutes/job)
  • less random orange cluttering up results


Leave a comment

Filed under Uncategorized

mochitests and manifests

Of all the tests that are run on tbpl, mochitests are the last ones to receive manifests.  As of this morning, we have landed all the changes that we can to have all our tests defined in mochitest.ini files and have removed the entries in b2g*.json, by putting entries in the appropriate mochitest.ini files.

Ahal, has done a good job of outlining what this means for b2g in his post.  As mentioned there, this work was done by a dedicated community member :vaibhav1994 as he continues to write patches, investigate failures, and repeat until success.

For those interested in the next steps, we are looking forward to removing our build time filtering and start filtering tests at runtime.  This work is being done by billm in bug 938019.  Once that is landed we can start querying which tests are enabled/disabled per platform and track that over time!

1 Comment

Filed under Uncategorized