Tag Archives: intermittent-failures

Project Stockwell – January 2016

Every month this year I am planning to write a summary of Project Stockwell.

Last year we started this project with a series of meetings and experiments.  We presented in Hawaii (a Mozilla all-hands event) an overview of our work and path forward.

With that said, we will be tracking two items every month:

Week of Jan 02 -> 09, 2017

Orange Factor 13.76
# High Frequency bugs 42

What are these high frequency bugs:

  • linux32 debug timeouts for devtools (bug 1328915)
  • Turning on leak checking (bug 1325148) – note, we did this Dec 29th and whitelisted a lot, still much exists and many great fixes have taken place
  • some infrastructure issues, other timeouts, and general failures

I am excited for the coming weeks as we reduce the orange factor back down <7 and get the high frequency bugs <20.

Outside of these tracking stats there are a few active projects we are working on:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – this will allow us to then match up triage contacts for each components so test case ownership has a patch to a live person
  • retrigger an existing job with additional debugging arguments (bug 1322433) – easier to get debug information, possibly extend to special runs like ‘rr-chaos’
  • add |mach test-info| support (bug 1324470) – allows us to get historical timing/run/pass data for a given test file
  • add a test-lint job to linux64/mochitest (bug 1323044) – ensure a test runs reliably by itself and in –repeat mode

While these seem small, we are currently actively triaging all bugs that are high frequency (>=30 times/week).  In January triage means letting people know this is high frequency and trying to add more data to the bugs.



Filed under intermittents, testdev

Working towards a productive definition of “intermittent orange”

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent.  Some common statements I have heard:

  • It’s obvious, if it failed last year, the test is intermittent
  • If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent
  • I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server

These are imply much different definitions of what is intermittent, a definition will need to:

  • determine if we should take action on a test (programatically or manually)
  • define policy sheriffs and developers can use to guide work
  • guide developers to know when a new/fixed test is ready for production
  • provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends.  I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%).  Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures.  It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle).  After much slicing and dicing, I have come up with 4 buckets:

  1. Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)
  2. Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)
  3. Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks.  (11.5% of bugs)
  4. High Frequency Orange: fails >10 times/day many times and are often seen in try pushes.  (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

  • low priority or not actionable (buckets 1 + 2)
  • high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

  • Determine if we should take action on a test (programatically or manually):
    • ideally buckets 1/2 can be detected programatically with autostar and removed from our view.  Possibly rerunning to validate it isn’t a new failure.
    • buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information
  • Define policy sheriffs and developers can use to guide work
    • sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)
    • developers understand the severity based on the bucket.  Ideally we will need a lot of context, but understanding severity is important.
  • Guide developers to know when a new/fixed test is ready for production
    • If we fix a test, we want to ensure it is stable before we make it tier-1.  A developer can use math of 300 commits/day and ensure we pass.
    • NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day
  • Provide useful data to release and Firefox product management about the quality of a release
    • Release Management can take the OrangeFactor into account
    • new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot).  There are simple rules:

  • 15+ times/day – post a daily summary (bucket #4)
  • 5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

  • missing or incorrect data in orange factor (human error)
  • some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.


Filed under general, testdev, Uncategorized

the orange factor – no need to retrigger this week

last week I did another round of re-triggering for a root cause and found some root causes!  This week I got an email from orange factor outlining the top 10 failures on the trees (as we do every week).

Unfortunately as of this morning there is no work for me to do- maybe next week I can hunt.

Here is the breakdown of bugs:

  • Bug 1081925 Intermittent browser_popup_blocker.js
    • investigated last week, test is disabled by a sheriff
  • Bug 1118277 Intermittent browser_popup_blocker.js
    • investigated last week, test is disabled by a sheriff
  • Bug 1096302 Intermittent test_collapse.html
    • test is fixed!  already landed
  • Bug 1121145 Intermittent browser_panel_toggle.js
    • too old!  problem got worse on April 24th
  • Bug 1157948 DMError: Non-zero return code for command
    • too old!  most likely a harness/infra issue
  • Bug 1166041 Intermittent LeakSanitizer
    • patch is already on this bug
  • Bug 1165938 Intermittent media-source
    • disabled the test already!
  • Bug 1149955 Intermittent Win8-PGO test_shared_all.py
    • too old!
  • Bug 1160008 Intermittent testVideoDiscovery
    • too old!
  • Bug 1137757 Intermittent Linux debug mochitest-dt1 command timed out
    • harness infra, test chunk is taking too long- problem is being addressed with more chunks.

As you can see there isn’t much to do here.  Maybe next week we will have some actions we can take.  Once I have about 10 bugs investigated I will summarize the bugs, related dates, and status, etc.

Leave a comment

Filed under testdev

SETA – Search for Extraneous Test Automation

Here at Mozilla we run dozens of builds and hundreds of test jobs for every push to a tree.  As time has gone on, we have gone from a couple hours from push to all tests complete to 4+ hours.  With the exception of a few test jobs, we could complete our test results in

The question becomes, how do we manage to keep up test coverage without growing the number of machines?  Right now we do this with buildbot coalescing (we queue up the jobs and skip the older ones when the load is high).  While this works great, it causes us to skip a bunch of jobs (builds/tests) on random pushes and sometimes we need to go back in and manually schedule jobs to find failures.  In fact, while keeping up with the automated alerts for talos regressions, the coalescing causes problems in over half of the regressions that I investigate!

Knowing that we live with coalescing and have for years, many of us started wondering if we need all of our tests.  Ideally we could select tests that are statistically most significant to the changes being pushed, and if those pass, we could run the rest of the tests if there were available machines.  To get there is tough, maybe there is a better way to solve this?  Luckily we can mine meta data from treeherder (and the former tbpl) and determine which failures are intermittent and which have been fixed/caused by a different revision.

A few months ago we started looking into unique failures on the trees.  Not just the failures, but which jobs failed.  Normally when we have a failure detected by the automation, many jobs fail at once (for example: xpcshell tests will fail on all windows platforms, opt + debug).  When you look at the common jobs which fail across all the failures over time, you can determine the minimum number of jobs required to detected all the failures.  Keep in mind that we only need 1 job to represent a given failure.

As of today, we have data since August 13, 2014 (just shy of 180 days):

  • 1061 failures caught by automation (for desktop builds/tests)
  • 362 jobs are currently run for all desktop builds
  • 285 jobs are optional and not needed to detect all 1061 regressions

To phrase this another way, we could have run 77 jobs per push and caught every regression in the last 6 months.  Lets back up  a bit and look at the regressions found- how many are there and how often do we see them:

Cumulative and per day regressions

Cumulative and per day regressions

This is a lot of regressions, yay for automation.  The problem is that this is historical data, not future data.  Our tests, browser, and features change every day, this doesn’t seem very useful for predicting the future.  This is a parallel to the stock market, there people invest in companies based on historical data and make decisions based on incoming data (press releases, quarterly earnings).  This is the same concept.  We have dozens of new failures every week, and if we only relied upon the 77 test jobs (which would catch all historical regressions) we would miss new ones.  This is easy to detect, and we have mapped out the changes.  Here it is on a calendar view (bold dates indicate a change was detected, i.e. a new job needed in the reduced set of jobs list):

Bolded dates are when a change is needed due to new failuresThis translates to about 1.5 changes per week.  To put this another way, if we were only running the 77 reduced set of jobs, we would have missed one regression December 2nd, and another December 16th, etc., or on average 1.5 regressions will be missed per week.  In a scenario where we only ran the optional jobs once/hour on the integration branches, 1-2 times/week we would see a failure and have to backfill some jobs (as we currently do for coalesced jobs) for the last hour to find the push which caused the failure.

To put this into perspective, here is a similar view to what you would expect to see today on treeherder:

All desktop unittest jobsFor perspective, here is what it would look like assuming we only ran the reduced set of 77 jobs:

Reduced set of jobs view* keep in mind this is weighted such that we prefer to run jobs on linux* builds since those run in the cloud.

With all of this information, what do we plan to do with it?  We plan to run the reduced set of jobs by default on all pushes, and use the [285] optional jobs as candidates for coalescing.  Currently we force coalescing for debug unittests.  This was done about 6 months ago because debug tests take considerably longer than opt, so if we could run them on every 2nd or 3rd build, we would save a lot of machine time.  This is only being considered on integration trees that the sheriffs monitor (mozilla-inbound, fx-team).

Some questions that are commonly asked:

  • How do you plan to keep this up to date?
    • We run a cronjob every day and update our master list of jobs, failures, and optional jobs.  This takes about 2 minutes.
  • What are the chances the reduced set of jobs catch >1 failure?  Do we need all 77 jobs?
    • 77 jobs detect 1061 failures (100%)
    • 35 jobs detect 977 failures (92%)
    • 23 jobs detect 940 failures (88.6%)
    • 12 jobs detect 900 failures (84.8%)
  • How can we see the data:
    • SETA website
    • in the near future summary emails when we detect a *change* to mozilla.dev.tree-alerts

Thanks for reading so far!  This project wouldn’t be here it it wasn’t for the many hours of work by Vaibhav, he continues to find more ways to contribute to Mozilla. If anything this should inspire you to think more about how our scheduling works and what great things we can do if we think out of the box.


Filed under testdev