SETA – Search for Extraneous Test Automation

Here at Mozilla we run dozens of builds and hundreds of test jobs for every push to a tree.  As time has gone on, we have gone from a couple hours from push to all tests complete to 4+ hours.  With the exception of a few test jobs, we could complete our test results in

The question becomes, how do we manage to keep up test coverage without growing the number of machines?  Right now we do this with buildbot coalescing (we queue up the jobs and skip the older ones when the load is high).  While this works great, it causes us to skip a bunch of jobs (builds/tests) on random pushes and sometimes we need to go back in and manually schedule jobs to find failures.  In fact, while keeping up with the automated alerts for talos regressions, the coalescing causes problems in over half of the regressions that I investigate!

Knowing that we live with coalescing and have for years, many of us started wondering if we need all of our tests.  Ideally we could select tests that are statistically most significant to the changes being pushed, and if those pass, we could run the rest of the tests if there were available machines.  To get there is tough, maybe there is a better way to solve this?  Luckily we can mine meta data from treeherder (and the former tbpl) and determine which failures are intermittent and which have been fixed/caused by a different revision.

A few months ago we started looking into unique failures on the trees.  Not just the failures, but which jobs failed.  Normally when we have a failure detected by the automation, many jobs fail at once (for example: xpcshell tests will fail on all windows platforms, opt + debug).  When you look at the common jobs which fail across all the failures over time, you can determine the minimum number of jobs required to detected all the failures.  Keep in mind that we only need 1 job to represent a given failure.

As of today, we have data since August 13, 2014 (just shy of 180 days):

  • 1061 failures caught by automation (for desktop builds/tests)
  • 362 jobs are currently run for all desktop builds
  • 285 jobs are optional and not needed to detect all 1061 regressions

To phrase this another way, we could have run 77 jobs per push and caught every regression in the last 6 months.  Lets back up  a bit and look at the regressions found- how many are there and how often do we see them:

Cumulative and per day regressions

Cumulative and per day regressions

This is a lot of regressions, yay for automation.  The problem is that this is historical data, not future data.  Our tests, browser, and features change every day, this doesn’t seem very useful for predicting the future.  This is a parallel to the stock market, there people invest in companies based on historical data and make decisions based on incoming data (press releases, quarterly earnings).  This is the same concept.  We have dozens of new failures every week, and if we only relied upon the 77 test jobs (which would catch all historical regressions) we would miss new ones.  This is easy to detect, and we have mapped out the changes.  Here it is on a calendar view (bold dates indicate a change was detected, i.e. a new job needed in the reduced set of jobs list):

Bolded dates are when a change is needed due to new failuresThis translates to about 1.5 changes per week.  To put this another way, if we were only running the 77 reduced set of jobs, we would have missed one regression December 2nd, and another December 16th, etc., or on average 1.5 regressions will be missed per week.  In a scenario where we only ran the optional jobs once/hour on the integration branches, 1-2 times/week we would see a failure and have to backfill some jobs (as we currently do for coalesced jobs) for the last hour to find the push which caused the failure.

To put this into perspective, here is a similar view to what you would expect to see today on treeherder:

All desktop unittest jobsFor perspective, here is what it would look like assuming we only ran the reduced set of 77 jobs:

Reduced set of jobs view* keep in mind this is weighted such that we prefer to run jobs on linux* builds since those run in the cloud.

With all of this information, what do we plan to do with it?  We plan to run the reduced set of jobs by default on all pushes, and use the [285] optional jobs as candidates for coalescing.  Currently we force coalescing for debug unittests.  This was done about 6 months ago because debug tests take considerably longer than opt, so if we could run them on every 2nd or 3rd build, we would save a lot of machine time.  This is only being considered on integration trees that the sheriffs monitor (mozilla-inbound, fx-team).

Some questions that are commonly asked:

  • How do you plan to keep this up to date?
    • We run a cronjob every day and update our master list of jobs, failures, and optional jobs.  This takes about 2 minutes.
  • What are the chances the reduced set of jobs catch >1 failure?  Do we need all 77 jobs?
    • 77 jobs detect 1061 failures (100%)
    • 35 jobs detect 977 failures (92%)
    • 23 jobs detect 940 failures (88.6%)
    • 12 jobs detect 900 failures (84.8%)
  • How can we see the data:
    • SETA website
    • in the near future summary emails when we detect a *change* to

Thanks for reading so far!  This project wouldn’t be here it it wasn’t for the many hours of work by Vaibhav, he continues to find more ways to contribute to Mozilla. If anything this should inspire you to think more about how our scheduling works and what great things we can do if we think out of the box.


Filed under testdev

8 responses to “SETA – Search for Extraneous Test Automation

  1. Great post Joel! Thanks for writing this up!

  2. Anders

    Isn’t it is a stretch to say that easing load will shorten the end-to-end time (without looking a the critical path) or that it is the easiest way (a push causes ~200 hours of build/test, so there must be quite a bit of scheduling going on, that might be optimized)? If you want to reduce the load by disabling jobs, you should probably account for how long jobs take and not just look at the count since it varies quite a bit.
    Couldn’t jobs be optimized? For browser-performance there are graphs and automatic checks to make sure that check-ins don’t regress performance. Maybe there should be something similar for build/test-time?
    It seems there once where some dashboards at but it seems broken.
    Some time ago joduinn posted a small series about saving some time by just looking through tests and disabling unneeded ones

    • elvis314

      The goal isn’t to cut our end to end time in half, it is to help us scale better by prioritizing which jobs get coalesced. There are always improvements being made in our infrastructure (recently re stopped rebooting slaves between every job to save a few minutes of time) as well as our harnesses. I would love for everybody to go through the tests and remove what is useless. Tests are written on historical bugs and on a feature set defined in the past, there is no guarantee that they are useful today.

      I do believe there is more work we can do to optimize our jobs and how we run the tests. Obviously running tests in debug mode are the most expensive since all the jobs take considerably longer (sometimes 3x). Another theory is balancing the runtime of our jobs out, since we are running hundreds of jobs per push, lets make sure our jobs have similar run times (right now some opt jobs finish in 7 minutes and others in 35 minutes).

      But to do that requires fixing a lot of tests since when running them standalone or with a different set of tests result in failures we don’t normally see.

      I am open to ideas of what other ideas might be out there for identifying individual tests which are no longer needed or are duplicated in what they are really testing. That seems like a ripe area for improvement!

      • I think the idea of running only the tests that are required given the change is great and look forward to seeing how it performs on m-i and fx-team. Other companies limit the number of tests that are run through additional mechanisms like:

        – analyzing the source tree to understand which tests may be impacted by a given change and running only those tests
        – annotating tests as small, medium, large and running each bucket at select times rather than all the time

      • elvis314

        Thanks for the comment Lawrence. There is a lot we can do with this concept going forward. Right now a changed file to test map seems simple- in fact that might be something to experiment with in trychooser for defaults! This work opens the door to tiered tests where we can run tier1 (however that is defined) and upon success and availability of resources start on the next tier, etc.

        All feedback is welcome!

  3. Pingback: A-Team Contributions in 2015 | Vaibhav's Blog

  4. Pingback: test_awsy_lite | gbrown_mozilla

  5. Pingback: Bravo! Been accepted by GSoC – mikelingblog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s