SETA – Search for Extraneous Test Automation

Here at Mozilla we run dozens of builds and hundreds of test jobs for every push to a tree.  As time has gone on, we have gone from a couple hours from push to all tests complete to 4+ hours.  With the exception of a few test jobs, we could complete our test results in <2 hours if we had idle machines ready to run tests when the builds are finished.  This doesn’t scale well, in fact we are only adding more platforms, more tests, and of course more pushes each month.

The question becomes, how do we manage to keep up test coverage without growing the number of machines?  Right now we do this with buildbot coalescing (we queue up the jobs and skip the older ones when the load is high).  While this works great, it causes us to skip a bunch of jobs (builds/tests) on random pushes and sometimes we need to go back in and manually schedule jobs to find failures.  In fact, while keeping up with the automated alerts for talos regressions, the coalescing causes problems in over half of the regressions that I investigate!

Knowing that we live with coalescing and have for years, many of us started wondering if we need all of our tests.  Ideally we could select tests that are statistically most significant to the changes being pushed, and if those pass, we could run the rest of the tests if there were available machines.  To get there is tough, maybe there is a better way to solve this?  Luckily we can mine meta data from treeherder (and the former tbpl) and determine which failures are intermittent and which have been fixed/caused by a different revision.

A few months ago we started looking into unique failures on the trees.  Not just the failures, but which jobs failed.  Normally when we have a failure detected by the automation, many jobs fail at once (for example: xpcshell tests will fail on all windows platforms, opt + debug).  When you look at the common jobs which fail across all the failures over time, you can determine the minimum number of jobs required to detected all the failures.  Keep in mind that we only need 1 job to represent a given failure.

As of today, we have data since August 13, 2014 (just shy of 180 days):

  • 1061 failures caught by automation (for desktop builds/tests)
  • 362 jobs are currently run for all desktop builds
  • 285 jobs are optional and not needed to detect all 1061 regressions

To phrase this another way, we could have run 77 jobs per push and caught every regression in the last 6 months.  Lets back up  a bit and look at the regressions found- how many are there and how often do we see them:

Cumulative and per day regressions

Cumulative and per day regressions

This is a lot of regressions, yay for automation.  The problem is that this is historical data, not future data.  Our tests, browser, and features change every day, this doesn’t seem very useful for predicting the future.  This is a parallel to the stock market, there people invest in companies based on historical data and make decisions based on incoming data (press releases, quarterly earnings).  This is the same concept.  We have dozens of new failures every week, and if we only relied upon the 77 test jobs (which would catch all historical regressions) we would miss new ones.  This is easy to detect, and we have mapped out the changes.  Here it is on a calendar view (bold dates indicate a change was detected, i.e. a new job needed in the reduced set of jobs list):

Bolded dates are when a change is needed due to new failuresThis translates to about 1.5 changes per week.  To put this another way, if we were only running the 77 reduced set of jobs, we would have missed one regression December 2nd, and another December 16th, etc., or on average 1.5 regressions will be missed per week.  In a scenario where we only ran the optional jobs once/hour on the integration branches, 1-2 times/week we would see a failure and have to backfill some jobs (as we currently do for coalesced jobs) for the last hour to find the push which caused the failure.

To put this into perspective, here is a similar view to what you would expect to see today on treeherder:

All desktop unittest jobsFor perspective, here is what it would look like assuming we only ran the reduced set of 77 jobs:

Reduced set of jobs view* keep in mind this is weighted such that we prefer to run jobs on linux* builds since those run in the cloud.

With all of this information, what do we plan to do with it?  We plan to run the reduced set of jobs by default on all pushes, and use the [285] optional jobs as candidates for coalescing.  Currently we force coalescing for debug unittests.  This was done about 6 months ago because debug tests take considerably longer than opt, so if we could run them on every 2nd or 3rd build, we would save a lot of machine time.  This is only being considered on integration trees that the sheriffs monitor (mozilla-inbound, fx-team).

Some questions that are commonly asked:

  • How do you plan to keep this up to date?
    • We run a cronjob every day and update our master list of jobs, failures, and optional jobs.  This takes about 2 minutes.
  • What are the chances the reduced set of jobs catch >1 failure?  Do we need all 77 jobs?
    • 77 jobs detect 1061 failures (100%)
    • 35 jobs detect 977 failures (92%)
    • 23 jobs detect 940 failures (88.6%)
    • 12 jobs detect 900 failures (84.8%)
  • How can we see the data:
    • SETA website
    • in the near future summary emails when we detect a *change* to

Thanks for reading so far!  This project wouldn’t be here it it wasn’t for the many hours of work by Vaibhav, he continues to find more ways to contribute to Mozilla. If anything this should inspire you to think more about how our scheduling works and what great things we can do if we think out of the box.


Filed under Uncategorized

Tracking Firefox performance as we uplift – the volume of alerts we get

For the last year, I have been focused on ensuring we look at the alerts generated by Talos.  For the last 6 months I have also looked a bit more carefully at the uplifts we do every 6 weeks.  In fact we wouldn’t generate alerts when we uplifted to beta because we didn’t run enough tests to verify a sustained regression in a given time window.

Lets look at data, specifically the volume of alerts:

Trend of improvements/regressions from Firefox 31 to 36 as we uplift to Aurora

Trend of improvements/regressions from Firefox 31 to 36 as we uplift to Aurora

this is a stacked graph, you can interpret it as Firefox 32 had a lot of improvements and Firefox 33 had a lot of regressions.  I think what is more interesting is how many performance regressions are fixed or added when we go from Aurora to Beta.  There is minimal data available for Beta.  This next image will compare alert volume for the same release on Aurora then on Beta:

Side by side stacked bars for the regressions going into Aurora and then going onto Beta.

Side by side stacked bars for the regressions going into Aurora and then going onto Beta.

One way to interpret this above graph is to see that we fixed a lot of regressions on Aurora while Firefox 33 was on there, but for Firefox 34, we introduced a lot of regressions.

The above data is just my interpretation of this, Here are links to a more fine grained view on the data:

As always, if you have questions, concerns, praise, or other great ideas- feel free to chat via this blog or via irc (:jmaher).

Leave a comment

Filed under Uncategorized

Language hacking – changing the choice of words we use

Have you ever talked to someone who continues to use the same word over and over again?  Then you find that many people you chat with end up using the same choice of words quite frequently.  My wife and I see this quite often, usually with the word ‘Amazing’, ‘cool’, and ‘hope’.

  • Are these words necessary in conversation?
  • Do these words we choose lose value due to overuse?
  • Are we communicating effectively?
  • Do others who are jaded by these words associate other meanings or intentions to the words we use?

Lets focus on the word “hope”.  There are many places where hope is appropriate, but I find that most people misuse the word.  For example:

I hope to show up at yoga on Saturday

I heard this sentence and wonder:

  • do you want to show up to yoga on Saturday?
  • are you saying this to make me feel good?
  • are there other things preventing you from committing to yoga on Saturday?

What could be said is:

I am planning to show up at yoga on Saturday


I have a lot of things going on, if all goes well I will show up at yoga on Saturday


I don’t want to hurt your feelings by saying no, so to make you feel good I will be non committal about showing up to yoga on Saturday even though I have no intentions.

There are many ways to replace the word “hope”, and all of them achieve a clearer communication between two people.

Now with that said, what am I hacking?  For the last few months I have been reducing (almost removing) the words ‘awesome’, ‘amazing’, ‘hate’, and ‘hope’ from my vocabulary.

Why am I writing about this?  I might as well be in the open about this and invite others to join me in being deliberate about how we speak.  Once a month I will post a new word, feel free to join me in this effort and see how thinking about what you say and how you say it impacts your communications.

Also please leave comments on this post about specific words that you feel are overused – I could use suggestions of words.

update March 28, 2015 – this post has been translated:

Пост доступен на сайте Язык взлома – изменение выбора используемых слов.


Filed under Uncategorized

5 days in Portland with Mozillians and 10 great things that came from it

I took a lot of notes in Portland last week.  One might not know that based on the fact that I talked so much my voice ran out of steam by the second day.  Either way, in chatting with some co-workers yesterday about what we took away from Portland, I realized that there is a long list of awesomeness.

Let me caveat this by saying that some of these ideas have been talked about in the past, but despite our efforts to work with others and field interesting and useful ideas, there is a big list of great things that came to light while chatting in person:

  • :bgrins mentioned a mozscreenshot tool and the need for getting a screenshot of new features in development on various platforms so UX can review the changes.  Currently it is a method of asking UX to download the build from try or some other location and run it locally to see the changes.
  • :heycam/:jwatt – had a great an interesting talos discussion.  Mostly around how to run it and validate patches/fixes locally and on try server. (check out bug 1109243)
  • :glandium is looking at doing some changes (I recall something with build/pgo) and wanted to know how to compare some Talos numbers to help make the right decision – this can be done with either bug 1109243, or the existing in the Talos repo (we might need some cleanup on this)
  • :bobowen has been working to get csb tests working- after chatting in line to board a plane, it became clear he needs to solve some finer grain test selection problems- many of which the ateam has on a roadmap in Q2/Q3 – I see some tighter collaboration happening here.
  • Thanks to chatting with :lsblakk, I am motivated to expand the talos sheriff team and look for dedicated Mozillians (or soon to become Mozillians) to work with in keeping a lid on the alerts and overall state of performance (based on what we measure).
  • :lightsofapollo had a great conversation with me about TaskCluster and what barriers stood in the way of running Talos on it – this will result is some initial investigation work!
  • :kats was asking me how to generate alerts for  This is very doable via posting data to graph server
  • After a good session on how to handle intermittents (seems like the same people have this conversation every time a bunch of Mozillians get together), I am motivated to push Titanic further to find the root cause of an intermittent via brute force retriggers (ideally on weekends).  In fact :dbaron has done this a few times in the last month and so have the sheriffs.  This is similar to what we do to verify a talos regression, just with some different parameters.
  • The same conversation about intermittents yielded a stronger desire to look at new tests coming into the system and validating stability.  The simple solution is to run the job 100 times, verify that the new test didn’t have issues and then leave it along.  Of course we could get smart and do this for all test_* files that are edited in the tree.  Thanks to :ehsan for spawning this conversation.
  • Discussing the idea of a Talos Sheriff with a few folks, it seems like there are further conversations needs with the existing Sheriff team as well as to chat with :vladan and :avih about what type of policy we should have for existing performance failures which are detected.  I would expect some changes to be made early next year as we have more tests and need more help.  My initial thoughts are specifically with responding to regressions or getting backed out in XX hours.  Yeah that sounds nasty, but there are probably cut and dry parameters we can set and start enforcing.

Those are 10 specific topics which despite everybody knowing how to contact me or the ateam and share great ideas or frustrations, these came out of being in the same place at the same time.

Thinking through this, when I see these folks in a real office while working from there for a few days or a week, it seems as though the conversations are smaller and not as intense.  Usually just small talk whilst waiting for a build to finish.  I believe the idea where we are not expected to focus on our day to day work and instead make plans for the future is the real innovation behind getting these topics surfaced.

1 Comment

Filed under Uncategorized

A case of the weekends?

Case of the Mondays

What was famous 15 years ago as a case of the Mondays has manifested itself in Talos.  In fact, I wonder why I get so many regression alerts on Monday as compared to other days.  It is more to a point of we have less noise in our Talos data on weekends.

Take for example the test case tresize:


* in fact we see this on other platforms as well linux32/linux64/osx10.8/windowsXP

30 days of linux tresize

Many other tests exhibit this.  What is different about weekends?  Is there just less data points?

I do know our volume of tests go down on weekends mostly as a side effect of less patches being landed on our trees.

Here are some ideas I have to debug this more:

  • Run massive retrigger scripts for talos on weekends to validate # of samples is/isnot the problem
  • Reduce the volume of talos on weekdays to validate the overall system load in the datacenter is/isnot the problem
  • compare the load of the machines with all branches and wait times to that of the noise we have in certain tests/platforms
  • Look at platforms like windows 7, windows 8, and osx 10.6 as to why they have more noise on weekends or are more stable.  Finding the delta in platforms would help provide answers

If you have ideas on how to uncover this mystery, please speak up.  I would be happy to have this gone and make any automated alerts more useful!


Filed under Uncategorized

Say hi to Kaustabh Datta Choudhury, a newer Mozillian

A couple months ago I ran into :kaustabh93 online as he had picked up a couple good first bugs.  Since then he has continued to work very hard and submit a lot of great pull requests to Ouija and Alert Manager (here is his github profile).  After working with him for a couple of weeks, I decided it was time to learn more about him, and I would like to share that with Mozilla as a whole:

Tell us about where you live-

I live in a town called Santragachi in West Bengal. The best thing about this place is its ambience. It is not at the heart of the city but the city is easily accessible. That keeps the maddening crowd of the city away and a calm and peaceful environment prevails here.

Tell us about your school-

 I completed my schooling from Don Bosco School, Liluah. After graduating from there, now I am pursuing an undergraduate degree in Computer Science & Engineering from MCKV Institute of Engineering.

Right from when it was introduced to me, I was in love with the subject ‘Computer Science’. And introduction to coding was one of the best things that has happened to me so far.

Tell us about getting involved with Mozilla-

I was looking for some exciting real life projects to work on during my vacation & it was then that the idea of contributing to open source projects had struck me. Now I have been using Firefox for many years now and that gave me an idea of where to start looking. Eventually I found the volunteer tab and thus started my wonderful journey on Mozilla.

Right from when I was starting out, till now, one thing that I liked very much about Mozilla was that help was always at hand when needed. On my first day , I popped a few questions in the IRC channel #introduction & after getting the basic of where to start out, I started working on Ouija under the guidance of ‘dminor’ & ‘jmaher’. After a few bug fixes there, Dan recommended me to have a look at Alert Manager & I have been working on it ever since. And the experience of working for Mozilla has been great.

Tell us what you enjoy doing-

I really love coding. But apart from it I also am an amateur photographer & enjoy playing computer games & reading books.

Where do you see yourself in 5 years?

In 5 years’ time I prefer to see myself as a successful engineer working on innovative projects & solving problems.

If somebody asked you for advice about life, what would you say?

Rather than following the crowd down the well-worn path, it is always better to explore unchartered territories with a few.

:kaustabh93 is back in school as of this week, but look for activity on bugzilla and github from him.  You will find him online once in a while in various channels, I usually find him in #ateam.


Filed under Uncategorized

I invite you to join me in welcoming 3 new tests to Talos

2 months ago, we added session restore tests to Talos, today I am pleased to announce 3 new tests:

  • media_tests – only runs on linux64 and is our first test to measure audio processing.  Much thanks to :jesup, and Suhas Nandaku from Cisco.
  •  tp5o_scroll – imagine if tscrollx and tp5 had a child- not only do we load the page, but we scroll the page.  Big thanks go to :avih for tackling this project.
  •  glterrain – The first webgl benchmark to show up in Talos.  Credit goes to :avih for driving this and delivering it.  There are others coming, this was the easiest to get going.


Stay tuned in the coming weeks as we have 2 others tests in the works:

  • ts_paint_cold
  • mainthreadio detection


Leave a comment

Filed under testdev