Category Archives: testdev

A-Team contribution opportunity – Dashboard Hacker

I am excited to announce a new focused project for contribution – Dashboard Hacker.  Last week we gave a preview that today we would be announcing 2 contribution projects.  This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks.  More time is welcome, but not required.

What is a dashboard hacker?

When a developer is ready to land code, they want to test it. Getting the results and understanding the results is made a lot easier by good dashboards and tools. For this project, we have a starting point with our performance data view to fix up a series of nice to have polish features and then ensure that it is easy to use with a normal developer workflow. Part of the developer work flow is the regular job view, If time permits there are some fun experiments we would like to implement in the job view.  These bugs, features, projects are all smaller and self contained which make great projects for someone looking to contribute.

What is required of you to participate?

  • A willingness to learn and ask questions
  • A general knowledge of programming (most of this will be in javascript, django, angularJS, and some work will be in python.
  • A promise to show up regularly and take ownership of the issues you are working on
  • Good at solving problems and thinking out of the box
  • Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

  • A dedicated mentor for the project whom you will work with regularly throughout the project
  • A single area of work to reduce the need to get up to speed over and over again.
    • This project will cover many tools, but the general problem space will be the same
  • The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
  • The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org, wlach on IRC will be the primary mentor).  We will point you at a starter bug and introduce you to the bugs and problems to solve.  If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here.  One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space.  Ultimately it will be up to the mentor (for this project, it will be :wlach).  If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

2 Comments

Filed under Community, testdev

A-Team contribution opportunity – DX (Developer Ergonomics)

I am excited to announce a new focused project for contribution – Developer Ergonomics/Experience, otherwise known as DX.  Last week we gave a preview that today we would be announcing 2 contribution projects.  This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks.  More time is welcome, but not required.

What does DX mean?

We chose this project as we continue to experience frustration while fixing bugs and debugging test failures.  Many people suggest great ideas, in this case we have set aside a few ideas (look at the dependent bugs to clean up argument parsers, help our tests run in smarter chunks, make it easier to run tests locally or on server, etc.) which would clean up stuff and be harder than a good first bug, yet each issue by itself would be too easy for an internship.  Our goal is to clean up our test harnesses and tools and if time permits, add stuff to the workflow which makes it easier for developers to do their job!

What is required of you to participate?

  • A willingness to learn and ask questions
  • A general knowledge of programming (this will be mostly in python with some javascript as well)
  • A promise to show up regularly and take ownership of the issues you are working on
  • Good at solving problems and thinking out of the box
  • Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

  • A dedicated mentor for the project whom you will work with regularly throughout the project
  • A single area of work to reduce the need to get up to speed over and over again.
    • This project will cover many tools, but the general problem space will be the same
  • The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
  • The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org).  We will point you at a starter bug and introduce you to the bugs and problems to solve.  If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here.  One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space.  Ultimately it will be up to the mentor (for this project, it will be me).  If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

1 Comment

Filed under Community, testdev

Watching the watcher – Some data on the Talos alerts we generate

What are the performance regressions at Mozilla- who monitors them and what kind of regressions do we see?  I want to answer this question with a few peeks at the data.  There are plenty of previous blog posts I have done outlining stats, trends, and the process.  Lets recap what we do briefly, then look at the breakdown of alerts (not necessarily bugs).

When Talos uploads numbers to graph server they get stored and eventually run through a calculation loop to find regressions and improvements.  As of Jan 1, 2015, we upload these to mozilla.dev.tree-alerts as well as email to the offending patch author (if they can easily be identified).  There are a couple folks (performance sheriffs) who look at the alerts and triage them.  If necessary a bug is filed for further investigation.  Reading this brief recap of what happens to our performance numbers probably doesn’t inspire folks, what is interesting is looking at the actual data we have.

Lets start with some basic facts about alerts in the last 12 months:

  • We have collected 8232 alerts!
  • 4213 of those alerts are regressions (the rest are improvements)
  • 3780 of those above alerts have a manually marked status
    • the rest have been programatically marked as merged and associated with the original
  • 278 bugs have been filed (or 17 alerts/bug)
    • 89 fixed!
    • 61 open!
    • 128 (5 invalid, 8 duplicate, 115 wontfix/worksforme)

As you can see this is not a casual hobby, it is a real system helping out in fixing and understanding hundreds of performance issues.

We generate alerts on a variety of branches, here is the breakdown of branches and alerts/branch;

number of regression alerts we have received per branch

number of regression alerts we have received per branch

There are a few things to keep in mind here, mobile/mozilla-central/Firefox are the same branch, and for non-pgo branches that is only linux/windows/android, not osx. 

Looking at that graph is sort of non inspiring, most of the alerts will land on fx-team and mozilla-inbound, then show up on the other branches as we merge code.  We run more tests/platforms and land/backout stuff more frequently on mozilla-inbound and fx-team, this is why we have a larger number of alerts.

Given the fact we have so many alerts and have manually triaged them, what state the the alerts end up in?

Current state of alerts

Current state of alerts

The interesting data point here is that 43% of our alerts are duplicates.  A few reasons for this:

  • we see an alert on non-pgo, then on pgo (we usually mark the pgo ones as duplicates)
  • we see an alert on mozilla-inbound, then the same alert shows up on fx-team,b2g-inbound,firefox (due to merging)
    • and then later we see the pgo versions on the merged branches
  • sometimes we retrigger or backfill to find the root cause, this generates a new alert many times
  • in a few cases we have landed/backed out/landed a patch and we end up with duplicate sets of alerts

The last piece of information that I would like to share is the break down of alerts per test:

Alerts per test

number of alerts per test (some excluded)

There are a few outliers, but we need to keep in mind that active work was being done in certain areas which would explain a lot of alerts for a given test.  There are 35 different test types which wouldn’t look good in an image, so I have excluded retired tests, counters, startup tests, and android tests.

Personally, I am looking forward to the next year as we transition some tools and do some hacking on the reporting, alert generation and overall process.  Thanks for reading!

1 Comment

Filed under testdev

Re-Triggering for a [root] cause – some notes from doing it

With all this talk of intermittent failures and folks coming up with ideas on how to solve them, I figured I should dive head first into looking at failures.  I have been working with a few folks on this, specifically :parkouss and :vaibhav1994.  This experiment (actually the second time doing so) is where I take a given intermittent failure bug and retrigger it.  If it reproduces, then I go back in history looking for where it becomes intermittent.  This weekend I wrote up some notes as I was trying to define what an intermittent is.

Lets outline the parameters first for this experiment:

  • All bugs marked with keyword ‘intermittent-failure’ qualify
  • Bugs must not be resolved or assigned to anybody
  • Bugs must have been filed in the last 28 days (we only keep 30 days of builds)
  • Bugs must have >=20 tbplrobot comments (arbitrarily picked, ideally we want something that can easily reproduce)

Here are what comes out of this:

  • 356 bugs are open, not assigned and have the intermittent-failure keyword
  • 25 bugs have >=20 comments meeting our criteria

The next step was to look at each of the 25 bugs and see if it makes sense to do this.  In fact 13 of the bugs I decided not to take action on (remember this is an experiment, so my reasoning for ignoring these 13 could be biased):

  • 5 bugs were thunderbird/mozmill only
  • 3 bugs looked to be related to android harness issues
  • bug 1157090 hadn’t reproduced in 2 weeks- was APZ feature which we turned off.
  • one bug was only on mozilla-beta only
  • 2 bugs had patches and 1 had a couple of comments indicating a developer was already looking at it

This leaves us with 12 bugs to investigate.  The premise here is easy, find the first occurrence of the intermittent (branch, platform, testjob, revision) and re-trigger it 20 times (picked for simplicity).  When the results are in, see if we have reproduced it.  In fact, only 5 bugs reproduced the exact error in the bug when re-triggered 20 times on a specific job that showed the error.

Moving on, I started re-triggering jobs back in the pushlog to see where it was introduced.  I started off with going back 2/4/6 revisions, but got more aggressive as I didn’t see patterns.  Here is a summary of what the 5 bugs turned out like:

  • Bug 1161915 – Windows XP PGO Reftest.  Found the root cause (pgo only, lots of pgo builds were required for this) 23 revisions back.
  • Bug 1160780 – OSX 10.6 mochitest-e10s-bc1.  Found the root cause 33 revisions back.
  • Bug 1161052 – Jetpack test failures.  So many failures in the same test file, it isn’t very clear if I am reproducing the failure or finding other ones.  :erikvold is working on fixing the test in bug 1163796 and ok’d disabling it if we want to.
  • Bug 1161537 – OSX 10.6 Mochitest-other.  Bisection didn’t find the root cause, but this is a new test case which was added.  This is a case where when the new test case was added it could have been run 100+ times successfully, then when it merged with other branches a couple hours later it failed!
  • bug 1155423 – Linux debug reftest-e10s-1.  This reproduced 75 revisions in the past, and due to that I looked at what changed in our buildbot-configs and mozharness scripts.  We actually turned on this test job (hadn’t been running e10s reftests on debug prior to this) and that caused the problem.  This can’t be tracked down by re-triggering jobs into the past.

In summary, out of 356 bugs 2 root causes were found by re-triggering.  In terms of time invested into this, I have put about 6 hours of time to fine the root cause of the 5 bugs.

2 Comments

Filed under testdev

intermittent oranges- missing a real definition

There are all kinds of great ideas folks have for fixing intermittent issues.  In fact each idea in and of itself is a worthwhile endeavor.   I have spent some time over the last couple of months fixing them, filing bugs on them, and really discussing them.  One question that remains- what is the definition of an intermittent.

I don’t plan to lay out a definition, instead I plan to ask some questions and lay out some parameters.  According to orange factor, there are 4640 failures in the last week (May 3 -> May 10) all within 514 unique bugs.  These are all failures that the sheriffs have done some kind of manual work on to star on treeherder.  I am not sure anybody can find a way to paint a pretty picture to make it appear we don’t have intermittent failures.

Looking at a few bugs, there are many reasons for intermittent failures:

  • core infrastructure (networking, power, large classes of machines (ec2), etc.)
  • machine specific (specific machine is failing a lot of jobs)
  • CI specific (buildbot issues, twisted issues, puppet issues, etc.)
  • CI Harness (usually mozharness)
  • Platforms (old platforms/tests we are phasing out, new platforms/tests we are phasing in)
  • Test harness (mochitest + libraries, common code for tests, reftest)
  • Test Cases (test cases actually causing failures, poorly written, new test cases, etc.)
  • Firefox Code (we actually have introduced conditions to cause real failures- just not every time)
  • Real regressions (failures which happen every time we run a test)

There are a lot of reasons, many of these have nothing to do with poor test cases or bad code in Firefox.  But many of these are showing up many times a day and as a developer who wants to fix a bad test, many are not really actionable.  Do we need to have some part of a definition to include something that is actionable?

Looking at the history of ‘intermittent-failure’ bugs in Bugzilla, many occur once and never occur again.  In fact this is the case for over half of the bugs filed (we file upwards of 100 new bugs/week).  While there are probably reasons for a given test case to fail, if it failed in August 2014 and has never failed again, is that test case intermittent?  As a developer could you really do anything about this given the fact that reproducing it is virtually impossible?

This is where I start to realize we need to find a way to identify real intermittent bugs/tests and not clutter the statistics with tests which are virtually impossible to reproduce.  Thinking back to what is actionable- I have found that while filing bugs for Talos regressions the closer the bug is filed to the original patch landing, the better the chance it will get fixed.  Adding to that point, we only keep 30 days of builds/test packages around for our CI automation.  I really think a definition of an intermittent needs to have some kind of concept of time.  Should we ignore intermittent failures which occur only once in 90 days?  Maybe ignore ones that don’t reproduce after 1000 iterations?  Some could argue that we look in a smaller or larger window of time/iterations.

Lastly, when looking into specific bugs, I find many times they are already fixed.  Many of the intermittent failures are actually fixed!  Do we track how many get fixed?  How many have patches and have debugging already taking place?  For example in the last 28 days, we have filed 417 intermittents, of which 55 are already resolved and of the remaining 362 only 25 have occurred >=20 times.  Of these 25 bugs, 4 already have patches.  It appears a lot of work is done to fix intermittent failures which are actionable.  Are the ones which are not being fixed not actionable?  Are they in a component where all the developers are busy and heads down?

In a perfect world a failure would never occur, all tests would be green, and all users would use Firefox.  In reality we have to deal with thousands of failures every week, most of which never happen again.  This quarter I would like to see many folks get involved in discussions and determine:

  • what is too infrequent to be intermittent? we can call this noise
  • what is the general threshold where something is intermittent?
  • what is the general threshold where we are too intermittent and need to backout a fix or disable a test?
  • what is a reasonable timeframe to track these failures such that we can make them actionable?

Thanks for reading, I look forward to hearing from many who have ideas on this subject.  Stay tuned for an upcoming blog post about re-trigging intermittent failures to find the root cause.

Leave a comment

Filed under testdev

SETA – Search for Extraneous Test Automation

Here at Mozilla we run dozens of builds and hundreds of test jobs for every push to a tree.  As time has gone on, we have gone from a couple hours from push to all tests complete to 4+ hours.  With the exception of a few test jobs, we could complete our test results in

The question becomes, how do we manage to keep up test coverage without growing the number of machines?  Right now we do this with buildbot coalescing (we queue up the jobs and skip the older ones when the load is high).  While this works great, it causes us to skip a bunch of jobs (builds/tests) on random pushes and sometimes we need to go back in and manually schedule jobs to find failures.  In fact, while keeping up with the automated alerts for talos regressions, the coalescing causes problems in over half of the regressions that I investigate!

Knowing that we live with coalescing and have for years, many of us started wondering if we need all of our tests.  Ideally we could select tests that are statistically most significant to the changes being pushed, and if those pass, we could run the rest of the tests if there were available machines.  To get there is tough, maybe there is a better way to solve this?  Luckily we can mine meta data from treeherder (and the former tbpl) and determine which failures are intermittent and which have been fixed/caused by a different revision.

A few months ago we started looking into unique failures on the trees.  Not just the failures, but which jobs failed.  Normally when we have a failure detected by the automation, many jobs fail at once (for example: xpcshell tests will fail on all windows platforms, opt + debug).  When you look at the common jobs which fail across all the failures over time, you can determine the minimum number of jobs required to detected all the failures.  Keep in mind that we only need 1 job to represent a given failure.

As of today, we have data since August 13, 2014 (just shy of 180 days):

  • 1061 failures caught by automation (for desktop builds/tests)
  • 362 jobs are currently run for all desktop builds
  • 285 jobs are optional and not needed to detect all 1061 regressions

To phrase this another way, we could have run 77 jobs per push and caught every regression in the last 6 months.  Lets back up  a bit and look at the regressions found- how many are there and how often do we see them:

Cumulative and per day regressions

Cumulative and per day regressions

This is a lot of regressions, yay for automation.  The problem is that this is historical data, not future data.  Our tests, browser, and features change every day, this doesn’t seem very useful for predicting the future.  This is a parallel to the stock market, there people invest in companies based on historical data and make decisions based on incoming data (press releases, quarterly earnings).  This is the same concept.  We have dozens of new failures every week, and if we only relied upon the 77 test jobs (which would catch all historical regressions) we would miss new ones.  This is easy to detect, and we have mapped out the changes.  Here it is on a calendar view (bold dates indicate a change was detected, i.e. a new job needed in the reduced set of jobs list):

Bolded dates are when a change is needed due to new failuresThis translates to about 1.5 changes per week.  To put this another way, if we were only running the 77 reduced set of jobs, we would have missed one regression December 2nd, and another December 16th, etc., or on average 1.5 regressions will be missed per week.  In a scenario where we only ran the optional jobs once/hour on the integration branches, 1-2 times/week we would see a failure and have to backfill some jobs (as we currently do for coalesced jobs) for the last hour to find the push which caused the failure.

To put this into perspective, here is a similar view to what you would expect to see today on treeherder:

All desktop unittest jobsFor perspective, here is what it would look like assuming we only ran the reduced set of 77 jobs:

Reduced set of jobs view* keep in mind this is weighted such that we prefer to run jobs on linux* builds since those run in the cloud.

With all of this information, what do we plan to do with it?  We plan to run the reduced set of jobs by default on all pushes, and use the [285] optional jobs as candidates for coalescing.  Currently we force coalescing for debug unittests.  This was done about 6 months ago because debug tests take considerably longer than opt, so if we could run them on every 2nd or 3rd build, we would save a lot of machine time.  This is only being considered on integration trees that the sheriffs monitor (mozilla-inbound, fx-team).

Some questions that are commonly asked:

  • How do you plan to keep this up to date?
    • We run a cronjob every day and update our master list of jobs, failures, and optional jobs.  This takes about 2 minutes.
  • What are the chances the reduced set of jobs catch >1 failure?  Do we need all 77 jobs?
    • 77 jobs detect 1061 failures (100%)
    • 35 jobs detect 977 failures (92%)
    • 23 jobs detect 940 failures (88.6%)
    • 12 jobs detect 900 failures (84.8%)
  • How can we see the data:
    • SETA website
    • in the near future summary emails when we detect a *change* to mozilla.dev.tree-alerts

Thanks for reading so far!  This project wouldn’t be here it it wasn’t for the many hours of work by Vaibhav, he continues to find more ways to contribute to Mozilla. If anything this should inspire you to think more about how our scheduling works and what great things we can do if we think out of the box.

6 Comments

Filed under testdev

Tracking Firefox performance as we uplift – the volume of alerts we get

For the last year, I have been focused on ensuring we look at the alerts generated by Talos.  For the last 6 months I have also looked a bit more carefully at the uplifts we do every 6 weeks.  In fact we wouldn’t generate alerts when we uplifted to beta because we didn’t run enough tests to verify a sustained regression in a given time window.

Lets look at data, specifically the volume of alerts:

Trend of improvements/regressions from Firefox 31 to 36 as we uplift to Aurora

Trend of improvements/regressions from Firefox 31 to 36 as we uplift to Aurora

this is a stacked graph, you can interpret it as Firefox 32 had a lot of improvements and Firefox 33 had a lot of regressions.  I think what is more interesting is how many performance regressions are fixed or added when we go from Aurora to Beta.  There is minimal data available for Beta.  This next image will compare alert volume for the same release on Aurora then on Beta:

Side by side stacked bars for the regressions going into Aurora and then going onto Beta.

Side by side stacked bars for the regressions going into Aurora and then going onto Beta.

One way to interpret this above graph is to see that we fixed a lot of regressions on Aurora while Firefox 33 was on there, but for Firefox 34, we introduced a lot of regressions.

The above data is just my interpretation of this, Here are links to a more fine grained view on the data:

As always, if you have questions, concerns, praise, or other great ideas- feel free to chat via this blog or via irc (:jmaher).

Leave a comment

Filed under testdev