Category Archives: testdev

Measuring the noise in Performance tests

Often I hear about our talos results, why are they so noisy?  What is noise in this context- by noise we are referring to a larger stddev in the results we track, here would be an example:


With the large spread of values posted regularly for this series, it is hard to track improvements or regressions unless they are larger or very obvious.

Knowing the definition of noise, there are a few questions that we often need to answer:

  • Developers working on new tests- what is the level of noise, how to reduce it, what is acceptable
  • Over time noise changes- this causes false alerts, often not related to to code changes or easily discovered via infra changes
  • New hardware we are considering- is this hardware going to post reliable data for us.

What I care about is the last point, we are working on replacing the hardware we run performance tests on from old 7 year old machines to new machines!  Typically when running tests on a new configuration, we want to make sure it is reliably producing results.  For our system, we look for all green:


This is really promising- if we could have all our tests this “green”, developers would be happy.  The catch here is these are performance tests, are the results we collect and post to graphs useful?  Another way to ask this is are the results noisy?

To answer this is hard, first we have to know how noisy things are prior to the test.  As mentioned 2 weeks ago, Talos collects 624 metrics that we track for every push.  That would be a lot of graph and calculating.  One method to do this is push to try with a single build and collect many data points for each test.  You can see that in the image showing the all green results.

One method to see the noise, is to look at compare view.  This is the view that we use when comparing one push to another push when we have multiple data points.  This typically highlights the changes that are easy to detect with our t-test for alert generation.  If we look at the above referenced push and compare it to itself, we have:



Here you can see for a11y, linux64 has +- 5.27 stddev.  You can see some metrics are higher and others are lower.  What if we add up all the stddev numbers that exist, what would we have?  In fact if we treat this as a sum of the squares to calculate the variance, we can generate a number, in this case 64.48!  That is the noise for that specific run.

Now if we are bringing up a new hardware platform, we can collect a series of data points on the old hardware and repeat this on the new hardware, now we can compare data between the two:


What is interesting here is we can see side by side the differences in noise as well as the improvements and regressions.  What about the variance?  I wanted to track that and did, but realized I needed to track the variance by platform, as each platform could be different- In bug 1416347, I set out to add a Noise Metric to the compare view.  This is on treeherder staging, probably next week in production.  Here is what you will see:


Here we see that the old hardware has a noise of 30.83 and the new hardware a noise of 64.48.  While there are a lot of small details to iron out, while we work on getting new hardware for linux64, windows7, and windows10, we now have a simpler method for measuring the stability of our data.






Filed under testdev

Keeping an eye on Performance alerts

Over the last 6 months there has been a deep focus on performance in order to release Firefox 57. Hundreds of developers sought out performance improvements and after thousands of small adjustments we see massive improvements.

Last week I introduced Ionut who has come in as a Performance Sheriff.  What do we do on a regular basis when it comes to monitoring performance.  In the past I focused on Talos and how many bugs per release we found, fixed, and closed.  While that is fun and interesting, we have expanded the scope of sheriffing.

Currently we have many frameworks:

  • Talos (old fashioned perf testing, in-tree, per commit, all desktop platforms- startup, benchmarks, pageload)
  • build_metrics (compile time, installer size, sccache hit rate, num_constructors, etc.)
  • AWSY (are we slim yet, now in-tree, per commit, measuring memory during heavy pageload activity)
  • Autophone (android fennec startup + talos tests, running on 4 different phones, per commit)
  • Platform Microbenchmarks (developer written GTEST (cpp code), mostly graphics and stylo specific)

We continue to refine benchmarks and tests on each of these frameworks to ensure we are running on relevant configurations, measuring the right things, and not duplicating data unnecessarily.

Looking at the list of frameworks, we collect 1127 unique data points and alert on them with included bugs for anything sustained and valid.  While the number of unique metrics can change, here are the current number of metrics we track:

Framework Total Metrics
Talos 624
Autophone 19
Build Metrics 172
Platform Microbenchmarks 229

While we generate these metrics for every commit (or every few commits for load reasons), what happens is we detect a regression and generate an alert.  In fact we have a sizable number of alerts in the last 6 weeks:

Framework Total Alerts
Talos 429
Autophone 77
Build Metrics 264
Platform Microbenchmarks 227

Alerts are not really what we file bugs on, instead we have an alert summary when can (and typically) does contain a set of alerts.  Here is the total number of alert summaries (i.e. what a sheriff will look at):

Framework Total Summaries
Talos 172
Autophone 54
Build Metrics 79
Platform Microbenchmarks 136

These alert summaries are then mapped into bugs (or downstream alerts to where the alerts started).  Here is a breakdown of the bugs we have:

Framework Total Bugs
Talos 41
Autophone 3
Build Metrics 17
Platform Microbenchmarks 6

This indicates there are 73 bugs associated with Performance Summaries . What is deceptive here is many of those bugs are ‘improvements’ and not ‘regressions’.  If you figured it out, we do associate improvements with bugs and try to comment in the bugs to let you know of the impact your code has on a [set of] metric[s].

Framework Total Bugs
Talos 23
Autophone 3
Build Metrics 14
Platform Microbenchmarks 3

This is a much smaller number of bugs- now there are a few quirks here-

  • some regressions show up across multiple frameworks (reduces to 43 total)
  • some bugs that are ‘downstream’ are marked against the root cause instead of just being downstream.  Often this happens when we are sheriffing bugs and a downstream alert shows up a couple days later.

Over the last few releases here are the tracking bugs:

Note that Firefox 58 has 28 bugs associated with it, but we have 43 bugs from the above query.  Some of those bugs from the above query are related to Firefox 57, and some are starred against a duplicate bug or a root cause bug instead of the regression bug.

I hope you find this data useful and informative towards understanding what goes on with all the performance data.

Leave a comment

Filed under testdev

Stockwell: flowchart for triage

I gave an update 2 weeks ago on the current state of Stockwell (intermittent failures).  I mentioned additional posts were coming and this is a second post in the series.

First off the tree sheriffs who maintain merges between branches, tree closures, backouts, hot fixes, and a many other actions that keep us releasing code do one important task, and that is star failures to a corresponding bug.


These annotations are saved in Treeherder and Orange Factor.  Inside of Orange Factor, we have a robot that comments on bugs– this has been changing a bit more frequently this year to help meet our new triage needs.

Once we get bugs annotated, now we work on triaging them.  Our primarily tool is Neglected Oranges which gives us a view of all failures that meet our threshold and don’t have a human comment in the last 7 days.  Here is the next stage of the process:


As you can see this is very simple, and it should be simple.  The ideal state is adding more information to the bug which helps make it easier for the person we NI? to prioritize the bug and make a decision:


While there is a lot more we can do, and much more that we have done, this seems to be the most effective use when looking across 1000+ bugs that we have triaged so far this year.

In some cases a bug fails very frequently and there are no development resources to spend fixing the bug- these will sometimes cross our 200 failures in 30 days policy and will get a [stockwell disabled-recommended] whiteboard tag, we monitor this and work to disable bugs on a regular basis:


This isn’t as cut and dry as disable every bug, but we do disable as quickly as possible and push hard on the bugs that are not as trivial to disable.

There are many new people working on Intermittent Triage and having a clear understanding of what they are doing will help you know how a random bug ended up with a ni? to you!

Leave a comment

Filed under intermittents, testdev

Talos tests- summary of recent changes

I have done a poor job of communicating status on our performance tooling, this is something I am trying to rectify this quarter.  Over the last 6 months many new talos tests have come online, along with some differences in scheduling or measurement.

In this post I will highlight many of the test related changes and leave other changes for a future post.

Here is a list of new tests that we run:

* cpstartup – (content process startup: thanks :gabor)
* sessionrestore many windows – (instead of one window and many tabs, thanks :beekill)
* perf-reftest[-singletons] – (thanks bholley, :heycam)
* speedometer – (thanks :jmaher)
* tp6 (amazon, facebook, google, youtube) – (thanks :rwood, :armenzg)

These are also new tests, but slight variations on existing tests:

* tp5o + webextension, ts_paint + webextension (test web extension perf, thanks :kmag)
* tp6 + heavy profile, ts_paint + heavy profile (thanks :rwood, :tarek)

The next tests have  been updated to be more relevant or reliable:

* damp (many subtests added, more upcoming, thanks :ochameau)
* tps – update measurements (thanks :mconley)
* tabpaint – update measurements (thanks :mconley)
* we run all talos tests on coverage builds (thanks :gmierz)

It is probably known to most, but earlier this year we stood up testing on Windows 10 and turned off our talos coverage on Windows 8 (big thanks to Q, for making this happen so fast)

Some changes that might not be so obvious, but worth mentioning:

* Added support for Time to first non blank paint (only tp6)
* Investigated mozAfterPaint on non-empty rects– updated a few tests to measure properly
* Added support for comparing perf measurements between tests (perf-reftests) so we can compare rendering time of A vs B- in this case stylo vs non-stylo
* tp6 requires mitmproxy for record/replay- this allows us to have https and multi host dns resolution which is much more real world than serving pages from http://localhost.
* Added support to wait for idle callback before testing the next page.

Stay tuned for updates on Sheriffing, non Talos tests, and upcoming plans.

1 Comment

Filed under testdev

Project Stockwell (October 2017)

It has been 6 months since the last Stockwell update.  With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates.  While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent.  We were just starting to focus on learning about failures.  We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands.  Our plan was to do a few things:

  • Triage all failures >=30 instances/week
  • Build tools to make triage easier and collect more data
  • Adjust policy for triaging, disabling, and managing intermittents
  • Make our tests better with linting and test-verification
  • Invest time into auto-classification
  • Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

  • Triaged all failures >= 30 instances/week (most weeks, never more than 3 weeks off)
  • Many improvements to our tools, including: adjusteing orange factor robot, intermittent-bug-filer, and added |mach test-info|
  • Played with policy on/off, have settled on needinfo “owner” when 30+ failures/week, and disabling if 200 failures in 30 days.
  • Added eslint to our tests, pylint for our tools, and the new TV job is tier-2.
  • added source file -> bugzilla components in-tree to define ownership.
  • 31 bugzilla components triage their own intermittents

While that is a lot of changes, it is incremental yet effective.  We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14).  While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low.  We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time.  We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category # Bugs
Fixed 511
Disabled 262
Infra 62
Needswork 49
Unknown 209
Total 1093

As you can see that is a lot of disabled tests.  Note, we usually only disable a test on a subset of the configurations, not 100% across the board.  Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes 🙂  The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

  • Triage process for component owners and others who want to participate
  • Test verification and the future
  • Workflow of an intermittent, from first failure to resolution
  • Future of Orange Factor and Autoclassification
  • Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla.  These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

Leave a comment

Filed under intermittents, testdev

Project Stockwell – January 2016

Every month this year I am planning to write a summary of Project Stockwell.

Last year we started this project with a series of meetings and experiments.  We presented in Hawaii (a Mozilla all-hands event) an overview of our work and path forward.

With that said, we will be tracking two items every month:

Week of Jan 02 -> 09, 2017

Orange Factor 13.76
# High Frequency bugs 42

What are these high frequency bugs:

  • linux32 debug timeouts for devtools (bug 1328915)
  • Turning on leak checking (bug 1325148) – note, we did this Dec 29th and whitelisted a lot, still much exists and many great fixes have taken place
  • some infrastructure issues, other timeouts, and general failures

I am excited for the coming weeks as we reduce the orange factor back down <7 and get the high frequency bugs <20.

Outside of these tracking stats there are a few active projects we are working on:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – this will allow us to then match up triage contacts for each components so test case ownership has a patch to a live person
  • retrigger an existing job with additional debugging arguments (bug 1322433) – easier to get debug information, possibly extend to special runs like ‘rr-chaos’
  • add |mach test-info| support (bug 1324470) – allows us to get historical timing/run/pass data for a given test file
  • add a test-lint job to linux64/mochitest (bug 1323044) – ensure a test runs reliably by itself and in –repeat mode

While these seem small, we are currently actively triaging all bugs that are high frequency (>=30 times/week).  In January triage means letting people know this is high frequency and trying to add more data to the bugs.



Filed under intermittents, testdev

Working towards a productive definition of “intermittent orange”

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent.  Some common statements I have heard:

  • It’s obvious, if it failed last year, the test is intermittent
  • If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent
  • I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server

These are imply much different definitions of what is intermittent, a definition will need to:

  • determine if we should take action on a test (programatically or manually)
  • define policy sheriffs and developers can use to guide work
  • guide developers to know when a new/fixed test is ready for production
  • provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends.  I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%).  Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures.  It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle).  After much slicing and dicing, I have come up with 4 buckets:

  1. Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)
  2. Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)
  3. Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks.  (11.5% of bugs)
  4. High Frequency Orange: fails >10 times/day many times and are often seen in try pushes.  (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

  • low priority or not actionable (buckets 1 + 2)
  • high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

  • Determine if we should take action on a test (programatically or manually):
    • ideally buckets 1/2 can be detected programatically with autostar and removed from our view.  Possibly rerunning to validate it isn’t a new failure.
    • buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information
  • Define policy sheriffs and developers can use to guide work
    • sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)
    • developers understand the severity based on the bucket.  Ideally we will need a lot of context, but understanding severity is important.
  • Guide developers to know when a new/fixed test is ready for production
    • If we fix a test, we want to ensure it is stable before we make it tier-1.  A developer can use math of 300 commits/day and ensure we pass.
    • NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day
  • Provide useful data to release and Firefox product management about the quality of a release
    • Release Management can take the OrangeFactor into account
    • new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot).  There are simple rules:

  • 15+ times/day – post a daily summary (bucket #4)
  • 5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

  • missing or incorrect data in orange factor (human error)
  • some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.


Filed under general, testdev, Uncategorized