Measuring the noise in Performance tests

Often I hear about our talos results, why are they so noisy?  What is noise in this context- by noise we are referring to a larger stddev in the results we track, here would be an example:

noise_example

With the large spread of values posted regularly for this series, it is hard to track improvements or regressions unless they are larger or very obvious.

Knowing the definition of noise, there are a few questions that we often need to answer:

  • Developers working on new tests- what is the level of noise, how to reduce it, what is acceptable
  • Over time noise changes- this causes false alerts, often not related to to code changes or easily discovered via infra changes
  • New hardware we are considering- is this hardware going to post reliable data for us.

What I care about is the last point, we are working on replacing the hardware we run performance tests on from old 7 year old machines to new machines!  Typically when running tests on a new configuration, we want to make sure it is reliably producing results.  For our system, we look for all green:

all_green

This is really promising- if we could have all our tests this “green”, developers would be happy.  The catch here is these are performance tests, are the results we collect and post to graphs useful?  Another way to ask this is are the results noisy?

To answer this is hard, first we have to know how noisy things are prior to the test.  As mentioned 2 weeks ago, Talos collects 624 metrics that we track for every push.  That would be a lot of graph and calculating.  One method to do this is push to try with a single build and collect many data points for each test.  You can see that in the image showing the all green results.

One method to see the noise, is to look at compare view.  This is the view that we use when comparing one push to another push when we have multiple data points.  This typically highlights the changes that are easy to detect with our t-test for alert generation.  If we look at the above referenced push and compare it to itself, we have:

self_compare

 

Here you can see for a11y, linux64 has +- 5.27 stddev.  You can see some metrics are higher and others are lower.  What if we add up all the stddev numbers that exist, what would we have?  In fact if we treat this as a sum of the squares to calculate the variance, we can generate a number, in this case 64.48!  That is the noise for that specific run.

Now if we are bringing up a new hardware platform, we can collect a series of data points on the old hardware and repeat this on the new hardware, now we can compare data between the two:

hardware_compare

What is interesting here is we can see side by side the differences in noise as well as the improvements and regressions.  What about the variance?  I wanted to track that and did, but realized I needed to track the variance by platform, as each platform could be different- In bug 1416347, I set out to add a Noise Metric to the compare view.  This is on treeherder staging, probably next week in production.  Here is what you will see:

noise_view

Here we see that the old hardware has a noise of 30.83 and the new hardware a noise of 64.48.  While there are a lot of small details to iron out, while we work on getting new hardware for linux64, windows7, and windows10, we now have a simpler method for measuring the stability of our data.

 

 

 

Advertisements

2 Comments

Filed under testdev

Keeping an eye on Performance alerts

Over the last 6 months there has been a deep focus on performance in order to release Firefox 57. Hundreds of developers sought out performance improvements and after thousands of small adjustments we see massive improvements.

Last week I introduced Ionut who has come in as a Performance Sheriff.  What do we do on a regular basis when it comes to monitoring performance.  In the past I focused on Talos and how many bugs per release we found, fixed, and closed.  While that is fun and interesting, we have expanded the scope of sheriffing.

Currently we have many frameworks:

  • Talos (old fashioned perf testing, in-tree, per commit, all desktop platforms- startup, benchmarks, pageload)
  • build_metrics (compile time, installer size, sccache hit rate, num_constructors, etc.)
  • AWSY (are we slim yet, now in-tree, per commit, measuring memory during heavy pageload activity)
  • Autophone (android fennec startup + talos tests, running on 4 different phones, per commit)
  • Platform Microbenchmarks (developer written GTEST (cpp code), mostly graphics and stylo specific)

We continue to refine benchmarks and tests on each of these frameworks to ensure we are running on relevant configurations, measuring the right things, and not duplicating data unnecessarily.

Looking at the list of frameworks, we collect 1127 unique data points and alert on them with included bugs for anything sustained and valid.  While the number of unique metrics can change, here are the current number of metrics we track:

Framework Total Metrics
Talos 624
Autophone 19
Build Metrics 172
AWSY 83
Platform Microbenchmarks 229
1127

While we generate these metrics for every commit (or every few commits for load reasons), what happens is we detect a regression and generate an alert.  In fact we have a sizable number of alerts in the last 6 weeks:

Framework Total Alerts
Talos 429
Autophone 77
Build Metrics 264
AWSY 85
Platform Microbenchmarks 227
1082

Alerts are not really what we file bugs on, instead we have an alert summary when can (and typically) does contain a set of alerts.  Here is the total number of alert summaries (i.e. what a sheriff will look at):

Framework Total Summaries
Talos 172
Autophone 54
Build Metrics 79
AWSY 29
Platform Microbenchmarks 136
470

These alert summaries are then mapped into bugs (or downstream alerts to where the alerts started).  Here is a breakdown of the bugs we have:

Framework Total Bugs
Talos 41
Autophone 3
Build Metrics 17
AWSY 6
Platform Microbenchmarks 6
73

This indicates there are 73 bugs associated with Performance Summaries . What is deceptive here is many of those bugs are ‘improvements’ and not ‘regressions’.  If you figured it out, we do associate improvements with bugs and try to comment in the bugs to let you know of the impact your code has on a [set of] metric[s].

Framework Total Bugs
Talos 23
Autophone 3
Build Metrics 14
AWSY 4
Platform Microbenchmarks 3
47

This is a much smaller number of bugs- now there are a few quirks here-

  • some regressions show up across multiple frameworks (reduces to 43 total)
  • some bugs that are ‘downstream’ are marked against the root cause instead of just being downstream.  Often this happens when we are sheriffing bugs and a downstream alert shows up a couple days later.

Over the last few releases here are the tracking bugs:

Note that Firefox 58 has 28 bugs associated with it, but we have 43 bugs from the above query.  Some of those bugs from the above query are related to Firefox 57, and some are starred against a duplicate bug or a root cause bug instead of the regression bug.

I hope you find this data useful and informative towards understanding what goes on with all the performance data.

Leave a comment

Filed under testdev

Stockwell: flowchart for triage

I gave an update 2 weeks ago on the current state of Stockwell (intermittent failures).  I mentioned additional posts were coming and this is a second post in the series.

First off the tree sheriffs who maintain merges between branches, tree closures, backouts, hot fixes, and a many other actions that keep us releasing code do one important task, and that is star failures to a corresponding bug.

Sheriff

These annotations are saved in Treeherder and Orange Factor.  Inside of Orange Factor, we have a robot that comments on bugs– this has been changing a bit more frequently this year to help meet our new triage needs.

Once we get bugs annotated, now we work on triaging them.  Our primarily tool is Neglected Oranges which gives us a view of all failures that meet our threshold and don’t have a human comment in the last 7 days.  Here is the next stage of the process:

Triage

As you can see this is very simple, and it should be simple.  The ideal state is adding more information to the bug which helps make it easier for the person we NI? to prioritize the bug and make a decision:

Comment

While there is a lot more we can do, and much more that we have done, this seems to be the most effective use when looking across 1000+ bugs that we have triaged so far this year.

In some cases a bug fails very frequently and there are no development resources to spend fixing the bug- these will sometimes cross our 200 failures in 30 days policy and will get a [stockwell disabled-recommended] whiteboard tag, we monitor this and work to disable bugs on a regular basis:

Disable

This isn’t as cut and dry as disable every bug, but we do disable as quickly as possible and push hard on the bugs that are not as trivial to disable.

There are many new people working on Intermittent Triage and having a clear understanding of what they are doing will help you know how a random bug ended up with a ni? to you!

Leave a comment

Filed under intermittents, testdev

A formal introduction to Ionut Goldan – Mozilla’s new Performance Sheriff and Tool hacker

About 8 months ago we started looking for a full time performance sheriff to help out with our growing number of alerts and needs for keeping the Talos toolchain relevant.

We got really lucky and ended up finding Ionut (:igoldan on irc, #perf).  Over the last 6 months, Ionut has done a fabulous job of learning how to understand Talos alerts, graphs, scheduling, and narrowing down root causes.  In fact, he has not only been able to easily handle all of the Talos alerts, Ionut has picked up alerts from Autophone (Android devices), Build Metrics (build times, installer sizes, etc.), AWSY (memory metrics), and Platform Microbenchmarks (tests run inside of gtest written by a few developers on the graphics and stylo teams).

While I could probably write a list of Ionut’s accomplishments and some tricky bugs he has sorted out, I figured your enjoyment of reading this blog is better spend on getting to know Ionut better, so I did a Q&A with him so we can all learn much more about Ionut.

Tell us about where you live?

I live in Iasi. It is a gorgeous and colorful town, somewhere in the North-East of Romania.  It is full of great places and enchanting sunsets. I love how a casual walk
leads me to new, beautiful and peaceful neighborhoods.

I have many things I very much appreciate about this town:
the people here, its continuous growth, its historical resonance, the fact that its streets once echoed the steps of the most important cultural figures of our country. It also resembles ancient Rome, as it is also built on 7 hills.

It’s pretty hard not to act like a poet around here.

What inspired you to be a computer programmer?

I wouldn’t say I was inspired to be a programmer.

During my last years in high school, I occasionally consulted with my close ones. Each time we concluded that IT is just the best domain to specialize in: it will improve continuously, there will be jobs available; things that are evident nowadays.

I found much inspiration in this domain after the first year in college, when I noticed the huge advances and how they’re conducted.  I understood we’re living in a whole new era. Digital transformation is now the coined term for what’s going on.

Any interesting projects you have done in the past (school/work/fun)?

I had the great opportunity to work with brilliant teams on a full advertising platform, from almost scratch.

It got almost everything: it was distributed, highly scalable, completely written in
Python 3.X, the frontend adopted material design, NoSQL database in conjunction with SQL ones… It used some really cutting-edge libraries and it was a fantastic feeling.

Now it’s Firefox. The sound name speaks for itself and there are just so many cool things I can do here.

What hobbies do you have?

I like reading a lot. History and software technology are my favourite subjects.
I enjoy cooking, when I have the time. My favourite dish definitely is the Hungarian goulash.

Also, I enjoy listening to classical music.

If you could solve any massive problem, what would you solve?

Greed. Laziness. Selfishness. Pride.

We can resolve all problems we can possibly encounter by leveraging technology.

Keeping non-values like those mentioned above would ruin every possible achievement.

Where do you see yourself in 10 years?

In a peaceful home, being a happy and caring father, spending time and energy with
my loved ones. Always trying to be the best example for them.  I envision becoming a top notch professional programmer, leading highly performant teams on
sound projects. Always familiar with cutting-edge tech and looking to fit it in our tool set.

Constantly inspiring values among my colleagues.

Do you have any advice or lessons learned for new students studying computer science?

Be passionate about IT technologies. Always be curious and willing to learn about new things. There are tons and tons of very good videos, articles, blogs, newsletters, books, docs…Look them out. Make use of them. Follow their guidance and advice.

Continuous learning is something very specific for IT. By persevering, this will become your second nature.

Treat every project as a fantastic opportunity to apply related knowledge you’ve acquired.  You need tons of coding to properly solidify all that theory, to really understand why you need to stick to the Open/Closed principle and all other nitty-gritty little things like that.

I have really enjoyed getting to know Ionut and working with him.  If you see him on IRC please ping him and say hi 🙂

 

Leave a comment

Filed under Community

Talos tests- summary of recent changes

I have done a poor job of communicating status on our performance tooling, this is something I am trying to rectify this quarter.  Over the last 6 months many new talos tests have come online, along with some differences in scheduling or measurement.

In this post I will highlight many of the test related changes and leave other changes for a future post.

Here is a list of new tests that we run:

* cpstartup – (content process startup: thanks :gabor)
* sessionrestore many windows – (instead of one window and many tabs, thanks :beekill)
* perf-reftest[-singletons] – (thanks bholley, :heycam)
* speedometer – (thanks :jmaher)
* tp6 (amazon, facebook, google, youtube) – (thanks :rwood, :armenzg)

These are also new tests, but slight variations on existing tests:

* tp5o + webextension, ts_paint + webextension (test web extension perf, thanks :kmag)
* tp6 + heavy profile, ts_paint + heavy profile (thanks :rwood, :tarek)

The next tests have  been updated to be more relevant or reliable:

* damp (many subtests added, more upcoming, thanks :ochameau)
* tps – update measurements (thanks :mconley)
* tabpaint – update measurements (thanks :mconley)
* we run all talos tests on coverage builds (thanks :gmierz)

It is probably known to most, but earlier this year we stood up testing on Windows 10 and turned off our talos coverage on Windows 8 (big thanks to Q, for making this happen so fast)

Some changes that might not be so obvious, but worth mentioning:

* Added support for Time to first non blank paint (only tp6)
* Investigated mozAfterPaint on non-empty rects– updated a few tests to measure properly
* Added support for comparing perf measurements between tests (perf-reftests) so we can compare rendering time of A vs B- in this case stylo vs non-stylo
* tp6 requires mitmproxy for record/replay- this allows us to have https and multi host dns resolution which is much more real world than serving pages from http://localhost.
* Added support to wait for idle callback before testing the next page.

Stay tuned for updates on Sheriffing, non Talos tests, and upcoming plans.

1 Comment

Filed under testdev

Project Stockwell (October 2017)

It has been 6 months since the last Stockwell update.  With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates.  While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent.  We were just starting to focus on learning about failures.  We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands.  Our plan was to do a few things:

  • Triage all failures >=30 instances/week
  • Build tools to make triage easier and collect more data
  • Adjust policy for triaging, disabling, and managing intermittents
  • Make our tests better with linting and test-verification
  • Invest time into auto-classification
  • Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

  • Triaged all failures >= 30 instances/week (most weeks, never more than 3 weeks off)
  • Many improvements to our tools, including: adjusteing orange factor robot, intermittent-bug-filer, and added |mach test-info|
  • Played with policy on/off, have settled on needinfo “owner” when 30+ failures/week, and disabling if 200 failures in 30 days.
  • Added eslint to our tests, pylint for our tools, and the new TV job is tier-2.
  • added source file -> bugzilla components in-tree to define ownership.
  • 31 bugzilla components triage their own intermittents

While that is a lot of changes, it is incremental yet effective.  We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14).  While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low.  We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time.  We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category # Bugs
Fixed 511
Disabled 262
Infra 62
Needswork 49
Unknown 209
Total 1093

As you can see that is a lot of disabled tests.  Note, we usually only disable a test on a subset of the configurations, not 100% across the board.  Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes 🙂  The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

  • Triage process for component owners and others who want to participate
  • Test verification and the future
  • Workflow of an intermittent, from first failure to resolution
  • Future of Orange Factor and Autoclassification
  • Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla.  These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

Leave a comment

Filed under intermittents, testdev

Project Stockwell (reduce intermittents) – April 2017

I am 1 week late in posting the update for Project Stockwell.  This wraps up a full quarter of work.  After a lot of concerns raised by developers about a proposed new backout policy, we moved on and didn’t change too much although we did push a little harder and I believe we have disabled more than we fixed as a result.

Lets look at some numbers:

Week Starting 01/02/17 02/27/17 03/24/17
Orange Factor 13.76 9.06 10.08
# P1 bugs 42 32 55
OF(P2) 7.25 4.78 5.13

As you can see we increased in March on all numbers- but overall a great decrease so far in 2017.

There have been a lot of failures which have lingered for a while which are not specific to a test.  For example:

  • windows 8 talos has a lot of crashes (work is being done in bug 1345735)
  • reftest crashes in bug 1352671.
  • general timeouts in jobs in bug 1204281.
  • and a few other leaks/timeouts/crashes/harness issues unrelated to a specific test
  • infrastructure issues and tier-3 jobs

While these are problematic, we see the overall failure rate is going down.  In all the other bugs where the test is clearly a problem we have seen many fixes which and responses to bugs from so many test owners and developers.  It is rare that we would suggest disabling a test and it was not agreed upon, and if there was concern we had a reasonable solution to reduce or fix the failure.

Speaking of which, we have been tracking total bugs, fixed, disabled, etc with whiteboard tags.  While there was a request to not use “stockwell” in the whiteboard tags and to make them more descriptive, after discussing this with many people we couldn’t come to agreement on names or what to track and what we would do with the data- so for now, we have remained the same.  Here is some data:

03/07/17 04/11/17
total 246 379
fixed 106 170
disabled 61 91
infrastructure 11 17
unknown 44 60
needswork 24 38
% disabled 36.53% 34.87%

What is interesting is that prior to march we had disabled 36.53% of the fixes, but in March when we were more “aggressive” about disabling tests, the overall percentage went down.  In fact this is a cumulative number for the year, so for the month of March alone we only disable 31.91% of the fixed tests.  Possibly if we had disabled a few more tests the overall numbers would have continued to go down vs slightly up.

A lot of changes took place on the tree in the last month, some interesting data on newer jobs:

  • taskcluster windows 7 tests are tier-2 for almost all windows VM tests
  • autophone is running all media tests which are not crashing or perma failing
  • disabled external media tests on linux platforms
  • added stylo mochitest and mochitest-chrome
  • fixed stylo reftests to run in e10s mode and on ubuntu 16.04

Upcoming job changes that I am aware of:

  • more stylo tests coming online
  • more linux tests moving to ubuntu 16.04
  • push to green up windows 10 taskcluster vm jobs

Regarding our tests, we are working on tracking new tests added to the tree, what components they belong in, what harness they run in, and overall how many intermittents we have for each component and harness.  Some preliminary work shows that we added 942 mochitest*/xpcshell tests in Q1 (609 were imported webgl tests, so we wrote 333 new tests, 208 of those are browser-chrome).  Given the fact that we disabled 91 tests and added 942, we are not doing so bad!

Looking forward into April and Q2, I do not see immediate changes to a policy needed, maybe in May we can finalize a policy and make it more formal.  With the recent re-org, we are now in the Product Integrity org.  This is a good fit, but dedicating full time resources to sheriffing and tooling for the sake of project stockwell is not in the mission.  Some of the original work will continue as it serves many purposes.  We will be looking to formalize some of our practices and tools to make this a repeatable process to ensure that progress can still be made towards reducing intermittents (we want <7.0) and creating a sustainable ecosystem for managing these failures and getting fixes in place.

 

Leave a comment

Filed under Uncategorized