running tests by bugzilla component instead of test suite

Over the years we have had great dreams of running our tests in many different ways.  There was a dream of ‘hyperchunking’ where we would run everything in hundreds of chunks finishing in just a couple of minutes for all the tests.  This idea is difficult for many reasons, so we shifted to ‘run-by-manifest’, while we sort of do this now for mochitest, we don’t for web-platform-tests, reftest, or xpcshell.  Both of these models require work on how we schedule and report data which isn’t too hard to solve, but does require a lot of additional work and supporting 2 models in parallel for some time.

In recent times, there has been an ongoing conversation about ‘run-by-component’.  Let me explain.  We have all files in tree mapped to bugzilla components.  In fact almost all manifests have a clean list of tests that map to the same component.  Why not schedule, run, and report our tests on the same bugzilla component?

I got excited near the end of the Austin work week as I started working on this to see what would happen.

rbc

This is hand crafted to show top level productions, and when we expand those products you can see all the components:

rbc_expanded

I just used the first 3 letters of each component until there was a conflict, then I hand edited exceptions.

What is great here is we can easy schedule networking only tests:

rbc_scheduling

and what you would see is:

rbc_networking

^ keep in mind in this example I am using the same push, but just filtering- but I did test on a smaller scale for a bit with just Core-networking until I got it working.

What would we use this for:

  1. collecting code coverage on components instead of random chunks which will give us the ability to recommend tests to run with more accuracy than we have now
  2. tools like SETA will be more deterministic
  3. developers can filter in treeherder on their specific components and see how green they are, etc.
  4. easier backfilling of intermittents for sheriffs as tests are not moving around between chunks every time we add/remove a test

While I am excited about the 4 reasons above, this is far from being production ready.  There are a few things we would need to solve:

  1. My current patch takes a list of manifests associated with bugzilla components are runs all manifests related to that component- we would need to sanitize all manifests to only have tests related to one component (or solve this differently)
  2. My current patch iterates through all possible test types- this is grossly inefficient, but the best I could do with mozharness- I suspect a slight bit of work and I could have reftest/xpcshell working, likewise web-platform tests.  Ideally we would run all tests from a source checkout and use |./mach test <component>| and it would find what needs to run
  3. What do we do when we need to chunk certain components?  Right now I hack on taskcluster to duplicate a ‘component’ test for each component in a .json file; we also cannot specify specific platform specific features and lose a lot of the functionality that we gain with taskcluster;  I assume some simple thought and a feature or two would allow for us to retain all the features of taskcluster with the simplicity of component based scheduling
  4. We would need a concrete method for defining the list of components (#2 solves this for the harnesses).  Currently I add raw .json into the taskcluster decision task since it wouldn’t find the file I had checked into the tree when I pushed to try.  In addition, finding the right code names and mappings would ideally be automatic, but might need to be a manual process.
  5. when we run tests in parallel, they will have to be different ‘platforms’ such as linux64-qr, linux64-noe10s.  This is much easier in the land of taskcluster, but a shift from how we currently do things.

This is something I wanted to bring visibility to- many see this as the next stage of how we test at Mozilla, I am glad for tools like taskcluster, mozharness, and common mozbase libraries (especially manifestparser) which have made this a simple hack.  There is still a lot to learn here, we do see a lot of value going here, but are looking for value and not for dangers- what problems do you see with this approach?

Advertisements

Leave a comment

Filed under testdev

Measuring the noise in Performance tests

Often I hear about our talos results, why are they so noisy?  What is noise in this context- by noise we are referring to a larger stddev in the results we track, here would be an example:

noise_example

With the large spread of values posted regularly for this series, it is hard to track improvements or regressions unless they are larger or very obvious.

Knowing the definition of noise, there are a few questions that we often need to answer:

  • Developers working on new tests- what is the level of noise, how to reduce it, what is acceptable
  • Over time noise changes- this causes false alerts, often not related to to code changes or easily discovered via infra changes
  • New hardware we are considering- is this hardware going to post reliable data for us.

What I care about is the last point, we are working on replacing the hardware we run performance tests on from old 7 year old machines to new machines!  Typically when running tests on a new configuration, we want to make sure it is reliably producing results.  For our system, we look for all green:

all_green

This is really promising- if we could have all our tests this “green”, developers would be happy.  The catch here is these are performance tests, are the results we collect and post to graphs useful?  Another way to ask this is are the results noisy?

To answer this is hard, first we have to know how noisy things are prior to the test.  As mentioned 2 weeks ago, Talos collects 624 metrics that we track for every push.  That would be a lot of graph and calculating.  One method to do this is push to try with a single build and collect many data points for each test.  You can see that in the image showing the all green results.

One method to see the noise, is to look at compare view.  This is the view that we use when comparing one push to another push when we have multiple data points.  This typically highlights the changes that are easy to detect with our t-test for alert generation.  If we look at the above referenced push and compare it to itself, we have:

self_compare

 

Here you can see for a11y, linux64 has +- 5.27 stddev.  You can see some metrics are higher and others are lower.  What if we add up all the stddev numbers that exist, what would we have?  In fact if we treat this as a sum of the squares to calculate the variance, we can generate a number, in this case 64.48!  That is the noise for that specific run.

Now if we are bringing up a new hardware platform, we can collect a series of data points on the old hardware and repeat this on the new hardware, now we can compare data between the two:

hardware_compare

What is interesting here is we can see side by side the differences in noise as well as the improvements and regressions.  What about the variance?  I wanted to track that and did, but realized I needed to track the variance by platform, as each platform could be different- In bug 1416347, I set out to add a Noise Metric to the compare view.  This is on treeherder staging, probably next week in production.  Here is what you will see:

noise_view

Here we see that the old hardware has a noise of 30.83 and the new hardware a noise of 64.48.  While there are a lot of small details to iron out, while we work on getting new hardware for linux64, windows7, and windows10, we now have a simpler method for measuring the stability of our data.

 

 

 

2 Comments

Filed under testdev

Keeping an eye on Performance alerts

Over the last 6 months there has been a deep focus on performance in order to release Firefox 57. Hundreds of developers sought out performance improvements and after thousands of small adjustments we see massive improvements.

Last week I introduced Ionut who has come in as a Performance Sheriff.  What do we do on a regular basis when it comes to monitoring performance.  In the past I focused on Talos and how many bugs per release we found, fixed, and closed.  While that is fun and interesting, we have expanded the scope of sheriffing.

Currently we have many frameworks:

  • Talos (old fashioned perf testing, in-tree, per commit, all desktop platforms- startup, benchmarks, pageload)
  • build_metrics (compile time, installer size, sccache hit rate, num_constructors, etc.)
  • AWSY (are we slim yet, now in-tree, per commit, measuring memory during heavy pageload activity)
  • Autophone (android fennec startup + talos tests, running on 4 different phones, per commit)
  • Platform Microbenchmarks (developer written GTEST (cpp code), mostly graphics and stylo specific)

We continue to refine benchmarks and tests on each of these frameworks to ensure we are running on relevant configurations, measuring the right things, and not duplicating data unnecessarily.

Looking at the list of frameworks, we collect 1127 unique data points and alert on them with included bugs for anything sustained and valid.  While the number of unique metrics can change, here are the current number of metrics we track:

Framework Total Metrics
Talos 624
Autophone 19
Build Metrics 172
AWSY 83
Platform Microbenchmarks 229
1127

While we generate these metrics for every commit (or every few commits for load reasons), what happens is we detect a regression and generate an alert.  In fact we have a sizable number of alerts in the last 6 weeks:

Framework Total Alerts
Talos 429
Autophone 77
Build Metrics 264
AWSY 85
Platform Microbenchmarks 227
1082

Alerts are not really what we file bugs on, instead we have an alert summary when can (and typically) does contain a set of alerts.  Here is the total number of alert summaries (i.e. what a sheriff will look at):

Framework Total Summaries
Talos 172
Autophone 54
Build Metrics 79
AWSY 29
Platform Microbenchmarks 136
470

These alert summaries are then mapped into bugs (or downstream alerts to where the alerts started).  Here is a breakdown of the bugs we have:

Framework Total Bugs
Talos 41
Autophone 3
Build Metrics 17
AWSY 6
Platform Microbenchmarks 6
73

This indicates there are 73 bugs associated with Performance Summaries . What is deceptive here is many of those bugs are ‘improvements’ and not ‘regressions’.  If you figured it out, we do associate improvements with bugs and try to comment in the bugs to let you know of the impact your code has on a [set of] metric[s].

Framework Total Bugs
Talos 23
Autophone 3
Build Metrics 14
AWSY 4
Platform Microbenchmarks 3
47

This is a much smaller number of bugs- now there are a few quirks here-

  • some regressions show up across multiple frameworks (reduces to 43 total)
  • some bugs that are ‘downstream’ are marked against the root cause instead of just being downstream.  Often this happens when we are sheriffing bugs and a downstream alert shows up a couple days later.

Over the last few releases here are the tracking bugs:

Note that Firefox 58 has 28 bugs associated with it, but we have 43 bugs from the above query.  Some of those bugs from the above query are related to Firefox 57, and some are starred against a duplicate bug or a root cause bug instead of the regression bug.

I hope you find this data useful and informative towards understanding what goes on with all the performance data.

Leave a comment

Filed under testdev

Stockwell: flowchart for triage

I gave an update 2 weeks ago on the current state of Stockwell (intermittent failures).  I mentioned additional posts were coming and this is a second post in the series.

First off the tree sheriffs who maintain merges between branches, tree closures, backouts, hot fixes, and a many other actions that keep us releasing code do one important task, and that is star failures to a corresponding bug.

Sheriff

These annotations are saved in Treeherder and Orange Factor.  Inside of Orange Factor, we have a robot that comments on bugs– this has been changing a bit more frequently this year to help meet our new triage needs.

Once we get bugs annotated, now we work on triaging them.  Our primarily tool is Neglected Oranges which gives us a view of all failures that meet our threshold and don’t have a human comment in the last 7 days.  Here is the next stage of the process:

Triage

As you can see this is very simple, and it should be simple.  The ideal state is adding more information to the bug which helps make it easier for the person we NI? to prioritize the bug and make a decision:

Comment

While there is a lot more we can do, and much more that we have done, this seems to be the most effective use when looking across 1000+ bugs that we have triaged so far this year.

In some cases a bug fails very frequently and there are no development resources to spend fixing the bug- these will sometimes cross our 200 failures in 30 days policy and will get a [stockwell disabled-recommended] whiteboard tag, we monitor this and work to disable bugs on a regular basis:

Disable

This isn’t as cut and dry as disable every bug, but we do disable as quickly as possible and push hard on the bugs that are not as trivial to disable.

There are many new people working on Intermittent Triage and having a clear understanding of what they are doing will help you know how a random bug ended up with a ni? to you!

Leave a comment

Filed under intermittents, testdev

A formal introduction to Ionut Goldan – Mozilla’s new Performance Sheriff and Tool hacker

About 8 months ago we started looking for a full time performance sheriff to help out with our growing number of alerts and needs for keeping the Talos toolchain relevant.

We got really lucky and ended up finding Ionut (:igoldan on irc, #perf).  Over the last 6 months, Ionut has done a fabulous job of learning how to understand Talos alerts, graphs, scheduling, and narrowing down root causes.  In fact, he has not only been able to easily handle all of the Talos alerts, Ionut has picked up alerts from Autophone (Android devices), Build Metrics (build times, installer sizes, etc.), AWSY (memory metrics), and Platform Microbenchmarks (tests run inside of gtest written by a few developers on the graphics and stylo teams).

While I could probably write a list of Ionut’s accomplishments and some tricky bugs he has sorted out, I figured your enjoyment of reading this blog is better spend on getting to know Ionut better, so I did a Q&A with him so we can all learn much more about Ionut.

Tell us about where you live?

I live in Iasi. It is a gorgeous and colorful town, somewhere in the North-East of Romania.  It is full of great places and enchanting sunsets. I love how a casual walk
leads me to new, beautiful and peaceful neighborhoods.

I have many things I very much appreciate about this town:
the people here, its continuous growth, its historical resonance, the fact that its streets once echoed the steps of the most important cultural figures of our country. It also resembles ancient Rome, as it is also built on 7 hills.

It’s pretty hard not to act like a poet around here.

What inspired you to be a computer programmer?

I wouldn’t say I was inspired to be a programmer.

During my last years in high school, I occasionally consulted with my close ones. Each time we concluded that IT is just the best domain to specialize in: it will improve continuously, there will be jobs available; things that are evident nowadays.

I found much inspiration in this domain after the first year in college, when I noticed the huge advances and how they’re conducted.  I understood we’re living in a whole new era. Digital transformation is now the coined term for what’s going on.

Any interesting projects you have done in the past (school/work/fun)?

I had the great opportunity to work with brilliant teams on a full advertising platform, from almost scratch.

It got almost everything: it was distributed, highly scalable, completely written in
Python 3.X, the frontend adopted material design, NoSQL database in conjunction with SQL ones… It used some really cutting-edge libraries and it was a fantastic feeling.

Now it’s Firefox. The sound name speaks for itself and there are just so many cool things I can do here.

What hobbies do you have?

I like reading a lot. History and software technology are my favourite subjects.
I enjoy cooking, when I have the time. My favourite dish definitely is the Hungarian goulash.

Also, I enjoy listening to classical music.

If you could solve any massive problem, what would you solve?

Greed. Laziness. Selfishness. Pride.

We can resolve all problems we can possibly encounter by leveraging technology.

Keeping non-values like those mentioned above would ruin every possible achievement.

Where do you see yourself in 10 years?

In a peaceful home, being a happy and caring father, spending time and energy with
my loved ones. Always trying to be the best example for them.  I envision becoming a top notch professional programmer, leading highly performant teams on
sound projects. Always familiar with cutting-edge tech and looking to fit it in our tool set.

Constantly inspiring values among my colleagues.

Do you have any advice or lessons learned for new students studying computer science?

Be passionate about IT technologies. Always be curious and willing to learn about new things. There are tons and tons of very good videos, articles, blogs, newsletters, books, docs…Look them out. Make use of them. Follow their guidance and advice.

Continuous learning is something very specific for IT. By persevering, this will become your second nature.

Treat every project as a fantastic opportunity to apply related knowledge you’ve acquired.  You need tons of coding to properly solidify all that theory, to really understand why you need to stick to the Open/Closed principle and all other nitty-gritty little things like that.

I have really enjoyed getting to know Ionut and working with him.  If you see him on IRC please ping him and say hi 🙂

 

Leave a comment

Filed under Community

Talos tests- summary of recent changes

I have done a poor job of communicating status on our performance tooling, this is something I am trying to rectify this quarter.  Over the last 6 months many new talos tests have come online, along with some differences in scheduling or measurement.

In this post I will highlight many of the test related changes and leave other changes for a future post.

Here is a list of new tests that we run:

* cpstartup – (content process startup: thanks :gabor)
* sessionrestore many windows – (instead of one window and many tabs, thanks :beekill)
* perf-reftest[-singletons] – (thanks bholley, :heycam)
* speedometer – (thanks :jmaher)
* tp6 (amazon, facebook, google, youtube) – (thanks :rwood, :armenzg)

These are also new tests, but slight variations on existing tests:

* tp5o + webextension, ts_paint + webextension (test web extension perf, thanks :kmag)
* tp6 + heavy profile, ts_paint + heavy profile (thanks :rwood, :tarek)

The next tests have  been updated to be more relevant or reliable:

* damp (many subtests added, more upcoming, thanks :ochameau)
* tps – update measurements (thanks :mconley)
* tabpaint – update measurements (thanks :mconley)
* we run all talos tests on coverage builds (thanks :gmierz)

It is probably known to most, but earlier this year we stood up testing on Windows 10 and turned off our talos coverage on Windows 8 (big thanks to Q, for making this happen so fast)

Some changes that might not be so obvious, but worth mentioning:

* Added support for Time to first non blank paint (only tp6)
* Investigated mozAfterPaint on non-empty rects– updated a few tests to measure properly
* Added support for comparing perf measurements between tests (perf-reftests) so we can compare rendering time of A vs B- in this case stylo vs non-stylo
* tp6 requires mitmproxy for record/replay- this allows us to have https and multi host dns resolution which is much more real world than serving pages from http://localhost.
* Added support to wait for idle callback before testing the next page.

Stay tuned for updates on Sheriffing, non Talos tests, and upcoming plans.

1 Comment

Filed under testdev

Project Stockwell (October 2017)

It has been 6 months since the last Stockwell update.  With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates.  While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent.  We were just starting to focus on learning about failures.  We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands.  Our plan was to do a few things:

  • Triage all failures >=30 instances/week
  • Build tools to make triage easier and collect more data
  • Adjust policy for triaging, disabling, and managing intermittents
  • Make our tests better with linting and test-verification
  • Invest time into auto-classification
  • Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

  • Triaged all failures >= 30 instances/week (most weeks, never more than 3 weeks off)
  • Many improvements to our tools, including: adjusteing orange factor robot, intermittent-bug-filer, and added |mach test-info|
  • Played with policy on/off, have settled on needinfo “owner” when 30+ failures/week, and disabling if 200 failures in 30 days.
  • Added eslint to our tests, pylint for our tools, and the new TV job is tier-2.
  • added source file -> bugzilla components in-tree to define ownership.
  • 31 bugzilla components triage their own intermittents

While that is a lot of changes, it is incremental yet effective.  We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14).  While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low.  We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time.  We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category # Bugs
Fixed 511
Disabled 262
Infra 62
Needswork 49
Unknown 209
Total 1093

As you can see that is a lot of disabled tests.  Note, we usually only disable a test on a subset of the configurations, not 100% across the board.  Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes 🙂  The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

  • Triage process for component owners and others who want to participate
  • Test verification and the future
  • Workflow of an intermittent, from first failure to resolution
  • Future of Orange Factor and Autoclassification
  • Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla.  These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

Leave a comment

Filed under intermittents, testdev