Category Archives: intermittents

Stockwell: flowchart for triage

I gave an update 2 weeks ago on the current state of Stockwell (intermittent failures).  I mentioned additional posts were coming and this is a second post in the series.

First off the tree sheriffs who maintain merges between branches, tree closures, backouts, hot fixes, and a many other actions that keep us releasing code do one important task, and that is star failures to a corresponding bug.

Sheriff

These annotations are saved in Treeherder and Orange Factor.  Inside of Orange Factor, we have a robot that comments on bugs– this has been changing a bit more frequently this year to help meet our new triage needs.

Once we get bugs annotated, now we work on triaging them.  Our primarily tool is Neglected Oranges which gives us a view of all failures that meet our threshold and don’t have a human comment in the last 7 days.  Here is the next stage of the process:

Triage

As you can see this is very simple, and it should be simple.  The ideal state is adding more information to the bug which helps make it easier for the person we NI? to prioritize the bug and make a decision:

Comment

While there is a lot more we can do, and much more that we have done, this seems to be the most effective use when looking across 1000+ bugs that we have triaged so far this year.

In some cases a bug fails very frequently and there are no development resources to spend fixing the bug- these will sometimes cross our 200 failures in 30 days policy and will get a [stockwell disabled-recommended] whiteboard tag, we monitor this and work to disable bugs on a regular basis:

Disable

This isn’t as cut and dry as disable every bug, but we do disable as quickly as possible and push hard on the bugs that are not as trivial to disable.

There are many new people working on Intermittent Triage and having a clear understanding of what they are doing will help you know how a random bug ended up with a ni? to you!

Leave a comment

Filed under intermittents, testdev

Project Stockwell (October 2017)

It has been 6 months since the last Stockwell update.  With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates.  While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent.  We were just starting to focus on learning about failures.  We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands.  Our plan was to do a few things:

  • Triage all failures >=30 instances/week
  • Build tools to make triage easier and collect more data
  • Adjust policy for triaging, disabling, and managing intermittents
  • Make our tests better with linting and test-verification
  • Invest time into auto-classification
  • Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

  • Triaged all failures >= 30 instances/week (most weeks, never more than 3 weeks off)
  • Many improvements to our tools, including: adjusteing orange factor robot, intermittent-bug-filer, and added |mach test-info|
  • Played with policy on/off, have settled on needinfo “owner” when 30+ failures/week, and disabling if 200 failures in 30 days.
  • Added eslint to our tests, pylint for our tools, and the new TV job is tier-2.
  • added source file -> bugzilla components in-tree to define ownership.
  • 31 bugzilla components triage their own intermittents

While that is a lot of changes, it is incremental yet effective.  We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14).  While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low.  We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time.  We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category # Bugs
Fixed 511
Disabled 262
Infra 62
Needswork 49
Unknown 209
Total 1093

As you can see that is a lot of disabled tests.  Note, we usually only disable a test on a subset of the configurations, not 100% across the board.  Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes 🙂  The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

  • Triage process for component owners and others who want to participate
  • Test verification and the future
  • Workflow of an intermittent, from first failure to resolution
  • Future of Orange Factor and Autoclassification
  • Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla.  These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

Leave a comment

Filed under intermittents, testdev

Project Stockwell – January 2016

Every month this year I am planning to write a summary of Project Stockwell.

Last year we started this project with a series of meetings and experiments.  We presented in Hawaii (a Mozilla all-hands event) an overview of our work and path forward.

With that said, we will be tracking two items every month:

Week of Jan 02 -> 09, 2017

Orange Factor 13.76
# High Frequency bugs 42

What are these high frequency bugs:

  • linux32 debug timeouts for devtools (bug 1328915)
  • Turning on leak checking (bug 1325148) – note, we did this Dec 29th and whitelisted a lot, still much exists and many great fixes have taken place
  • some infrastructure issues, other timeouts, and general failures

I am excited for the coming weeks as we reduce the orange factor back down <7 and get the high frequency bugs <20.

Outside of these tracking stats there are a few active projects we are working on:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – this will allow us to then match up triage contacts for each components so test case ownership has a patch to a live person
  • retrigger an existing job with additional debugging arguments (bug 1322433) – easier to get debug information, possibly extend to special runs like ‘rr-chaos’
  • add |mach test-info| support (bug 1324470) – allows us to get historical timing/run/pass data for a given test file
  • add a test-lint job to linux64/mochitest (bug 1323044) – ensure a test runs reliably by itself and in –repeat mode

While these seem small, we are currently actively triaging all bugs that are high frequency (>=30 times/week).  In January triage means letting people know this is high frequency and trying to add more data to the bugs.

 

2 Comments

Filed under intermittents, testdev