Over the last month we had a higher rate of commits, failures, and fixes. One large thing is that we turned on stylo specific tests and that was a slightly rocky road. Last month we suggested disabling tests after 2 weeks of seeing the failures. We ended up disabling many tests, but fixing many more.
In addition to more disabling of tests, we implemented a set bugzilla whiteboard entries to track our progress:
* [stockwell fixed] – a fix went in (even if it partially fixed the problem)
* in the last 2 months, we have 106
* [stockwell disabled] – we disabled the test in at least one config and no fix
* in the last 2 months, we have 61
* [stockwell infra] – Infra issues are usually externally driven
* in the last 2 months, we have 11
* [stockwell unknown] – this became less intermittent with no clear reason
* in the last 2 months, we have 44
* [stockwell needswork] – bugs in progress
* in the last 2 months, we have 24
We have also been tracking the orange factor and number of high frequency intermittents:
|Week starting:||Jan 02, 2017||Jan 30, 2017||Feb 27, 2017|
|Orange Factor (OF)||13.76||10.75||9.06|
|# priority intermittents||42||61||32|
|OF – priority intermittents||7.25||5.78||4.78|
I added a new row here, tracking the Orange Factor assuming all of the high frequency intermittent bugs didn’t exist. This is what the long tail looks like and I am really excited to see that number going down over time. For me a healthy spot would be OF <5.0 and the long tail <3.0.
We also looked at the number of unique bugs and repeat bugs/week. Most bugs have a lifecycle of 2 weeks and 2/3 of the bugs we see in a given week were high frequency (HF) the week prior. For example this past week we had 32 HF bugs and 21 of them were from the previous week (11 were still HF 2 weeks prior).
While it is nice to assume we should just disable all tests, we find that many developers are actively working on these issues and it shows that we have many more fixed bugs than disabled bugs. The main motivation for disabling tests is to reduce the confusion for developers on try and to reduce the work the sheriffs need to do. Taking this data into account we are looking to adjust our policy for disabling slightly:
- all high frequency bugs (>=30 times/week) will be triaged and expected to be resolved in 2 weeks, otherwise we will start the process of disabling the test that is causing the bug
- if a bug occurs >75 times/week, it will be triaged but expectations are that it will be resolved in 1 week, otherwise we will start the process of disabling the test that is causing the bug
- if a bug is reduced below a high frequency (< 30 times/week), we will be happy to make a note of that and keep an eye on it- but will not look at disabling the test.
The big change here is we will be more serious on disabling tests specifically when a test is >= 75 times/week. We have had many tests failing at least 50% of the time for weeks, these show up on almost all try pushes that run these tests. Developers should not be seeing failures like these. Since we are tracking fixed vs disabled, if we determine that we are disabling too much, we can revisit this policy next month.
Outside of numbers and policy, our goal is to have a solid policy, process, and toolchain available for self triaging as the year goes on. We are refining the policy and process via manual triage. The toolchain is the other work we are doing, here are some updates:
- adding BUG_COMPONENTS to all files in m-c (bug 1328351) – slow and steady progress, thanks for the reviews to date! We got behind to get SETA completed, but much of the heavy lifting is already done
- retrigger an existing job with additional debugging arguments (bug 1322433) – main discussion is done, figuring out small details, we have a prototype working with little work remaining. Next steps would be to implement the top 3 or 4 use cases.
- add a test-lint job to linux64/mochitest (bug 1323044) – no progress yet- this got put on the backburner as we worked on SETA and focused on triage, whiteboard tags, and BUG_COMPONENTS. We have landed code for using the ‘when’ clause for test jobs (bug 1342963) which is a small piece of this. Getting this initially working will move up in priority soon, and making this work on all harnesses/platforms will most likely be a Google Summer of Code project.
Are there items we should be working on or looking into? Please join our meetings.