I am 1 week late in posting the update for Project Stockwell. This wraps up a full quarter of work. After a lot of concerns raised by developers about a proposed new backout policy, we moved on and didn’t change too much although we did push a little harder and I believe we have disabled more than we fixed as a result.
Lets look at some numbers:
|# P1 bugs
As you can see we increased in March on all numbers- but overall a great decrease so far in 2017.
There have been a lot of failures which have lingered for a while which are not specific to a test. For example:
- windows 8 talos has a lot of crashes (work is being done in bug 1345735)
- reftest crashes in bug 1352671.
- general timeouts in jobs in bug 1204281.
- and a few other leaks/timeouts/crashes/harness issues unrelated to a specific test
- infrastructure issues and tier-3 jobs
While these are problematic, we see the overall failure rate is going down. In all the other bugs where the test is clearly a problem we have seen many fixes which and responses to bugs from so many test owners and developers. It is rare that we would suggest disabling a test and it was not agreed upon, and if there was concern we had a reasonable solution to reduce or fix the failure.
Speaking of which, we have been tracking total bugs, fixed, disabled, etc with whiteboard tags. While there was a request to not use “stockwell” in the whiteboard tags and to make them more descriptive, after discussing this with many people we couldn’t come to agreement on names or what to track and what we would do with the data- so for now, we have remained the same. Here is some data:
What is interesting is that prior to march we had disabled 36.53% of the fixes, but in March when we were more “aggressive” about disabling tests, the overall percentage went down. In fact this is a cumulative number for the year, so for the month of March alone we only disable 31.91% of the fixed tests. Possibly if we had disabled a few more tests the overall numbers would have continued to go down vs slightly up.
A lot of changes took place on the tree in the last month, some interesting data on newer jobs:
- taskcluster windows 7 tests are tier-2 for almost all windows VM tests
- autophone is running all media tests which are not crashing or perma failing
- disabled external media tests on linux platforms
- added stylo mochitest and mochitest-chrome
- fixed stylo reftests to run in e10s mode and on ubuntu 16.04
Upcoming job changes that I am aware of:
- more stylo tests coming online
- more linux tests moving to ubuntu 16.04
- push to green up windows 10 taskcluster vm jobs
Regarding our tests, we are working on tracking new tests added to the tree, what components they belong in, what harness they run in, and overall how many intermittents we have for each component and harness. Some preliminary work shows that we added 942 mochitest*/xpcshell tests in Q1 (609 were imported webgl tests, so we wrote 333 new tests, 208 of those are browser-chrome). Given the fact that we disabled 91 tests and added 942, we are not doing so bad!
Looking forward into April and Q2, I do not see immediate changes to a policy needed, maybe in May we can finalize a policy and make it more formal. With the recent re-org, we are now in the Product Integrity org. This is a good fit, but dedicating full time resources to sheriffing and tooling for the sake of project stockwell is not in the mission. Some of the original work will continue as it serves many purposes. We will be looking to formalize some of our practices and tools to make this a repeatable process to ensure that progress can still be made towards reducing intermittents (we want <7.0) and creating a sustainable ecosystem for managing these failures and getting fixes in place.