Category Archives: Uncategorized

Project Stockwell (reduce intermittents) – April 2017

I am 1 week late in posting the update for Project Stockwell.  This wraps up a full quarter of work.  After a lot of concerns raised by developers about a proposed new backout policy, we moved on and didn’t change too much although we did push a little harder and I believe we have disabled more than we fixed as a result.

Lets look at some numbers:

Week Starting 01/02/17 02/27/17 03/24/17
Orange Factor 13.76 9.06 10.08
# P1 bugs 42 32 55
OF(P2) 7.25 4.78 5.13

As you can see we increased in March on all numbers- but overall a great decrease so far in 2017.

There have been a lot of failures which have lingered for a while which are not specific to a test.  For example:

  • windows 8 talos has a lot of crashes (work is being done in bug 1345735)
  • reftest crashes in bug 1352671.
  • general timeouts in jobs in bug 1204281.
  • and a few other leaks/timeouts/crashes/harness issues unrelated to a specific test
  • infrastructure issues and tier-3 jobs

While these are problematic, we see the overall failure rate is going down.  In all the other bugs where the test is clearly a problem we have seen many fixes which and responses to bugs from so many test owners and developers.  It is rare that we would suggest disabling a test and it was not agreed upon, and if there was concern we had a reasonable solution to reduce or fix the failure.

Speaking of which, we have been tracking total bugs, fixed, disabled, etc with whiteboard tags.  While there was a request to not use “stockwell” in the whiteboard tags and to make them more descriptive, after discussing this with many people we couldn’t come to agreement on names or what to track and what we would do with the data- so for now, we have remained the same.  Here is some data:

03/07/17 04/11/17
total 246 379
fixed 106 170
disabled 61 91
infrastructure 11 17
unknown 44 60
needswork 24 38
% disabled 36.53% 34.87%

What is interesting is that prior to march we had disabled 36.53% of the fixes, but in March when we were more “aggressive” about disabling tests, the overall percentage went down.  In fact this is a cumulative number for the year, so for the month of March alone we only disable 31.91% of the fixed tests.  Possibly if we had disabled a few more tests the overall numbers would have continued to go down vs slightly up.

A lot of changes took place on the tree in the last month, some interesting data on newer jobs:

  • taskcluster windows 7 tests are tier-2 for almost all windows VM tests
  • autophone is running all media tests which are not crashing or perma failing
  • disabled external media tests on linux platforms
  • added stylo mochitest and mochitest-chrome
  • fixed stylo reftests to run in e10s mode and on ubuntu 16.04

Upcoming job changes that I am aware of:

  • more stylo tests coming online
  • more linux tests moving to ubuntu 16.04
  • push to green up windows 10 taskcluster vm jobs

Regarding our tests, we are working on tracking new tests added to the tree, what components they belong in, what harness they run in, and overall how many intermittents we have for each component and harness.  Some preliminary work shows that we added 942 mochitest*/xpcshell tests in Q1 (609 were imported webgl tests, so we wrote 333 new tests, 208 of those are browser-chrome).  Given the fact that we disabled 91 tests and added 942, we are not doing so bad!

Looking forward into April and Q2, I do not see immediate changes to a policy needed, maybe in May we can finalize a policy and make it more formal.  With the recent re-org, we are now in the Product Integrity org.  This is a good fit, but dedicating full time resources to sheriffing and tooling for the sake of project stockwell is not in the mission.  Some of the original work will continue as it serves many purposes.  We will be looking to formalize some of our practices and tools to make this a repeatable process to ensure that progress can still be made towards reducing intermittents (we want <7.0) and creating a sustainable ecosystem for managing these failures and getting fixes in place.

 

Advertisements

Leave a comment

Filed under Uncategorized

Project Stockwell (reduce intermittents) – March 2017

Over the last month we had a higher rate of commits, failures, and fixes. One large thing is that we turned on stylo specific tests and that was a slightly rocky road. Last month we suggested disabling tests after 2 weeks of seeing the failures. We ended up disabling many tests, but fixing many more.

In addition to more disabling of tests, we implemented a set bugzilla whiteboard entries to track our progress:
* [stockwell fixed] – a fix went in (even if it partially fixed the problem)
* in the last 2 months, we have 106
* [stockwell disabled] – we disabled the test in at least one config and no fix
* in the last 2 months, we have 61
* [stockwell infra] – Infra issues are usually externally driven
* in the last 2 months, we have 11
* [stockwell unknown] – this became less intermittent with no clear reason
* in the last 2 months, we have 44
* [stockwell needswork] – bugs in progress
* in the last 2 months, we have 24

We have also been tracking the orange factor and number of high frequency intermittents:

Week starting: Jan 02, 2017 Jan 30, 2017 Feb 27, 2017
Orange Factor (OF) 13.76 10.75 9.06
# priority intermittents 42 61 32
OF – priority intermittents 7.25 5.78 4.78

I added a new row here, tracking the Orange Factor assuming all of the high frequency intermittent bugs didn’t exist.  This is what the long tail looks like and I am really excited to see that number going down over time.  For me a healthy spot would be OF <5.0 and the long tail <3.0.

We also looked at the number of unique bugs and repeat bugs/week.  Most bugs have a lifecycle of 2 weeks and 2/3 of the bugs we see in a given week were high frequency (HF) the week prior.  For example this past week we had 32 HF bugs and 21 of them were from the previous week (11 were still HF 2 weeks prior).

While it is nice to assume we should just disable all tests, we find that many developers are actively working on these issues and it shows that we have many more fixed bugs than disabled bugs.  The main motivation for disabling tests is to reduce the confusion for developers on try and to reduce the work the sheriffs need to do.  Taking this data into account we are looking to adjust our policy for disabling slightly:

  1. all high frequency bugs (>=30 times/week) will be triaged and expected to be resolved in 2 weeks, otherwise we will start the process of disabling the test that is causing the bug
  2. if a bug occurs >75 times/week, it will be triaged but expectations are that it will be resolved in 1 week, otherwise we will start the process of disabling the test that is causing the bug
  3. if a bug is reduced below a high frequency (< 30 times/week), we will be happy to make a note of that and keep an eye on it- but will not look at disabling the test.

The big change here is we will be more serious on disabling tests specifically when a test is >= 75 times/week.  We have had many tests failing at least 50% of the time for weeks, these show up on almost all try pushes that run these tests.  Developers should not be seeing failures like these.  Since we are tracking fixed vs disabled, if we determine that we are disabling too much, we can revisit this policy next month.

Outside of numbers and policy, our goal is to have a solid policy, process, and toolchain available for self triaging as the year goes on.  We are refining the policy and process via manual triage.  The toolchain is the other work we are doing, here are some updates:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – slow and steady progress, thanks for the reviews to date!  We got behind to get SETA completed, but much of the heavy lifting is already done
  • retrigger an existing job with additional debugging arguments (bug 1322433) – main discussion is done, figuring out small details, we have a prototype working with little work remaining.  Next steps would be to implement the top 3 or 4 use cases.
  • add a test-lint job to linux64/mochitest (bug 1323044) – no progress yet- this got put on the backburner as we worked on SETA and focused on triage, whiteboard tags, and BUG_COMPONENTS.  We have landed code for using the ‘when’ clause for test jobs (bug 1342963) which is a small piece of this.  Getting this initially working will move up in priority soon, and making this work on all harnesses/platforms will most likely be a Google Summer of Code project.

Are there items we should be working on or looking into?  Please join our meetings.

Leave a comment

Filed under Uncategorized

Project Stockwell – February 2017

I realized my post for last month was titled “Project Stockwell – January 2016” – that is a fun typo to make 🙂

Last month we focused on triaging all bugs that met our criteria of >=30 failures/week.  Every day there are many new bugs to triage and we started with a large list.  In the end we have commented on all the bugs and have a small list every day to revisit or investigate.

One thing we focus on is only requesting assistance at most once per week- to that note we have a “Neglected Oranges” dashboard that we use daily.

What is changing this month- We will be recommending resolution on priority bugs (>=30 failures/week) in 2 weeks time.  Resolution is active debugging, landing changes to the test to reduce,debug, or fix the intermittent, or in the case where there is a lack of time or ease of finding a fix disabling the test.  If this goes well, we will reduce that down to 7 days in March.

So how are we doing?

Week starting: Jan 02, 2017 Jan 30, 2017
Orange Factor 13.76 10.75
# priority intermittent 42 61

We have less overall failures, but more bugs spread out.  Some interesting bugs:

In terms of projects underway, here is some status:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – slow and steady progress, thanks for the reviews to date!  We expect large majority of this to be completed this month.
  • retrigger an existing job with additional debugging arguments (bug 1322433) – main discussion is done, figuring out small details, should see a prototype this month
  • add |mach test-info| support (bug 1324470) – landed today!
  • add a test-lint job to linux64/mochitest (bug 1323044) – no progress yet, I expect some this month.

Are there items we should be working on or looking into?  Please join our meetings.

Leave a comment

Filed under Uncategorized

Working towards a productive definition of “intermittent orange”

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent.  Some common statements I have heard:

  • It’s obvious, if it failed last year, the test is intermittent
  • If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent
  • I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server

These are imply much different definitions of what is intermittent, a definition will need to:

  • determine if we should take action on a test (programatically or manually)
  • define policy sheriffs and developers can use to guide work
  • guide developers to know when a new/fixed test is ready for production
  • provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends.  I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%).  Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures.  It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle).  After much slicing and dicing, I have come up with 4 buckets:

  1. Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)
  2. Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)
  3. Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks.  (11.5% of bugs)
  4. High Frequency Orange: fails >10 times/day many times and are often seen in try pushes.  (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

  • low priority or not actionable (buckets 1 + 2)
  • high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

  • Determine if we should take action on a test (programatically or manually):
    • ideally buckets 1/2 can be detected programatically with autostar and removed from our view.  Possibly rerunning to validate it isn’t a new failure.
    • buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information
  • Define policy sheriffs and developers can use to guide work
    • sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)
    • developers understand the severity based on the bucket.  Ideally we will need a lot of context, but understanding severity is important.
  • Guide developers to know when a new/fixed test is ready for production
    • If we fix a test, we want to ensure it is stable before we make it tier-1.  A developer can use math of 300 commits/day and ensure we pass.
    • NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day
  • Provide useful data to release and Firefox product management about the quality of a release
    • Release Management can take the OrangeFactor into account
    • new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot).  There are simple rules:

  • 15+ times/day – post a daily summary (bucket #4)
  • 5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

  • missing or incorrect data in orange factor (human error)
  • some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.

4 Comments

Filed under general, testdev, Uncategorized

QoC.2 – Iterations and thoughts

Quite a few weeks ago now, the Second official Quarter of Contribution wrapped up.  We had advertised 4 projects and found awesome contributors for all 4.  While all hackers gave a good effort, sometimes plans change and life gets in the way.  In the end we had 2 projects with very active contributors.

We had two projects with a lot of activity throughout the project:

First off, this 2nd round of QoC wouldn’t have been possible without the Mentors creating projects and mentoring, nor without the great contributors volunteering their time to build great tools and features.

I really like to look at what worked and what didn’t, let me try to summarize some thoughts.

What worked well:

  • building up interest in others to propose and mentor projects
  • having the entire community in #ateam serve as an environment of encouragement and learning
  • specifying exact start/end dates
  • advertising on blogs/twitter/newsgroups to find great hackers

What I would like to see changed for QoC.3:

  •  Be clear up front on what we expect.  Many contributors waited until the start date before working- that doesn’t give people a chance to ensure mentors and projects are a good fit for them (especially over a couple of months)
  • Ensure each project has clear guidelines on code expectations.  Linting, Tests, self review before PR, etc.  These are all things which might be tough to define and tough to do at first, but it makes for better programmers and end products!
  • Keep a check every other week on the projects as mentors (just a simple irc chat or email chain)
  • Consider timing of the project, either on-demand as mentors want to do it, or continue in batches, but avoid with common mentor time off (work weeks, holidays)
  • Encourage mentors to set weekly meetings and “office hours”

As it stands now, we are pushing on submitting Outreachy and GSoC project proposals, assuming that those programs pick up our projects, we will look at QoC.3 more into September or November.

 

Leave a comment

Filed under Uncategorized

QoC.2 – WPT Results Viewer – wrapping up

Quite a few weeks ago now, the Second official Quarter of Contribution wrapped up.  We had advertised 4 projects and found awesome contributors for all 4.  While all hackers gave a good effort, sometimes plans change and life gets in the way.  In the end we had 2 projects with very active contributors.

This post, I want to talk about WPT Results Viewer.  You can find the code on github, and still find the team on irc in #ateam.  As this finished up, I reached out to :martianwars to learn what his experience was like, here are his own words:

What interested you in QoC?

So I’d been contributing to Mozilla for sometime fixing random bugs here and there. I was looking for something larger and more interesting. I think that was the major motivation behind QoC, besides Manishearth’s recommendation to work on the Web Platform Test Viewer. I guess I’m really happy that QoC came around the right time!

What challenges did you encounter while working on your project?  How did you solve them?

I guess the major issue while working on wptview was the lack of Javascript knowledge and the lack of help online when it came to Lovefield. But like every project, I doubt I would have enjoyed much had I known everything required right from the start. I’m glad I got jgraham as a mentor, who made sure I worked my way up the learning curve as we made steady progress.

What are some things you learned?

So I definitely learnt some Javascript, code styling, the importance of code reviews, but there was a lot more to this project. I think the most important thing that I learnt was patience. I generally tend to search for StackOverflow answers when it I need to perform a programming task I’m unaware of. With Lovefield being a relatively new project, I was compelled to patiently read and understand the documentation and sample programs. I also learnt a bit on how a large open source community functions, and I feel excited being a part of it!  A bit irrelevant to the question, but I think I’ve made some friends in #ateam 🙂 The IRC is like my second home, and helps me escape life’s never ending stress, to a wonderland of ideas and excitement!

If you were to give advice to students looking at doing a QoC, what would you tell them?

Well the first thing I would advice them is not to be afraid, especially of asking the so called “stupid” questions on the IRC. The second thing would be to make sure they give the project a decent amount of time, not with the aim of completing it or something, but to learn as much as they can 🙂 Showing enthusiasm is the best way to ensure one has a worthwhile QoC 🙂 Lastly, I’ve tried my level best to get a few newcomers into wptview. I think spreading the knowledge one learns is important, and one should try to motivate others to join open source 🙂

If you were to give advice to mentors wanting to mentor a project, what would you tell them?

I think jgraham has set a great example of what an ideal mentor should be like. Like I mentioned earlier, James helped me learn while we made steady progress. I especially appreciate the way he had (has rather) planned this project. Every feature was slowly built upon and in the right order, and he ensured the project continued to progress while I was away. He would give me a sufficient insight into each feature, and leave the technical aspects to me, correcting my fallacies after the first commit. I think this is the right approach. Lastly, a quality every mentor MUST have, is to be awake at 1am on a weekend night reviewing PRs 😉

Personally I have really enjoyed getting to know :martianwars and seeing the great progress he has made.

Leave a comment

Filed under Uncategorized

Introducing a contributor for the Pulse Guardian project

3 weeks ago we announced the new Quarter of Contribution, today I would like to introduce the participants.  Personally I really enjoy meeting new contributors and learning about them. It is exciting to see interest in all 4 projects.  Let me introduce who will be working on Pulse Guardian – Core Hacker:

Mike Yao

What interests you in this specific project?

Python, infrastructure

What do you plan to get out of this after 8 weeks?

Continue to contribute to Mozilla

Are there any interesting facts/hobbies that you would like to share so others can enjoy reading about you?

Cooking/food lover, I was chef long time ago. Free software/Open source and Linux changed my mind and career.

 

I do recall one other eager contributor who might join in late when exams are completed, meanwhile, enjoy learning a bit about Mike Yao (who was introduced to Mozilla by Mike Ling who did our first every Quarter of Contribution).

Leave a comment

Filed under Uncategorized