re-triggering for a [root] cause – version 1.57

Last week I wrote some notes about re-triggering jobs to find a root cause.  This week I decided to look at the orange factor email of the top 10 bugs and see how I could help.  Looking at each of the 10 bugs, I had 3 worth investigating and 7 I ignored.

Investigate:

  • Bug 1163911 test_viewport_resize.html – new test which was added 15 revisions back from the first instance in the bug.  The sheriffs had already worked to get this test disabled prior to my results coming in!
  • Bug 1081925 browser_popup_blocker.js – previous test in the directory was modified to work in e10s 4 revisions back from the first instance reported in the bug, causing this to fail
  • Bug 1118277 browser_popup_blocker.js (different symptom, same test pattern and root cause as bug 1081925)

Ignore:

  • Bug 1073442 – Intermittent command timed out; might not be code related and >30 days of history.
  • Bug 1096302test_collapse.html | Test timed out. >30 days of history.
  • Bug 1151786 testOfflinePage. >30 days of history. (and a patch exists).
  • Bug 1145199 browser_referrer_open_link_in_private.js. >30 days of history.
  • Bug 1073761 test_value_storage.html. >30 days of history.
  • Bug 1161537 test_dev_mode_activity.html. resolved (a result from the previous bisection experiment).
  • Bug 1153454 browser_tabfocus.js. >30 days of history.

Looking at the bugs of interest, I jumped right in in retriggering.  This time around I did 20 retriggers for the original changeset, then went back to 30 revisions (every 5th) doing the same thing.  Effectively this was doing 20 retriggers for the 0, 5th, 10th, 15th, 20th, 25th, and 30th revisions in the history list (140 retriggers).

I ran into issues doing this, specifically on Bug 1073761.  The reason why is that for about 7 revisions in history the windows 8 builds failed!  Luckily the builds finished enough to get a binary+tests package so we could run tests, but mozci didn’t understand that the build was available.  That required some manual retriggering.  Actually a few cases on both retriggers were actual build failures which resulted in having to manually pick a different revision to retrigger on.  This was fairly easy to then run my tool again and fill in the 4 missing revisions using slightly different mozci parameters.

This was a bit frustrating as there was a lot of manual digging and retriggering due to build failures.  Luckily 2 of the top 10 bugs are the same root cause and we figured it out.  Including irc chatter and this blog post, I have roughly 3 hours invested into this experiment.

Leave a comment

Filed under Uncategorized

A-Team contribution opportunity – Dashboard Hacker

I am excited to announce a new focused project for contribution – Dashboard Hacker.  Last week we gave a preview that today we would be announcing 2 contribution projects.  This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks.  More time is welcome, but not required.

What is a dashboard hacker?

When a developer is ready to land code, they want to test it. Getting the results and understanding the results is made a lot easier by good dashboards and tools. For this project, we have a starting point with our performance data view to fix up a series of nice to have polish features and then ensure that it is easy to use with a normal developer workflow. Part of the developer work flow is the regular job view, If time permits there are some fun experiments we would like to implement in the job view.  These bugs, features, projects are all smaller and self contained which make great projects for someone looking to contribute.

What is required of you to participate?

  • A willingness to learn and ask questions
  • A general knowledge of programming (most of this will be in javascript, django, angularJS, and some work will be in python.
  • A promise to show up regularly and take ownership of the issues you are working on
  • Good at solving problems and thinking out of the box
  • Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

  • A dedicated mentor for the project whom you will work with regularly throughout the project
  • A single area of work to reduce the need to get up to speed over and over again.
    • This project will cover many tools, but the general problem space will be the same
  • The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
  • The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org, wlach on IRC will be the primary mentor).  We will point you at a starter bug and introduce you to the bugs and problems to solve.  If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here.  One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space.  Ultimately it will be up to the mentor (for this project, it will be :wlach).  If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

Leave a comment

Filed under Community, testdev

A-Team contribution opportunity – DX (Developer Ergonomics)

I am excited to announce a new focused project for contribution – Developer Ergonomics/Experience, otherwise known as DX.  Last week we gave a preview that today we would be announcing 2 contribution projects.  This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks.  More time is welcome, but not required.

What does DX mean?

We chose this project as we continue to experience frustration while fixing bugs and debugging test failures.  Many people suggest great ideas, in this case we have set aside a few ideas (look at the dependent bugs to clean up argument parsers, help our tests run in smarter chunks, make it easier to run tests locally or on server, etc.) which would clean up stuff and be harder than a good first bug, yet each issue by itself would be too easy for an internship.  Our goal is to clean up our test harnesses and tools and if time permits, add stuff to the workflow which makes it easier for developers to do their job!

What is required of you to participate?

  • A willingness to learn and ask questions
  • A general knowledge of programming (this will be mostly in python with some javascript as well)
  • A promise to show up regularly and take ownership of the issues you are working on
  • Good at solving problems and thinking out of the box
  • Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

  • A dedicated mentor for the project whom you will work with regularly throughout the project
  • A single area of work to reduce the need to get up to speed over and over again.
    • This project will cover many tools, but the general problem space will be the same
  • The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
  • The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org).  We will point you at a starter bug and introduce you to the bugs and problems to solve.  If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here.  One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space.  Ultimately it will be up to the mentor (for this project, it will be me).  If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

1 Comment

Filed under Community, testdev

Watching the watcher – Some data on the Talos alerts we generate

What are the performance regressions at Mozilla- who monitors them and what kind of regressions do we see?  I want to answer this question with a few peeks at the data.  There are plenty of previous blog posts I have done outlining stats, trends, and the process.  Lets recap what we do briefly, then look at the breakdown of alerts (not necessarily bugs).

When Talos uploads numbers to graph server they get stored and eventually run through a calculation loop to find regressions and improvements.  As of Jan 1, 2015, we upload these to mozilla.dev.tree-alerts as well as email to the offending patch author (if they can easily be identified).  There are a couple folks (performance sheriffs) who look at the alerts and triage them.  If necessary a bug is filed for further investigation.  Reading this brief recap of what happens to our performance numbers probably doesn’t inspire folks, what is interesting is looking at the actual data we have.

Lets start with some basic facts about alerts in the last 12 months:

  • We have collected 8232 alerts!
  • 4213 of those alerts are regressions (the rest are improvements)
  • 3780 of those above alerts have a manually marked status
    • the rest have been programatically marked as merged and associated with the original
  • 278 bugs have been filed (or 17 alerts/bug)
    • 89 fixed!
    • 61 open!
    • 128 (5 invalid, 8 duplicate, 115 wontfix/worksforme)

As you can see this is not a casual hobby, it is a real system helping out in fixing and understanding hundreds of performance issues.

We generate alerts on a variety of branches, here is the breakdown of branches and alerts/branch;

number of regression alerts we have received per branch

number of regression alerts we have received per branch

There are a few things to keep in mind here, mobile/mozilla-central/Firefox are the same branch, and for non-pgo branches that is only linux/windows/android, not osx. 

Looking at that graph is sort of non inspiring, most of the alerts will land on fx-team and mozilla-inbound, then show up on the other branches as we merge code.  We run more tests/platforms and land/backout stuff more frequently on mozilla-inbound and fx-team, this is why we have a larger number of alerts.

Given the fact we have so many alerts and have manually triaged them, what state the the alerts end up in?

Current state of alerts

Current state of alerts

The interesting data point here is that 43% of our alerts are duplicates.  A few reasons for this:

  • we see an alert on non-pgo, then on pgo (we usually mark the pgo ones as duplicates)
  • we see an alert on mozilla-inbound, then the same alert shows up on fx-team,b2g-inbound,firefox (due to merging)
    • and then later we see the pgo versions on the merged branches
  • sometimes we retrigger or backfill to find the root cause, this generates a new alert many times
  • in a few cases we have landed/backed out/landed a patch and we end up with duplicate sets of alerts

The last piece of information that I would like to share is the break down of alerts per test:

Alerts per test

number of alerts per test (some excluded)

There are a few outliers, but we need to keep in mind that active work was being done in certain areas which would explain a lot of alerts for a given test.  There are 35 different test types which wouldn’t look good in an image, so I have excluded retired tests, counters, startup tests, and android tests.

Personally, I am looking forward to the next year as we transition some tools and do some hacking on the reporting, alert generation and overall process.  Thanks for reading!

1 Comment

Filed under testdev

community hacking – thoughts on what works for the automation and tools team

Community is a word that means a lot of things to different people.  When there is talk of community at an A*Team meeting, some people perk up and others tune out.  Taking a voluntary role in leading many community efforts on the A*Team over the last year, here are some thoughts I have towards accepting contributions, growing community, and making it work within the team.

Contributions:

Historically on the A*Team we would file bugs which are mentored (and discoverable via bugsahoy) and blog/advertise help wanted.  This is always met with great enthusiasm from a lot of contributors.  What does this mean for the mentor?  There are a few common axis here:

  • High-Touch vs Low-Touch
    • High-Touch is where there is a lot of time invested in getting the current problem solved.  Usually a lot of bug comments, email, irc chatter to solve a good first bug.  Sometimes this can take hours!
    • Low-Touch is where a person comes in, a patch randomly appears and there is little to no feedback for the patch.
  • High-Reward vs Low-Reward:
    • High-Reward is where we have contributors that solve larger problems.  A rewarding experience for both the contributor and the mentor
    • Low-Reward is where a contributor is fixing useful things, but they are little nits or polish.  This isn’t as rewarding for the contributor, nor the mentor.
  • Short-Term vs Long-Term:
    • Short-Term – a contributor shows up for a few days, fixes however many bugs they can and disappears.  This is a common workflow for folks who are on break from school or shifting around in different stages of their lives.
    • Long-Term – a contributor who shows up on a regular basis, continues to contribute and just the fact of them being around has a much larger impact on the team.

We need to appreciate all types of contributions and ensure we do our best to encourage folks to participate.  As a mentor if you have a lot of high-touch, low-reward, short-term contributors, it is exhausting and de-moralizing.  No wonder a lot of people don’t want to participate in mentoring folks as they contribute.  It is also unrealistic to expect a bunch of seasoned coders to show up and implement all the great features, then repeat for years on end.

The question remains, how do you find low-touch contributors or identify ones that are high-touch at the start and end up learning fast (some of the best contributors fall into this latter category).

Growing Community:

The simple answer here is file a bunch of bugs.  In fact whenever we do this they get fixed real fast.  This turns into a problem when you have 8 new contributors, 2 mentors, and 10 good first bugs.  Of course it is possible to find more mentors, and it is possible to file more bugs.  In reality this doesn’t work well for most people and projects.

The real question to ask is what kind of growth are you looking for?  To answer this is different for many people.  What we find of value is slowly growing our long-term/low-touch contributors by giving them more responsibility (i.e. ownership) and really depending on them for input on projects.  There is also a need to grow mentors and mentors can be contributors as well!  Lastly it is great to have a larger pool of short-term contributors who have ramped up on a few projects and enjoy pitching in once in a while.

How can we foster a better environment for both mentors and contributors?  Here are a few key areas:

  • Have good documentation.
  • Set expectations up front and make it easy to understand what is expected and what next steps are.
  • Have great mentors (this might be the hardest part),
  • Focus more on what comes after good first bugs,
  • Get to know the people you work with.

Just focusing on the relationships and what comes after the good first bugs will go a long way in retaining new contributors and reducing the time spent helping out.

How we make it work in the A*Team:

The A*Team is not perfect.  We have few mentors and community is not built into the way we work.  Some of this is circumstantial, but a lot of it is within our control.  What do we do and what does and does not work for us.

Once a month we meet to discuss what is going on within the community on our team.  We have tackled topics such as project documentation, bootcamp tools/docs, discoverability, good next bugs, good first projects, and prioritizing our projects for encouraging new contributors.

While that sounds good, it is the work of a few people.  There is a lot of negative history of contributors fixing one bug and taking off.  Much frustration is expressed around helping someone with basic pull requests and patch management, over and over again.  While we can document stuff all day long, the reality is new contributors won’t read the docs and still ask questions.

The good news is in the last year we have seen a much larger impact of contributors to our respective projects.  Many great ideas were introduced, problems were solved, and experiments were conducted- all by the growing pool of contributors who associate themselves with the A*Team!

Recently, we discussed the most desirable attributes of contributors in trying to think about the problem in a different way.  It boiled down to a couple things, willingness to learn, and sticking around at least for the medium term.

Going forward we are working on growing our mentor pool, and focusing on key projects so the high-touch and timely learning curve only happens in areas where we can spread the love between domain experts and folks just getting started.

Keep an eye out for most posts in the coming week(s) outlining some new projects and opportunities to get involved.

Leave a comment

Filed under Community

Re-Triggering for a [root] cause – some notes from doing it

With all this talk of intermittent failures and folks coming up with ideas on how to solve them, I figured I should dive head first into looking at failures.  I have been working with a few folks on this, specifically :parkouss and :vaibhav1994.  This experiment (actually the second time doing so) is where I take a given intermittent failure bug and retrigger it.  If it reproduces, then I go back in history looking for where it becomes intermittent.  This weekend I wrote up some notes as I was trying to define what an intermittent is.

Lets outline the parameters first for this experiment:

  • All bugs marked with keyword ‘intermittent-failure’ qualify
  • Bugs must not be resolved or assigned to anybody
  • Bugs must have been filed in the last 28 days (we only keep 30 days of builds)
  • Bugs must have >=20 tbplrobot comments (arbitrarily picked, ideally we want something that can easily reproduce)

Here are what comes out of this:

  • 356 bugs are open, not assigned and have the intermittent-failure keyword
  • 25 bugs have >=20 comments meeting our criteria

The next step was to look at each of the 25 bugs and see if it makes sense to do this.  In fact 13 of the bugs I decided not to take action on (remember this is an experiment, so my reasoning for ignoring these 13 could be biased):

  • 5 bugs were thunderbird/mozmill only
  • 3 bugs looked to be related to android harness issues
  • bug 1157090 hadn’t reproduced in 2 weeks- was APZ feature which we turned off.
  • one bug was only on mozilla-beta only
  • 2 bugs had patches and 1 had a couple of comments indicating a developer was already looking at it

This leaves us with 12 bugs to investigate.  The premise here is easy, find the first occurrence of the intermittent (branch, platform, testjob, revision) and re-trigger it 20 times (picked for simplicity).  When the results are in, see if we have reproduced it.  In fact, only 5 bugs reproduced the exact error in the bug when re-triggered 20 times on a specific job that showed the error.

Moving on, I started re-triggering jobs back in the pushlog to see where it was introduced.  I started off with going back 2/4/6 revisions, but got more aggressive as I didn’t see patterns.  Here is a summary of what the 5 bugs turned out like:

  • Bug 1161915 – Windows XP PGO Reftest.  Found the root cause (pgo only, lots of pgo builds were required for this) 23 revisions back.
  • Bug 1160780 – OSX 10.6 mochitest-e10s-bc1.  Found the root cause 33 revisions back.
  • Bug 1161052 – Jetpack test failures.  So many failures in the same test file, it isn’t very clear if I am reproducing the failure or finding other ones.  :erikvold is working on fixing the test in bug 1163796 and ok’d disabling it if we want to.
  • Bug 1161537 – OSX 10.6 Mochitest-other.  Bisection didn’t find the root cause, but this is a new test case which was added.  This is a case where when the new test case was added it could have been run 100+ times successfully, then when it merged with other branches a couple hours later it failed!
  • bug 1155423 – Linux debug reftest-e10s-1.  This reproduced 75 revisions in the past, and due to that I looked at what changed in our buildbot-configs and mozharness scripts.  We actually turned on this test job (hadn’t been running e10s reftests on debug prior to this) and that caused the problem.  This can’t be tracked down by re-triggering jobs into the past.

In summary, out of 356 bugs 2 root causes were found by re-triggering.  In terms of time invested into this, I have put about 6 hours of time to fine the root cause of the 5 bugs.

2 Comments

Filed under testdev

intermittent oranges- missing a real definition

There are all kinds of great ideas folks have for fixing intermittent issues.  In fact each idea in and of itself is a worthwhile endeavor.   I have spent some time over the last couple of months fixing them, filing bugs on them, and really discussing them.  One question that remains- what is the definition of an intermittent.

I don’t plan to lay out a definition, instead I plan to ask some questions and lay out some parameters.  According to orange factor, there are 4640 failures in the last week (May 3 -> May 10) all within 514 unique bugs.  These are all failures that the sheriffs have done some kind of manual work on to star on treeherder.  I am not sure anybody can find a way to paint a pretty picture to make it appear we don’t have intermittent failures.

Looking at a few bugs, there are many reasons for intermittent failures:

  • core infrastructure (networking, power, large classes of machines (ec2), etc.)
  • machine specific (specific machine is failing a lot of jobs)
  • CI specific (buildbot issues, twisted issues, puppet issues, etc.)
  • CI Harness (usually mozharness)
  • Platforms (old platforms/tests we are phasing out, new platforms/tests we are phasing in)
  • Test harness (mochitest + libraries, common code for tests, reftest)
  • Test Cases (test cases actually causing failures, poorly written, new test cases, etc.)
  • Firefox Code (we actually have introduced conditions to cause real failures- just not every time)
  • Real regressions (failures which happen every time we run a test)

There are a lot of reasons, many of these have nothing to do with poor test cases or bad code in Firefox.  But many of these are showing up many times a day and as a developer who wants to fix a bad test, many are not really actionable.  Do we need to have some part of a definition to include something that is actionable?

Looking at the history of ‘intermittent-failure’ bugs in Bugzilla, many occur once and never occur again.  In fact this is the case for over half of the bugs filed (we file upwards of 100 new bugs/week).  While there are probably reasons for a given test case to fail, if it failed in August 2014 and has never failed again, is that test case intermittent?  As a developer could you really do anything about this given the fact that reproducing it is virtually impossible?

This is where I start to realize we need to find a way to identify real intermittent bugs/tests and not clutter the statistics with tests which are virtually impossible to reproduce.  Thinking back to what is actionable- I have found that while filing bugs for Talos regressions the closer the bug is filed to the original patch landing, the better the chance it will get fixed.  Adding to that point, we only keep 30 days of builds/test packages around for our CI automation.  I really think a definition of an intermittent needs to have some kind of concept of time.  Should we ignore intermittent failures which occur only once in 90 days?  Maybe ignore ones that don’t reproduce after 1000 iterations?  Some could argue that we look in a smaller or larger window of time/iterations.

Lastly, when looking into specific bugs, I find many times they are already fixed.  Many of the intermittent failures are actually fixed!  Do we track how many get fixed?  How many have patches and have debugging already taking place?  For example in the last 28 days, we have filed 417 intermittents, of which 55 are already resolved and of the remaining 362 only 25 have occurred >=20 times.  Of these 25 bugs, 4 already have patches.  It appears a lot of work is done to fix intermittent failures which are actionable.  Are the ones which are not being fixed not actionable?  Are they in a component where all the developers are busy and heads down?

In a perfect world a failure would never occur, all tests would be green, and all users would use Firefox.  In reality we have to deal with thousands of failures every week, most of which never happen again.  This quarter I would like to see many folks get involved in discussions and determine:

  • what is too infrequent to be intermittent? we can call this noise
  • what is the general threshold where something is intermittent?
  • what is the general threshold where we are too intermittent and need to backout a fix or disable a test?
  • what is a reasonable timeframe to track these failures such that we can make them actionable?

Thanks for reading, I look forward to hearing from many who have ideas on this subject.  Stay tuned for an upcoming blog post about re-trigging intermittent failures to find the root cause.

Leave a comment

Filed under testdev