Category Archives: testdev

Working towards a productive definition of “intermittent orange”

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent.  Some common statements I have heard:

  • It’s obvious, if it failed last year, the test is intermittent
  • If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent
  • I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server

These are imply much different definitions of what is intermittent, a definition will need to:

  • determine if we should take action on a test (programatically or manually)
  • define policy sheriffs and developers can use to guide work
  • guide developers to know when a new/fixed test is ready for production
  • provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends.  I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%).  Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures.  It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle).  After much slicing and dicing, I have come up with 4 buckets:

  1. Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)
  2. Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)
  3. Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks.  (11.5% of bugs)
  4. High Frequency Orange: fails >10 times/day many times and are often seen in try pushes.  (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

  • low priority or not actionable (buckets 1 + 2)
  • high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

  • Determine if we should take action on a test (programatically or manually):
    • ideally buckets 1/2 can be detected programatically with autostar and removed from our view.  Possibly rerunning to validate it isn’t a new failure.
    • buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information
  • Define policy sheriffs and developers can use to guide work
    • sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)
    • developers understand the severity based on the bucket.  Ideally we will need a lot of context, but understanding severity is important.
  • Guide developers to know when a new/fixed test is ready for production
    • If we fix a test, we want to ensure it is stable before we make it tier-1.  A developer can use math of 300 commits/day and ensure we pass.
    • NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day
  • Provide useful data to release and Firefox product management about the quality of a release
    • Release Management can take the OrangeFactor into account
    • new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot).  There are simple rules:

  • 15+ times/day – post a daily summary (bucket #4)
  • 5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

  • missing or incorrect data in orange factor (human error)
  • some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.

4 Comments

Filed under general, testdev, Uncategorized

Please welcome the Dashboard Hacker team

A few weeks ago we announced that we would be looking for committed contributors for 8+ weeks on Perfherder.  We have found a few great individuals, all of whom show a lot of potential and make a great team.

They are becoming familiar with the code base and are already making a dent in the initial list of work set aside.  Let me introduce them (alphabetical order by nicks):

akhileshpillai – Akhilesh has jumped right in and started fixing bugs in Perfherder.  He is new to Mozilla and will fit right in.  With how fast he has come up to speed, we are looking forward to what he will be delivering in the coming weeks.  We have a lot of UI workflow as well as backend data refactoring work on our list, all of which he will be a great contributor towards.

mikeling – mikeling has been around Mozilla for roughly two years, recently he started out helping with a few projects on the A*Team.  He is very detailed oriented, easy to work with and is willing to tackle big things.

theterabyte – Tushar is excited about this program as an opportunity to grow his skills as a python developer and experiencing how software is built outside of a classroom.  Tushar will get a chance to grow his UI skills on Perfherder by making the graphs and compare view more polished and complete, while helping out with an interface for the alerts.

Perfherder will soon become the primary dashboard for all our performance needs.

I am looking forward to the ideas and solutions these new team members bring to the table.  Please join me in welcoming them!

1 Comment

Filed under Community, testdev

Please join me in welcoming the DX Team

A few weeks ago, I posted a call out for people to reach out and commit to participate for 8+ weeks.  There were two projects and one of them was Developer Experience.  Since then we have had some great interest, there are 5 awesome contributors participating (sorted by irc nicks).

BYK – I met BYK 3+ years ago on IRC- he is a great person and very ambitious.  As a more senior developer he will be focused primarily on improving interactions with mach.  While there are a lot of little things to make mach better, BYK proposed a system to collect information about how mach is used.

gma_fav – I met gma_fav on IRC when she heard about the program.  She has a lot of energy, seems very detail oriented, asks good questions, and brings fresh ideas to the team!  She is a graduate of the Ascend project and is looking to continue her growth in development and open source.  Her primary focus will be on the interface to try server (think the try chooser page, extension, and taking other experiments further).

kaustabh93 – I met Kaustabh on IRC about a year ago and since then he has been a consistent friend and hacker.  He attends university.  In fact I do owe him credit for large portions of alert manager.  While working on this team, he will be focused on making run-by-dir a reality.  There are two parts: getting the tests to run green, and reducing the overhead of the harness.

sehgalvibhor – I met Vibhor on IRC about 2 weeks ago.  He was excited about the possibility of working on this project and jumped right in.  Like Kaustabh, he is a student who is just finishing up exams this week.  His primary focus this summer will be working in a similar role to Stanley in making our test harnesses act the same and more useful.

stanley – When this program was announced Stanley was the first person to ping me on IRC.  I have found him to be very organized, a pleasure to chat with and he understands code quite well.  Coding and open source are both new things to Stanley and we have the opportunity to give him a great view of it.  Stanley will be focusing on making the commands we have for running tests via mach easier to use and more unified between harnesses.

Personally I am looking forward to seeing the ambition folks have translate into great solutions, learning more about each person, and sharing with Mozilla as a whole the great work they are doing.

Take a few moments to say hi to them online.

1 Comment

Filed under Community, testdev

the orange factor – no need to retrigger this week

last week I did another round of re-triggering for a root cause and found some root causes!  This week I got an email from orange factor outlining the top 10 failures on the trees (as we do every week).

Unfortunately as of this morning there is no work for me to do- maybe next week I can hunt.

Here is the breakdown of bugs:

  • Bug 1081925 Intermittent browser_popup_blocker.js
    • investigated last week, test is disabled by a sheriff
  • Bug 1118277 Intermittent browser_popup_blocker.js
    • investigated last week, test is disabled by a sheriff
  • Bug 1096302 Intermittent test_collapse.html
    • test is fixed!  already landed
  • Bug 1121145 Intermittent browser_panel_toggle.js
    • too old!  problem got worse on April 24th
  • Bug 1157948 DMError: Non-zero return code for command
    • too old!  most likely a harness/infra issue
  • Bug 1166041 Intermittent LeakSanitizer
    • patch is already on this bug
  • Bug 1165938 Intermittent media-source
    • disabled the test already!
  • Bug 1149955 Intermittent Win8-PGO test_shared_all.py
    • too old!
  • Bug 1160008 Intermittent testVideoDiscovery
    • too old!
  • Bug 1137757 Intermittent Linux debug mochitest-dt1 command timed out
    • harness infra, test chunk is taking too long- problem is being addressed with more chunks.

As you can see there isn’t much to do here.  Maybe next week we will have some actions we can take.  Once I have about 10 bugs investigated I will summarize the bugs, related dates, and status, etc.

Leave a comment

Filed under testdev

A-Team contribution opportunity – Dashboard Hacker

I am excited to announce a new focused project for contribution – Dashboard Hacker.  Last week we gave a preview that today we would be announcing 2 contribution projects.  This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks.  More time is welcome, but not required.

What is a dashboard hacker?

When a developer is ready to land code, they want to test it. Getting the results and understanding the results is made a lot easier by good dashboards and tools. For this project, we have a starting point with our performance data view to fix up a series of nice to have polish features and then ensure that it is easy to use with a normal developer workflow. Part of the developer work flow is the regular job view, If time permits there are some fun experiments we would like to implement in the job view.  These bugs, features, projects are all smaller and self contained which make great projects for someone looking to contribute.

What is required of you to participate?

  • A willingness to learn and ask questions
  • A general knowledge of programming (most of this will be in javascript, django, angularJS, and some work will be in python.
  • A promise to show up regularly and take ownership of the issues you are working on
  • Good at solving problems and thinking out of the box
  • Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

  • A dedicated mentor for the project whom you will work with regularly throughout the project
  • A single area of work to reduce the need to get up to speed over and over again.
    • This project will cover many tools, but the general problem space will be the same
  • The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
  • The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org, wlach on IRC will be the primary mentor).  We will point you at a starter bug and introduce you to the bugs and problems to solve.  If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here.  One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space.  Ultimately it will be up to the mentor (for this project, it will be :wlach).  If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

3 Comments

Filed under Community, testdev

A-Team contribution opportunity – DX (Developer Ergonomics)

I am excited to announce a new focused project for contribution – Developer Ergonomics/Experience, otherwise known as DX.  Last week we gave a preview that today we would be announcing 2 contribution projects.  This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks.  More time is welcome, but not required.

What does DX mean?

We chose this project as we continue to experience frustration while fixing bugs and debugging test failures.  Many people suggest great ideas, in this case we have set aside a few ideas (look at the dependent bugs to clean up argument parsers, help our tests run in smarter chunks, make it easier to run tests locally or on server, etc.) which would clean up stuff and be harder than a good first bug, yet each issue by itself would be too easy for an internship.  Our goal is to clean up our test harnesses and tools and if time permits, add stuff to the workflow which makes it easier for developers to do their job!

What is required of you to participate?

  • A willingness to learn and ask questions
  • A general knowledge of programming (this will be mostly in python with some javascript as well)
  • A promise to show up regularly and take ownership of the issues you are working on
  • Good at solving problems and thinking out of the box
  • Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

  • A dedicated mentor for the project whom you will work with regularly throughout the project
  • A single area of work to reduce the need to get up to speed over and over again.
    • This project will cover many tools, but the general problem space will be the same
  • The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
  • The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org).  We will point you at a starter bug and introduce you to the bugs and problems to solve.  If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here.  One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space.  Ultimately it will be up to the mentor (for this project, it will be me).  If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

3 Comments

Filed under Community, testdev

Watching the watcher – Some data on the Talos alerts we generate

What are the performance regressions at Mozilla- who monitors them and what kind of regressions do we see?  I want to answer this question with a few peeks at the data.  There are plenty of previous blog posts I have done outlining stats, trends, and the process.  Lets recap what we do briefly, then look at the breakdown of alerts (not necessarily bugs).

When Talos uploads numbers to graph server they get stored and eventually run through a calculation loop to find regressions and improvements.  As of Jan 1, 2015, we upload these to mozilla.dev.tree-alerts as well as email to the offending patch author (if they can easily be identified).  There are a couple folks (performance sheriffs) who look at the alerts and triage them.  If necessary a bug is filed for further investigation.  Reading this brief recap of what happens to our performance numbers probably doesn’t inspire folks, what is interesting is looking at the actual data we have.

Lets start with some basic facts about alerts in the last 12 months:

  • We have collected 8232 alerts!
  • 4213 of those alerts are regressions (the rest are improvements)
  • 3780 of those above alerts have a manually marked status
    • the rest have been programatically marked as merged and associated with the original
  • 278 bugs have been filed (or 17 alerts/bug)
    • 89 fixed!
    • 61 open!
    • 128 (5 invalid, 8 duplicate, 115 wontfix/worksforme)

As you can see this is not a casual hobby, it is a real system helping out in fixing and understanding hundreds of performance issues.

We generate alerts on a variety of branches, here is the breakdown of branches and alerts/branch;

number of regression alerts we have received per branch

number of regression alerts we have received per branch

There are a few things to keep in mind here, mobile/mozilla-central/Firefox are the same branch, and for non-pgo branches that is only linux/windows/android, not osx. 

Looking at that graph is sort of non inspiring, most of the alerts will land on fx-team and mozilla-inbound, then show up on the other branches as we merge code.  We run more tests/platforms and land/backout stuff more frequently on mozilla-inbound and fx-team, this is why we have a larger number of alerts.

Given the fact we have so many alerts and have manually triaged them, what state the the alerts end up in?

Current state of alerts

Current state of alerts

The interesting data point here is that 43% of our alerts are duplicates.  A few reasons for this:

  • we see an alert on non-pgo, then on pgo (we usually mark the pgo ones as duplicates)
  • we see an alert on mozilla-inbound, then the same alert shows up on fx-team,b2g-inbound,firefox (due to merging)
    • and then later we see the pgo versions on the merged branches
  • sometimes we retrigger or backfill to find the root cause, this generates a new alert many times
  • in a few cases we have landed/backed out/landed a patch and we end up with duplicate sets of alerts

The last piece of information that I would like to share is the break down of alerts per test:

Alerts per test

number of alerts per test (some excluded)

There are a few outliers, but we need to keep in mind that active work was being done in certain areas which would explain a lot of alerts for a given test.  There are 35 different test types which wouldn’t look good in an image, so I have excluded retired tests, counters, startup tests, and android tests.

Personally, I am looking forward to the next year as we transition some tools and do some hacking on the reporting, alert generation and overall process.  Thanks for reading!

1 Comment

Filed under testdev