Tag Archives: automation

October 10, 2016 · 2:00 pm

Working towards a productive definition of “intermittent orange”

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent. Some common statements I have heard:

“It’s obvious, if it failed last year, the test is intermittent“
“If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent“
“I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server“

These are imply much different definitions of what is intermittent, a definition will need to:

determine if we should take action on a test (programatically or manually)
define policy sheriffs and developers can use to guide work
guide developers to know when a new/fixed test is ready for production
provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends. I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%). Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures. It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle). After much slicing and dicing, I have come up with 4 buckets:

Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)
Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)
Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks. (11.5% of bugs)
High Frequency Orange: fails >10 times/day many times and are often seen in try pushes. (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

low priority or not actionable (buckets 1 + 2)
high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

Determine if we should take action on a test (programatically or manually):
- ideally buckets 1/2 can be detected programatically with autostar and removed from our view. Possibly rerunning to validate it isn’t a new failure.
- buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information
Define policy sheriffs and developers can use to guide work
- sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)
- developers understand the severity based on the bucket. Ideally we will need a lot of context, but understanding severity is important.
Guide developers to know when a new/fixed test is ready for production
- If we fix a test, we want to ensure it is stable before we make it tier-1. A developer can use math of 300 commits/day and ensure we pass.
- NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day
Provide useful data to release and Firefox product management about the quality of a release
- Release Management can take the OrangeFactor into account
- new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot). There are simple rules:

15+ times/day – post a daily summary (bucket #4)
5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

missing or incorrect data in orange factor (human error)
some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.

4 Comments

Filed under general, testdev, Uncategorized

Tagged as automation, development, intermittent-failures, policy

May 28, 2015 · 12:15 pm

the orange factor – no need to retrigger this week

last week I did another round of re-triggering for a root cause and found some root causes! This week I got an email from orange factor outlining the top 10 failures on the trees (as we do every week).

Unfortunately as of this morning there is no work for me to do- maybe next week I can hunt.

Here is the breakdown of bugs:

Bug 1081925 – Intermittent browser_popup_blocker.js
- investigated last week, test is disabled by a sheriff
Bug 1118277 – Intermittent browser_popup_blocker.js
- investigated last week, test is disabled by a sheriff
Bug 1096302 – Intermittent test_collapse.html
- test is fixed! already landed
Bug 1121145 – Intermittent browser_panel_toggle.js
- too old! problem got worse on April 24th
Bug 1157948 – DMError: Non-zero return code for command
- too old! most likely a harness/infra issue
Bug 1166041 – Intermittent LeakSanitizer
- patch is already on this bug
Bug 1165938 – Intermittent media-source
- disabled the test already!
Bug 1149955 – Intermittent Win8-PGO test_shared_all.py
- too old!
Bug 1160008 – Intermittent testVideoDiscovery
- too old!
Bug 1137757 – Intermittent Linux debug mochitest-dt1 command timed out
- harness infra, test chunk is taking too long- problem is being addressed with more chunks.

As you can see there isn’t much to do here. Maybe next week we will have some actions we can take. Once I have about 10 bugs investigated I will summarize the bugs, related dates, and status, etc.

A-Team contribution opportunity – Dashboard Hacker

I am excited to announce a new focused project for contribution – Dashboard Hacker. Last week we gave a preview that today we would be announcing 2 contribution projects. This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks. More time is welcome, but not required.

What is a dashboard hacker?

When a developer is ready to land code, they want to test it. Getting the results and understanding the results is made a lot easier by good dashboards and tools. For this project, we have a starting point with our performance data view to fix up a series of nice to have polish features and then ensure that it is easy to use with a normal developer workflow. Part of the developer work flow is the regular job view, If time permits there are some fun experiments we would like to implement in the job view. These bugs, features, projects are all smaller and self contained which make great projects for someone looking to contribute.

What is required of you to participate?

A willingness to learn and ask questions
A general knowledge of programming (most of this will be in javascript, django, angularJS, and some work will be in python.
A promise to show up regularly and take ownership of the issues you are working on
Good at solving problems and thinking out of the box
Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

A dedicated mentor for the project whom you will work with regularly throughout the project
A single area of work to reduce the need to get up to speed over and over again.
- This project will cover many tools, but the general problem space will be the same
The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org, wlach on IRC will be the primary mentor). We will point you at a starter bug and introduce you to the bugs and problems to solve. If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here. One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space. Ultimately it will be up to the mentor (for this project, it will be :wlach). If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

3 Comments

Filed under Community, testdev

Tagged as automation, Community, contribution, hacking

May 18, 2015 · 3:42 pm

A-Team contribution opportunity – DX (Developer Ergonomics)

I am excited to announce a new focused project for contribution – Developer Ergonomics/Experience, otherwise known as DX. Last week we gave a preview that today we would be announcing 2 contribution projects. This is an unpaid program where we are looking for 1-2 contributors who will dedicate between 5-10 hours/week for at least 8 weeks. More time is welcome, but not required.

What does DX mean?

We chose this project as we continue to experience frustration while fixing bugs and debugging test failures. Many people suggest great ideas, in this case we have set aside a few ideas (look at the dependent bugs to clean up argument parsers, help our tests run in smarter chunks, make it easier to run tests locally or on server, etc.) which would clean up stuff and be harder than a good first bug, yet each issue by itself would be too easy for an internship. Our goal is to clean up our test harnesses and tools and if time permits, add stuff to the workflow which makes it easier for developers to do their job!

What is required of you to participate?

A willingness to learn and ask questions
A general knowledge of programming (this will be mostly in python with some javascript as well)
A promise to show up regularly and take ownership of the issues you are working on
Good at solving problems and thinking out of the box
Comfortable with (or willing to try) working with a variety of people

What we will guarantee from our end:

A dedicated mentor for the project whom you will work with regularly throughout the project
A single area of work to reduce the need to get up to speed over and over again.
- This project will cover many tools, but the general problem space will be the same
The opportunity to work with many people (different bugs could have a specific mentor) while retaining a single mentor to guide you through the process
The ability to be part of the team- you will be welcome in meetings, we will value your input on solving problems, brainstorming, and figuring out new problems to tackle.

How do you apply?

Get in touch with us either by replying to the post, commenting in the bug or just contacting us on IRC (I am :jmaher in #ateam on irc.mozilla.org). We will point you at a starter bug and introduce you to the bugs and problems to solve. If you have prior work (links to bugzilla, github, blogs, etc.) that would be useful to learn more about you that would be a plus.

How will you select the candidates?

There is no real criteria here. One factor will be if you can meet the criteria outlined above and how well you do at picking up the problem space. Ultimately it will be up to the mentor (for this project, it will be me). If you do apply and we already have a candidate picked or don’t choose you for other reasons, we do plan to repeat this every few months.

Looking forward to building great things!

3 Comments

Filed under Community, testdev

Tagged as automation, Community, contribution

May 12, 2015 · 9:00 am

Re-Triggering for a [root] cause – some notes from doing it

With all this talk of intermittent failures and folks coming up with ideas on how to solve them, I figured I should dive head first into looking at failures. I have been working with a few folks on this, specifically :parkouss and :vaibhav1994. This experiment (actually the second time doing so) is where I take a given intermittent failure bug and retrigger it. If it reproduces, then I go back in history looking for where it becomes intermittent. This weekend I wrote up some notes as I was trying to define what an intermittent is.

Lets outline the parameters first for this experiment:

All bugs marked with keyword ‘intermittent-failure’ qualify
Bugs must not be resolved or assigned to anybody
Bugs must have been filed in the last 28 days (we only keep 30 days of builds)
Bugs must have >=20 tbplrobot comments (arbitrarily picked, ideally we want something that can easily reproduce)

Here are what comes out of this:

356 bugs are open, not assigned and have the intermittent-failure keyword
25 bugs have >=20 comments meeting our criteria

The next step was to look at each of the 25 bugs and see if it makes sense to do this. In fact 13 of the bugs I decided not to take action on (remember this is an experiment, so my reasoning for ignoring these 13 could be biased):

5 bugs were thunderbird/mozmill only
3 bugs looked to be related to android harness issues
bug 1157090 hadn’t reproduced in 2 weeks- was APZ feature which we turned off.
one bug was only on mozilla-beta only
2 bugs had patches and 1 had a couple of comments indicating a developer was already looking at it

This leaves us with 12 bugs to investigate. The premise here is easy, find the first occurrence of the intermittent (branch, platform, testjob, revision) and re-trigger it 20 times (picked for simplicity). When the results are in, see if we have reproduced it. In fact, only 5 bugs reproduced the exact error in the bug when re-triggered 20 times on a specific job that showed the error.

Moving on, I started re-triggering jobs back in the pushlog to see where it was introduced. I started off with going back 2/4/6 revisions, but got more aggressive as I didn’t see patterns. Here is a summary of what the 5 bugs turned out like:

Bug 1161915 – Windows XP PGO Reftest. Found the root cause (pgo only, lots of pgo builds were required for this) 23 revisions back.
Bug 1160780 – OSX 10.6 mochitest-e10s-bc1. Found the root cause 33 revisions back.
Bug 1161052 – Jetpack test failures. So many failures in the same test file, it isn’t very clear if I am reproducing the failure or finding other ones. :erikvold is working on fixing the test in bug 1163796 and ok’d disabling it if we want to.
Bug 1161537 – OSX 10.6 Mochitest-other. Bisection didn’t find the root cause, but this is a new test case which was added. This is a case where when the new test case was added it could have been run 100+ times successfully, then when it merged with other branches a couple hours later it failed!
bug 1155423 – Linux debug reftest-e10s-1. This reproduced 75 revisions in the past, and due to that I looked at what changed in our buildbot-configs and mozharness scripts. We actually turned on this test job (hadn’t been running e10s reftests on debug prior to this) and that caused the problem. This can’t be tracked down by re-triggering jobs into the past.

In summary, out of 356 bugs 2 root causes were found by re-triggering. In terms of time invested into this, I have put about 6 hours of time to fine the root cause of the 5 bugs.

2 Comments

Filed under testdev

Tagged as automation, bugs, data, development

May 10, 2015 · 5:49 pm

intermittent oranges- missing a real definition

There are all kinds of great ideas folks have for fixing intermittent issues. In fact each idea in and of itself is a worthwhile endeavor. I have spent some time over the last couple of months fixing them, filing bugs on them, and really discussing them. One question that remains- what is the definition of an intermittent.

I don’t plan to lay out a definition, instead I plan to ask some questions and lay out some parameters. According to orange factor, there are 4640 failures in the last week (May 3 -> May 10) all within 514 unique bugs. These are all failures that the sheriffs have done some kind of manual work on to star on treeherder. I am not sure anybody can find a way to paint a pretty picture to make it appear we don’t have intermittent failures.

Looking at a few bugs, there are many reasons for intermittent failures:

core infrastructure (networking, power, large classes of machines (ec2), etc.)
machine specific (specific machine is failing a lot of jobs)
CI specific (buildbot issues, twisted issues, puppet issues, etc.)
CI Harness (usually mozharness)
Platforms (old platforms/tests we are phasing out, new platforms/tests we are phasing in)
Test harness (mochitest + libraries, common code for tests, reftest)
Test Cases (test cases actually causing failures, poorly written, new test cases, etc.)
Firefox Code (we actually have introduced conditions to cause real failures- just not every time)
Real regressions (failures which happen every time we run a test)

There are a lot of reasons, many of these have nothing to do with poor test cases or bad code in Firefox. But many of these are showing up many times a day and as a developer who wants to fix a bad test, many are not really actionable. Do we need to have some part of a definition to include something that is actionable?

Looking at the history of ‘intermittent-failure’ bugs in Bugzilla, many occur once and never occur again. In fact this is the case for over half of the bugs filed (we file upwards of 100 new bugs/week). While there are probably reasons for a given test case to fail, if it failed in August 2014 and has never failed again, is that test case intermittent? As a developer could you really do anything about this given the fact that reproducing it is virtually impossible?

This is where I start to realize we need to find a way to identify real intermittent bugs/tests and not clutter the statistics with tests which are virtually impossible to reproduce. Thinking back to what is actionable- I have found that while filing bugs for Talos regressions the closer the bug is filed to the original patch landing, the better the chance it will get fixed. Adding to that point, we only keep 30 days of builds/test packages around for our CI automation. I really think a definition of an intermittent needs to have some kind of concept of time. Should we ignore intermittent failures which occur only once in 90 days? Maybe ignore ones that don’t reproduce after 1000 iterations? Some could argue that we look in a smaller or larger window of time/iterations.

Lastly, when looking into specific bugs, I find many times they are already fixed. Many of the intermittent failures are actually fixed! Do we track how many get fixed? How many have patches and have debugging already taking place? For example in the last 28 days, we have filed 417 intermittents, of which 55 are already resolved and of the remaining 362 only 25 have occurred >=20 times. Of these 25 bugs, 4 already have patches. It appears a lot of work is done to fix intermittent failures which are actionable. Are the ones which are not being fixed not actionable? Are they in a component where all the developers are busy and heads down?

In a perfect world a failure would never occur, all tests would be green, and all users would use Firefox. In reality we have to deal with thousands of failures every week, most of which never happen again. This quarter I would like to see many folks get involved in discussions and determine:

what is too infrequent to be intermittent? we can call this noise
what is the general threshold where something is intermittent?
what is the general threshold where we are too intermittent and need to backout a fix or disable a test?
what is a reasonable timeframe to track these failures such that we can make them actionable?

Thanks for reading, I look forward to hearing from many who have ideas on this subject. Stay tuned for an upcoming blog post about re-trigging intermittent failures to find the root cause.

SETA – Search for Extraneous Test Automation

Here at Mozilla we run dozens of builds and hundreds of test jobs for every push to a tree. As time has gone on, we have gone from a couple hours from push to all tests complete to 4+ hours. With the exception of a few test jobs, we could complete our test results in

The question becomes, how do we manage to keep up test coverage without growing the number of machines? Right now we do this with buildbot coalescing (we queue up the jobs and skip the older ones when the load is high). While this works great, it causes us to skip a bunch of jobs (builds/tests) on random pushes and sometimes we need to go back in and manually schedule jobs to find failures. In fact, while keeping up with the automated alerts for talos regressions, the coalescing causes problems in over half of the regressions that I investigate!

Knowing that we live with coalescing and have for years, many of us started wondering if we need all of our tests. Ideally we could select tests that are statistically most significant to the changes being pushed, and if those pass, we could run the rest of the tests if there were available machines. To get there is tough, maybe there is a better way to solve this? Luckily we can mine meta data from treeherder (and the former tbpl) and determine which failures are intermittent and which have been fixed/caused by a different revision.

A few months ago we started looking into unique failures on the trees. Not just the failures, but which jobs failed. Normally when we have a failure detected by the automation, many jobs fail at once (for example: xpcshell tests will fail on all windows platforms, opt + debug). When you look at the common jobs which fail across all the failures over time, you can determine the minimum number of jobs required to detected all the failures. Keep in mind that we only need 1 job to represent a given failure.

As of today, we have data since August 13, 2014 (just shy of 180 days):

1061 failures caught by automation (for desktop builds/tests)
362 jobs are currently run for all desktop builds
285 jobs are optional and not needed to detect all 1061 regressions

To phrase this another way, we could have run 77 jobs per push and caught every regression in the last 6 months. Lets back up a bit and look at the regressions found- how many are there and how often do we see them:

Cumulative and per day regressions

This is a lot of regressions, yay for automation. The problem is that this is historical data, not future data. Our tests, browser, and features change every day, this doesn’t seem very useful for predicting the future. This is a parallel to the stock market, there people invest in companies based on historical data and make decisions based on incoming data (press releases, quarterly earnings). This is the same concept. We have dozens of new failures every week, and if we only relied upon the 77 test jobs (which would catch all historical regressions) we would miss new ones. This is easy to detect, and we have mapped out the changes. Here it is on a calendar view (bold dates indicate a change was detected, i.e. a new job needed in the reduced set of jobs list):

This translates to about 1.5 changes per week. To put this another way, if we were only running the 77 reduced set of jobs, we would have missed one regression December 2nd, and another December 16th, etc., or on average 1.5 regressions will be missed per week. In a scenario where we only ran the optional jobs once/hour on the integration branches, 1-2 times/week we would see a failure and have to backfill some jobs (as we currently do for coalesced jobs) for the last hour to find the push which caused the failure.

To put this into perspective, here is a similar view to what you would expect to see today on treeherder:

For perspective, here is what it would look like assuming we only ran the reduced set of 77 jobs:

* keep in mind this is weighted such that we prefer to run jobs on linux* builds since those run in the cloud.

With all of this information, what do we plan to do with it? We plan to run the reduced set of jobs by default on all pushes, and use the [285] optional jobs as candidates for coalescing. Currently we force coalescing for debug unittests. This was done about 6 months ago because debug tests take considerably longer than opt, so if we could run them on every 2nd or 3rd build, we would save a lot of machine time. This is only being considered on integration trees that the sheriffs monitor (mozilla-inbound, fx-team).

Some questions that are commonly asked:

How do you plan to keep this up to date?
- We run a cronjob every day and update our master list of jobs, failures, and optional jobs. This takes about 2 minutes.
What are the chances the reduced set of jobs catch >1 failure? Do we need all 77 jobs?
- 77 jobs detect 1061 failures (100%)
- 35 jobs detect 977 failures (92%)
- 23 jobs detect 940 failures (88.6%)
- 12 jobs detect 900 failures (84.8%)
How can we see the data:
- SETA website
- in the near future summary emails when we detect a *change* to mozilla.dev.tree-alerts

Thanks for reading so far! This project wouldn’t be here it it wasn’t for the many hours of work by Vaibhav, he continues to find more ways to contribute to Mozilla. If anything this should inspire you to think more about how our scheduling works and what great things we can do if we think out of the box.

8 Comments

Filed under testdev

Tagged as automation, bugs, intermittent-failures, SETA

December 9, 2014 · 4:00 pm

5 days in Portland with Mozillians and 10 great things that came from it

I took a lot of notes in Portland last week. One might not know that based on the fact that I talked so much my voice ran out of steam by the second day. Either way, in chatting with some co-workers yesterday about what we took away from Portland, I realized that there is a long list of awesomeness.

Let me caveat this by saying that some of these ideas have been talked about in the past, but despite our efforts to work with others and field interesting and useful ideas, there is a big list of great things that came to light while chatting in person:

:bgrins mentioned a mozscreenshot tool and the need for getting a screenshot of new features in development on various platforms so UX can review the changes. Currently it is a method of asking UX to download the build from try or some other location and run it locally to see the changes.
:heycam/:jwatt – had a great an interesting talos discussion. Mostly around how to run it and validate patches/fixes locally and on try server. (check out bug 1109243)
:glandium is looking at doing some changes (I recall something with build/pgo) and wanted to know how to compare some Talos numbers to help make the right decision – this can be done with either bug 1109243, or the existing compare.py in the Talos repo (we might need some cleanup on this)
:bobowen has been working to get csb tests working- after chatting in line to board a plane, it became clear he needs to solve some finer grain test selection problems- many of which the ateam has on a roadmap in Q2/Q3 – I see some tighter collaboration happening here.
Thanks to chatting with :lsblakk, I am motivated to expand the talos sheriff team and look for dedicated Mozillians (or soon to become Mozillians) to work with in keeping a lid on the alerts and overall state of performance (based on what we measure).
:lightsofapollo had a great conversation with me about TaskCluster and what barriers stood in the way of running Talos on it – this will result is some initial investigation work!
:kats was asking me how to generate alerts for areweslimyet.com. This is very doable via posting data to graph server
After a good session on how to handle intermittents (seems like the same people have this conversation every time a bunch of Mozillians get together), I am motivated to push Titanic further to find the root cause of an intermittent via brute force retriggers (ideally on weekends). In fact :dbaron has done this a few times in the last month and so have the sheriffs. This is similar to what we do to verify a talos regression, just with some different parameters.
The same conversation about intermittents yielded a stronger desire to look at new tests coming into the system and validating stability. The simple solution is to run the job 100 times, verify that the new test didn’t have issues and then leave it along. Of course we could get smart and do this for all test_* files that are edited in the tree. Thanks to :ehsan for spawning this conversation.
Discussing the idea of a Talos Sheriff with a few folks, it seems like there are further conversations needs with the existing Sheriff team as well as to chat with :vladan and :avih about what type of policy we should have for existing performance failures which are detected. I would expect some changes to be made early next year as we have more tests and need more help. My initial thoughts are specifically with responding to regressions or getting backed out in XX hours. Yeah that sounds nasty, but there are probably cut and dry parameters we can set and start enforcing.

Those are 10 specific topics which despite everybody knowing how to contact me or the ateam and share great ideas or frustrations, these came out of being in the same place at the same time.

Thinking through this, when I see these folks in a real office while working from there for a few days or a week, it seems as though the conversations are smaller and not as intense. Usually just small talk whilst waiting for a build to finish. I believe the idea where we are not expected to focus on our day to day work and instead make plans for the future is the real innovation behind getting these topics surfaced.

1 Comment

Filed under general

Tagged as automation, mozilla

June 25, 2014 · 12:52 pm

Thoughts on Auto-Land, Try server, and intermittent oranges outline

This is the first post is a series where I will post some ideas. These are ideas, not active projects (although these ideas could be implemented with many active projects).

My first idea is surrounding the concept of AutoLand. Mozilla has talked about this for a long time. In fact a conversation I had last week got me thinking more of the value of AutoLand vs blocking on various aspects of it. There are a handful of issues blocking us from a system where we push to try and if it goes well we magically land it on the tip of a tree. My vested interest comes in the part of “if it goes well”.

The argument here has been that we have so many intermittent oranges and until we fix those we cannot determine if a job is green. A joke for many years has been that it would be easier to win the lottery than to get an all green run on tbpl. I have seen a lot of cases where people push to Try and land on Inbound to only be backed out by a test failure- a test failure that was seen on Try (for the record I personally have done this once). I am sure someone could write a book on human behavior, tooling, and analysis of why failures land on integration branches when we have try server.

My current thought is this-

* push to try server with a final patch, run a full set of tests and builds

* when all the jobs are done [1], we analyze the results of the jobs and look for 2 patterns

* pattern 1: for a given build, at most 1 job fails

* pattern 2: for a given job [2], at most 1 platform fails

* if pattern 1 + 2 pass, we put this patch in the queue for landing by the robots

[1] – we can determine the minimal amount of jobs or verify with more analysis (i.e. 1 mochitest can fail, 1 reftest can fail, 1 other can fail)

[2] – some jobs are run in different chunks. on opt ‘dt’ runs all browser-chrome/devtools jobs, but this is ‘dt1’, ‘dt2’, ‘dt3’ on debug builds

This simple approach would give us the confidence that we need to reliably land patches on integration branches and achieve the same if not better results than humans.

For the bonus we could optimize our machine usage by not building/running all jobs on the integration commit because we have a complete set done on try server.

9 Comments

Filed under Uncategorized

Tagged as automation, Ideas, mozilla, Scheduling

May 15, 2014 · 1:50 pm

Are there any trends in our Talos regression bugs?

Now that we have a better process for taking action on Talos alerts and pushing them to resolution, it is time to take a step back and see if any trends show up in our bugs.

First I want to look at bugs filed/week:

This is fun to see, now what if we stack this up side by side with the alerts we receive:

We started tracking alerts halfway through this process. We show that for about 1 out of every 25 alerts we file a bug. I had previously stated it was closer to 1/33 alerts (it appears that is averaging out the first few weeks).

Lets see where these bugs are filed, here is a view of the different bugzilla products:

The Testing product is used to file bugs that we cannot figure out the exact changeset, so they get filed in testing::talos. As there are almost 30 unique components bugs are filed in, I took a few minutes to look at the Core product, here is where the bugs live in Core:

Pardon my bad graphing attempt here with the components cut off. Graphics is the clear winner for regressions (with “graphics: layers” being a large part of it). Of course the Javascript Engine and DOM would be there (a lot of our tests are sensitive to changes here). This really shows where our test coverage is more than where bad code lives.

Now that I know where the bugs are, here is a view of how long the bugs stay open:

The fantastic news is most of our bugs are resolved in <=15 days! I think this is a metric we can track and get better at- ideally closing all Talos regression bugs in <30 days.

Looking over all the bugs we have, what is the status of them?

Yay for the blue pacman! We have a lot of new bugs instead of assigned bugs, that might be something we could adjust and assign owners once it is confirmed and briefly discussed- that is still up in the air.

The burning question is what are all the bugs resolved as?

To me this seems healthy, it is a starting point. Tracking this over time will probably be a useful metric!

In summary, many developers have done great work to make improvements and fix patches over the last 6 months that we have been tracking this information. There are things we can do better, I want to know-