Tag Archives: data

May 12, 2015 · 9:00 am

Re-Triggering for a [root] cause – some notes from doing it

With all this talk of intermittent failures and folks coming up with ideas on how to solve them, I figured I should dive head first into looking at failures. I have been working with a few folks on this, specifically :parkouss and :vaibhav1994. This experiment (actually the second time doing so) is where I take a given intermittent failure bug and retrigger it. If it reproduces, then I go back in history looking for where it becomes intermittent. This weekend I wrote up some notes as I was trying to define what an intermittent is.

Lets outline the parameters first for this experiment:

All bugs marked with keyword ‘intermittent-failure’ qualify
Bugs must not be resolved or assigned to anybody
Bugs must have been filed in the last 28 days (we only keep 30 days of builds)
Bugs must have >=20 tbplrobot comments (arbitrarily picked, ideally we want something that can easily reproduce)

Here are what comes out of this:

356 bugs are open, not assigned and have the intermittent-failure keyword
25 bugs have >=20 comments meeting our criteria

The next step was to look at each of the 25 bugs and see if it makes sense to do this. In fact 13 of the bugs I decided not to take action on (remember this is an experiment, so my reasoning for ignoring these 13 could be biased):

5 bugs were thunderbird/mozmill only
3 bugs looked to be related to android harness issues
bug 1157090 hadn’t reproduced in 2 weeks- was APZ feature which we turned off.
one bug was only on mozilla-beta only
2 bugs had patches and 1 had a couple of comments indicating a developer was already looking at it

This leaves us with 12 bugs to investigate. The premise here is easy, find the first occurrence of the intermittent (branch, platform, testjob, revision) and re-trigger it 20 times (picked for simplicity). When the results are in, see if we have reproduced it. In fact, only 5 bugs reproduced the exact error in the bug when re-triggered 20 times on a specific job that showed the error.

Moving on, I started re-triggering jobs back in the pushlog to see where it was introduced. I started off with going back 2/4/6 revisions, but got more aggressive as I didn’t see patterns. Here is a summary of what the 5 bugs turned out like:

Bug 1161915 – Windows XP PGO Reftest. Found the root cause (pgo only, lots of pgo builds were required for this) 23 revisions back.
Bug 1160780 – OSX 10.6 mochitest-e10s-bc1. Found the root cause 33 revisions back.
Bug 1161052 – Jetpack test failures. So many failures in the same test file, it isn’t very clear if I am reproducing the failure or finding other ones. :erikvold is working on fixing the test in bug 1163796 and ok’d disabling it if we want to.
Bug 1161537 – OSX 10.6 Mochitest-other. Bisection didn’t find the root cause, but this is a new test case which was added. This is a case where when the new test case was added it could have been run 100+ times successfully, then when it merged with other branches a couple hours later it failed!
bug 1155423 – Linux debug reftest-e10s-1. This reproduced 75 revisions in the past, and due to that I looked at what changed in our buildbot-configs and mozharness scripts. We actually turned on this test job (hadn’t been running e10s reftests on debug prior to this) and that caused the problem. This can’t be tracked down by re-triggering jobs into the past.

In summary, out of 356 bugs 2 root causes were found by re-triggering. In terms of time invested into this, I have put about 6 hours of time to fine the root cause of the 5 bugs.

2 Comments

Filed under testdev

Tagged as automation, bugs, data, development

May 10, 2015 · 5:49 pm

intermittent oranges- missing a real definition

There are all kinds of great ideas folks have for fixing intermittent issues. In fact each idea in and of itself is a worthwhile endeavor. I have spent some time over the last couple of months fixing them, filing bugs on them, and really discussing them. One question that remains- what is the definition of an intermittent.

I don’t plan to lay out a definition, instead I plan to ask some questions and lay out some parameters. According to orange factor, there are 4640 failures in the last week (May 3 -> May 10) all within 514 unique bugs. These are all failures that the sheriffs have done some kind of manual work on to star on treeherder. I am not sure anybody can find a way to paint a pretty picture to make it appear we don’t have intermittent failures.

Looking at a few bugs, there are many reasons for intermittent failures:

core infrastructure (networking, power, large classes of machines (ec2), etc.)
machine specific (specific machine is failing a lot of jobs)
CI specific (buildbot issues, twisted issues, puppet issues, etc.)
CI Harness (usually mozharness)
Platforms (old platforms/tests we are phasing out, new platforms/tests we are phasing in)
Test harness (mochitest + libraries, common code for tests, reftest)
Test Cases (test cases actually causing failures, poorly written, new test cases, etc.)
Firefox Code (we actually have introduced conditions to cause real failures- just not every time)
Real regressions (failures which happen every time we run a test)

There are a lot of reasons, many of these have nothing to do with poor test cases or bad code in Firefox. But many of these are showing up many times a day and as a developer who wants to fix a bad test, many are not really actionable. Do we need to have some part of a definition to include something that is actionable?

Looking at the history of ‘intermittent-failure’ bugs in Bugzilla, many occur once and never occur again. In fact this is the case for over half of the bugs filed (we file upwards of 100 new bugs/week). While there are probably reasons for a given test case to fail, if it failed in August 2014 and has never failed again, is that test case intermittent? As a developer could you really do anything about this given the fact that reproducing it is virtually impossible?

This is where I start to realize we need to find a way to identify real intermittent bugs/tests and not clutter the statistics with tests which are virtually impossible to reproduce. Thinking back to what is actionable- I have found that while filing bugs for Talos regressions the closer the bug is filed to the original patch landing, the better the chance it will get fixed. Adding to that point, we only keep 30 days of builds/test packages around for our CI automation. I really think a definition of an intermittent needs to have some kind of concept of time. Should we ignore intermittent failures which occur only once in 90 days? Maybe ignore ones that don’t reproduce after 1000 iterations? Some could argue that we look in a smaller or larger window of time/iterations.

Lastly, when looking into specific bugs, I find many times they are already fixed. Many of the intermittent failures are actually fixed! Do we track how many get fixed? How many have patches and have debugging already taking place? For example in the last 28 days, we have filed 417 intermittents, of which 55 are already resolved and of the remaining 362 only 25 have occurred >=20 times. Of these 25 bugs, 4 already have patches. It appears a lot of work is done to fix intermittent failures which are actionable. Are the ones which are not being fixed not actionable? Are they in a component where all the developers are busy and heads down?

In a perfect world a failure would never occur, all tests would be green, and all users would use Firefox. In reality we have to deal with thousands of failures every week, most of which never happen again. This quarter I would like to see many folks get involved in discussions and determine:

what is too infrequent to be intermittent? we can call this noise
what is the general threshold where something is intermittent?
what is the general threshold where we are too intermittent and need to backout a fix or disable a test?
what is a reasonable timeframe to track these failures such that we can make them actionable?

Thanks for reading, I look forward to hearing from many who have ideas on this subject. Stay tuned for an upcoming blog post about re-trigging intermittent failures to find the root cause.

Tracking Firefox performance as we uplift – the volume of alerts we get

For the last year, I have been focused on ensuring we look at the alerts generated by Talos. For the last 6 months I have also looked a bit more carefully at the uplifts we do every 6 weeks. In fact we wouldn’t generate alerts when we uplifted to beta because we didn’t run enough tests to verify a sustained regression in a given time window.

Lets look at data, specifically the volume of alerts:

Trend of improvements/regressions from Firefox 31 to 36 as we uplift to Aurora

this is a stacked graph, you can interpret it as Firefox 32 had a lot of improvements and Firefox 33 had a lot of regressions. I think what is more interesting is how many performance regressions are fixed or added when we go from Aurora to Beta. There is minimal data available for Beta. This next image will compare alert volume for the same release on Aurora then on Beta:

Side by side stacked bars for the regressions going into Aurora and then going onto Beta.

One way to interpret this above graph is to see that we fixed a lot of regressions on Aurora while Firefox 33 was on there, but for Firefox 34, we introduced a lot of regressions.

The above data is just my interpretation of this, Here are links to a more fine grained view on the data:

As always, if you have questions, concerns, praise, or other great ideas- feel free to chat via this blog or via irc (:jmaher).

A case of the weekends?

Case of the Mondays

What was famous 15 years ago as a case of the Mondays has manifested itself in Talos. In fact, I wonder why I get so many regression alerts on Monday as compared to other days. It is more to a point of we have less noise in our Talos data on weekends.

Take for example the test case tresize:

* linux32,

* in fact we see this on other platforms as well linux32/linux64/osx10.8/windowsXP

Many other tests exhibit this. What is different about weekends? Is there just less data points?

I do know our volume of tests go down on weekends mostly as a side effect of less patches being landed on our trees.

Here are some ideas I have to debug this more:

Run massive retrigger scripts for talos on weekends to validate # of samples is/isnot the problem
Reduce the volume of talos on weekdays to validate the overall system load in the datacenter is/isnot the problem
compare the load of the machines with all branches and wait times to that of the noise we have in certain tests/platforms
Look at platforms like windows 7, windows 8, and osx 10.6 as to why they have more noise on weekends or are more stable. Finding the delta in platforms would help provide answers

If you have ideas on how to uncover this mystery, please speak up. I would be happy to have this gone and make any automated alerts more useful!

3 Comments

Filed under testdev

Tagged as data, mozilla, performance, talos

May 19, 2014 · 12:20 pm

Looking for long term trends and patterns in how I work

Early last year (2013), I noticed I would work really productive for a couple weeks, and then get in a rut for a week here and there. After discussing this perceived trend with Clint, I started tracking it every week (end of work day on Friday). I have been tracking it for a year, and now I have data to examine in more detail:

For the first part of last year (through September 2013), I would go in 6 week cycles which appeared to be about 1 week after the uplifts. Oddly enough I wasn’t doing any specific work for uplifts, but I do recall a lot of odd issues that required debugging for each uplift. Quite possible the day or two spent handling these issues resulted in me getting backlogged on emails.

Oddly enough when I transitioned from full time mobile automation -> full time performance automation, my cycles became more regular. One exception was a focused project development week early in 2014 which had me doing other tasks and getting behind on a few other projects.

There is no direct correlation in the health data, but I have some theories. I record my general feeling of health (for the most part physical, not emotional). This is pure judgemental and there is no science behind it. 10 is good, 0 is bad, so when there is a dip in health on the graph, I usually see an increase in email volume the next week. No explanation for that, just an observation.

In summary, I have enjoyed looking back on this data. It was good to see a trend for most of 2013. Maybe next year I will see a different trend or pattern.

2 Comments

Filed under Uncategorized

Tagged as data, personal, trends