This week on Lost in Data, we tackle yet another pile of alerts. This time we have a set of changes which landed together and we push to try for bisection. In addition we have an e10s only failure which happened when we broke talos uploading to perfherder. See how I get one step closer to figuring out the root cause of the regressions.
Category Archives: Uncategorized
Today I recorded a session of me investigating talos alerts, It is ~35 minutes, sort of long, but doable in the background whilst grabbing breakfast or lunch!
I am looking forward to more sessions and answering questions that come up.
This week I looked at each bug again, annotated them with some notes as to what I did or why I didn’t do anything, here are the bugs:
- Bug 1160008 – Intermittent testVideoDiscovery
- too old! (from last week)
- Bug 1073442 – Intermittent command timed out
- too old; infra issue (from week before last)
- Bug 1121145 – Intermittent browser_panel_toggle.js
- too old! problem got worse on April 24th (from last week)
- Bug 1157948 – DMError: Non-zero return code for command
- too old! most likely a harness/infra issue (from last week)
- Bug 1161817 – Intermittent browser_timeline-waterfall-sidebar.js
- fixed already
- Bug 1168747 – Intermittent 336736-1a.html
- resolved (via skip-if in manifest)
- Bug 1149955 – Intermittent Win8-PGO test_shared_all.py
- too old (from last week – someone looking into it now though!)
- Bug 1158887 – Intermittent test_async_transactions.js
- reopened – fix landed recently- re-triggering doesn’t seem helpful :(
- Bug 1090203 – Intermittent style-src-3_2.html
- to do work
- Bug 1081925 – Intermittent browser_popup_blocker.js
- test is disabled now (from last week)
It is nice to find a bug that we can take action on. What is interesting is the bug has been around for a while, but we noticed about May 21 that the rate of failures went up from a couple a day to >5/day. Details:
- started out re-triggering on m-c. We could see a pattern on a specific merge.
- did re-triggers on m-i. 20 was inconclusive, and then triggered 20 more for each job- the results were still inconclusive. There is no increasing pattern based on a specific chanageset.
I might try a full experiment soon blindly looking at bugs instead of using orange factor.
Last week I wrote some notes about re-triggering jobs to find a root cause. This week I decided to look at the orange factor email of the top 10 bugs and see how I could help. Looking at each of the 10 bugs, I had 3 worth investigating and 7 I ignored.
- Bug 1163911 – test_viewport_resize.html – new test which was added 15 revisions back from the first instance in the bug. The sheriffs had already worked to get this test disabled prior to my results coming in!
- Bug 1081925 – browser_popup_blocker.js – previous test in the directory was modified to work in e10s 4 revisions back from the first instance reported in the bug, causing this to fail
- Bug 1118277 – browser_popup_blocker.js (different symptom, same test pattern and root cause as bug 1081925)
- Bug 1073442 – Intermittent command timed out; might not be code related and >30 days of history.
- Bug 1096302 – test_collapse.html | Test timed out. >30 days of history.
- Bug 1151786 – testOfflinePage. >30 days of history. (and a patch exists).
- Bug 1145199 – browser_referrer_open_link_in_private.js. >30 days of history.
- Bug 1073761 – test_value_storage.html. >30 days of history.
- Bug 1161537 – test_dev_mode_activity.html. resolved (a result from the previous bisection experiment).
- Bug 1153454 – browser_tabfocus.js. >30 days of history.
Looking at the bugs of interest, I jumped right in in retriggering. This time around I did 20 retriggers for the original changeset, then went back to 30 revisions (every 5th) doing the same thing. Effectively this was doing 20 retriggers for the 0, 5th, 10th, 15th, 20th, 25th, and 30th revisions in the history list (140 retriggers).
I ran into issues doing this, specifically on Bug 1073761. The reason why is that for about 7 revisions in history the windows 8 builds failed! Luckily the builds finished enough to get a binary+tests package so we could run tests, but mozci didn’t understand that the build was available. That required some manual retriggering. Actually a few cases on both retriggers were actual build failures which resulted in having to manually pick a different revision to retrigger on. This was fairly easy to then run my tool again and fill in the 4 missing revisions using slightly different mozci parameters.
This was a bit frustrating as there was a lot of manual digging and retriggering due to build failures. Luckily 2 of the top 10 bugs are the same root cause and we figured it out. Including irc chatter and this blog post, I have roughly 3 hours invested into this experiment.
I have been using alert manager for a few months to help me track performance regressions. It is time to take it to the next level and increase productivity with it.
Yesterday I created a wiki page outlining the project. Today I filed a bug of bugs to outline my roadmap.
Basically we have:
* a bunch of little UI polish bugs
* some optimizations
* addition of reporting
* more work flow for easier management and investigations
In the near future we will work on making this work for high resolution alerts (i.e. each page that we load in talos collects numbers and we need to figure out how to track regressions on those instead of the highly summarized version of a collection).
Thanks for looking at this, improving tools always allows for higher quality work.
Last week I wrote a post with some thoughts on AutoLand and Try Server, this had some wonderful comments and because of that I have continued to think in the same problem space a bit more.
What if we rerun the test case that caused the job to go orange (yes in a crash, leak, shutdown timeout we would rerun the entire job) and if it was green then we could deem the failure as intermittent and ignore it!
With some of the work being done in bug 1014125, we could achieve this outside of buildbot and the job could repeat itself inside the single job instance yielding a true green.
One thought- we might want to ensure that if it is a test failing that we run it 5 times and it only fails 1 time, otherwise it is too intermittent.
A second thought- we would do this by try by default for autoland, but still show the intermittents on integration branches.
I will eventually get to my thoughts on massive chunking, but for now, lets hear more pros and cons of this idea/topic.
This is the first post is a series where I will post some ideas. These are ideas, not active projects (although these ideas could be implemented with many active projects).
My first idea is surrounding the concept of AutoLand. Mozilla has talked about this for a long time. In fact a conversation I had last week got me thinking more of the value of AutoLand vs blocking on various aspects of it. There are a handful of issues blocking us from a system where we push to try and if it goes well we magically land it on the tip of a tree. My vested interest comes in the part of “if it goes well”.
The argument here has been that we have so many intermittent oranges and until we fix those we cannot determine if a job is green. A joke for many years has been that it would be easier to win the lottery than to get an all green run on tbpl. I have seen a lot of cases where people push to Try and land on Inbound to only be backed out by a test failure- a test failure that was seen on Try (for the record I personally have done this once). I am sure someone could write a book on human behavior, tooling, and analysis of why failures land on integration branches when we have try server.
My current thought is this-
* push to try server with a final patch, run a full set of tests and builds
* when all the jobs are done , we analyze the results of the jobs and look for 2 patterns
* pattern 1: for a given build, at most 1 job fails
* pattern 2: for a given job , at most 1 platform fails
* if pattern 1 + 2 pass, we put this patch in the queue for landing by the robots
 – we can determine the minimal amount of jobs or verify with more analysis (i.e. 1 mochitest can fail, 1 reftest can fail, 1 other can fail)
 – some jobs are run in different chunks. on opt ‘dt’ runs all browser-chrome/devtools jobs, but this is ‘dt1’, ‘dt2’, ‘dt3’ on debug builds
This simple approach would give us the confidence that we need to reliably land patches on integration branches and achieve the same if not better results than humans.
For the bonus we could optimize our machine usage by not building/running all jobs on the integration commit because we have a complete set done on try server.