Category Archives: Uncategorized

re-triggering for a [root] cause – version 1.57

Last week I wrote some notes about re-triggering jobs to find a root cause.  This week I decided to look at the orange factor email of the top 10 bugs and see how I could help.  Looking at each of the 10 bugs, I had 3 worth investigating and 7 I ignored.

Investigate:

  • Bug 1163911 test_viewport_resize.html – new test which was added 15 revisions back from the first instance in the bug.  The sheriffs had already worked to get this test disabled prior to my results coming in!
  • Bug 1081925 browser_popup_blocker.js – previous test in the directory was modified to work in e10s 4 revisions back from the first instance reported in the bug, causing this to fail
  • Bug 1118277 browser_popup_blocker.js (different symptom, same test pattern and root cause as bug 1081925)

Ignore:

  • Bug 1073442 – Intermittent command timed out; might not be code related and >30 days of history.
  • Bug 1096302test_collapse.html | Test timed out. >30 days of history.
  • Bug 1151786 testOfflinePage. >30 days of history. (and a patch exists).
  • Bug 1145199 browser_referrer_open_link_in_private.js. >30 days of history.
  • Bug 1073761 test_value_storage.html. >30 days of history.
  • Bug 1161537 test_dev_mode_activity.html. resolved (a result from the previous bisection experiment).
  • Bug 1153454 browser_tabfocus.js. >30 days of history.

Looking at the bugs of interest, I jumped right in in retriggering.  This time around I did 20 retriggers for the original changeset, then went back to 30 revisions (every 5th) doing the same thing.  Effectively this was doing 20 retriggers for the 0, 5th, 10th, 15th, 20th, 25th, and 30th revisions in the history list (140 retriggers).

I ran into issues doing this, specifically on Bug 1073761.  The reason why is that for about 7 revisions in history the windows 8 builds failed!  Luckily the builds finished enough to get a binary+tests package so we could run tests, but mozci didn’t understand that the build was available.  That required some manual retriggering.  Actually a few cases on both retriggers were actual build failures which resulted in having to manually pick a different revision to retrigger on.  This was fairly easy to then run my tool again and fill in the 4 missing revisions using slightly different mozci parameters.

This was a bit frustrating as there was a lot of manual digging and retriggering due to build failures.  Luckily 2 of the top 10 bugs are the same root cause and we figured it out.  Including irc chatter and this blog post, I have roughly 3 hours invested into this experiment.

Leave a comment

Filed under Uncategorized

Alert Manager has a more documentation and a roadmap

I have been using alert manager for a few months to help me track performance regressions. It is time to take it to the next level and increase productivity with it.

Yesterday I created a wiki page outlining the project. Today I filed a bug of bugs to outline my roadmap.

Basically we have:
* a bunch of little UI polish bugs
* some optimizations
* addition of reporting
* more work flow for easier management and investigations

In the near future we will work on making this work for high resolution alerts (i.e. each page that we load in talos collects numbers and we need to figure out how to track regressions on those instead of the highly summarized version of a collection).

Thanks for looking at this, improving tools always allows for higher quality work.

Leave a comment

Filed under Uncategorized

More thoughts on Auto-land and try server

Last week I wrote a post with some thoughts on AutoLand and Try Server, this had some wonderful comments and because of that I have continued to think in the same problem space a bit more.

In chatting with Vaibhav1994 (who is really doing an awesome GSoC project this summer for Mozilla), we started brainstorming another way to resolve our intermittent orange problem.

What if we rerun the test case that caused the job to go orange (yes in a crash, leak, shutdown timeout we would rerun the entire job) and if it was green then we could deem the failure as intermittent and ignore it!

With some of the work being done in bug 1014125, we could achieve this outside of buildbot and the job could repeat itself inside the single job instance yielding a true green.

One thought- we might want to ensure that if it is a test failing that we run it 5 times and it only fails 1 time, otherwise it is too intermittent.

A second thought- we would do this by try by default for autoland, but still show the intermittents on integration branches.

I will eventually get to my thoughts on massive chunking, but for now, lets hear more pros and cons of this idea/topic.

6 Comments

Filed under Uncategorized

Thoughts on Auto-Land, Try server, and intermittent oranges outline

This is the first post is a series where I will post some ideas.  These are ideas, not active projects (although these ideas could be implemented with many active projects).

My first idea is surrounding the concept of AutoLand.  Mozilla has talked about this for a long time.  In fact a conversation I had last week got me thinking more of the value of AutoLand vs blocking on various aspects of it.  There are a handful of issues blocking us from a system where we push to try and if it goes well we magically land it on the tip of a tree.  My vested interest comes in the part of “if it goes well”.

The argument here has been that we have so many intermittent oranges and until we fix those we cannot determine if a job is green.  A joke for many years has been that it would be easier to win the lottery than to get an all green run on tbpl.  I have seen a lot of cases where people push to Try and land on Inbound to only be backed out by a test failure- a test failure that was seen on Try (for the record I personally have done this once).  I am sure someone could write a book on human behavior, tooling, and analysis of why failures land on integration branches when we have try server.

My current thought is this-

* push to try server with a final patch, run a full set of tests and builds

* when all the jobs are done [1], we analyze the results of the jobs and look for 2 patterns

* pattern 1: for a given build, at most 1 job fails

* pattern 2: for a given job [2], at most 1 platform fails

* if pattern 1 + 2 pass, we put this patch in the queue for landing by the robots

[1] – we can determine the minimal amount of jobs or verify with more analysis (i.e. 1 mochitest can fail, 1 reftest can fail, 1 other can fail)

[2] – some jobs are run in different chunks.  on opt ‘dt’ runs all browser-chrome/devtools jobs, but this is ‘dt1′, ‘dt2′, ‘dt3′ on debug builds

 

This simple approach would give us the confidence that we need to reliably land patches on integration branches and achieve the same if not better results than humans.

For the bonus we could optimize our machine usage by not building/running all jobs on the integration commit because we have a complete set done on try server.

 

9 Comments

Filed under Uncategorized

Firefox 32 leaves the train station- what does the performance look like

Now that we have an uplift completed and enough future data has been collected to ensure sustained changes in data automatically, it is time for the triple fortnightly report of what performance looks like.  For reference there is some data in a blog post about general talos numbers.

Firefox 32 uplift, m-c -> Aurora (tracking bug 1004427):

  • 20 – regressions (3 CART, 3 TART, 3 SVG, 3 TResize, and some one off tests)
    • 18 regressions are on windows, the majority a result of OMTC being turned on (this is more of a rebaselining of tests than actual regressions)
    • 3 bugs are tracking all 20 regressions!
  • 43 – improvements (15 Kraken/V8/Dromaeo, 2 SVG, 7 TScroll, 4 TART/CART, 4 Paint, 9 SessionRestore, and a couple others)
  • the Improvements are distributed amongst Windows, Mac, Linux

Firefox 31 uplift, m-c -> Aurora (tracking bug 990085):

Firefox 30 uplift, m-c -> Aurora:

  • 26 – regressions (4 TART, 4 SVG, 3 TS, Paint, and many more)
    • 2 remaining bugs not resolved as we are now on Beta (bug 990183, bug 990194)

As you can see Firefox32 has a lot of improvements and fewer regressions (of those 20 about half are related to rebasing numbers).

Lets look at bugs:

  • 36 bugs filed to date for Firefox32 Talos regressions
  • 16 are resolved (7 as wontfix)
  • 20 are open (this means that 17 of them are only showing up on non-pgo)

 

After reviewing the process of investigating alerts, it makes sense that we continue forward with the same process in 6 week intervals and any changes are made on uplift day and they would apply only to trunk.  Some future changes we are considering:

  • not filing bugs on minimal regressions (ex. <4%)
  • not filing bugs on non-pgo only regressions (since we only build pgo on Aurora, Beta, Release)
  • generating alerts for per test (not per suite) regressions (and only file bugs if a single test is >10%)
  • adjust the graph server alert calculation to not drop the page with the highest value and to report the geometric mean of the pages instead of the average
  • any other great ideas you have on how to be efficient with our time while continuing to identify and document our regressions

Onward to Firefox 33!

Leave a comment

Filed under Uncategorized

Looking for long term trends and patterns in how I work

Early last year (2013), I noticed I would work really productive for a couple weeks, and then get in a rut for a week here and there.  After discussing this perceived trend with Clint, I started tracking it every week (end of work day on Friday).  I have been tracking it for a year, and now I have data to examine in more detail:

Image

For the first part of last year (through September 2013), I would go in 6 week cycles which appeared to be about 1 week after the uplifts.  Oddly enough I wasn’t doing any specific work for uplifts, but I do recall a lot of odd issues that required debugging for each uplift.  Quite possible the day or two spent handling these issues resulted in me getting backlogged on emails.

Oddly enough when I transitioned from full time mobile automation -> full time performance automation, my cycles became more regular.  One exception was a focused project development week early in 2014 which had me doing other tasks and getting behind on a few other projects.

There is no direct correlation in the health data, but I have some theories.  I record my general feeling of health (for the most part physical, not emotional).  This is pure judgemental and there is no science behind it.  10 is good, 0 is bad, so when there is a dip in health on the graph, I usually see an increase in email volume the next week.  No explanation for that, just an observation.

In summary, I have enjoyed looking back on this data.  It was good to see a trend for most of 2013.  Maybe next year I will see a different trend or pattern.

2 Comments

Filed under Uncategorized

Some thoughts on being a good mentor

I have done a good deal of mentored bugs as well as mentoring new Mozillians (gsoc, interns, employees) on their journey.  I would like to share a few things which I have found that make things easier.  Most of this might seem like common sense, but I find it so easy to overlook little details and forget things.

  1. When filing a bug (or editing an existing one), make sure to include:
    • Link(s) to getting started with the code base (cloning, building, docs, etc.)
    • Clear explanation of what is expected in the bug
    • A general idea of where the problematic code is
    • What testing should look like
    • How to commit a patch
    • A note of how best to communicate and not to worry about asking questions
    • Avoid shorthand and acronyms!
  2. Spend a few minutes via IRC/Email getting to know your new friend, especially timezones and general schedule of availability.
  3. Make it a priority to do quick reviews and answer questions – nothing is more discouraging when you have 1 thing to work on and you need to wait for further information.
  4. It is your job to help them be effective – take the time to explain why coding styles and testing are important and how it is done at Mozilla.
  5. Make it clear how their current work plays a role in Mozilla as a whole.  Nobody likes to work on something that is not valued.
  6. Granting access to Try server (or as a contributor to a git repository) really make you feel welcome and part of the team, consider doing this sooner than later.  With this comes the responsibility of teaching them how to use their privileges responsibly!
  7. Pay attention to details- forming good habits up front go a long way!

With those things said, just try to put yourself in the shoes of a new Mozillian.  Would you want honest feedback?  Would you want to feel part of the larger community?

Being a good mentor should be rewarding (the majority of the time) and result in great Mozillians who people enjoy working with.

Lets continue to grow Mozilla!

2 Comments

Filed under Uncategorized