A case of the weekends?

Case of the Mondays

What was famous 15 years ago as a case of the Mondays has manifested itself in Talos.  In fact, I wonder why I get so many regression alerts on Monday as compared to other days.  It is more to a point of we have less noise in our Talos data on weekends.

Take for example the test case tresize:

linux32,

* in fact we see this on other platforms as well linux32/linux64/osx10.8/windowsXP

30 days of linux tresize

Many other tests exhibit this.  What is different about weekends?  Is there just less data points?

I do know our volume of tests go down on weekends mostly as a side effect of less patches being landed on our trees.

Here are some ideas I have to debug this more:

  • Run massive retrigger scripts for talos on weekends to validate # of samples is/isnot the problem
  • Reduce the volume of talos on weekdays to validate the overall system load in the datacenter is/isnot the problem
  • compare the load of the machines with all branches and wait times to that of the noise we have in certain tests/platforms
  • Look at platforms like windows 7, windows 8, and osx 10.6 as to why they have more noise on weekends or are more stable.  Finding the delta in platforms would help provide answers

If you have ideas on how to uncover this mystery, please speak up.  I would be happy to have this gone and make any automated alerts more useful!

2 Comments

Filed under Uncategorized

Say hi to Kaustabh Datta Choudhury, a newer Mozillian

A couple months ago I ran into :kaustabh93 online as he had picked up a couple good first bugs.  Since then he has continued to work very hard and submit a lot of great pull requests to Ouija and Alert Manager (here is his github profile).  After working with him for a couple of weeks, I decided it was time to learn more about him, and I would like to share that with Mozilla as a whole:

Tell us about where you live-

I live in a town called Santragachi in West Bengal. The best thing about this place is its ambience. It is not at the heart of the city but the city is easily accessible. That keeps the maddening crowd of the city away and a calm and peaceful environment prevails here.

Tell us about your school-

 I completed my schooling from Don Bosco School, Liluah. After graduating from there, now I am pursuing an undergraduate degree in Computer Science & Engineering from MCKV Institute of Engineering.

Right from when it was introduced to me, I was in love with the subject ‘Computer Science’. And introduction to coding was one of the best things that has happened to me so far.

Tell us about getting involved with Mozilla-

I was looking for some exciting real life projects to work on during my vacation & it was then that the idea of contributing to open source projects had struck me. Now I have been using Firefox for many years now and that gave me an idea of where to start looking. Eventually I found the volunteer tab and thus started my wonderful journey on Mozilla.

Right from when I was starting out, till now, one thing that I liked very much about Mozilla was that help was always at hand when needed. On my first day , I popped a few questions in the IRC channel #introduction & after getting the basic of where to start out, I started working on Ouija under the guidance of ‘dminor’ & ‘jmaher’. After a few bug fixes there, Dan recommended me to have a look at Alert Manager & I have been working on it ever since. And the experience of working for Mozilla has been great.

Tell us what you enjoy doing-

I really love coding. But apart from it I also am an amateur photographer & enjoy playing computer games & reading books.

Where do you see yourself in 5 years?

In 5 years’ time I prefer to see myself as a successful engineer working on innovative projects & solving problems.

If somebody asked you for advice about life, what would you say?

Rather than following the crowd down the well-worn path, it is always better to explore unchartered territories with a few.

:kaustabh93 is back in school as of this week, but look for activity on bugzilla and github from him.  You will find him online once in a while in various channels, I usually find him in #ateam.

4 Comments

Filed under Uncategorized

I invite you to join me in welcoming 3 new tests to Talos

2 months ago, we added session restore tests to Talos, today I am pleased to announce 3 new tests:

  • media_tests – only runs on linux64 and is our first test to measure audio processing.  Much thanks to :jesup, and Suhas Nandaku from Cisco.
  •  tp5o_scroll - imagine if tscrollx and tp5 had a child- not only do we load the page, but we scroll the page.  Big thanks go to :avih for tackling this project.
  •  glterrain – The first webgl benchmark to show up in Talos.  Credit goes to :avih for driving this and delivering it.  There are others coming, this was the easiest to get going.

 

Stay tuned in the coming weeks as we have 2 others tests in the works:

  • ts_paint_cold
  • mainthreadio detection

 

Leave a comment

Filed under testdev

Alert Manager has a more documentation and a roadmap

I have been using alert manager for a few months to help me track performance regressions. It is time to take it to the next level and increase productivity with it.

Yesterday I created a wiki page outlining the project. Today I filed a bug of bugs to outline my roadmap.

Basically we have:
* a bunch of little UI polish bugs
* some optimizations
* addition of reporting
* more work flow for easier management and investigations

In the near future we will work on making this work for high resolution alerts (i.e. each page that we load in talos collects numbers and we need to figure out how to track regressions on those instead of the highly summarized version of a collection).

Thanks for looking at this, improving tools always allows for higher quality work.

Leave a comment

Filed under Uncategorized

More thoughts on Auto-land and try server

Last week I wrote a post with some thoughts on AutoLand and Try Server, this had some wonderful comments and because of that I have continued to think in the same problem space a bit more.

In chatting with Vaibhav1994 (who is really doing an awesome GSoC project this summer for Mozilla), we started brainstorming another way to resolve our intermittent orange problem.

What if we rerun the test case that caused the job to go orange (yes in a crash, leak, shutdown timeout we would rerun the entire job) and if it was green then we could deem the failure as intermittent and ignore it!

With some of the work being done in bug 1014125, we could achieve this outside of buildbot and the job could repeat itself inside the single job instance yielding a true green.

One thought- we might want to ensure that if it is a test failing that we run it 5 times and it only fails 1 time, otherwise it is too intermittent.

A second thought- we would do this by try by default for autoland, but still show the intermittents on integration branches.

I will eventually get to my thoughts on massive chunking, but for now, lets hear more pros and cons of this idea/topic.

6 Comments

Filed under Uncategorized

Thoughts on Auto-Land, Try server, and intermittent oranges outline

This is the first post is a series where I will post some ideas.  These are ideas, not active projects (although these ideas could be implemented with many active projects).

My first idea is surrounding the concept of AutoLand.  Mozilla has talked about this for a long time.  In fact a conversation I had last week got me thinking more of the value of AutoLand vs blocking on various aspects of it.  There are a handful of issues blocking us from a system where we push to try and if it goes well we magically land it on the tip of a tree.  My vested interest comes in the part of “if it goes well”.

The argument here has been that we have so many intermittent oranges and until we fix those we cannot determine if a job is green.  A joke for many years has been that it would be easier to win the lottery than to get an all green run on tbpl.  I have seen a lot of cases where people push to Try and land on Inbound to only be backed out by a test failure- a test failure that was seen on Try (for the record I personally have done this once).  I am sure someone could write a book on human behavior, tooling, and analysis of why failures land on integration branches when we have try server.

My current thought is this-

* push to try server with a final patch, run a full set of tests and builds

* when all the jobs are done [1], we analyze the results of the jobs and look for 2 patterns

* pattern 1: for a given build, at most 1 job fails

* pattern 2: for a given job [2], at most 1 platform fails

* if pattern 1 + 2 pass, we put this patch in the queue for landing by the robots

[1] – we can determine the minimal amount of jobs or verify with more analysis (i.e. 1 mochitest can fail, 1 reftest can fail, 1 other can fail)

[2] – some jobs are run in different chunks.  on opt ‘dt’ runs all browser-chrome/devtools jobs, but this is ‘dt1′, ‘dt2′, ‘dt3′ on debug builds

 

This simple approach would give us the confidence that we need to reliably land patches on integration branches and achieve the same if not better results than humans.

For the bonus we could optimize our machine usage by not building/running all jobs on the integration commit because we have a complete set done on try server.

 

9 Comments

Filed under Uncategorized

Firefox 32 leaves the train station- what does the performance look like

Now that we have an uplift completed and enough future data has been collected to ensure sustained changes in data automatically, it is time for the triple fortnightly report of what performance looks like.  For reference there is some data in a blog post about general talos numbers.

Firefox 32 uplift, m-c -> Aurora (tracking bug 1004427):

  • 20 – regressions (3 CART, 3 TART, 3 SVG, 3 TResize, and some one off tests)
    • 18 regressions are on windows, the majority a result of OMTC being turned on (this is more of a rebaselining of tests than actual regressions)
    • 3 bugs are tracking all 20 regressions!
  • 43 – improvements (15 Kraken/V8/Dromaeo, 2 SVG, 7 TScroll, 4 TART/CART, 4 Paint, 9 SessionRestore, and a couple others)
  • the Improvements are distributed amongst Windows, Mac, Linux

Firefox 31 uplift, m-c -> Aurora (tracking bug 990085):

Firefox 30 uplift, m-c -> Aurora:

  • 26 – regressions (4 TART, 4 SVG, 3 TS, Paint, and many more)
    • 2 remaining bugs not resolved as we are now on Beta (bug 990183, bug 990194)

As you can see Firefox32 has a lot of improvements and fewer regressions (of those 20 about half are related to rebasing numbers).

Lets look at bugs:

  • 36 bugs filed to date for Firefox32 Talos regressions
  • 16 are resolved (7 as wontfix)
  • 20 are open (this means that 17 of them are only showing up on non-pgo)

 

After reviewing the process of investigating alerts, it makes sense that we continue forward with the same process in 6 week intervals and any changes are made on uplift day and they would apply only to trunk.  Some future changes we are considering:

  • not filing bugs on minimal regressions (ex. <4%)
  • not filing bugs on non-pgo only regressions (since we only build pgo on Aurora, Beta, Release)
  • generating alerts for per test (not per suite) regressions (and only file bugs if a single test is >10%)
  • adjust the graph server alert calculation to not drop the page with the highest value and to report the geometric mean of the pages instead of the average
  • any other great ideas you have on how to be efficient with our time while continuing to identify and document our regressions

Onward to Firefox 33!

Leave a comment

Filed under Uncategorized