Tag Archives: automation

5 days in Portland with Mozillians and 10 great things that came from it

I took a lot of notes in Portland last week.  One might not know that based on the fact that I talked so much my voice ran out of steam by the second day.  Either way, in chatting with some co-workers yesterday about what we took away from Portland, I realized that there is a long list of awesomeness.

Let me caveat this by saying that some of these ideas have been talked about in the past, but despite our efforts to work with others and field interesting and useful ideas, there is a big list of great things that came to light while chatting in person:

  • :bgrins mentioned a mozscreenshot tool and the need for getting a screenshot of new features in development on various platforms so UX can review the changes.  Currently it is a method of asking UX to download the build from try or some other location and run it locally to see the changes.
  • :heycam/:jwatt – had a great an interesting talos discussion.  Mostly around how to run it and validate patches/fixes locally and on try server. (check out bug 1109243)
  • :glandium is looking at doing some changes (I recall something with build/pgo) and wanted to know how to compare some Talos numbers to help make the right decision – this can be done with either bug 1109243, or the existing compare.py in the Talos repo (we might need some cleanup on this)
  • :bobowen has been working to get csb tests working- after chatting in line to board a plane, it became clear he needs to solve some finer grain test selection problems- many of which the ateam has on a roadmap in Q2/Q3 – I see some tighter collaboration happening here.
  • Thanks to chatting with :lsblakk, I am motivated to expand the talos sheriff team and look for dedicated Mozillians (or soon to become Mozillians) to work with in keeping a lid on the alerts and overall state of performance (based on what we measure).
  • :lightsofapollo had a great conversation with me about TaskCluster and what barriers stood in the way of running Talos on it – this will result is some initial investigation work!
  • :kats was asking me how to generate alerts for areweslimyet.com.  This is very doable via posting data to graph server
  • After a good session on how to handle intermittents (seems like the same people have this conversation every time a bunch of Mozillians get together), I am motivated to push Titanic further to find the root cause of an intermittent via brute force retriggers (ideally on weekends).  In fact :dbaron has done this a few times in the last month and so have the sheriffs.  This is similar to what we do to verify a talos regression, just with some different parameters.
  • The same conversation about intermittents yielded a stronger desire to look at new tests coming into the system and validating stability.  The simple solution is to run the job 100 times, verify that the new test didn’t have issues and then leave it along.  Of course we could get smart and do this for all test_* files that are edited in the tree.  Thanks to :ehsan for spawning this conversation.
  • Discussing the idea of a Talos Sheriff with a few folks, it seems like there are further conversations needs with the existing Sheriff team as well as to chat with :vladan and :avih about what type of policy we should have for existing performance failures which are detected.  I would expect some changes to be made early next year as we have more tests and need more help.  My initial thoughts are specifically with responding to regressions or getting backed out in XX hours.  Yeah that sounds nasty, but there are probably cut and dry parameters we can set and start enforcing.

Those are 10 specific topics which despite everybody knowing how to contact me or the ateam and share great ideas or frustrations, these came out of being in the same place at the same time.

Thinking through this, when I see these folks in a real office while working from there for a few days or a week, it seems as though the conversations are smaller and not as intense.  Usually just small talk whilst waiting for a build to finish.  I believe the idea where we are not expected to focus on our day to day work and instead make plans for the future is the real innovation behind getting these topics surfaced.

Advertisements

1 Comment

Filed under general

Thoughts on Auto-Land, Try server, and intermittent oranges outline

This is the first post is a series where I will post some ideas.  These are ideas, not active projects (although these ideas could be implemented with many active projects).

My first idea is surrounding the concept of AutoLand.  Mozilla has talked about this for a long time.  In fact a conversation I had last week got me thinking more of the value of AutoLand vs blocking on various aspects of it.  There are a handful of issues blocking us from a system where we push to try and if it goes well we magically land it on the tip of a tree.  My vested interest comes in the part of “if it goes well”.

The argument here has been that we have so many intermittent oranges and until we fix those we cannot determine if a job is green.  A joke for many years has been that it would be easier to win the lottery than to get an all green run on tbpl.  I have seen a lot of cases where people push to Try and land on Inbound to only be backed out by a test failure- a test failure that was seen on Try (for the record I personally have done this once).  I am sure someone could write a book on human behavior, tooling, and analysis of why failures land on integration branches when we have try server.

My current thought is this-

* push to try server with a final patch, run a full set of tests and builds

* when all the jobs are done [1], we analyze the results of the jobs and look for 2 patterns

* pattern 1: for a given build, at most 1 job fails

* pattern 2: for a given job [2], at most 1 platform fails

* if pattern 1 + 2 pass, we put this patch in the queue for landing by the robots

[1] – we can determine the minimal amount of jobs or verify with more analysis (i.e. 1 mochitest can fail, 1 reftest can fail, 1 other can fail)

[2] – some jobs are run in different chunks.  on opt ‘dt’ runs all browser-chrome/devtools jobs, but this is ‘dt1’, ‘dt2’, ‘dt3’ on debug builds

 

This simple approach would give us the confidence that we need to reliably land patches on integration branches and achieve the same if not better results than humans.

For the bonus we could optimize our machine usage by not building/running all jobs on the integration commit because we have a complete set done on try server.

 

9 Comments

Filed under Uncategorized

Are there any trends in our Talos regression bugs?

Now that we have a better process for taking action on Talos alerts and pushing them to resolution, it is time to take a step back and see if any trends show up in our bugs.

First I want to look at bugs filed/week:

Image

This is fun to see, now what if we stack this up side by side with the alerts we receive:

Image

We started tracking alerts halfway through this process.  We show that for about 1 out of every 25 alerts we file a bug.  I had previously stated it was closer to 1/33 alerts (it appears that is averaging out the first few weeks).

Lets see where these bugs are filed, here is a view of the different bugzilla products:

Image

The Testing product is used to file bugs that we cannot figure out the exact changeset, so they get filed in testing::talos.  As there are almost 30 unique components bugs are filed in, I took a few minutes to look at the Core product, here is where the bugs live in Core:

Image

Pardon my bad graphing attempt here with the components cut off.  Graphics is the clear winner for regressions (with “graphics: layers” being a large part of it).  Of course the Javascript Engine and DOM would be there (a lot of our tests are sensitive to changes here).  This really shows where our test coverage is more than where bad code lives. 

Now that I know where the bugs are, here is a view of how long the bugs stay open:

Image

The fantastic news is most of our bugs are resolved in <=15 days!  I think this is a metric we can track and get better at- ideally closing all Talos regression bugs in <30 days.

Looking over all the bugs we have, what is the status of them?

Image

Yay for the blue pacman!  We have a lot of new bugs instead of assigned bugs, that might be something we could adjust and assign owners once it is confirmed and briefly discussed- that is still up in the air.

The burning question is what are all the bugs resolved as?

Image

To me this seems healthy, it is a starting point.  Tracking this over time will probably be a useful metric!

 

In summary, many developers have done great work to make improvements and fix patches over the last 6 months that we have been tracking this information.  There are things we can do better, I want to know-

What information provided today is useful to track regularly?

Is there something you would rather see?

 

3 Comments

Filed under Uncategorized

The lifecycle of a Talos performance regression

The lifecycle of a Talos performance regression

The cycle of landing a change to Firefox that affects performance

Leave a comment

May 8, 2014 · 9:38 am

Hey, SessionRestore- you have a Talos test

As of last Friday bug 936630 landed so we now have sessionrestore (and sessionrestore_no_auto_restore) as 2 new tests in the Talos suite (posting results under they magic ‘o’ key on tbpl).  In fact we have already seen this test show improvements.

Thanks to Yoric for creating these new tests, give him some karma on irc!  Please refer to the talos docs if you want more information on these tests.

 

Leave a comment

Filed under Uncategorized

Performance Alerts – by the numbers

If you have ever received an automated mail about a performance regression, and then 10 more, you probably are frustrated by the volume of alerts.  6 months ago, I started looking at the alerts and filing bugs, and 10 weeks ago a little tool was written to help out.

What have I seen in 10 weeks:

1926 alerts on mozilla.dev.tree-management for Talos resulting in 58 bugs filed (or 1 bug/33 alerts):

Image

*keep in mind that many alerts are improvements, as well as duplicated between trees and pgo/nonpgo

 

Now for some numbers as we uplift.  How are we doing from one release to another?  Are we regressing, Improving?  These are all questions I would like to answer in the coming weeks.

Firefox 30 uplift, m-c -> Aurora:

  • 26 – regressions (4 TART, 4 SVG, 3 TS, Paint, and many more)
    • 2 remaining bugs not resolved as we are now on Beta (bug 990183, bug 990194)

 

Firefox 31 uplift, m-c -> Aurora (tracking bug 990085):

 

Is this useful information?

Are there questions you would rather I answer with this data?

 

3 Comments

Filed under Uncategorized

Vaibhav has a blog- a perspective of a new Mozilla hacker

As I mentioned earlier in this year, I have had the pleasure of working with Vaibhav.  Now that time has passed he continues to contribute to Mozilla, and he will be participating this year in Google Summer of Code with Mozilla.  I am excited. 

He now has a blog– while there is only one post, he will be posting ~weekly with updates to his GSoC project and other fun topics.

Leave a comment

Filed under Uncategorized