Monthly Archives: October 2017

Stockwell: flowchart for triage

I gave an update 2 weeks ago on the current state of Stockwell (intermittent failures).  I mentioned additional posts were coming and this is a second post in the series.

First off the tree sheriffs who maintain merges between branches, tree closures, backouts, hot fixes, and a many other actions that keep us releasing code do one important task, and that is star failures to a corresponding bug.

Sheriff

These annotations are saved in Treeherder and Orange Factor.  Inside of Orange Factor, we have a robot that comments on bugs– this has been changing a bit more frequently this year to help meet our new triage needs.

Once we get bugs annotated, now we work on triaging them.  Our primarily tool is Neglected Oranges which gives us a view of all failures that meet our threshold and don’t have a human comment in the last 7 days.  Here is the next stage of the process:

Triage

As you can see this is very simple, and it should be simple.  The ideal state is adding more information to the bug which helps make it easier for the person we NI? to prioritize the bug and make a decision:

Comment

While there is a lot more we can do, and much more that we have done, this seems to be the most effective use when looking across 1000+ bugs that we have triaged so far this year.

In some cases a bug fails very frequently and there are no development resources to spend fixing the bug- these will sometimes cross our 200 failures in 30 days policy and will get a [stockwell disabled-recommended] whiteboard tag, we monitor this and work to disable bugs on a regular basis:

Disable

This isn’t as cut and dry as disable every bug, but we do disable as quickly as possible and push hard on the bugs that are not as trivial to disable.

There are many new people working on Intermittent Triage and having a clear understanding of what they are doing will help you know how a random bug ended up with a ni? to you!

Leave a comment

Filed under intermittents, testdev

A formal introduction to Ionut Goldan – Mozilla’s new Performance Sheriff and Tool hacker

About 8 months ago we started looking for a full time performance sheriff to help out with our growing number of alerts and needs for keeping the Talos toolchain relevant.

We got really lucky and ended up finding Ionut (:igoldan on irc, #perf).  Over the last 6 months, Ionut has done a fabulous job of learning how to understand Talos alerts, graphs, scheduling, and narrowing down root causes.  In fact, he has not only been able to easily handle all of the Talos alerts, Ionut has picked up alerts from Autophone (Android devices), Build Metrics (build times, installer sizes, etc.), AWSY (memory metrics), and Platform Microbenchmarks (tests run inside of gtest written by a few developers on the graphics and stylo teams).

While I could probably write a list of Ionut’s accomplishments and some tricky bugs he has sorted out, I figured your enjoyment of reading this blog is better spend on getting to know Ionut better, so I did a Q&A with him so we can all learn much more about Ionut.

Tell us about where you live?

I live in Iasi. It is a gorgeous and colorful town, somewhere in the North-East of Romania.  It is full of great places and enchanting sunsets. I love how a casual walk
leads me to new, beautiful and peaceful neighborhoods.

I have many things I very much appreciate about this town:
the people here, its continuous growth, its historical resonance, the fact that its streets once echoed the steps of the most important cultural figures of our country. It also resembles ancient Rome, as it is also built on 7 hills.

It’s pretty hard not to act like a poet around here.

What inspired you to be a computer programmer?

I wouldn’t say I was inspired to be a programmer.

During my last years in high school, I occasionally consulted with my close ones. Each time we concluded that IT is just the best domain to specialize in: it will improve continuously, there will be jobs available; things that are evident nowadays.

I found much inspiration in this domain after the first year in college, when I noticed the huge advances and how they’re conducted.  I understood we’re living in a whole new era. Digital transformation is now the coined term for what’s going on.

Any interesting projects you have done in the past (school/work/fun)?

I had the great opportunity to work with brilliant teams on a full advertising platform, from almost scratch.

It got almost everything: it was distributed, highly scalable, completely written in
Python 3.X, the frontend adopted material design, NoSQL database in conjunction with SQL ones… It used some really cutting-edge libraries and it was a fantastic feeling.

Now it’s Firefox. The sound name speaks for itself and there are just so many cool things I can do here.

What hobbies do you have?

I like reading a lot. History and software technology are my favourite subjects.
I enjoy cooking, when I have the time. My favourite dish definitely is the Hungarian goulash.

Also, I enjoy listening to classical music.

If you could solve any massive problem, what would you solve?

Greed. Laziness. Selfishness. Pride.

We can resolve all problems we can possibly encounter by leveraging technology.

Keeping non-values like those mentioned above would ruin every possible achievement.

Where do you see yourself in 10 years?

In a peaceful home, being a happy and caring father, spending time and energy with
my loved ones. Always trying to be the best example for them.  I envision becoming a top notch professional programmer, leading highly performant teams on
sound projects. Always familiar with cutting-edge tech and looking to fit it in our tool set.

Constantly inspiring values among my colleagues.

Do you have any advice or lessons learned for new students studying computer science?

Be passionate about IT technologies. Always be curious and willing to learn about new things. There are tons and tons of very good videos, articles, blogs, newsletters, books, docs…Look them out. Make use of them. Follow their guidance and advice.

Continuous learning is something very specific for IT. By persevering, this will become your second nature.

Treat every project as a fantastic opportunity to apply related knowledge you’ve acquired.  You need tons of coding to properly solidify all that theory, to really understand why you need to stick to the Open/Closed principle and all other nitty-gritty little things like that.

I have really enjoyed getting to know Ionut and working with him.  If you see him on IRC please ping him and say hi 🙂

 

Leave a comment

Filed under Community

Talos tests- summary of recent changes

I have done a poor job of communicating status on our performance tooling, this is something I am trying to rectify this quarter.  Over the last 6 months many new talos tests have come online, along with some differences in scheduling or measurement.

In this post I will highlight many of the test related changes and leave other changes for a future post.

Here is a list of new tests that we run:

* cpstartup – (content process startup: thanks :gabor)
* sessionrestore many windows – (instead of one window and many tabs, thanks :beekill)
* perf-reftest[-singletons] – (thanks bholley, :heycam)
* speedometer – (thanks :jmaher)
* tp6 (amazon, facebook, google, youtube) – (thanks :rwood, :armenzg)

These are also new tests, but slight variations on existing tests:

* tp5o + webextension, ts_paint + webextension (test web extension perf, thanks :kmag)
* tp6 + heavy profile, ts_paint + heavy profile (thanks :rwood, :tarek)

The next tests have  been updated to be more relevant or reliable:

* damp (many subtests added, more upcoming, thanks :ochameau)
* tps – update measurements (thanks :mconley)
* tabpaint – update measurements (thanks :mconley)
* we run all talos tests on coverage builds (thanks :gmierz)

It is probably known to most, but earlier this year we stood up testing on Windows 10 and turned off our talos coverage on Windows 8 (big thanks to Q, for making this happen so fast)

Some changes that might not be so obvious, but worth mentioning:

* Added support for Time to first non blank paint (only tp6)
* Investigated mozAfterPaint on non-empty rects– updated a few tests to measure properly
* Added support for comparing perf measurements between tests (perf-reftests) so we can compare rendering time of A vs B- in this case stylo vs non-stylo
* tp6 requires mitmproxy for record/replay- this allows us to have https and multi host dns resolution which is much more real world than serving pages from http://localhost.
* Added support to wait for idle callback before testing the next page.

Stay tuned for updates on Sheriffing, non Talos tests, and upcoming plans.

1 Comment

Filed under testdev

Project Stockwell (October 2017)

It has been 6 months since the last Stockwell update.  With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates.  While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent.  We were just starting to focus on learning about failures.  We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands.  Our plan was to do a few things:

  • Triage all failures >=30 instances/week
  • Build tools to make triage easier and collect more data
  • Adjust policy for triaging, disabling, and managing intermittents
  • Make our tests better with linting and test-verification
  • Invest time into auto-classification
  • Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

  • Triaged all failures >= 30 instances/week (most weeks, never more than 3 weeks off)
  • Many improvements to our tools, including: adjusteing orange factor robot, intermittent-bug-filer, and added |mach test-info|
  • Played with policy on/off, have settled on needinfo “owner” when 30+ failures/week, and disabling if 200 failures in 30 days.
  • Added eslint to our tests, pylint for our tools, and the new TV job is tier-2.
  • added source file -> bugzilla components in-tree to define ownership.
  • 31 bugzilla components triage their own intermittents

While that is a lot of changes, it is incremental yet effective.  We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14).  While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low.  We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time.  We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category # Bugs
Fixed 511
Disabled 262
Infra 62
Needswork 49
Unknown 209
Total 1093

As you can see that is a lot of disabled tests.  Note, we usually only disable a test on a subset of the configurations, not 100% across the board.  Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes 🙂  The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

  • Triage process for component owners and others who want to participate
  • Test verification and the future
  • Workflow of an intermittent, from first failure to resolution
  • Future of Orange Factor and Autoclassification
  • Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla.  These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

Leave a comment

Filed under intermittents, testdev