A formal introduction to Ionut Goldan – Mozilla’s new Performance Sheriff and Tool hacker

About 8 months ago we started looking for a full time performance sheriff to help out with our growing number of alerts and needs for keeping the Talos toolchain relevant.

We got really lucky and ended up finding Ionut (:igoldan on irc, #perf).  Over the last 6 months, Ionut has done a fabulous job of learning how to understand Talos alerts, graphs, scheduling, and narrowing down root causes.  In fact, he has not only been able to easily handle all of the Talos alerts, Ionut has picked up alerts from Autophone (Android devices), Build Metrics (build times, installer sizes, etc.), AWSY (memory metrics), and Platform Microbenchmarks (tests run inside of gtest written by a few developers on the graphics and stylo teams).

While I could probably write a list of Ionut’s accomplishments and some tricky bugs he has sorted out, I figured your enjoyment of reading this blog is better spend on getting to know Ionut better, so I did a Q&A with him so we can all learn much more about Ionut.

Tell us about where you live?

I live in Iasi. It is a gorgeous and colorful town, somewhere in the North-East of Romania.  It is full of great places and enchanting sunsets. I love how a casual walk
leads me to new, beautiful and peaceful neighborhoods.

I have many things I very much appreciate about this town:
the people here, its continuous growth, its historical resonance, the fact that its streets once echoed the steps of the most important cultural figures of our country. It also resembles ancient Rome, as it is also built on 7 hills.

It’s pretty hard not to act like a poet around here.

What inspired you to be a computer programmer?

I wouldn’t say I was inspired to be a programmer.

During my last years in high school, I occasionally consulted with my close ones. Each time we concluded that IT is just the best domain to specialize in: it will improve continuously, there will be jobs available; things that are evident nowadays.

I found much inspiration in this domain after the first year in college, when I noticed the huge advances and how they’re conducted.  I understood we’re living in a whole new era. Digital transformation is now the coined term for what’s going on.

Any interesting projects you have done in the past (school/work/fun)?

I had the great opportunity to work with brilliant teams on a full advertising platform, from almost scratch.

It got almost everything: it was distributed, highly scalable, completely written in
Python 3.X, the frontend adopted material design, NoSQL database in conjunction with SQL ones… It used some really cutting-edge libraries and it was a fantastic feeling.

Now it’s Firefox. The sound name speaks for itself and there are just so many cool things I can do here.

What hobbies do you have?

I like reading a lot. History and software technology are my favourite subjects.
I enjoy cooking, when I have the time. My favourite dish definitely is the Hungarian goulash.

Also, I enjoy listening to classical music.

If you could solve any massive problem, what would you solve?

Greed. Laziness. Selfishness. Pride.

We can resolve all problems we can possibly encounter by leveraging technology.

Keeping non-values like those mentioned above would ruin every possible achievement.

Where do you see yourself in 10 years?

In a peaceful home, being a happy and caring father, spending time and energy with
my loved ones. Always trying to be the best example for them.  I envision becoming a top notch professional programmer, leading highly performant teams on
sound projects. Always familiar with cutting-edge tech and looking to fit it in our tool set.

Constantly inspiring values among my colleagues.

Do you have any advice or lessons learned for new students studying computer science?

Be passionate about IT technologies. Always be curious and willing to learn about new things. There are tons and tons of very good videos, articles, blogs, newsletters, books, docs…Look them out. Make use of them. Follow their guidance and advice.

Continuous learning is something very specific for IT. By persevering, this will become your second nature.

Treat every project as a fantastic opportunity to apply related knowledge you’ve acquired.  You need tons of coding to properly solidify all that theory, to really understand why you need to stick to the Open/Closed principle and all other nitty-gritty little things like that.

I have really enjoyed getting to know Ionut and working with him.  If you see him on IRC please ping him and say hi 🙂

 

Advertisements

Leave a comment

Filed under Community

Talos tests- summary of recent changes

I have done a poor job of communicating status on our performance tooling, this is something I am trying to rectify this quarter.  Over the last 6 months many new talos tests have come online, along with some differences in scheduling or measurement.

In this post I will highlight many of the test related changes and leave other changes for a future post.

Here is a list of new tests that we run:

* cpstartup – (content process startup: thanks :gabor)
* sessionrestore many windows – (instead of one window and many tabs, thanks :beekill)
* perf-reftest[-singletons] – (thanks bholley, :heycam)
* speedometer – (thanks :jmaher)
* tp6 (amazon, facebook, google, youtube) – (thanks :rwood, :armenzg)

These are also new tests, but slight variations on existing tests:

* tp5o + webextension, ts_paint + webextension (test web extension perf, thanks :kmag)
* tp6 + heavy profile, ts_paint + heavy profile (thanks :rwood, :tarek)

The next tests have  been updated to be more relevant or reliable:

* damp (many subtests added, more upcoming, thanks :ochameau)
* tps – update measurements (thanks :mconley)
* tabpaint – update measurements (thanks :mconley)
* we run all talos tests on coverage builds (thanks :gmierz)

It is probably known to most, but earlier this year we stood up testing on Windows 10 and turned off our talos coverage on Windows 8 (big thanks to Q, for making this happen so fast)

Some changes that might not be so obvious, but worth mentioning:

* Added support for Time to first non blank paint (only tp6)
* Investigated mozAfterPaint on non-empty rects– updated a few tests to measure properly
* Added support for comparing perf measurements between tests (perf-reftests) so we can compare rendering time of A vs B- in this case stylo vs non-stylo
* tp6 requires mitmproxy for record/replay- this allows us to have https and multi host dns resolution which is much more real world than serving pages from http://localhost.
* Added support to wait for idle callback before testing the next page.

Stay tuned for updates on Sheriffing, non Talos tests, and upcoming plans.

1 Comment

Filed under testdev

Project Stockwell (October 2017)

It has been 6 months since the last Stockwell update.  With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates.  While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent.  We were just starting to focus on learning about failures.  We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands.  Our plan was to do a few things:

  • Triage all failures >=30 instances/week
  • Build tools to make triage easier and collect more data
  • Adjust policy for triaging, disabling, and managing intermittents
  • Make our tests better with linting and test-verification
  • Invest time into auto-classification
  • Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

  • Triaged all failures >= 30 instances/week (most weeks, never more than 3 weeks off)
  • Many improvements to our tools, including: adjusteing orange factor robot, intermittent-bug-filer, and added |mach test-info|
  • Played with policy on/off, have settled on needinfo “owner” when 30+ failures/week, and disabling if 200 failures in 30 days.
  • Added eslint to our tests, pylint for our tools, and the new TV job is tier-2.
  • added source file -> bugzilla components in-tree to define ownership.
  • 31 bugzilla components triage their own intermittents

While that is a lot of changes, it is incremental yet effective.  We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14).  While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low.  We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time.  We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category # Bugs
Fixed 511
Disabled 262
Infra 62
Needswork 49
Unknown 209
Total 1093

As you can see that is a lot of disabled tests.  Note, we usually only disable a test on a subset of the configurations, not 100% across the board.  Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes 🙂  The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

  • Triage process for component owners and others who want to participate
  • Test verification and the future
  • Workflow of an intermittent, from first failure to resolution
  • Future of Orange Factor and Autoclassification
  • Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla.  These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

Leave a comment

Filed under intermittents, testdev

Project Stockwell (reduce intermittents) – April 2017

I am 1 week late in posting the update for Project Stockwell.  This wraps up a full quarter of work.  After a lot of concerns raised by developers about a proposed new backout policy, we moved on and didn’t change too much although we did push a little harder and I believe we have disabled more than we fixed as a result.

Lets look at some numbers:

Week Starting 01/02/17 02/27/17 03/24/17
Orange Factor 13.76 9.06 10.08
# P1 bugs 42 32 55
OF(P2) 7.25 4.78 5.13

As you can see we increased in March on all numbers- but overall a great decrease so far in 2017.

There have been a lot of failures which have lingered for a while which are not specific to a test.  For example:

  • windows 8 talos has a lot of crashes (work is being done in bug 1345735)
  • reftest crashes in bug 1352671.
  • general timeouts in jobs in bug 1204281.
  • and a few other leaks/timeouts/crashes/harness issues unrelated to a specific test
  • infrastructure issues and tier-3 jobs

While these are problematic, we see the overall failure rate is going down.  In all the other bugs where the test is clearly a problem we have seen many fixes which and responses to bugs from so many test owners and developers.  It is rare that we would suggest disabling a test and it was not agreed upon, and if there was concern we had a reasonable solution to reduce or fix the failure.

Speaking of which, we have been tracking total bugs, fixed, disabled, etc with whiteboard tags.  While there was a request to not use “stockwell” in the whiteboard tags and to make them more descriptive, after discussing this with many people we couldn’t come to agreement on names or what to track and what we would do with the data- so for now, we have remained the same.  Here is some data:

03/07/17 04/11/17
total 246 379
fixed 106 170
disabled 61 91
infrastructure 11 17
unknown 44 60
needswork 24 38
% disabled 36.53% 34.87%

What is interesting is that prior to march we had disabled 36.53% of the fixes, but in March when we were more “aggressive” about disabling tests, the overall percentage went down.  In fact this is a cumulative number for the year, so for the month of March alone we only disable 31.91% of the fixed tests.  Possibly if we had disabled a few more tests the overall numbers would have continued to go down vs slightly up.

A lot of changes took place on the tree in the last month, some interesting data on newer jobs:

  • taskcluster windows 7 tests are tier-2 for almost all windows VM tests
  • autophone is running all media tests which are not crashing or perma failing
  • disabled external media tests on linux platforms
  • added stylo mochitest and mochitest-chrome
  • fixed stylo reftests to run in e10s mode and on ubuntu 16.04

Upcoming job changes that I am aware of:

  • more stylo tests coming online
  • more linux tests moving to ubuntu 16.04
  • push to green up windows 10 taskcluster vm jobs

Regarding our tests, we are working on tracking new tests added to the tree, what components they belong in, what harness they run in, and overall how many intermittents we have for each component and harness.  Some preliminary work shows that we added 942 mochitest*/xpcshell tests in Q1 (609 were imported webgl tests, so we wrote 333 new tests, 208 of those are browser-chrome).  Given the fact that we disabled 91 tests and added 942, we are not doing so bad!

Looking forward into April and Q2, I do not see immediate changes to a policy needed, maybe in May we can finalize a policy and make it more formal.  With the recent re-org, we are now in the Product Integrity org.  This is a good fit, but dedicating full time resources to sheriffing and tooling for the sake of project stockwell is not in the mission.  Some of the original work will continue as it serves many purposes.  We will be looking to formalize some of our practices and tools to make this a repeatable process to ensure that progress can still be made towards reducing intermittents (we want <7.0) and creating a sustainable ecosystem for managing these failures and getting fixes in place.

 

Leave a comment

Filed under Uncategorized

Project Stockwell (reduce intermittents) – March 2017

Over the last month we had a higher rate of commits, failures, and fixes. One large thing is that we turned on stylo specific tests and that was a slightly rocky road. Last month we suggested disabling tests after 2 weeks of seeing the failures. We ended up disabling many tests, but fixing many more.

In addition to more disabling of tests, we implemented a set bugzilla whiteboard entries to track our progress:
* [stockwell fixed] – a fix went in (even if it partially fixed the problem)
* in the last 2 months, we have 106
* [stockwell disabled] – we disabled the test in at least one config and no fix
* in the last 2 months, we have 61
* [stockwell infra] – Infra issues are usually externally driven
* in the last 2 months, we have 11
* [stockwell unknown] – this became less intermittent with no clear reason
* in the last 2 months, we have 44
* [stockwell needswork] – bugs in progress
* in the last 2 months, we have 24

We have also been tracking the orange factor and number of high frequency intermittents:

Week starting: Jan 02, 2017 Jan 30, 2017 Feb 27, 2017
Orange Factor (OF) 13.76 10.75 9.06
# priority intermittents 42 61 32
OF – priority intermittents 7.25 5.78 4.78

I added a new row here, tracking the Orange Factor assuming all of the high frequency intermittent bugs didn’t exist.  This is what the long tail looks like and I am really excited to see that number going down over time.  For me a healthy spot would be OF <5.0 and the long tail <3.0.

We also looked at the number of unique bugs and repeat bugs/week.  Most bugs have a lifecycle of 2 weeks and 2/3 of the bugs we see in a given week were high frequency (HF) the week prior.  For example this past week we had 32 HF bugs and 21 of them were from the previous week (11 were still HF 2 weeks prior).

While it is nice to assume we should just disable all tests, we find that many developers are actively working on these issues and it shows that we have many more fixed bugs than disabled bugs.  The main motivation for disabling tests is to reduce the confusion for developers on try and to reduce the work the sheriffs need to do.  Taking this data into account we are looking to adjust our policy for disabling slightly:

  1. all high frequency bugs (>=30 times/week) will be triaged and expected to be resolved in 2 weeks, otherwise we will start the process of disabling the test that is causing the bug
  2. if a bug occurs >75 times/week, it will be triaged but expectations are that it will be resolved in 1 week, otherwise we will start the process of disabling the test that is causing the bug
  3. if a bug is reduced below a high frequency (< 30 times/week), we will be happy to make a note of that and keep an eye on it- but will not look at disabling the test.

The big change here is we will be more serious on disabling tests specifically when a test is >= 75 times/week.  We have had many tests failing at least 50% of the time for weeks, these show up on almost all try pushes that run these tests.  Developers should not be seeing failures like these.  Since we are tracking fixed vs disabled, if we determine that we are disabling too much, we can revisit this policy next month.

Outside of numbers and policy, our goal is to have a solid policy, process, and toolchain available for self triaging as the year goes on.  We are refining the policy and process via manual triage.  The toolchain is the other work we are doing, here are some updates:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – slow and steady progress, thanks for the reviews to date!  We got behind to get SETA completed, but much of the heavy lifting is already done
  • retrigger an existing job with additional debugging arguments (bug 1322433) – main discussion is done, figuring out small details, we have a prototype working with little work remaining.  Next steps would be to implement the top 3 or 4 use cases.
  • add a test-lint job to linux64/mochitest (bug 1323044) – no progress yet- this got put on the backburner as we worked on SETA and focused on triage, whiteboard tags, and BUG_COMPONENTS.  We have landed code for using the ‘when’ clause for test jobs (bug 1342963) which is a small piece of this.  Getting this initially working will move up in priority soon, and making this work on all harnesses/platforms will most likely be a Google Summer of Code project.

Are there items we should be working on or looking into?  Please join our meetings.

Leave a comment

Filed under Uncategorized

Project Stockwell – February 2017

I realized my post for last month was titled “Project Stockwell – January 2016” – that is a fun typo to make 🙂

Last month we focused on triaging all bugs that met our criteria of >=30 failures/week.  Every day there are many new bugs to triage and we started with a large list.  In the end we have commented on all the bugs and have a small list every day to revisit or investigate.

One thing we focus on is only requesting assistance at most once per week- to that note we have a “Neglected Oranges” dashboard that we use daily.

What is changing this month- We will be recommending resolution on priority bugs (>=30 failures/week) in 2 weeks time.  Resolution is active debugging, landing changes to the test to reduce,debug, or fix the intermittent, or in the case where there is a lack of time or ease of finding a fix disabling the test.  If this goes well, we will reduce that down to 7 days in March.

So how are we doing?

Week starting: Jan 02, 2017 Jan 30, 2017
Orange Factor 13.76 10.75
# priority intermittent 42 61

We have less overall failures, but more bugs spread out.  Some interesting bugs:

In terms of projects underway, here is some status:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – slow and steady progress, thanks for the reviews to date!  We expect large majority of this to be completed this month.
  • retrigger an existing job with additional debugging arguments (bug 1322433) – main discussion is done, figuring out small details, should see a prototype this month
  • add |mach test-info| support (bug 1324470) – landed today!
  • add a test-lint job to linux64/mochitest (bug 1323044) – no progress yet, I expect some this month.

Are there items we should be working on or looking into?  Please join our meetings.

Leave a comment

Filed under Uncategorized

Project Stockwell – January 2016

Every month this year I am planning to write a summary of Project Stockwell.

Last year we started this project with a series of meetings and experiments.  We presented in Hawaii (a Mozilla all-hands event) an overview of our work and path forward.

With that said, we will be tracking two items every month:

Week of Jan 02 -> 09, 2017

Orange Factor 13.76
# High Frequency bugs 42

What are these high frequency bugs:

  • linux32 debug timeouts for devtools (bug 1328915)
  • Turning on leak checking (bug 1325148) – note, we did this Dec 29th and whitelisted a lot, still much exists and many great fixes have taken place
  • some infrastructure issues, other timeouts, and general failures

I am excited for the coming weeks as we reduce the orange factor back down <7 and get the high frequency bugs <20.

Outside of these tracking stats there are a few active projects we are working on:

  • adding BUG_COMPONENTS to all files in m-c (bug 1328351) – this will allow us to then match up triage contacts for each components so test case ownership has a patch to a live person
  • retrigger an existing job with additional debugging arguments (bug 1322433) – easier to get debug information, possibly extend to special runs like ‘rr-chaos’
  • add |mach test-info| support (bug 1324470) – allows us to get historical timing/run/pass data for a given test file
  • add a test-lint job to linux64/mochitest (bug 1323044) – ensure a test runs reliably by itself and in –repeat mode

While these seem small, we are currently actively triaging all bugs that are high frequency (>=30 times/week).  In January triage means letting people know this is high frequency and trying to add more data to the bugs.

 

2 Comments

Filed under intermittents, testdev