Tag Archives: mozilla

A formal introduction to Ionut Goldan – Mozilla’s new Performance Sheriff and Tool hacker

About 8 months ago we started looking for a full time performance sheriff to help out with our growing number of alerts and needs for keeping the Talos toolchain relevant.

We got really lucky and ended up finding Ionut (:igoldan on irc, #perf).  Over the last 6 months, Ionut has done a fabulous job of learning how to understand Talos alerts, graphs, scheduling, and narrowing down root causes.  In fact, he has not only been able to easily handle all of the Talos alerts, Ionut has picked up alerts from Autophone (Android devices), Build Metrics (build times, installer sizes, etc.), AWSY (memory metrics), and Platform Microbenchmarks (tests run inside of gtest written by a few developers on the graphics and stylo teams).

While I could probably write a list of Ionut’s accomplishments and some tricky bugs he has sorted out, I figured your enjoyment of reading this blog is better spend on getting to know Ionut better, so I did a Q&A with him so we can all learn much more about Ionut.

Tell us about where you live?

I live in Iasi. It is a gorgeous and colorful town, somewhere in the North-East of Romania.  It is full of great places and enchanting sunsets. I love how a casual walk
leads me to new, beautiful and peaceful neighborhoods.

I have many things I very much appreciate about this town:
the people here, its continuous growth, its historical resonance, the fact that its streets once echoed the steps of the most important cultural figures of our country. It also resembles ancient Rome, as it is also built on 7 hills.

It’s pretty hard not to act like a poet around here.

What inspired you to be a computer programmer?

I wouldn’t say I was inspired to be a programmer.

During my last years in high school, I occasionally consulted with my close ones. Each time we concluded that IT is just the best domain to specialize in: it will improve continuously, there will be jobs available; things that are evident nowadays.

I found much inspiration in this domain after the first year in college, when I noticed the huge advances and how they’re conducted.  I understood we’re living in a whole new era. Digital transformation is now the coined term for what’s going on.

Any interesting projects you have done in the past (school/work/fun)?

I had the great opportunity to work with brilliant teams on a full advertising platform, from almost scratch.

It got almost everything: it was distributed, highly scalable, completely written in
Python 3.X, the frontend adopted material design, NoSQL database in conjunction with SQL ones… It used some really cutting-edge libraries and it was a fantastic feeling.

Now it’s Firefox. The sound name speaks for itself and there are just so many cool things I can do here.

What hobbies do you have?

I like reading a lot. History and software technology are my favourite subjects.
I enjoy cooking, when I have the time. My favourite dish definitely is the Hungarian goulash.

Also, I enjoy listening to classical music.

If you could solve any massive problem, what would you solve?

Greed. Laziness. Selfishness. Pride.

We can resolve all problems we can possibly encounter by leveraging technology.

Keeping non-values like those mentioned above would ruin every possible achievement.

Where do you see yourself in 10 years?

In a peaceful home, being a happy and caring father, spending time and energy with
my loved ones. Always trying to be the best example for them.  I envision becoming a top notch professional programmer, leading highly performant teams on
sound projects. Always familiar with cutting-edge tech and looking to fit it in our tool set.

Constantly inspiring values among my colleagues.

Do you have any advice or lessons learned for new students studying computer science?

Be passionate about IT technologies. Always be curious and willing to learn about new things. There are tons and tons of very good videos, articles, blogs, newsletters, books, docs…Look them out. Make use of them. Follow their guidance and advice.

Continuous learning is something very specific for IT. By persevering, this will become your second nature.

Treat every project as a fantastic opportunity to apply related knowledge you’ve acquired.  You need tons of coding to properly solidify all that theory, to really understand why you need to stick to the Open/Closed principle and all other nitty-gritty little things like that.

I have really enjoyed getting to know Ionut and working with him.  If you see him on IRC please ping him and say hi 🙂

 

Leave a comment

Filed under Community

community hacking – thoughts on what works for the automation and tools team

Community is a word that means a lot of things to different people.  When there is talk of community at an A*Team meeting, some people perk up and others tune out.  Taking a voluntary role in leading many community efforts on the A*Team over the last year, here are some thoughts I have towards accepting contributions, growing community, and making it work within the team.

Contributions:

Historically on the A*Team we would file bugs which are mentored (and discoverable via bugsahoy) and blog/advertise help wanted.  This is always met with great enthusiasm from a lot of contributors.  What does this mean for the mentor?  There are a few common axis here:

  • High-Touch vs Low-Touch
    • High-Touch is where there is a lot of time invested in getting the current problem solved.  Usually a lot of bug comments, email, irc chatter to solve a good first bug.  Sometimes this can take hours!
    • Low-Touch is where a person comes in, a patch randomly appears and there is little to no feedback for the patch.
  • High-Reward vs Low-Reward:
    • High-Reward is where we have contributors that solve larger problems.  A rewarding experience for both the contributor and the mentor
    • Low-Reward is where a contributor is fixing useful things, but they are little nits or polish.  This isn’t as rewarding for the contributor, nor the mentor.
  • Short-Term vs Long-Term:
    • Short-Term – a contributor shows up for a few days, fixes however many bugs they can and disappears.  This is a common workflow for folks who are on break from school or shifting around in different stages of their lives.
    • Long-Term – a contributor who shows up on a regular basis, continues to contribute and just the fact of them being around has a much larger impact on the team.

We need to appreciate all types of contributions and ensure we do our best to encourage folks to participate.  As a mentor if you have a lot of high-touch, low-reward, short-term contributors, it is exhausting and de-moralizing.  No wonder a lot of people don’t want to participate in mentoring folks as they contribute.  It is also unrealistic to expect a bunch of seasoned coders to show up and implement all the great features, then repeat for years on end.

The question remains, how do you find low-touch contributors or identify ones that are high-touch at the start and end up learning fast (some of the best contributors fall into this latter category).

Growing Community:

The simple answer here is file a bunch of bugs.  In fact whenever we do this they get fixed real fast.  This turns into a problem when you have 8 new contributors, 2 mentors, and 10 good first bugs.  Of course it is possible to find more mentors, and it is possible to file more bugs.  In reality this doesn’t work well for most people and projects.

The real question to ask is what kind of growth are you looking for?  To answer this is different for many people.  What we find of value is slowly growing our long-term/low-touch contributors by giving them more responsibility (i.e. ownership) and really depending on them for input on projects.  There is also a need to grow mentors and mentors can be contributors as well!  Lastly it is great to have a larger pool of short-term contributors who have ramped up on a few projects and enjoy pitching in once in a while.

How can we foster a better environment for both mentors and contributors?  Here are a few key areas:

  • Have good documentation.
  • Set expectations up front and make it easy to understand what is expected and what next steps are.
  • Have great mentors (this might be the hardest part),
  • Focus more on what comes after good first bugs,
  • Get to know the people you work with.

Just focusing on the relationships and what comes after the good first bugs will go a long way in retaining new contributors and reducing the time spent helping out.

How we make it work in the A*Team:

The A*Team is not perfect.  We have few mentors and community is not built into the way we work.  Some of this is circumstantial, but a lot of it is within our control.  What do we do and what does and does not work for us.

Once a month we meet to discuss what is going on within the community on our team.  We have tackled topics such as project documentation, bootcamp tools/docs, discoverability, good next bugs, good first projects, and prioritizing our projects for encouraging new contributors.

While that sounds good, it is the work of a few people.  There is a lot of negative history of contributors fixing one bug and taking off.  Much frustration is expressed around helping someone with basic pull requests and patch management, over and over again.  While we can document stuff all day long, the reality is new contributors won’t read the docs and still ask questions.

The good news is in the last year we have seen a much larger impact of contributors to our respective projects.  Many great ideas were introduced, problems were solved, and experiments were conducted- all by the growing pool of contributors who associate themselves with the A*Team!

Recently, we discussed the most desirable attributes of contributors in trying to think about the problem in a different way.  It boiled down to a couple things, willingness to learn, and sticking around at least for the medium term.

Going forward we are working on growing our mentor pool, and focusing on key projects so the high-touch and timely learning curve only happens in areas where we can spread the love between domain experts and folks just getting started.

Keep an eye out for most posts in the coming week(s) outlining some new projects and opportunities to get involved.

Leave a comment

Filed under Community

5 days in Portland with Mozillians and 10 great things that came from it

I took a lot of notes in Portland last week.  One might not know that based on the fact that I talked so much my voice ran out of steam by the second day.  Either way, in chatting with some co-workers yesterday about what we took away from Portland, I realized that there is a long list of awesomeness.

Let me caveat this by saying that some of these ideas have been talked about in the past, but despite our efforts to work with others and field interesting and useful ideas, there is a big list of great things that came to light while chatting in person:

  • :bgrins mentioned a mozscreenshot tool and the need for getting a screenshot of new features in development on various platforms so UX can review the changes.  Currently it is a method of asking UX to download the build from try or some other location and run it locally to see the changes.
  • :heycam/:jwatt – had a great an interesting talos discussion.  Mostly around how to run it and validate patches/fixes locally and on try server. (check out bug 1109243)
  • :glandium is looking at doing some changes (I recall something with build/pgo) and wanted to know how to compare some Talos numbers to help make the right decision – this can be done with either bug 1109243, or the existing compare.py in the Talos repo (we might need some cleanup on this)
  • :bobowen has been working to get csb tests working- after chatting in line to board a plane, it became clear he needs to solve some finer grain test selection problems- many of which the ateam has on a roadmap in Q2/Q3 – I see some tighter collaboration happening here.
  • Thanks to chatting with :lsblakk, I am motivated to expand the talos sheriff team and look for dedicated Mozillians (or soon to become Mozillians) to work with in keeping a lid on the alerts and overall state of performance (based on what we measure).
  • :lightsofapollo had a great conversation with me about TaskCluster and what barriers stood in the way of running Talos on it – this will result is some initial investigation work!
  • :kats was asking me how to generate alerts for areweslimyet.com.  This is very doable via posting data to graph server
  • After a good session on how to handle intermittents (seems like the same people have this conversation every time a bunch of Mozillians get together), I am motivated to push Titanic further to find the root cause of an intermittent via brute force retriggers (ideally on weekends).  In fact :dbaron has done this a few times in the last month and so have the sheriffs.  This is similar to what we do to verify a talos regression, just with some different parameters.
  • The same conversation about intermittents yielded a stronger desire to look at new tests coming into the system and validating stability.  The simple solution is to run the job 100 times, verify that the new test didn’t have issues and then leave it along.  Of course we could get smart and do this for all test_* files that are edited in the tree.  Thanks to :ehsan for spawning this conversation.
  • Discussing the idea of a Talos Sheriff with a few folks, it seems like there are further conversations needs with the existing Sheriff team as well as to chat with :vladan and :avih about what type of policy we should have for existing performance failures which are detected.  I would expect some changes to be made early next year as we have more tests and need more help.  My initial thoughts are specifically with responding to regressions or getting backed out in XX hours.  Yeah that sounds nasty, but there are probably cut and dry parameters we can set and start enforcing.

Those are 10 specific topics which despite everybody knowing how to contact me or the ateam and share great ideas or frustrations, these came out of being in the same place at the same time.

Thinking through this, when I see these folks in a real office while working from there for a few days or a week, it seems as though the conversations are smaller and not as intense.  Usually just small talk whilst waiting for a build to finish.  I believe the idea where we are not expected to focus on our day to day work and instead make plans for the future is the real innovation behind getting these topics surfaced.

1 Comment

Filed under general

A case of the weekends?

Case of the Mondays

What was famous 15 years ago as a case of the Mondays has manifested itself in Talos.  In fact, I wonder why I get so many regression alerts on Monday as compared to other days.  It is more to a point of we have less noise in our Talos data on weekends.

Take for example the test case tresize:

linux32,

* in fact we see this on other platforms as well linux32/linux64/osx10.8/windowsXP

30 days of linux tresize

Many other tests exhibit this.  What is different about weekends?  Is there just less data points?

I do know our volume of tests go down on weekends mostly as a side effect of less patches being landed on our trees.

Here are some ideas I have to debug this more:

  • Run massive retrigger scripts for talos on weekends to validate # of samples is/isnot the problem
  • Reduce the volume of talos on weekdays to validate the overall system load in the datacenter is/isnot the problem
  • compare the load of the machines with all branches and wait times to that of the noise we have in certain tests/platforms
  • Look at platforms like windows 7, windows 8, and osx 10.6 as to why they have more noise on weekends or are more stable.  Finding the delta in platforms would help provide answers

If you have ideas on how to uncover this mystery, please speak up.  I would be happy to have this gone and make any automated alerts more useful!

3 Comments

Filed under testdev

Thoughts on Auto-Land, Try server, and intermittent oranges outline

This is the first post is a series where I will post some ideas.  These are ideas, not active projects (although these ideas could be implemented with many active projects).

My first idea is surrounding the concept of AutoLand.  Mozilla has talked about this for a long time.  In fact a conversation I had last week got me thinking more of the value of AutoLand vs blocking on various aspects of it.  There are a handful of issues blocking us from a system where we push to try and if it goes well we magically land it on the tip of a tree.  My vested interest comes in the part of “if it goes well”.

The argument here has been that we have so many intermittent oranges and until we fix those we cannot determine if a job is green.  A joke for many years has been that it would be easier to win the lottery than to get an all green run on tbpl.  I have seen a lot of cases where people push to Try and land on Inbound to only be backed out by a test failure- a test failure that was seen on Try (for the record I personally have done this once).  I am sure someone could write a book on human behavior, tooling, and analysis of why failures land on integration branches when we have try server.

My current thought is this-

* push to try server with a final patch, run a full set of tests and builds

* when all the jobs are done [1], we analyze the results of the jobs and look for 2 patterns

* pattern 1: for a given build, at most 1 job fails

* pattern 2: for a given job [2], at most 1 platform fails

* if pattern 1 + 2 pass, we put this patch in the queue for landing by the robots

[1] – we can determine the minimal amount of jobs or verify with more analysis (i.e. 1 mochitest can fail, 1 reftest can fail, 1 other can fail)

[2] – some jobs are run in different chunks.  on opt ‘dt’ runs all browser-chrome/devtools jobs, but this is ‘dt1’, ‘dt2’, ‘dt3’ on debug builds

 

This simple approach would give us the confidence that we need to reliably land patches on integration branches and achieve the same if not better results than humans.

For the bonus we could optimize our machine usage by not building/running all jobs on the integration commit because we have a complete set done on try server.

 

9 Comments

Filed under Uncategorized

Firefox 32 leaves the train station- what does the performance look like

Now that we have an uplift completed and enough future data has been collected to ensure sustained changes in data automatically, it is time for the triple fortnightly report of what performance looks like.  For reference there is some data in a blog post about general talos numbers.

Firefox 32 uplift, m-c -> Aurora (tracking bug 1004427):

  • 20 – regressions (3 CART, 3 TART, 3 SVG, 3 TResize, and some one off tests)
    • 18 regressions are on windows, the majority a result of OMTC being turned on (this is more of a rebaselining of tests than actual regressions)
    • 3 bugs are tracking all 20 regressions!
  • 43 – improvements (15 Kraken/V8/Dromaeo, 2 SVG, 7 TScroll, 4 TART/CART, 4 Paint, 9 SessionRestore, and a couple others)
  • the Improvements are distributed amongst Windows, Mac, Linux

Firefox 31 uplift, m-c -> Aurora (tracking bug 990085):

Firefox 30 uplift, m-c -> Aurora:

  • 26 – regressions (4 TART, 4 SVG, 3 TS, Paint, and many more)
    • 2 remaining bugs not resolved as we are now on Beta (bug 990183, bug 990194)

As you can see Firefox32 has a lot of improvements and fewer regressions (of those 20 about half are related to rebasing numbers).

Lets look at bugs:

  • 36 bugs filed to date for Firefox32 Talos regressions
  • 16 are resolved (7 as wontfix)
  • 20 are open (this means that 17 of them are only showing up on non-pgo)

 

After reviewing the process of investigating alerts, it makes sense that we continue forward with the same process in 6 week intervals and any changes are made on uplift day and they would apply only to trunk.  Some future changes we are considering:

  • not filing bugs on minimal regressions (ex. <4%)
  • not filing bugs on non-pgo only regressions (since we only build pgo on Aurora, Beta, Release)
  • generating alerts for per test (not per suite) regressions (and only file bugs if a single test is >10%)
  • adjust the graph server alert calculation to not drop the page with the highest value and to report the geometric mean of the pages instead of the average
  • any other great ideas you have on how to be efficient with our time while continuing to identify and document our regressions

Onward to Firefox 33!

Leave a comment

Filed under Uncategorized

browser-chrome is greener and in many chunks

On Friday we rolled out a big change to split up our browser-chrome tests.  It started out as a great idea to split the devtools out into their own suite, then after testing, we ended up chunking the remaining browser chrome tests into 3 chunks.

No more 200 minute wait times, in fact we probably are running too many chunks.  A lot of heavy lifting took place, a lot of it in releng from Armen and Ben, and much work from Gavin and RyanVM who pushed hard and proposed great ideas to see this through.

What is next?

There are a few more test cases to fix and to get all these changes on Aurora.  We have more work we want to do (lower priority) on running the tests differently to help isolate issues where one test affects another test.

In the next few weeks I want to put together a list of projects and bugs that we can work on to make our tests more useful and reliable.  Stay tuned!

 

1 Comment

Filed under testdev

is a phone too hard to use?

Working at Mozilla, I get to see a lot of great things.  One of them is collaborating with my team (as we are almost all remoties) and I have been doing that for almost 6 years.  Sometime around 3 years ago we switch to using Vidyo as a way to communicate in meetings.  This is great, we can see and hear each other.  Unfortunately heartbleed came out and affects Mozilla’s Vidyo servers.  So yesterday and today we have been without Vidyo.

Now I am getting meeting cancellation notices, why are we cancelling meetings?  Did meetings not happen 3 years ago?  Mozilla actually creates an operating system for a … phone.  In fact our old teleconferencing system is still in place.  I thought about this earlier today and wondered why we are cancelling meetings.  Personally I always put Vidyo in the background during meetings and keep IRC in the foreground.  Am I a minority?

I am not advocating for scrapping Vidyo, instead I would like to attend meetings, and if we find they cannot be held without Vidyo, we should cancel them (and not reschedule them). 

Meetings existed before Vidyo and Open Source existed before GitHub, we don’t need the latest and greatest things to function in life. Pick up a phone and discuss what needs to be discussed.

3 Comments

Filed under Uncategorized

tracking talos alerts across branches

A year without blogging and I am back.  I figured there was some cool stuff to share, here is one tidbit.

In the last year I have picked up looking at talos results and filing regression bugs for results.  This has been useful.  What currently happens is when results are submitted to g.m.o (graph server) we detect a regression and send out an email to the original patch author (if we can determine it) and post to mozilla.dev.tree-management.  I have been using dev.tree-management as a starting point for my hunting regressions.  When things are busy it can eat up a couple hours in a day.  Luckily many developers are responsible in taking action when they receive the emails.

Given that at least half of the regressions are not acted upon by the original developer, it is important to read the newsgroup. One of the things which makes it frustrating is that for a single regression we can get multiple alerts (regular builds vs pgo builds and as the patch merges between branches/projects).

To make my life easier, I have taken all the alerts on dev.tree-management and put them in a database (local right now).  The final goal is a webUI that lets me easily annotate these alerts similar to tbpl for random test failures.  One thing I wanted to do was help identify duplicate alerts.  Today in my attempt I had a clear picture of what the lifecycle of a regression looks like:

mysql> select date,branch,percent,keyrevision from alerts where test=’Paint’ and platform=’WINNT 6.2 x64′ order by date ASC;
+———————+————————-+———+————–+
| date                | branch                  | percent | keyrevision  |
+———————+————————-+———+————–+
| 2014-02-14 19:41:38 | Mozilla-Inbound-Non-PGO | 10.1%   | c7802c9d6eec |
| 2014-02-15 01:03:54 | Fx-Team-Non-PGO         | 9.53%   | 7a3adc5aac28 |
| 2014-02-15 21:43:48 | Mozilla-Inbound         | 10.6%   | c7802c9d6eec |
| 2014-02-16 03:46:12 | Firefox-Non-PGO         | 8.88%   | 5d7caa093f4f |
| 2014-02-16 03:46:13 | B2g-Inbound-Non-PGO     | 9.44%   | 071885f79841 |
| 2014-02-16 14:22:38 | Fx-Team                 | 10.4%   | 7a3adc5aac28 |
| 2014-02-17 04:42:57 | B2g-Inbound             | 10.7%   | 071885f79841 |
| 2014-02-18 11:43:54 | Firefox                 | 9.76%   | eac89fb04bb9 |
+———————+————————-+———+————–+
8 rows in set (0.00 sec)

This is really cool to see how 1 change can generate alerts for 4 days.

Stay tuned for more information on this and other topics!

Leave a comment

Filed under Uncategorized

Android automation is becoming more stable ~7% failure rate

At Mozilla we have made our unit testing on android devices to be as important as desktop testing. Earlier today I was asked how do we measure this and what is our definition of success. The obvious answer is no failures except for code that breaks a test, but reality is something where we allow for random failures and infrastructure failures. Our current goal is 5%

So what are these acceptable failures and what does 5% really mean. Failures can happen when we have tests which fail randomly, usually poorly written tests or tests which have been written a long time ago and hacked to work in todays environment. This doesn’t mean any test that fails is a problem, it could be a previous test that changes a Firefox preference on accident. For Android testing, this currently means the browser failed to launch and load the test webpage properly or it crashed in the middle of the test. Other failures are the device losing connectivity, our host machine having hiccups, the network going down, sdcard failures, and many other problems. With our current state of testing this mostly falls into the category of losing connectivity to the device. For infrastructure problems they are indicated as Red or Purple and for test related problems they are Orange.

I took at a look at the last 10 runs on mozilla-central (where we build Firefox nightlies from) and built this little graph:

Firefox Android Failures

Firefox Android Failures

Here you can see that our tests are causing 6.67% of the failures and 12.33% of the time we can expect a failure on Android.

We have another branch called mozilla-inbound (we merge this into mozilla-central regularly) where most of the latest changes get checked in.  I did the same thing here:

mozilla-inbound Android Failures

mozilla-inbound Android Failures

Here you can see that our tests are causing 7.77% of the failures and 9.89% of the time we can expect a failure on Android.

This is only a small sample of the tests, but it should give you a good idea of where we are.

3 Comments

Filed under testdev