Category Archives: Uncategorized

Working towards a productive definition of “intermittent orange”

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent.  Some common statements I have heard:

  • It’s obvious, if it failed last year, the test is intermittent
  • If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent
  • I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server

These are imply much different definitions of what is intermittent, a definition will need to:

  • determine if we should take action on a test (programatically or manually)
  • define policy sheriffs and developers can use to guide work
  • guide developers to know when a new/fixed test is ready for production
  • provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends.  I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%).  Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures.  It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle).  After much slicing and dicing, I have come up with 4 buckets:

  1. Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)
  2. Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)
  3. Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks.  (11.5% of bugs)
  4. High Frequency Orange: fails >10 times/day many times and are often seen in try pushes.  (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

  • low priority or not actionable (buckets 1 + 2)
  • high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

  • Determine if we should take action on a test (programatically or manually):
    • ideally buckets 1/2 can be detected programatically with autostar and removed from our view.  Possibly rerunning to validate it isn’t a new failure.
    • buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information
  • Define policy sheriffs and developers can use to guide work
    • sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)
    • developers understand the severity based on the bucket.  Ideally we will need a lot of context, but understanding severity is important.
  • Guide developers to know when a new/fixed test is ready for production
    • If we fix a test, we want to ensure it is stable before we make it tier-1.  A developer can use math of 300 commits/day and ensure we pass.
    • NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day
  • Provide useful data to release and Firefox product management about the quality of a release
    • Release Management can take the OrangeFactor into account
    • new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot).  There are simple rules:

  • 15+ times/day – post a daily summary (bucket #4)
  • 5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

  • missing or incorrect data in orange factor (human error)
  • some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.


Filed under general, testdev, Uncategorized

QoC.2 – Iterations and thoughts

Quite a few weeks ago now, the Second official Quarter of Contribution wrapped up.  We had advertised 4 projects and found awesome contributors for all 4.  While all hackers gave a good effort, sometimes plans change and life gets in the way.  In the end we had 2 projects with very active contributors.

We had two projects with a lot of activity throughout the project:

First off, this 2nd round of QoC wouldn’t have been possible without the Mentors creating projects and mentoring, nor without the great contributors volunteering their time to build great tools and features.

I really like to look at what worked and what didn’t, let me try to summarize some thoughts.

What worked well:

  • building up interest in others to propose and mentor projects
  • having the entire community in #ateam serve as an environment of encouragement and learning
  • specifying exact start/end dates
  • advertising on blogs/twitter/newsgroups to find great hackers

What I would like to see changed for QoC.3:

  •  Be clear up front on what we expect.  Many contributors waited until the start date before working- that doesn’t give people a chance to ensure mentors and projects are a good fit for them (especially over a couple of months)
  • Ensure each project has clear guidelines on code expectations.  Linting, Tests, self review before PR, etc.  These are all things which might be tough to define and tough to do at first, but it makes for better programmers and end products!
  • Keep a check every other week on the projects as mentors (just a simple irc chat or email chain)
  • Consider timing of the project, either on-demand as mentors want to do it, or continue in batches, but avoid with common mentor time off (work weeks, holidays)
  • Encourage mentors to set weekly meetings and “office hours”

As it stands now, we are pushing on submitting Outreachy and GSoC project proposals, assuming that those programs pick up our projects, we will look at QoC.3 more into September or November.


Leave a comment

Filed under Uncategorized

QoC.2 – WPT Results Viewer – wrapping up

Quite a few weeks ago now, the Second official Quarter of Contribution wrapped up.  We had advertised 4 projects and found awesome contributors for all 4.  While all hackers gave a good effort, sometimes plans change and life gets in the way.  In the end we had 2 projects with very active contributors.

This post, I want to talk about WPT Results Viewer.  You can find the code on github, and still find the team on irc in #ateam.  As this finished up, I reached out to :martianwars to learn what his experience was like, here are his own words:

What interested you in QoC?

So I’d been contributing to Mozilla for sometime fixing random bugs here and there. I was looking for something larger and more interesting. I think that was the major motivation behind QoC, besides Manishearth’s recommendation to work on the Web Platform Test Viewer. I guess I’m really happy that QoC came around the right time!

What challenges did you encounter while working on your project?  How did you solve them?

I guess the major issue while working on wptview was the lack of Javascript knowledge and the lack of help online when it came to Lovefield. But like every project, I doubt I would have enjoyed much had I known everything required right from the start. I’m glad I got jgraham as a mentor, who made sure I worked my way up the learning curve as we made steady progress.

What are some things you learned?

So I definitely learnt some Javascript, code styling, the importance of code reviews, but there was a lot more to this project. I think the most important thing that I learnt was patience. I generally tend to search for StackOverflow answers when it I need to perform a programming task I’m unaware of. With Lovefield being a relatively new project, I was compelled to patiently read and understand the documentation and sample programs. I also learnt a bit on how a large open source community functions, and I feel excited being a part of it!  A bit irrelevant to the question, but I think I’ve made some friends in #ateam :) The IRC is like my second home, and helps me escape life’s never ending stress, to a wonderland of ideas and excitement!

If you were to give advice to students looking at doing a QoC, what would you tell them?

Well the first thing I would advice them is not to be afraid, especially of asking the so called “stupid” questions on the IRC. The second thing would be to make sure they give the project a decent amount of time, not with the aim of completing it or something, but to learn as much as they can🙂 Showing enthusiasm is the best way to ensure one has a worthwhile QoC🙂 Lastly, I’ve tried my level best to get a few newcomers into wptview. I think spreading the knowledge one learns is important, and one should try to motivate others to join open source🙂

If you were to give advice to mentors wanting to mentor a project, what would you tell them?

I think jgraham has set a great example of what an ideal mentor should be like. Like I mentioned earlier, James helped me learn while we made steady progress. I especially appreciate the way he had (has rather) planned this project. Every feature was slowly built upon and in the right order, and he ensured the project continued to progress while I was away. He would give me a sufficient insight into each feature, and leave the technical aspects to me, correcting my fallacies after the first commit. I think this is the right approach. Lastly, a quality every mentor MUST have, is to be awake at 1am on a weekend night reviewing PRs😉

Personally I have really enjoyed getting to know :martianwars and seeing the great progress he has made.

Leave a comment

Filed under Uncategorized

Introducing a contributor for the Pulse Guardian project

3 weeks ago we announced the new Quarter of Contribution, today I would like to introduce the participants.  Personally I really enjoy meeting new contributors and learning about them. It is exciting to see interest in all 4 projects.  Let me introduce who will be working on Pulse Guardian – Core Hacker:

Mike Yao

What interests you in this specific project?

Python, infrastructure

What do you plan to get out of this after 8 weeks?

Continue to contribute to Mozilla

Are there any interesting facts/hobbies that you would like to share so others can enjoy reading about you?

Cooking/food lover, I was chef long time ago. Free software/Open source and Linux changed my mind and career.


I do recall one other eager contributor who might join in late when exams are completed, meanwhile, enjoy learning a bit about Mike Yao (who was introduced to Mozilla by Mike Ling who did our first every Quarter of Contribution).

Leave a comment

Filed under Uncategorized

Adventures in Task Cluster – running a custom Docker image

I needed to get compiz on the machine (bug 1223123), and I thought it should be on the base image.  So to take the path of most resistance, I dove deep into the internals of taskcluster/docker and figured this out.  To be fair, :ahal gave me a lot of pointers, in fact if I would have taken better notes this might have been a single pass to success vs. 3 attempts.

First let me explain a bit about how the docker image is defined and how it is referenced/built up.  We define the image to use in-tree.  In this case we are using taskcluster/docker-test:0.4.4 for the automation jobs.  If you look carefully at the definition in-tree, we have a Dockerfile which outlines who we inherit the base image from:

FROM          taskcluster/ubuntu1204-test-upd:

This means there is another image called ubuntu1204-test-upd, and this is also defined in tree which then references a 3rd image, ubuntu1204-test.  These all build upon each other creating the final image we use for testing.  If you look in each of the directories, there is a REGISTRY and a VERSION file, these are used to determine the image to pull, so in the case of wanting:

docker pull taskcluster/desktop-test:0.4.4

we would effectively be using:

docker pull {REGISTRY}/desktop-test:{VERSION}

For our use case, taskcluster/desktop-test is defined on  This means that you could create a new version of the container ‘desktop-test’ and use that while pushing to try.  In fact that is all that is needed.

First lets talk about how to create an image.  I found that I needed to create both a desktop-test and an ubuntu1204-test image on Docker Hub.  Luckily in tree there is a script which will take a currently running container and make a convenient package ready to upload, some steps would be:

  • docker pull taskcluster/desktop-test:0.4.4
  • docker run taskcluster/desktop-test:0.4.4
  • apt-get install compiz; # inside of docker, make modifications
  • # now on the host we prepare the new image (using elvis314 as the docker hub account)
  • echo elvis314 > testing/docker/docker-test/REGISTRY
  • echo 0.4.5 > testing/docker/docker-test/VERSION  # NOTE: I incremented the version
  • cd testing/docker
  • docker-test # go run a 5K
  • docker push elvis314/docker-test # go run a 10K

those are the simple steps to update an image, what we want to do is actually verify this image has what we need.  While I am not an expert in docker, I do like to keep my containers under control, so I do a |docker ps -a| and then |docker rm <cid>| for any containers that are old.  Now to verify I do this:

  • docker pull elvis314/desktop-test:0.4.5
  • docker run elvis314/desktop-test:0.4.5
  • compiz # this verifies my change is there, I should get an error about display not found!

I will continue on here assuming things are working.  As you saw earlier, I had modifed filed in testing/docker/desktop-test, these should be part of a patch to push to try.  In fact that is all the magic.  to actually use compiz successfully, I needed to add this to to launch |compiz &| after initializing Xvfb.

Now when you push to try with your patch, any tasks that used taskcluster/desktop-test before will use the new image (i.e. elvis314/desktop-test).  In this case I was able to see the test cases that opened dialogs and windows pass on try!

Leave a comment

Filed under Uncategorized

Adventures in Task Cluster – Running tests locally

There is a lot of promise around Taskcluster (the replacement for BuildBot in our CI system at Mozilla) to be the best thing since sliced bread.  One of the deliverables on the Engineering Productivity team this quarter is to stand up the Linux debug tests on Taskcluster in parallel to running them normally via Buildbot.  Of course next quarter it would be logical to turn off the BuildBot tests and run tests via Taskcluster.

This post will outline some of the things I did to run the tests locally.  What is neat is that we run the taskcluster jobs inside a Docker image (yes this is Linux only), and we can download the exact OS container and configuration that runs the tests.

I started out with a try server push which generated some data and a lot of failed tests.  Sadly I found that the treeherder integration was not really there for results.  We have a fancy popup in treeherder when you click on a job, but for taskcluster jobs, all you need is to find the link to inspect task.  When you inspect a task, it takes you to a task cluster specific page that has information about the task.  In fact you can watch a test run live (at least from the log output point of view).  In this case, my test job is completed and I want to see the errors in the log, so I can click on the link for live.log and search away.  The other piece of critical information is the ‘Task‘ tab at the top of the inspect task page.  Here you can see the details about the docker image used, what binaries and other files were used, and the golden nugget at the bottom of the page, the “Run Locally” script!  You can cut and paste this script into a bash shell and theoretically reproduce the exact same failures!

As you can imagine this is exactly what I did and it didn’t work!  Luckily in the #taskcluster channel, there were a lot of folks to help me get going.  The problem I had was I didn’t have a v4l2loopback device available.  This is interesting because we need this in many of our unittests and it means that our host operating system running docker needs to provide video/audio devices for the docker container to use.  Now is time to hack this up a bit, let me start:

first lets pull down the docker image used (from the run locally script):

docker pull 'taskcluster/desktop-test:0.4.4'

next lets prepare my local host machine to run by installing/setting up v4l2loopback:

sudo apt-get install v4l2loopback-dkms

sudo modprobe v4l2loopback devices=2

Now we can try to run docker again, this time adding the –device command:

docker run -ti \
  --name "${NAME}" \
  --device=/dev/video1:/dev/video1 \
  -e MOZHARNESS_SCRIPT='mozharness/scripts/' \
  -e MOZHARNESS_CONFIG='mozharness/configs/unittests/ mozharness/configs/
' \
  -e GECKO_HEAD_REV='5e76c816870fdfd46701fd22eccb70258dfb3b0c' \

Now when I run the test command, I don’t get v4l2loopback failures!

bash /home/worker/bin/ --no-read-buildbot-config '--installer-url=' '--test-packages-url=' '--download-symbols=ondemand' '--mochitest-suite=browser-chrome-chunked' '--total-chunk=7' '--this-chunk=1'

In fact, I get the same failures as I did when the job originally ran :)  This is great, except for the fact that I don’t have an easy way to run the test by itself, debug, or watch the screen- let me go into a few details on that.

Given a failure in browser/components/search/test/browser_searchbar_keyboard_navigation.js, how do we get more information on that?  Locally I would do:

./mach test browser/components/search/test/browser_searchbar_keyboard_navigation.js

Then at least see if anything looks odd in the console, on the screen, etc.  I might look at the test and see where we are failing at to give me more clues.  How do I do this in a docker container?  The command above to run the tests, calls, which then calls as the user ‘worker’ (not as user root).  This is important that we use the ‘worker’ user as the pactl program to find audio devices will fail as root.  Now what happens is we setup the box for testing, including running pulseaudio, Xfvb, compiz (after bug 1223123), and bootstraps mozharness.  Finally we call the mozharness script to run the job we care about, in this case it is ‘mochitest-browser-chrome-chunked’, chunk 1.  It is important to follow these details because mozharness downloads all python packages, tools, firefox binaries, other binaries, test harnesses, and tests.  Then we create a python virtualenv to setup the python environment to run the tests while putting all the files and unpacking them in the proper places.  Now mozharness can call the test harness (python –browser-chrome …)  Given this overview of what happens, it seems as though we should be able to run: <params> –test-path browser/components/search/test

Why this doesn’t work is that mozharness has no method for passing in a directory or single test, let along doing other simple things that |./mach test| allows.  In fact, in order to run this single test, we need to:

  • download Firefox binary, tools, and harnesses
  • unpack them (in all the right places)
  • setup the virtual env and install needed dependencies
  • then run the mochitest harness with the dirty dozen (just too many commands to memorize)

Of course most of this is scripted, how can we take advantage of our scripts to set things up for us?  What I did was hack the locally to not run mozharness and instead echo the command.  Likewise with the mozharness script to echo the test harness call instead of calling it.  Here is the commands I ended up using:

  • bash /home/worker/bin/ --no-read-buildbot-config '--installer-url=' '--test-packages-url=' '--download-symbols=ondemand' '--mochitest-suite=browser-chrome-chunked' '--total-chunk=7' --this-chunk=1
  • #now that it failed, we can do:
  • cd workspace/build
  • . venv/bin/activate
  • cd ../build/tests/mochitest
  • python –app ../../application/firefox/firefox –utility-path ../bin –extra-profile-file ../bin/plugins –certificate-path ../certs –browser-chrome browser/browser/components/search/test/
  • # NOTE: you might not want –browser-chrome or the specific directory, but you can adjust the parameters used

This is how I was able to run a single directory, and then a single test.  Unfortunately that just proved that I could hack around the test case a bit and look at the output.  In docker there is no simple way to view the screen.   To solve this I had to install x11vnc:

apt-get install x11vnc

Assuming the Xvfb server is running, you can then do:

x11vnc &

This allows you to connect with vnc to the docker container!  The problem is you need the ipaddress.  I then need to get the ip address from the host by doing:

docker ps #find the container id (cid) from the list

docker inspect <cid> | grep IPAddress

for me this is and now from my host I can do:


This is great as I can now see what is going on with the machine while the test is running!

This is it for now.  I suspect in the future we will make this simpler by doing:

  • allowing mozharness (and scripts) to take a directory instead of some args
  • create a simple bootstrap script that allows for running ./mach style commands and installing tools like x11vnc.
  • figuring out how to run a local objdir in the docker container (I tried mapping the objdir, but had GLIBC issues based on the container being based on Ubuntu 12.04)

Stay tuned for my next post on how to update your own custom TaskCluster image- yes it is possible if you are patient.


Filed under Uncategorized

Lost in Data – Episode 3 – digging into alerts from an uplift

Yesterday I recorded a session where I looked at alerts from an uplift.  I did a lot of rambling and not a lot of content, but there are a few interesting differences between uplift alerts and normal alerts:

  • a 6 week snapshot
  • we try to match them up historically
  • there should be pre-existing bugs
  • a lot of time there are odd things
  • we need to account for differences in build/configs between trunk and aurora/beta.

If you want to take a look at this, the link is on


I do plan to do more episodes soon, a few topics of interest:

  • understanding the noisy tests (bi-modal, noise)
  • comparing talos on android to the new ported tests to autophone
  • running talos locally and interpreting the results
  • looking at intermittent failures in general- maybe with a focus on Talos issues.

Leave a comment

Filed under Uncategorized