smoketest for firefox android on panda boards

Last September the panda board were deemed ready to run tests.  The next steps were to start integrating them into buildbot and making them 100% automated.  This task turned into a much larger project and the end results was developing a smoketest which yielded a cleaner integration point with the automation.

The core of the android automation to date has been on the NVidia Tegra 250 developer kit.  This has been running quite successfully with 3-4% total failure rate (product, test harness, tests, infrastructure, hardware).  Our goal for testing on Android 4.0 was to test on the panda boards which also have a NEON chipset.  Essentially this is just like adding more tegra boards to our automation, and for the most part that was true.

The main problems we faced came about when dealing with installing, rebooting, and overall management of the device.  For our tegras, this is all in a set of python code call sut_tools.  These sut_tools handle all the device management and with a few modifications we were able to do that for the panda boards.

While the tests and harnesses ran fine on a panda board at my desk, getting them to work smoothly with the sut_tools and the buildbot scripts proved to be quite a challenge.  After about 10 weeks of solid work and many bugs fixed in the android kernel, system libraries, Firefox and of course our harnesses we were able to get this going fairly reliably with <10% total failure rate when we first turned these tests on in late December.

In order to prove this was working, we developed a smoketest which would run on the production foopies (host to control the panda boards, 12 at a time) and production panda boards.  In fact this ended up being a way to diagnose boards, script changes and help debug overall test failures.  The original smoketest was going to be ‘run some tests on a given panda board for 24 hours’.  The resulting smoketest is a reuse of the exact tools we use in automation for cleanup, verification, installation, and uninstalling the product from the device under test.  We also run a set of production mochitests, so we mimic a real job being pushed with about 98% accuracy.

To run these, it is pretty easy:

While this sounds straightforward, there is a bit more required in order to test a new panda board or what we normally do a chassis of new panda boards.  As it stands now, I run an instance of smoketest.py in a different terminal window for every panda I am interested in testing.  Usually this is 6-8 at a time, but this can easily be done for 1 or 12 without concern.

I usually run this in a loop of 100:

  • $ for i in {0..99}; do python smoketest.py; done

Then I grep the logs looking for failure messages or more specifically count how many success messages I have.  If I have >95% success rate across all my logs, this is a good sign that things are ready to roll.

In the future, it would be nice to make smoketest.py have a better reporting and looping system.  There is also the need to get us to 99% success rate running a controlled smoketest.  One thing that would make this easier would be a tool to launch on a given set of machines and report back information and query the log files for easier parsing and status.

 

Leave a Comment

Filed under testdev

Mozilla A-Team – How to compare talos numbers from try server to trunk

Have you ever been working on a change that you think will affect performance numbers and you were not sure how to verify the impact of your change?

I have had questions on how to do this and recently I needed to do it myself (as I introduced a change to Talos which caused a big old performance regression in everything).

The main use case I needed to do was run a change on Try server and verify that it did in fact fix my performance regression.  Normally I would go to tbpl, and click on each of my tests to see the reported number(s).  For each of those test:number sets, I would look on graph server (hint: you can get to graph server for a given test by clicking on the reported number) and verify that my numbers were inside the expected range for that test/platform/branch based on the history.  If only I was part of a software developers union I could complain that that boring time intensive work was not in my contract.

To simplify my life, I decided to automate this with a python script.  I wrote compare.py which will spit out a text based summary of what I described above.  Here is a sample output:

python compare.py --revision c094aeea5f73 --branch Try --masterbranch Firefox --test tp5n --platform Linux
Linux:
    tp5n: 292.157 -> 400.444; 308.596

A quick explanation:

  • 292.157 is the lowest number reported in the last 7 days for tp5n,linux
  • 400.444 is the highest number reported in the last 7 days for tp5n,linux
  • 308.596 is the value reported from my test on try server for tp5n,linux

While this doesn’t do the previous 30 changesets and the next 5, it gives a pretty good indicator about what to expect.  I can run this on a different time range (to check the 7 days prior to my introduced regression) by adding –skipdays to the command line:

python compare.py --revision c094aeea5f73 --branch Try --masterbranch Firefox --test tp5n --platform Linux --skipdays 6
 Linux:
 **tp5n: 311.975 -> 398.571; 308.596

Here you will see a “**tp5n”, and that indicates that the Try server number is not in the range and should be looked at the old fashioned way.

Hope this helps in debugging.

1 Comment

Filed under testdev

Mozilla A-Team – Unraveling the mystery of Talos – part 1 of a googol

Most people at Mozilla have heard of Talos, if you haven’t, Talos is the performance testing framework that runs for every checkin that occurs at Mozilla.

Over the course of the last year I have had the opportunity to extend, modify, retrofit, rewrite many parts of the harness and tests that make up Talos.

It seems that once or twice a month I get a question about Talos.  Wouldn’t it be nice if I documented Talos?  When Alice was the main owner of Talos, she had written up some great documentation and as of today I am announcing that it has gone through an update:

https://wiki.mozilla.org/Buildbot/Talos.

Stay tuned as there will be more updates to come as we make the documentation more useful.

3 Comments

Filed under general, testdev

What is so special about 267194?

Today I landed the final patch for bug 761125.  This is a huge milestone for Firefox Android automation as we had been running 23,531 checks for each push by running known good directories of tests but now we will be running 290,725 checks by running ALL test cases except the 282 known test cases which fail, timeout, hang, crash.

I fully expect a lot of new random failures to crop up, this is just the nature of automated testing.

For the PI geeks out there, 267194 can be found at the 2051st digit of PI :)

Leave a Comment

Filed under Uncategorized

Android unittests now have LOLcat

One common critcism of the android unittests is there is no logcat information.  I could go on for a long time about the decisions made from day 1 up to today, but the main thing is we have found adb to be very unreliable and we connect via telnet to a SUT agent on the tegra.  No usb cable or active adb session for all the automation we do.  This means collecting logcat information is not as easy as issuing a ‘adb logcat > testrun.log’. 

Today I landed bug 754873 which collects logcat information and puts it in the log file generated by tinderbox.  This does not capture the entire logcat session, but it does filter out the random noise that shows up in our logcat coming from tegras (dalvikvm and network wifi messages).  We will display up to 500 messages if they exist.  As logcat is a rotating queue of logs, it will be rare that we find a full 500 lines of useful information in the logs.  What this will solve is when we have a crash in Java or some other crash that is not detected by the crashreporter we will be able to see it in our tinderbox logs.

random fact: did you know you can do ‘adb lolcat’ and it will produce the same output as ‘adb logcat’

1 Comment

Filed under Uncategorized

Two failed attempts with technology today, just one of those days

Today I experienced two WTF moments while trying to use computers:

1) BrowserID ended up being a total failure for me

2) Accessing people.mozilla.org is next to impossible when trying to share files across computers

I have heard great things about BrowserID, and today was my first real chance at it.  I had an account on builder.addons.mozilla.org, and this was with my <me>@mozilla.com email address.  It has been a few months since I had been on there and now it uses BrowserID for all access.  Great!!  I had signed up with BrowserID with my <me>@gmail.com address, but that failed to log me in.  So I clicked the ‘add another email address’, and got a verification email in my inbox.  Trying to verify was impossible with some cryptic error messages.  10 minutes later after trying to log in, I finally found my way to #identity and was told to try it again.  It magically worked.  OK, let me log in to my addons account, no luck.  After 15 more minutes of poking around, I found that my @mozilla.com email address worked with BrowserID just fine by testing it on another site, but it still failed on addons.

Here is my take of the problem:

  • BrowserID is supposed to make logging in easier, 30 minutes of debugging and I still cannot login.
  • There are no useful error and help messages on the BrowserID site, nor AMO.  How could my mom figure this out?
  • Where in the world is my ‘I forgot my username/password’ link?  Honestly I could have signed up on AMO with a totally random email address and could have been wasting a lot of time.
  • I found it easier to signup as a new user with a different BrowserID email, than to figure out how to login with my normal account.

My next problem occurs with accessing people.mozilla.org.  I have been using this for 3.5 years on a regular basis.  I put log files up there for people to read, zip files when I want to share some code or an build, and sometimes I create a webpage to outline data.  I depend on this as a workflow since I know of no other file server at mozilla that I can just scp files up to.  Just this past weekend, some work was done on the server and the permissions got messed up.  This was fixed, then it wasn’t, it was fixed and now it isn’t.  I can detect patterns and that is a pretty easy pattern to detect.  What really gets me is this message when I log in:

Last login: Thu May 17 18:41:20 2012 from zlb1.dmz.scl3.mozilla.com
All files stored on this server are subject to automated scans.
You shouldn’t store sensitive information on this server, and you should
avoid having production services depend on data stored here.
Files in ~/public_html may be seen by anyone on the Internet.
[jmaher@people1.dmz.scl3 ~]$

Who in their right mind would think that putting files in a folder called ‘public_html’ would not be seen by anyone on the Internet?  I expect tomorrow I will have to sign a NDA to access my people.mozilla.org account.

The big problem here is that I wasted 20 minutes doing a task that I normally do in 2 minutes and delayed getting a perma red test fixed because I couldn’t find a place to upload a fixed talos.zip to.

Enough complaining and ranting and back to work on reftests for android native!

5 Comments

Filed under general, personal, reviews

Reducing the Noise in Talos

Over the last year there has been a lot of research into reducing the noise in our talos performance numbers.  For example looking at tscroll, we have a fluctuation in the reported numbers of almost 400 (out of 14000).  Jan Larres took a look at this problem in his masters thesis, and found a variety of factors that did and didn’t contribute to the noise.  We actually have Bug 706912 filed to implement some of his suggestions on how to calculated the posted number.  Last fall, Stephen Lewchuk look at the raw data that was collected and found some inconsistencies in the way we were aggregating the data.  In short, we have a lot of ground to cover if we want to reduce our numbers.

Over the last couple months, we have been working on a project call Signal From Noise.  This is an attempt to fix the way we collect some numbers and redo the way we aggregate numbers for reporting.  We have done a lot of experimenting with the primary focus on tp5.  The way we run tp5 is to load each of the 100 pages once, then repeat 10 times.  For each page, we would drop the highest value and take the median value of the remaining 9 numbers.  This results in an array of 100 data points which get reported to the graph server.  We take those 100 data points and average them out to generate the single number for tp5.  It is easy to imagine that the small samples and median/average combination will produce a lot of noise.

Going forward, we are looking to change from column major to row major and collect 30 samples instead of 10.  This means we focus on one page and load it 30 times, then move to the next page and repeat until all 100 pages have been loaded.  The downfall is the runtime as we move from an average of 17 minutes to an average of 39 minutes for the entire tp5 run.  Collecting 30 samples will give us a much more meaningful number, but we also found that the first 5-10 iterations contain the most noise.  So initially we are looking to throw away the first 10 numbers instead of what we originally did by throwing away the highest number. When looking at the raw numbers (not the aggregated number), here are some graphs to highlight the difference:

Image

Image

 

This is only the first step in many changes needed.  After rolling this out, we need to evaluate the other test suites as well and ensure we are running adequate cycles to get a valid sample size.  We are also working on allowing the database to accept the raw values instead of the single median value per page.  Likewise are are looking to stop doing a average([median(page)]).  All of this will allow us to find regressions easier per page instead of having it washed over with the other numbers.

4 Comments

Filed under Uncategorized