Lost in data – episode 2 – bisecting and comparing

This week on Lost in Data, we tackle yet another pile of alerts.  This time we have a set of changes which landed together and we push to try for bisection.  In addition we have an e10s only failure which happened when we broke talos uploading to perfherder.  See how I get one step closer to figuring out the root cause of the regressions.

Leave a comment

Filed under Uncategorized

lost in data – episode 1, tackling a bunch of alerts

Today I recorded a session of me investigating talos alerts, It is ~35 minutes, sort of long, but doable in the background whilst grabbing breakfast or lunch!

I am looking forward to more sessions and answering questions that come up.

Leave a comment

Filed under Uncategorized

Please welcome the Dashboard Hacker team

A few weeks ago we announced that we would be looking for committed contributors for 8+ weeks on Perfherder.  We have found a few great individuals, all of whom show a lot of potential and make a great team.

They are becoming familiar with the code base and are already making a dent in the initial list of work set aside.  Let me introduce them (alphabetical order by nicks):

akhileshpillai – Akhilesh has jumped right in and started fixing bugs in Perfherder.  He is new to Mozilla and will fit right in.  With how fast he has come up to speed, we are looking forward to what he will be delivering in the coming weeks.  We have a lot of UI workflow as well as backend data refactoring work on our list, all of which he will be a great contributor towards.

mikeling – mikeling has been around Mozilla for roughly two years, recently he started out helping with a few projects on the A*Team.  He is very detailed oriented, easy to work with and is willing to tackle big things.

theterabyte – Tushar is excited about this program as an opportunity to grow his skills as a python developer and experiencing how software is built outside of a classroom.  Tushar will get a chance to grow his UI skills on Perfherder by making the graphs and compare view more polished and complete, while helping out with an interface for the alerts.

Perfherder will soon become the primary dashboard for all our performance needs.

I am looking forward to the ideas and solutions these new team members bring to the table.  Please join me in welcoming them!

1 Comment

Filed under Community, testdev

Please join me in welcoming the DX Team

A few weeks ago, I posted a call out for people to reach out and commit to participate for 8+ weeks.  There were two projects and one of them was Developer Experience.  Since then we have had some great interest, there are 5 awesome contributors participating (sorted by irc nicks).

BYK – I met BYK 3+ years ago on IRC- he is a great person and very ambitious.  As a more senior developer he will be focused primarily on improving interactions with mach.  While there are a lot of little things to make mach better, BYK proposed a system to collect information about how mach is used.

gma_fav – I met gma_fav on IRC when she heard about the program.  She has a lot of energy, seems very detail oriented, asks good questions, and brings fresh ideas to the team!  She is a graduate of the Ascend project and is looking to continue her growth in development and open source.  Her primary focus will be on the interface to try server (think the try chooser page, extension, and taking other experiments further).

kaustabh93 – I met Kaustabh on IRC about a year ago and since then he has been a consistent friend and hacker.  He attends university.  In fact I do owe him credit for large portions of alert manager.  While working on this team, he will be focused on making run-by-dir a reality.  There are two parts: getting the tests to run green, and reducing the overhead of the harness.

sehgalvibhor – I met Vibhor on IRC about 2 weeks ago.  He was excited about the possibility of working on this project and jumped right in.  Like Kaustabh, he is a student who is just finishing up exams this week.  His primary focus this summer will be working in a similar role to Stanley in making our test harnesses act the same and more useful.

stanley – When this program was announced Stanley was the first person to ping me on IRC.  I have found him to be very organized, a pleasure to chat with and he understands code quite well.  Coding and open source are both new things to Stanley and we have the opportunity to give him a great view of it.  Stanley will be focusing on making the commands we have for running tests via mach easier to use and more unified between harnesses.

Personally I am looking forward to seeing the ambition folks have translate into great solutions, learning more about each person, and sharing with Mozilla as a whole the great work they are doing.

Take a few moments to say hi to them online.

1 Comment

Filed under Community, testdev

re-triggering for a [root] cause- version 2.0 – a single bug!

Today the orange factor email came out- the top 10 bugs with stars on them :)  Last week we had no bugs that we could take action on, and the week before we had a few bugs to take action on.

This week I looked at each bug again, annotated them with some notes as to what I did or why I didn’t do anything, here are the bugs:

  • Bug 1160008 Intermittent testVideoDiscovery
    • too old!  (from last week)
  • Bug 1073442 Intermittent command timed out
    • too old; infra issue (from week before last)
  • Bug 1121145 Intermittent browser_panel_toggle.js
    • too old!  problem got worse on April 24th (from last week)
  • Bug 1157948 DMError: Non-zero return code for command
    • too old!  most likely a harness/infra issue (from last week)
  • Bug 1161817 Intermittent browser_timeline-waterfall-sidebar.js
    • fixed already
  • Bug 1168747 Intermittent 336736-1a.html
    • resolved (via skip-if in manifest)
  • Bug 1149955 Intermittent Win8-PGO test_shared_all.py
    • too old (from last week – someone looking into it now though!)
  • Bug 1158887 Intermittent test_async_transactions.js
    • reopened – fix landed recently- re-triggering doesn’t seem helpful :(
  • Bug 1090203 Intermittent style-src-3_2.html
    • to do work
  • Bug 1081925 Intermittent browser_popup_blocker.js
    • test is disabled now (from last week)

It is nice to find a bug that we can take action on.  What is interesting is the bug has been around for a while, but we noticed about May 21 that the rate of failures went up from a couple a day to >5/day.  Details:

  • started out re-triggering on m-c.  We could see a pattern on a specific merge.
  • did re-triggers on m-i.  20 was inconclusive, and then triggered 20 more for each job- the results were still inconclusive.  There is no increasing pattern based on a specific chanageset.

I might try a full experiment soon blindly looking at bugs instead of using orange factor.

Leave a comment

Filed under Uncategorized

the orange factor – no need to retrigger this week

last week I did another round of re-triggering for a root cause and found some root causes!  This week I got an email from orange factor outlining the top 10 failures on the trees (as we do every week).

Unfortunately as of this morning there is no work for me to do- maybe next week I can hunt.

Here is the breakdown of bugs:

  • Bug 1081925 Intermittent browser_popup_blocker.js
    • investigated last week, test is disabled by a sheriff
  • Bug 1118277 Intermittent browser_popup_blocker.js
    • investigated last week, test is disabled by a sheriff
  • Bug 1096302 Intermittent test_collapse.html
    • test is fixed!  already landed
  • Bug 1121145 Intermittent browser_panel_toggle.js
    • too old!  problem got worse on April 24th
  • Bug 1157948 DMError: Non-zero return code for command
    • too old!  most likely a harness/infra issue
  • Bug 1166041 Intermittent LeakSanitizer
    • patch is already on this bug
  • Bug 1165938 Intermittent media-source
    • disabled the test already!
  • Bug 1149955 Intermittent Win8-PGO test_shared_all.py
    • too old!
  • Bug 1160008 Intermittent testVideoDiscovery
    • too old!
  • Bug 1137757 Intermittent Linux debug mochitest-dt1 command timed out
    • harness infra, test chunk is taking too long- problem is being addressed with more chunks.

As you can see there isn’t much to do here.  Maybe next week we will have some actions we can take.  Once I have about 10 bugs investigated I will summarize the bugs, related dates, and status, etc.

Leave a comment

Filed under testdev

re-triggering for a [root] cause – version 1.57

Last week I wrote some notes about re-triggering jobs to find a root cause.  This week I decided to look at the orange factor email of the top 10 bugs and see how I could help.  Looking at each of the 10 bugs, I had 3 worth investigating and 7 I ignored.

Investigate:

  • Bug 1163911 test_viewport_resize.html – new test which was added 15 revisions back from the first instance in the bug.  The sheriffs had already worked to get this test disabled prior to my results coming in!
  • Bug 1081925 browser_popup_blocker.js – previous test in the directory was modified to work in e10s 4 revisions back from the first instance reported in the bug, causing this to fail
  • Bug 1118277 browser_popup_blocker.js (different symptom, same test pattern and root cause as bug 1081925)

Ignore:

  • Bug 1073442 – Intermittent command timed out; might not be code related and >30 days of history.
  • Bug 1096302test_collapse.html | Test timed out. >30 days of history.
  • Bug 1151786 testOfflinePage. >30 days of history. (and a patch exists).
  • Bug 1145199 browser_referrer_open_link_in_private.js. >30 days of history.
  • Bug 1073761 test_value_storage.html. >30 days of history.
  • Bug 1161537 test_dev_mode_activity.html. resolved (a result from the previous bisection experiment).
  • Bug 1153454 browser_tabfocus.js. >30 days of history.

Looking at the bugs of interest, I jumped right in in retriggering.  This time around I did 20 retriggers for the original changeset, then went back to 30 revisions (every 5th) doing the same thing.  Effectively this was doing 20 retriggers for the 0, 5th, 10th, 15th, 20th, 25th, and 30th revisions in the history list (140 retriggers).

I ran into issues doing this, specifically on Bug 1073761.  The reason why is that for about 7 revisions in history the windows 8 builds failed!  Luckily the builds finished enough to get a binary+tests package so we could run tests, but mozci didn’t understand that the build was available.  That required some manual retriggering.  Actually a few cases on both retriggers were actual build failures which resulted in having to manually pick a different revision to retrigger on.  This was fairly easy to then run my tool again and fill in the 4 missing revisions using slightly different mozci parameters.

This was a bit frustrating as there was a lot of manual digging and retriggering due to build failures.  Luckily 2 of the top 10 bugs are the same root cause and we figured it out.  Including irc chatter and this blog post, I have roughly 3 hours invested into this experiment.

Leave a comment

Filed under Uncategorized