More thoughts on Auto-land and try server

Last week I wrote a post with some thoughts on AutoLand and Try Server, this had some wonderful comments and because of that I have continued to think in the same problem space a bit more.

In chatting with Vaibhav1994 (who is really doing an awesome GSoC project this summer for Mozilla), we started brainstorming another way to resolve our intermittent orange problem.

What if we rerun the test case that caused the job to go orange (yes in a crash, leak, shutdown timeout we would rerun the entire job) and if it was green then we could deem the failure as intermittent and ignore it!

With some of the work being done in bug 1014125, we could achieve this outside of buildbot and the job could repeat itself inside the single job instance yielding a true green.

One thought- we might want to ensure that if it is a test failing that we run it 5 times and it only fails 1 time, otherwise it is too intermittent.

A second thought- we would do this by try by default for autoland, but still show the intermittents on integration branches.

I will eventually get to my thoughts on massive chunking, but for now, lets hear more pros and cons of this idea/topic.


Filed under Uncategorized

6 responses to “More thoughts on Auto-land and try server

  1. I like the idea. It would mean that test suites that are intermittent would cause pain for developers again instead of just the sheriffs like today. Perhaps if a few test suites are triggering the 1 out of 5 rule to fail developers will have more incentives to fix these tests for autoland to work better.

    To be fair I rather have autoland ASAP even if we can’t find an optimal solution to the intermittent problem.

  2. elvis314

    Thanks for your comment :BenWa!

    I agree, coming up with a better than OK but not perfect solution to our roadblocks will get us using it and increasing productivity.

  3. I’d rather see intermittent failures show up as orange; we should care about intermittent failures just as we care about permanent ones. I think having an autoland system will provide more people with an incentive to care about intermittent oranges, and we’ll end up with a cultural shift that reduces the amount of intermittent orange.

    • elvis314

      David, this is a great concern and something which I am not sure we can figure out. Many developers are of the mindset that we need to have autoland sooner than later- I agree that writing tools to help us ignore test issues (could be product issues) only makes our system exponentially worse.

      The unfortunate reality is we have >3700 intermittent orange bugs in bugzilla and this number keeps growing. In a given week we see between 700-900 of those starred by the Sheriff’s. How can we shift the momentum and start reducing our orange problem?

      Thinking on this, I see a few options, one stands out as a middle of the road approach. What if we had a ‘yellow’ (call it a lemon) for what we know as an intermittent problem? Developers can ignore the yellow jobs (as they already ignore orange jobs) and they still have visibility of the overall problem. With that said an job cannot use AutoLand or other tools which would depend on all green if there are any real failures (current orange).

      One thing we are doing for mochitest-browser-chrome is to switch them to run by each subdirectory ( instead of as a large chunk. This alone will reduce the intermittent failures by a noticeable size- eventually we will get all mochitests running in this fashion.

      Do you have other thoughts on what we can do to reduce oranges or speed up the process of AutoLand?

      • I think your view of the rate of orange is too static — the biggest factor in how much intermittent orange we have is how strict we are about backing out things that cause intermittent orange, and that’s cultural rather than technical. I think autoland would shift the culture enough that our intermittent orange rate would drop substantially within a week or two because people would want their autoland not to be broken by intermittent oranges, so they’d bisect and back out or fix them much more quickly than they do today. (Today, the pain falls mainly on the sheriffs and doesn’t affect developers much, and that makes the rate of intermittent failures high.)

  4. elvis314

    That is a great way to look at this, thanks for explaining in more detail. I agree that changing our backout/landing policies would help.

    What tools would you see as a better method for finding the root cause of who introduced the intermitent problem? My biggest concern is someone is changing some developer tools code and they get a failure in reftests unrelated to their code- they retrigger the job 10 times and it is green- are they the root cause of this intermittent? I would like to get more ideas and have conversations around how to get things landed while we have a massive backlog of intermittents.

    Would taking the entire development team offline for 2 days to hack on intermittents be a solution to get rid of most of the problems and then we could start adjusting our policies for backouts/landings? There are probably a dozens ways to approach this problem, maybe you have some preferred ideas!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s