
Case of the Mondays
What was famous 15 years ago as a case of the Mondays has manifested itself in Talos. In fact, I wonder why I get so many regression alerts on Monday as compared to other days. It is more to a point of we have less noise in our Talos data on weekends.
Take for example the test case tresize:
* linux32,
* in fact we see this on other platforms as well linux32/linux64/osx10.8/windowsXP
Many other tests exhibit this. What is different about weekends? Is there just less data points?
I do know our volume of tests go down on weekends mostly as a side effect of less patches being landed on our trees.
Here are some ideas I have to debug this more:
- Run massive retrigger scripts for talos on weekends to validate # of samples is/isnot the problem
- Reduce the volume of talos on weekdays to validate the overall system load in the datacenter is/isnot the problem
- compare the load of the machines with all branches and wait times to that of the noise we have in certain tests/platforms
- Look at platforms like windows 7, windows 8, and osx 10.6 as to why they have more noise on weekends or are more stable. Finding the delta in platforms would help provide answers
If you have ideas on how to uncover this mystery, please speak up. I would be happy to have this gone and make any automated alerts more useful!
For reference, we previously discussed this here: https://bugzilla.mozilla.org/show_bug.cgi?id=908888#c25
One fun (if difficult to get right) test: reboot all the idle slaves of a platform on a weekend, and then quickly trigger talos on them. One of the things that’s true of weekend runs is that you are far more likely to get a slave which has been up and idle for a few hours than you are on weekdays, where you’re generally getting a slave which has finished a job and rebooted a few minutes before it started your test. If, as we have before, we have post-boot scheduled tasks running, I’d expect weekends to be much less affected by them.
Pingback: More Perfherder updates | William Lachance's Log