3.5 years ago we implemented and integrated SETA. This has a net effect today of reducing our load between 60-70%. SETA works on the premise of identifying specific test jobs that find real regressions and marking them as high priority. While this logic is not perfect, it proves a great savings of test resources while not adding a large burden to our sheriffs.
There are a two things we could improve upon:
- a test job that finds a failure runs dozens if not hundreds of tests, even though the job failed for only a single test that found a failure.
- in jobs that are split to run in multiple chunks, it is likely that tests failing in chunk 1 could be run in chunk X in the future- therefore making this less reliable
I did an experiment in June (was PTO and busy on migrating a lot of tests in July/August) where I did some queries on the treeherder database to find the actual test cases that caused the failures instead of only the job names. I came up with a list of 171 tests that we needed to run and these ran in 6 jobs in the tree using 147 minutes of CPU time.
This was a fun project and it gives some insight into what a future could look like. The future I envision is picking high priority tests via SETA and using code coverage to find additional tests to run. There are a few caveats which make this tough:
- Not all failures we find are related to a single test- we have shutdown leaks, hangs, CI and tooling/harness changes, etc. This experiment only covers tests that we could specify in a manifest file (about 75% of the failures)
- My experiment didn’t load balance on all configs. SETA does a great job of picking the fewest jobs possibly by knowing if a failure is windows specific we can run on windows and not schedule on linux/osx/android. My experiment was to see if we could run tests, but right now we have no way to schedule a list of test files and specify which configs to run them on. Of course we can limit this to run “all these tests” on “this list of configs”. Running 147 minutes of execution on 27 different configs doesn’t save us much, it might take more time than what we currently do.
- It was difficult to get the unique test failures. I had to do a series of queries on the treeherder data, then parse it up, then adjust a lot of the SETA aggregation/reduction code- finally getting a list of tests- this would require a few days of work to sort out if we wanted to go this route and we would need to figure out what to do with the other 25% of failures.
- The only way to run is using per-test style used for test-verify (and the in-progress per-test code coverage). This has a problem of changing the way we report tests in the treeherder UI- it is hard to know what we ran and didn’t run and to summarize between bugs for failures could be interesting- we need a better story for running tests and reporting them without caring about chunks and test harnesses (for example see my running tests by component experiment)
- Assuming this was implemented- this model would need to be tightly integrated into the sheriffing and developer workflow. For developers, if you just want to run xpcshell tests, what does that mean for what you see on your try push? For sheriffs, if there is a new failure, can we backfill it and find which commit caused the problem? Can we easily retrigger the failed test?
I realized I did this work and never documented it. I would be excited to see progress made towards running a more simplified set of tests, ideally reducing our current load by 75% or more while keeping our quality levels high.