Over the last year there has been a lot of research into reducing the noise in our talos performance numbers. For example looking at tscroll, we have a fluctuation in the reported numbers of almost 400 (out of 14000). Jan Larres took a look at this problem in his masters thesis, and found a variety of factors that did and didn’t contribute to the noise. We actually have Bug 706912 filed to implement some of his suggestions on how to calculated the posted number. Last fall, Stephen Lewchuk look at the raw data that was collected and found some inconsistencies in the way we were aggregating the data. In short, we have a lot of ground to cover if we want to reduce our numbers.
Over the last couple months, we have been working on a project call Signal From Noise. This is an attempt to fix the way we collect some numbers and redo the way we aggregate numbers for reporting. We have done a lot of experimenting with the primary focus on tp5. The way we run tp5 is to load each of the 100 pages once, then repeat 10 times. For each page, we would drop the highest value and take the median value of the remaining 9 numbers. This results in an array of 100 data points which get reported to the graph server. We take those 100 data points and average them out to generate the single number for tp5. It is easy to imagine that the small samples and median/average combination will produce a lot of noise.
Going forward, we are looking to change from column major to row major and collect 30 samples instead of 10. This means we focus on one page and load it 30 times, then move to the next page and repeat until all 100 pages have been loaded. The downfall is the runtime as we move from an average of 17 minutes to an average of 39 minutes for the entire tp5 run. Collecting 30 samples will give us a much more meaningful number, but we also found that the first 5-10 iterations contain the most noise. So initially we are looking to throw away the first 10 numbers instead of what we originally did by throwing away the highest number. When looking at the raw numbers (not the aggregated number), here are some graphs to highlight the difference:
This is only the first step in many changes needed. After rolling this out, we need to evaluate the other test suites as well and ensure we are running adequate cycles to get a valid sample size. We are also working on allowing the database to accept the raw values instead of the single median value per page. Likewise are are looking to stop doing a average([median(page)]). All of this will allow us to find regressions easier per page instead of having it washed over with the other numbers.