Just found this thread. Looks like I'll be up a while tonight
About a year ago I started reflecting on load development. I took a look around at what a person can buy and do and thought to myself that it should be easy enough to isolate variables to at least get an impression of what matters and what doesn't. In my mind there should not be a decades long argument over whether to start with seating depth tests, or OCW, or powder ladder tests, or Satterlee tests.... Certainly there is a way to (within the bounds of real life) isolate variables to begin to quantify the effects of changing powder charge, changing seating depth, SRP vs. LRP, Crown angles, barrel tuners, etc... etc... So little by little I've been working towards it. I'm not going to get into the entirety of the testing I've done so far, but the TLDR of it is that I've arrived at the conclusion (as it relates to this topic) that 20-35 shots of a single variable is sufficient to give the end user a pretty solid idea of what that set of conditions is going to do over the next several hundred (if not thousand) rounds of barrel life.
Why?
Because I have been exposed to probability and statistics. I'm not going to try to teach the subject, so the TLDR of it is that in order to use a standard deviation (SD), you assume that your sample follows a distribution pattern (usually "normal distribution", or a bell curve) AND most importantly that
your sample accurately represents the population. This is huge.
Say for example, a 6.5 Creedmoor has a generally accepted 2500 round barrel life. Then our "population" is that entire 2500 rounds. Each variable we change (powder charge, seating depth, case mfg., powder type, etc.) then has a theoretical population of 2500 rounds. So if we could see into the future, we could set up load A, B, C, D, E, and F where we change 1 variable per letter. Then we imanginarily shoot 2500 rounds of A from a perfectly fixed barrel in a 1000yd indoor range and it produces a scatter plot on our 1000yd target. Then we reverse time, regain our barrel life, and try load B, then C, etc... Each produces it's own population scatter plot. And if we had a time machine and infinite supplies we could definitively tell which combination/arrangement of components would produce the tightest dispersion in that barrel... but we don't have those things.
So the age old battle has been trying to find the winning combination with as few rounds as possible because we don't have time machines to wash round counts off of our barrels. Unfortunately, what has developed out of this search is traditions that are skimpy on one of the best tools that are available to us, probability and statistics.
Every MFer out there with a chronograph has shot groups and collected velocities and been tickled pink when a tight spread over 5 shots gets produced. Myself included. When you scroll through that Chrono and it says "SD: 3" you're like "Oh yeah motherfucker.. that's right!". The problem is that most people don't understand the bounds in which SD is meaningful and useful.
I also don't want this to be a textbook on statistics. Suffice it to say, SD is ONLY RELEVANT IF THE SAMPLE REPRESENTS THE POPULATION. So then, how large must the sample be to accurately represent the population? Traditionally we've shot 3 and 5 shot groups. Is that good enough?
Say there's a campus of 2500 students and we nab 3 or 5 of them and question them about an upcoming election. Do we have a good idea of which way that campus is going to swing? Say we find a colony of 2500 ants and we capture 3 or 5 of them to measure for height/weight. Not seeing any other ants, do we have a good hold on how big ants from this colony are? Say we have 2500 rounds of useful barrel life with a given combination of reloading components and we randomly select 3 or 5 of them to test. Do we have a good feeling for how the other 2495-2497 will behave?
I submit "no". No we do not.
Why? Well after thinking about the above, I shoulder fired several variables at 50 shots per group (5x 10 shotters with cooling between, all with correlated POI/POA) and recorded the impact location and velocity for each shot. The first thing that I tested was a powder ladder test. The end results of the powder ladder test showed a negligible difference in ES/SD between all of the powder charges, and a nearly negligible difference in dispersion across the powder ladder. This caused me to look into the data a little more closely. Do I suck? Did I do something wrong? THERE WERE SUPPOSED TO BE NODES!
I plotted running averages, running SD's, running MPOI in horizontal and vertical directions. This basically showed the "total results by the round count". It was both fascinating and horrible. Fascinating because regardless of whether the first 5-10 shots showed an SD of 3 or 12, or 18, by the time it ended up at 50 shots, it was 10-12fps. All of my data (other than the avg. MV which obviously grew with increased charge weight) more or less converged. Horrible because it seemed to me that every single load development test I had done in the past was a complete and utter waste of time. To settle this in my mind I repeated the median charge test 2 more times and again had the results converged at my 50 shot mark. This rabbit hole has consumed thousands of rounds now and I have moved to an accuracy fixture to isolate the shooter out of the equation but that's another subject...
So why my remarks on 20-35 shots? Because for all intents and purposes, by 35 shots the tests were indistinguishable. The 15 following shots to get to 50 were insurance. In my opinion, 35 shots of a variable is enough to very accurately represent the population, and any statistical data derived from such a test is pretty viable for use in probability calculations. I can go more into depth on this subject if people want. Not surprisingly 30-35 is a fairly common 'rule of thumb' for sample size in use of statistics. I don't want to bore anyone. Suffice it to say that every single shot represents a random event (to some extent), and it just so happens that the distribution of shots from a rifle tracks pretty close to a normal distribution relative to the MPOI. So you can plug in empirically derived SD's into an Excel spreadsheet with a random number generator for X and Y coordinates and produce realistic 1000, 2000, 100000 (whatever you want) shot groups, which can then be used with stuff like 4DoF to introduce other variables and have pretty useful hit probability info at the cost of 2 boxes of ammo.
Where does 20 come from? 20 is about the minimum mark where things are settled down with what I consider an acceptable level of error margin. SD's that will long-term average 11 will show up in 20 shot tests from 9-13fps. +/- 2fps. Who cares? If you apply 4 sigma or 6 sigma it's still not enough to matter in PRS/NRL match ranges, even the far targets. It's got enough resolution to tell you one load is better than another, one barrel is better than another, etc... without burning the extra 15-30 rounds to be 100% sure.
Why not 15 (or 10, or 7 or 5)??? Too much noise, IMO. I'd have to play with my data a bit to get you numbers for 10's or 15's but because 5's seem to be the "standard" I have played with them a lot and offer this example.
Say I shot a 1/2 MOA 20 shot group. We can expect then that pretty much every round of that load will be sub MOA for sure, probably better.
If I use that data to generate 5 shot groups, I will get groups that average in size about 0.37-0.42 MOA. BUT!!! I will get individual groups that are anywhere from about 0.15 MOA up to 0.85 MOA. Remember that part about randomly selecting 5 out of 2500? So when you shoot a ladder test with 5x of each load... How do you know if the 0.84 MOA group you just shot was representative of that load? You say "oh shit that wasn't good" because the groups right next to it were 0.2-0.4 MOA and you throw it away, never to be shot/tested again. Reality is that you fell victim to variance and small sample size.
A caveat on that last subject... There is a probability distribution going on so the odds are that most rounds are going to fall within X MOA for a given system. That's true.. You know like 70% of your rounds are going to fall within a pretty damn tight window... But 30% are going to fall outside of that window and it's random. You can prep your brass and weigh to the .00001gr and sort by this and by that.... low probability events still happen and
the only way to know if what happened to you was a low probability event is to increase sample size.
Hopefully that makes sense. It's way past my bed time.
Another rabbit hole real quick:
Group size is a less than optimal way to measure dispersion. Whether you shoot 2 rounds or 2000 rounds, you are only using 2 rounds to qualify the group. I want to go to sleep now so maybe this will spark some discussion for tomorrow & the weekend, but mean radius is a much better metric. Alas it requires a bit more math to figure-- but a lot of these apps do it anyway and tell us and we ignore it or don't understand what it means.
So Frank, I understand I'm up against thousands and thousands of shooters set in their ways, and an industry with traditions that are what they are, but that doesn't mean that there aren't better ways. Not everyone needs a true 1/2 MOA all day long rifle+ammo combo, and not everyone needs to know that it is 1/2 MOA. However, a statistically significant method for comparison is desirable to sort out the white noise BS. The more we truly learn the more we can push the envelope. The more white noise BS we push around the slower we progress.
100% agree with THEIS about isolating what it is you want to test and testing for it appropriately. No two barrels, shooters, lots of bullets, etc. are the same. It is very difficult to say "Brand X model A shoots # mean radius at 200yd and Brand Y model B is 10% better".
Nonetheless, I am a fan of the dot drills, but IMO they should be the same size aiming point for all 20, perhaps with 'ghost rings' for different sizes (1/4, 1/2, 3/4, 1 MOA etc..).