# On Random Sampling and Generalization in Ecology

Virtually every introduction to statistics book makes the point that random sampling is a critical assumption that underlies all statistical inferences. It is assumption #1 of statistical inference and it carries with it an often-hidden assumption that in trying to make your inference, you have clearly defined what the statistical population is that you are sampling. Defining the ‘population’ under consideration should perhaps be rule # 1, but that is usually left as a vague understanding in many statistical studies. As an exercise consult a series of papers on ecological field studies and see if you can find a clear statement of what the ‘population’ under consideration is. An excellent example of this kind of analysis is given by Ioannidis (2003, 2005).

The problem of random sampling does not occur in theoretical statistics and all effort is concentrated on mathematical correctness. This is illustrated well in the polls we are subjected to on political or social issues, and in the medical studies that we hear about daily. The social sciences have considered sampling for polls in much more detail that have biologists. In a historical overview (Lusinchi 2017) provides an interesting and useful analysis of how pollsters have over the years bent the science of statistical inference to their methods of polling to provide an unending flow of conclusions about who will be elected, or which coffee is better tasting. By confounding sample size with an approach to Truth and ignoring the problem of random sampling, the public has been brainwashed to believe what should be properly labeled as ‘fake news’.

What has all of this got to do with the science of ecology? Much of the data we accumulate is uncertain when we ask what is the ‘population’ to which it applies. If you are concerned about the ecology of sharks, you face the problem that most species of shark have never been studied (Ducatez 2019). If you are interested in fish populations, for example, you may find that the fish you catch with hooks are not a random sample of the fish population (Lennox et al. 2017). If you are studying the trees in a large woodlot, that may be your universe for statistical purposes. Interest then shifts to the question of how much you will generalize to other woodlots over what geographical space, a question too rarely discussed in data papers. In an ideal world we would sample several woodlots randomly selected from a larger sample of similar woodlots, so that we could infer processes that were common to woodlots in general.

There are a couple of problems that confound ecologists at this point. No series of woodlots or study sites in general are identical, so we assume they are a collective of ‘very similar’ woodlots about which we could make an inference. Alternatively, we can simply state that we wish to make inferences about only this single woodlot, it is our total population. At this point your supervisor/boss will say that he or she is not interested only in this one woodlot but much more general conclusions, so you will be cut from research funding for having too narrow an interest.

The solution is in general to study one ‘woodlot’ and then generalize to all ‘woodlots’ with no further study on your part, so that it will be up to the next generation to find out if your generalization is right or wrong. While this way of proceeding will perhaps not matter to people interested in ‘woodlots’, it might well matter greatly if your ‘population of interest’ was composed of humans considering a drug for disease treatment. We are further confounded in this era of climate change in dealing with changing ecosystems, so that a study in 2000 about coral reef fish communities could be completely different if it were repeated in 2040 as oceans warm.

Back to random sampling. I would propose that random sampling in ecological systems is impossible and cannot be achieved in a global sense. Be concerned about local processes and sample accordingly. Descriptive ecology must come to the rescue here, so that we know as background information (for example) that trees grow slower as they age, that tree growth varies from year to year, that insect attacks vary with summer temperature, and so on, and sample accordingly following your favourite statistician. There are many very useful statistical techniques and sampling designs you can use as an ecologist to achieve random sampling on a local scale, and statisticians are most useful to consult to validate the design of your field studies.

But it is important to remember that your results and conclusions even though carried out with a perfect statistical design cannot ensure that your generalizations are correct in time or in space. The use of meta-analysis can assist in validating generalizations when enough replicated studies are available, but there are problems even with this approach (Siontis and Ioannidis 2018). Continued discussion of p-values in ecology could benefit much from similar discussions in medicine where funding is higher, and replication is more common (Ioannidis 2019b; Ioannidis 2019a).

All these statistical issues provide a strong argument as to why ecological field studies and experiments should never stop, and all our studies and conclusions are temporary signposts along a path that is never ending.

Ducatez, S. (2019). Which sharks attract research? Analyses of the distribution of research effort in sharks reveal significant non-random knowledge biases. Reviews in Fish Biology and Fisheries 29, 355-367. doi: 10.1007/s11160-019-09556-0.

Ioannidis, J.P.A. (2005). Contradicted and initially stronger effects in highly cited clinical research. Journal of the American Medical Association 294, 218-228. doi: 10.1001/jama.294.2.218.

Ioannidis, J.P.A. (2005). Why most published research findings are false. PLOS Medicine 2, e124. doi: 10.1371/journal.pmed.0020124.

Ioannidis, J.P.A. (2019a). What have we (not) learnt from millions of scientific papers with p values? American Statistician 73, 20-25. doi: 10.1080/00031305.2018.1447512.

Ioannidis, J.P.A. (2019b). The importance of predefined rules and prespecified statistical analyses: do not abandon significance. Journal of the American Medical Association 321, 2067-2068. doi: 10.1001/jama.2019.4582.

Lennox, R.J., et al. (2017). What makes fish vulnerable to capture by hooks? A conceptual framework and a review of key determinants. Fish and Fisheries 18, 986-1010. doi: 10.1111/faf.12219.

Lusinchi, D. (2017). The rhetorical use of random sampling: crafting and communicating the public image of polls as a science (1935-1948). Journal of the History of the Behavioral Sciences 53, 113-132. doi: 10.1002/jhbs.21836.

Siontis, K.C. and Ioannidis, J.P.A. (2018). Replication, duplication, and waste in a quarter million systematic reviews and meta-analyses. Circulation Cardiovascular Quality and Outcomes 11, e005212. doi: 10.1161/circoutcomes.118.005212.