Statistical Evidence of Associations
A key objective of environmental data analysis is to provide evidence of associations between different variables.
A question we often ask is, “What is the probability I would see my evidence if the association did not exist?”. This is the question that a p-value answers.
Types of Associations
- If we are investigating a binary cause and a continuous effect, we use techniques like the central limit theorem, a t-test, or others.
- If we are investigating a continuous cause and a continuous effect, we use techniques like linear regression.
Binary Cause, Continuous Effect
As an example, our binary cause is whether or not a drug was taken. Our continuous effect is the duration of illness symptoms. We measure our population of all folks illness symptom durations. Our sample is the folks given the drug (there are N folks in the sample). Our sample distribution is the distribution of the means of the illness symptom durations if we randomly select N folks over and over. Each of the data points is a single person.
Our statistical test is how likely it is to see the difference between the mean illness duration in our sample, and the population mean.
The central limit tells us that the mean of the sample distribution is the same as the population distribution and that the standard deviation of the sample distribution is the standard deviation of the population divided by the square root of N.
Continuous Cause, Continuous Effect
As an example, our continuous cause is the amount of fertilizer applied to a crop. Our continuous effect is the yield of the crop. We expect to see an increase in the crop yield with an increase in the amount of fertilizer.
We create a scatter plot with the amount of fertilizer applied on the x-axis and the yield of the crop on the x-axis. Each of the data points is an area of crop (different crop areas have different fertilizer amounts).
The technique of linear regression finds the linear model that best predicts the data and reports the slope and y-intercept. Linear regression tools also report the probability that you’d see your slope if there was no association (p-value) and the amount of variability in the data explained by the linear model (R-squared).
Linear regression models our data with a linear equation (y=mx+b) that has a normal distribution superimposed on top of it.