We often want to determine where we stand in relation to a distribution. If we have an observation, is it typical or unusual for our distribution. What percentage of the data is above or below it?

We use areas to determine this. First, let’s introduce a probability distribution.

Probability Distribution

A probability distribution is a model that explains the relative probability of different events happening. The probability distribution is an ideal mathematical model. The real world doesn’t fit it exactly, but the model provides lots of practical power.

The x-axis is the value of a measurement and the y-axis is the relative frequency.

The area under the curve of a (normalized) probability distribution is exactly one.

The probability of a measurement occurring between two intervals is the area under the curve between those two intervals.

These probability distributions are models just like a rectangle or a circle. This probability distribution is usually a mathematical function.

For example, the gaussian distribution is given by

f = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-(x-\mu)^2/\sigma^2}

Notice our new friend, the number e, showing up.

Relative probability

If you can calculate or estimate the area under the curve of a probability distribution, you can predict the probability of an event occurring.

For example, if we want to know the probability of a measurement between 1 and 2 occuring, we estimate the area under the curve between 1 and 2 and divide by the total area under the curve.

Percentile

As we ask about the relative probabilities in a distribution, we use the percentile. The Xth percentile is the value Y at which X percent of the data has a value below Y. We interpret this as the area below a value on a probability distribution.

For example, the median is the same thing as the 50th percentile. That is, what is the value where 50% of the data is below that value.

This is useful for understanding the shape of a distribution. You may hear the terms quartiles, quintiles, or deciles which split the data into four, five, or ten equal sized groups.

The percentile will have units of the measurement of the group.

The 60th percentile of CalEnviroScreen asthma would be the asthma measurement for which 60% of the asthma measurements are below that value in the data set.

Percentile Rank

The percentile rank is the inverse of the percentile. It tells you the percentage of the data that has a value below the data Y.

This is useful for understanding if your observation is typical or unusual.

The percentile rank has no units since it is telling you the fraction of the data below your observation.

PR = \frac{\text{values below Y}}{\text{all values}}

Cumulative Distribution Function

The Cumulative Distribution Function (CDF) adds up the areas under the curve. (It is an integral.) Thus, the CDF makes it very quick to see the what percentile your observation belongs to.

Data Sources