R has many built-in functions to calculate probabilities for discrete random variables and probability densities for continuous random variables. These are additionally useful for calculating likelihoods. This section highlights a few of the most commonly used probability distributions. If used to calculate likelihoods, the log=TRUE option (yielding log likelihoods instead) is recommended to prevent memory overflow or underflow.

The basic structure of these commands is detailed in the section below on the binomial distribution. The structure of other commands will be similar, though options may vary.


Binomial distribution

Used for binary data (1 or 0; success or failure; alive or dead; mated or unmated). The number of “successes” in n trials will have a binomial distribution if separate trials are independent and if the probability of success p is the same in every trial.

The “dbinom” command is used to calculate Pr[X], the probability of obtaining X successes in n random trials, where X is an integer between 0 and n (n can be as small as 1 trial). To calculate this probability you will also need to specify p, the probability of success in each trial,

dbinom(X, size = n, prob = p)

The same command calculates L[ p | X ], the likelihood of the parameter value p given the observed number of successes X,

like <- dbinom(X, size = n, prob = p)                 # likelihood of p
loglike <- dbinom(X, size = n, prob = p, log = TRUE)  # log-likelihood of p

The same command can be used to obtain the vector of log-likelihoods corresponding to each of many values for p. For example,

p <- seq(0.01, 0.99, by = 0.01)
loglike <- dbinom(X, size = n, prob = p ,log = TRUE)

Geometric distribution

In a sequence of trials, the number of trials ending in failure before the first success has a geometric distribution. Each trial in the sequence is assumed to be independent. The outcome of each trial is binary, either “success” or “failure”. The probability of success p is the same in every trial. The probability of failure in each trial is 1 - p.

The variable X counts the number of “failures” before the first “success”. In other words, X = 0 if success occurs on the 1st trial. X = 1 means that the first success occurred in the 2nd trial and the 1st trial ended in failure. X = 2 means that the that success occurred in the 3rd trial, while the 1st and 2nd trials ended in failure. And so on. Pr[X] is calculated as

dgeom(X, prob = p)            # p is the probability of success
dgeom(X, prob = p, log=TRUE)  # log of the probability instead

The same commands calculates the likelihood (and log-likelihood) of the parameter value p given a vector of data X containing observed values for the number of “failures” before the first “success”. The likelihood corresponding to each independent data point is calculated and then multiplied to obtain the product. The log-likelihood corresponding to each independent data point is calculated and then summed.

For example, if the vector of data from a series of trials is as follows,

X <- c(0,2,3,0,1,1,0,0,0,2,1,0,0,0)

then the log-likelihood for the probability of success in any one trial, p, can be obtained as

like <- prod(dgeom(X, prob = p))                # likelihood of p
loglike <- sum(dgeom(X, prob = p, log = TRUE))  # log-likelihood of p

The above commands give the likelihood of only a single specified value for p. To obtain the likelihoods for a range of possible p values, you can use a short for loop.

loglike <- vector()
p <- seq(0.01, 0.99, by = 0.01)
for(i in 1:length(p)){
  loglike[i] <- sum(dgeom(X, prob = p[i], log = TRUE))
  }

Hypergeometric distribution

Used for mark-recapture data. The hypergeometric distribution describes the number of recaptures (marked individuals) X in a random sample of k individuals from a finite population in which m individuals in total are marked and n individuals are unmarked.

The distribution is used to model mark-recapture data in which the goal is to estimate total population size m + n, where n is unknown. X can be any integer between 0 and m. k can be any integer between 1 and m + n. The model assumes no births, deaths, or immigration or emigration events between marking and recapturing. It also assumes that all individuals in the population have the same probability of being captured. The probability of X recaptures (marked individuals) in a random sample of size k is

dhyper(X, m, n, k)
dhyper(X, m, n, k, log = TRUE)

Poisson distribution

Used for count data. The Poisson distribution describes the number of events X in a “block” of space or time. The single parameter lambda is the population mean number of events per block. The assumption is that individual events occur randomly and independently in space or time. Block size is arbitrary but is usually chosen such that the mean number of events per block is neither very small nor very large. The probability of X events occurring in a single block is

dpois(X, lambda)
dpois(X, lambda, log = TRUE)

Normal distribution

Optimistically, this probability distribution approximates the frequency distribution of a trait in a population (except when it doesn’t!). Because it is a continuous distribution, the height of the normal curve refers to probability density rather than probability as such. Probability is instead represented by area under the normal curve.

The probability density of a normally distributed variable X having mean µ and standard deviation σ is

dnorm(X, mean = µ, sd = σ)
dnorm(X, mean = µ, sd = σ, log = TRUE)

Exponential distribution

The exponential distribution is often used to model the waiting time X between events occurring randomly and independently in time (or space). Because it is a continuous distribution, the height of the exponential curve at any X refers to probability density rather than probability. Probability is instead represented by area under the exponential curve.

The main assumption of the exponential distribution is that at any point in time, the probability of an event occurring in the next instant does not depend on how much time has already elapsed since the previous event. The parameter of the exponential distribution is the rate at which events occur – the number of events per unit time. The mean of the exponential distribution is 1/rate.

Waiting time X to the next event has probability density

dexp(X, rate = 1)
dexp(X, rate = 1, log = TRUE)

The default value for the rate is 1, so you must alter to fit your circumstance. Remember: the rate is the inverse of the mean


Use the bbmle() package

The bbmle() package can quickly find the maximum likelihood estimate for a parameter of a probability distribution fitted to data. You might need to install the package the first time you use it. Then load the package to begin.

library(bbmle)

The first step is to make a function that calculates the negative log likelihood of a parameter fitted to data.


Binomial distribution

For example, if fitting the binomial distribution, the function would look like the following. In this example, p is the unknown parameter to be estimated. X is the observed number of successes in the data and n is the observed number of trials (replace X and n in the command below with the actual numbers from the data).

myNegLLfunc <- function(p){-dbinom(X, size = n, p, log = TRUE)}

Notice the minus sign in front of dbinom. The bbmle() package minimizes the negative log likelihood rather than maximizes the log likelihood, but the result is the same.

The next step is to fit the model to the data using the mle2() command The argument start = list(p = 0.5) provides a realistic starting value to begin the search for the maximum likelihood estimate. When fitting a single parameter, the profile log-likelihood is the same as the log-likelihood curve.

z <- mle2(myNegLLfunc, start = list(p = 0.5))

Finally, we apply functions to the model object z to extract useful quantities.

summary(z)                            # Provides the MLE estimate. Ignore the P-value, as usual.
plot(profile(z))                      # Visualize CI's for the parameter at different alpha levels.
confint(profile(z), method="uniroot") # Calculate the 95% confidence interval.


Exponential distribution

This second example shows how you would incude data from a data frame in the function for the negative log likelihood. The exponential distribution is used to model waiting times to an events. The parameter to be estimated is the rate. The data on waiting times is assumed to be a column t in a data frame mydata.

myNegLLfunc <- function(rate){-sum(dexp(mydata$t, rate = rate, log = TRUE))}

The rest of the analysis is idetical to that shown above for the binomial distribution

 

© 2009-2024 Dolph Schluter