Parts 1 and 2

Open libraries

library(ggplot2) # for graphics
library(MASS) # for maximum likelihood estimation

Read in data vector

To illustrate, we will generate some fake data here:

(commented out since I replaced with my own data - see below)

# # quick and dirty, a truncated normal distribution to work on the solution set
# 
# z <- rnorm(n=3000,mean=0.2)
# z <- data.frame(1:3000,z)
# names(z) <- list("ID","myVar")
# z <- z[z$myVar>0,] # only include positive values of myVar
# str(z)
# summary(z$myVar)

In the third step of this exercise, you will substitute in your own data for this fake data set. But for now, use the code chunks below to see how you fit different statistical distributions to a vector of observations, and then estimate the maximum likelihood parameters for each distribution.

Loading in my own data

The data I am loading in here is length data (in mm) for zooplankton individuals that we collected at the CEREEP-Ecotron Ile-de-France “PLANAQUA” facility in Saint-Pierre-lès-Nemours, France. The .csv file I am loading in contains length data for zooplankton collected from the epilimnion (surface waters) of Lake 1 at night on August 1st, 2019.

z <- read.table("HW6_M1_01AUG2019_N_EP_022_copy.csv", header=TRUE, sep=",")
z <- data.frame(z$length_mm) # only keep the length_mm column
z <- data.frame(1:223,z)
names(z) <- list("ID","myVar")
str(z)
## 'data.frame':    223 obs. of  2 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ myVar: num  0.614 0.606 0.494 0.482 0.475 ...
summary(z) # summary of min, Q1, median, mean, Q3, and max for myVar
##        ID            myVar       
##  Min.   :  1.0   Min.   :0.1081  
##  1st Qu.: 56.5   1st Qu.:0.2447  
##  Median :112.0   Median :0.3029  
##  Mean   :112.0   Mean   :0.3076  
##  3rd Qu.:167.5   3rd Qu.:0.3664  
##  Max.   :223.0   Max.   :0.6143

Plot histogram of data

Plot a histogram of the data, using a modification of the code from lecture. Here we are switching from qplot to ggplot for more graphics options. We are also rescaling the y axis of the histogram from counts to density, so that the area under the histogram equals 1.0.

p1 <- ggplot(data=z, aes(x=myVar, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) 
print(p1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Add empirical density curve

Now modify the code to add in a kernel density plot of the data. This is an empirical curve that is fitted to the data. It does not assume any particular probability distribution, but it smooths out the shape of the histogram:

p1 <-  p1 +  geom_density(linetype="dotted",size=0.75)
print(p1) # dashed line is the empirical curve
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Get maximum likelihood parameters for normal

Next, fit a normal distribution to your data and grab the maximum likelihood estimators of the two parameters of the normal, the mean and the variance:

normPars <- fitdistr(z$myVar,"normal")
print(normPars)
##       mean           sd     
##   0.307611659   0.089921338 
##  (0.006021578) (0.004257899)
str(normPars)
## List of 5
##  $ estimate: Named num [1:2] 0.3076 0.0899
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 0.00602 0.00426
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 3.63e-05 0.00 0.00 1.81e-05
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 223
##  $ loglik  : num 221
##  - attr(*, "class")= chr "fitdistr"
normPars$estimate["mean"] # note structure of getting a named attribute
##      mean 
## 0.3076117

Plot normal probability density

Now let’s call the dnorm function inside ggplot’s stat_function to generate the probability density for the normal distribution. Read about stat_function in the help system to see how you can use this to add a smooth function to any ggplot. Note that we first get the maximum likelihood parameters for a normal distribution fitted to thse data by calling fitdistr. Then we pass those parameters (meanML and sdML to stat_function):

meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]

xval <- seq(0,max(z$myVar),len=length(z$myVar))

 stat <- stat_function(aes(x = xval, y = ..y..), fun = dnorm, colour="red", n = length(z$myVar), args = list(mean = meanML, sd = sdML))
 p1 + stat # red line is normal distribution
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that the best-fitting normal distribution (red curve) for these data actually has a biased mean. That is because the data set has no negative values, so the normal distribution (which is symmetric) is not working well.


Plot exponential probability density

Now let’s use the same template and add in the curve for the exponential:

expoPars <- fitdistr(z$myVar,"exponential")
rateML <- expoPars$estimate["rate"]

stat2 <- stat_function(aes(x = xval, y = ..y..), fun = dexp, colour="blue", n = length(z$myVar), args = list(rate=rateML))
 p1 + stat + stat2 # blue line is exponential distribution
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Plot uniform probability density

For the uniform, we don’t need to use fitdistr because the maximum likelihood estimators of the two parameters are just the minimum and the maximum of the data:

stat3 <- stat_function(aes(x = xval, y = ..y..), fun = dunif, colour="darkgreen", n = length(z$myVar), args = list(min=min(z$myVar), max=max(z$myVar)))
 p1 + stat + stat2 + stat3 # green line is the uniform distribution
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Plot gamma probability density

gammaPars <- fitdistr(z$myVar,"gamma")
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]

stat4 <- stat_function(aes(x = xval, y = ..y..), fun = dgamma, colour="brown", n = length(z$myVar), args = list(shape=shapeML, rate=rateML))
 p1 + stat + stat2 + stat3 + stat4 # brown line is the gamma distribution
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Plot beta probability density

This one has to be shown in its own plot because the raw data must be rescaled so they are between 0 and 1, and then they can be compared to the beta.

pSpecial <- ggplot(data=z, aes(x=myVar/(max(myVar + 0.1)), y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) + 
  xlim(c(0,1)) +
  geom_density(size=0.75,linetype="dotted")

betaPars <- fitdistr(x=z$myVar/max(z$myVar + 0.1),start=list(shape1=1,shape2=2),"beta")
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]

statSpecial <- stat_function(aes(x = xval, y = ..y..), fun = dbeta, colour="orchid", n = length(z$myVar), args = list(shape1=shape1ML,shape2=shape2ML))
pSpecial + statSpecial # purple line is the beta distribution
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (geom_bar).


Part 3

The best-fitting distribution for my zooplankton length data is the gamma distribution (brown line above). This distribution follows the general trend of the length data and also slightly incorporates the outliers above 0.6 mm.


Part 4

Getting the maximum likelihood parameters for the gamma distribution:

gammaPars <- fitdistr(z$myVar,"gamma")
## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced

## Warning in densfun(x, parm[1], parm[2], ...): NaNs produced
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"] 

Simulating a new data set with the same length as my original vector:

z_sim <- rgamma(n = 223, shape = shapeML, rate = rateML)
z_sim <- data.frame(1:223,z_sim)
names(z_sim) <- list("ID","myVar_sim")
str(z_sim)
## 'data.frame':    223 obs. of  2 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ myVar_sim: num  0.329 0.298 0.306 0.246 0.485 ...
summary(z_sim$myVar_sim)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1364  0.2458  0.2956  0.3068  0.3650  0.5674

Plotting the new data set in a histogram with probability density curve:

p1 <- ggplot(data=z_sim, aes(x=myVar_sim, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) + 
  geom_density(linetype="dotted",size=0.75)
print(p1) # dashed line is the empirical curve
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Fresh histogram of the original data with probability density curve:

p1 <- ggplot(data=z, aes(x=myVar, y=..density..)) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) +
  geom_density(linetype="dotted",size=0.75)
print(p1) # dashed line is the empirical curve
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How do the two histogram profiles compare? Do you think the model is doing a good job of simulating realistic data that match your original measurements? Why or why not?

The two histogram profiles above look relatively similar. The simulated dataset’s probability density curve matches the probability density curve of the original dataset: it peaks around a y-axis value of 4, and has a slightly longer tail to the right. My original dataset had two outlier values with lengths over 0.6 mm, and the simulated data captured similar outliers. Overall, I think the model is going a good job of simulating realistic data that match my original dataset since the shape of the probabilty density curve is preserved and the outliers on the right are captured.


Back to Home Page