Homework 9

1. Use the code that you worked on in Homework #8 (creating fake data sets), and re-organize it following the principles of structured programming. Do all the work in a single chunk in your R markdown file, just as if you were writing a single R script. Start with all of your annotated functions, preliminary calls, and global variables. The program body should be only a few lines of code that call the appropriate functions and run them in the correct order. Make sure that the output from one function serves as the input to the next. You can either daisy-chain the functions or write separate lines of code to hold elements in temporary variables and pass them along.

# Description -----------------------------------------------------
# Simulations of a rotifer count data set and ANOVA analysis
# 23 March 2022
# AGS

# Initialize -----------------------------------------------------
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

set.seed(100)

# Create and Load Functions -----------------------------------------------------

################################
# FUNCTION: create_dataset
# purpose: create a simulated data frame of rotifer counts for the four lake groups
# input: n (sample size)
# output: data frame
# ------------------------------
create_dataset <- function(n = sample(20:80, 1)) {
  
  # create data frame
  d_frame <- data.frame(group1 = rnorm(n = n, mean = 300, sd = 50), # simulate count data, group 1
                        group2 = rnorm(n = n, mean = 250, sd = 20), # simulate count data, group 2
                        group3 = rnorm(n = n, mean = 455, sd = 65), # simulate count data, group 3
                        group4 = rnorm(n = n, mean = 511, sd = 20)) # simulate count data, group 4
  
  d_frame <- d_frame %>% pivot_longer(cols = group1:group4, # make the data frame longer
                                      names_to = "group",
                                      values_to = "counts")
  return(d_frame)
}

#create_dataset()


################################
# FUNCTION: run_stats
# purpose: run one-way ANOVA analysis and extract p-value
# input: data frame of rotifer counts for the four lake groups, and n
# output: p value from one-way ANOVA
# ------------------------------
run_stats <- function(data = NULL, n = sample(20:80, 1)) {
  if(is.null(data)) {
    d_frame <- data.frame(group = rep(c("group1", "group2", "group3", "group4"), each = n), 
                          counts = runif(4*n, min=200, max=600))
  }

  ano_model <- aov(counts~group, data = d_frame) # run anova
  z <- summary(ano_model) # model output
  ano_stats <- list(f_pval <- unlist(z)[9]) # pull out p value
  
  return(ano_stats)
}

#run_stats()


################################
# FUNCTION: graph_data
# purpose: create a boxplot of the rotifer counts in the four groups
# input: data frame of rotifer counts for the four lake groups, and n
# output: boxplot of rotifer counts for the four lake groups
# ------------------------------
graph_data <- function(data = NULL, n = sample(20:80, 1)) {
  if(is.null(data)) {
    d_frame <- data.frame(group = rep(c("group1", "group2", "group3", "group4"), each = n), 
                          counts = runif(4*n, min=200, max=600))
  }
  
  ano_plot <- ggplot(data = d_frame) + # create boxplot
            aes(x = group, y = counts) +
            geom_boxplot(fill = "lightblue") +
            labs(x = "Group", y = "Counts")
  
  return(ano_plot)
}

#graph_data()


# Global Variables -----------------------------------------------------

n <- 36 # sample size for each group

# Program Body -----------------------------------------------------

d_frame <- create_dataset(n = n) # create data frame

p_val <- run_stats(data = d_frame, n = n) # run statistical tests

boxplot <- graph_data(data = d_frame) # create boxplot

print(boxplot) # print boxplot

print(p_val) # print p-value from ANOVA

## [[1]]
##      Pr(>F)1 
## 5.394671e-59

2. Once your code is up and working, modify your program do something else: record a new summary variable, code a new statistical analysis, or create a different set of random variables or output graph. Do not rewrite any of your existing functions. Instead, copy them, rename them, and then modify them to do new things. Once your new functions are written, add some more lines of program code, calling a mixture of your previous functions and your new functions to get the job done.

The one-way ANOVA determines whether rotifer counts for at least one group are different than rotifer counts in another group, but does not determine which group(s) have statistically different counts.

I modified my run_stats function to run a post-hoc Tukey HSD test to compare all four groups to each other, i.e. test for significance for all possible comparisons of two groups. Now, the run_stats function reports the p-value from the one-way ANOVA, as well as the Tukey HSD results and a figure showing the results of the Tukey HSD test.

An additional change that I made to this iteration of the script is that I added the raw data points to my boxplot of the rotifer counts in the four groups. I made this change in the graph_data function when I was plotting the boxplot in ggplot.

# Description -----------------------------------------------------
# Simulations of a rotifer count data set, ANOVA analysis, and post-hoc tests
# 23 March 2022
# AGS

# Initialize -----------------------------------------------------
library(tidyverse)
set.seed(100)

# Create and Load Functions -----------------------------------------------------

################################
# FUNCTION: create_dataset
# purpose: create a simulated data frame of rotifer counts for the four lake groups
# input: n (sample size)
# output: data frame
# ------------------------------
create_dataset <- function(n = sample(20:80, 1)) {
  
  # create data frame
  d_frame <- data.frame(group1 = rnorm(n = n, mean = 300, sd = 50), # simulate count data, group 1
                        group2 = rnorm(n = n, mean = 250, sd = 20), # simulate count data, group 2
                        group3 = rnorm(n = n, mean = 455, sd = 65), # simulate count data, group 3
                        group4 = rnorm(n = n, mean = 511, sd = 20)) # simulate count data, group 4
  
  d_frame <- d_frame %>% pivot_longer(cols = group1:group4, # make the data frame longer
                                      names_to = "group",
                                      values_to = "counts")
  return(d_frame)
}

#create_dataset()


################################
# FUNCTION: run_stats
# purpose: run one-way ANOVA analysis and post-hoc Tukey HSD test
# input: data frame of rotifer counts for the four lake groups, and n
# output: p value from one-way ANOVA, Tukey HSD results, Tukey figure
# ------------------------------
run_stats <- function(data = NULL, n = sample(20:80, 1)) {
  if(is.null(data)) {
    d_frame <- data.frame(group = rep(c("group1", "group2", "group3", "group4"), each = n), 
                          counts = runif(4*n, min=200, max=600))
  }

  ano_model <- aov(counts~group, data = d_frame) # run anova
  z <- summary(ano_model) # model output
  ano_stats <- list(f_pval <- unlist(z)[9]) # pull out p value
  
  tukey <- TukeyHSD(ano_model) # run Tukey HSD test
  flat_out <- unlist(tukey) # unlist
  
  return(c(ano_stats, tukey, plot(tukey))) # return p val, tukey results, and tukey figure
}

#run_stats()


################################
# FUNCTION: graph_data
# purpose: create a boxplot of the rotifer counts in the four groups
# input: data frame of rotifer counts for the four lake groups, and n
# output: boxplot of rotifer counts for the four lake groups, with raw data points
# ------------------------------
graph_data <- function(data = NULL, n = sample(20:80, 1)) {
  if(is.null(data)) {
    d_frame <- data.frame(group = rep(c("group1", "group2", "group3", "group4"), each = n), 
                          counts = runif(4*n, min=200, max=600))
  }
  
  ano_plot <- ggplot(data = d_frame) + # create boxplot
            aes(x = group, y = counts) +
            geom_boxplot(fill = "lightblue", outlier.shape = NA) +
            geom_jitter(width = 0.2, colour = "darkslateblue", alpha = 0.5) + # add the raw data points
            labs(x = "Group", y = "Counts") 
  
  return(ano_plot)
}

#graph_data()


# Global Variables -----------------------------------------------------

n <- 36 # sample size for each group

# Program Body -----------------------------------------------------

d_frame <- create_dataset(n = n) # create data frame

stats <- run_stats(data = d_frame, n = n) # run statistical tests

boxplot <- graph_data(data = d_frame) # create boxplot

print(boxplot) # print boxplot

print(stats[1]) # print p-value from ANOVA

## [[1]]
##      Pr(>F)1 
## 5.394671e-59

print(stats[2]) # print Tukey HSD results

## $group
##                    diff       lwr       upr        p adj
## group2-group1 -49.89186 -76.33232 -23.45140 1.502549e-05
## group3-group1 141.71532 115.27486 168.15578 1.265654e-14
## group4-group1 208.59913 182.15867 235.03959 1.265654e-14
## group3-group2 191.60719 165.16673 218.04765 1.265654e-14
## group4-group2 258.49099 232.05053 284.93145 1.265654e-14
## group4-group3  66.88380  40.44334  93.32426 5.290958e-09

In the Tukey HSD graph, we can see that the confidence intervals do not cross the line at zero. This shows that all groups are significantly different from each other.

Back to Home Page

Homework 9

Anna Schmidt

3/23/2022