Homework 10

1. Using a for loop, write a function to calculate the number of zeroes in a numeric vector. Before entering the loop, set up a counter variable counter <- 0. Inside the loop, add 1 to counter each time you have a zero in the vector. Finally, use return(counter) for the output.

################################
# FUNCTION: calculate_zeroes
# purpose: calculate the number of zeroes in a numeric vector
# input: vec = numeric vector
# output: counter = number of zeroes in the numeric vector
# ------------------------------

calculate_zeroes <- function(vec = c(0, 0, 5, 1)) {
  counter <- 0 # set up counter variable
  for (i in 1:length(vec)) {
    if (vec[i] == 0) counter <- counter + 1
    } # end of for loop
  return(counter) # output
} # end of function

calculate_zeroes() # test defaults, counter should = 2

## [1] 2

calculate_zeroes(c(0, 10, 25, 0, 0, 0)) # test a different vector, counter should = 4

## [1] 4

2. Use subsetting instead of a loop to rewrite the function as a single line of code.

vec <- c(0, 0, 5, 1)

length(which(vec == 0)) # should = 2

## [1] 2

3. Write a function that takes as input two integers representing the number of rows and columns in a matrix. The output is a matrix of these dimensions in which each element is the product of the row number x the column number.

################################
# FUNCTION: matrix_products
# purpose: create a matrix in which each element is the product of its row# x column#
# input: nrow = integer representing the number of rows in a matrix
#        ncol = integer representing the number of columns in a matrix
# output: m = matrix of dimension(nrow,ncol)
# ------------------------------

matrix_products <- function(nrow = 3,
                            ncol = 4) {
  m <- matrix(runif(nrow*ncol), nrow = nrow) # create matrix with random values
  
  # double for loop to loop over rows and columns
  for (i in 1:nrow(m)) { # loop over rows
    for (j in 1:ncol(m)) { # loop over columns
      m[i,j] <- i*j
      } # end of column loop
    } # end of row loop
  
  return(m)
  
} # end of function

matrix_products() # test defaults

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    2    4    6    8
## [3,]    3    6    9   12

matrix_products(nrow = 5, ncol = 6) # test with different inputs

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    2    3    4    5    6
## [2,]    2    4    6    8   10   12
## [3,]    3    6    9   12   15   18
## [4,]    4    8   12   16   20   24
## [5,]    5   10   15   20   25   30

4. In the next few lectures, you will learn how to do a randomization test on your data. We will complete some of the steps today to practice calling custom functions within a for loop. Use the code from the March 31st lecture (Randomization Tests) to complete the following steps:

a. Simulate a dataset with 3 groups of data, each group drawn from a distribution with a different mean. The final data frame should have 1 column for group and 1 column for the response variable.

set.seed(2000)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Create random numbers for three groups of data
control <- rnorm(n = 10, mean = 10, sd = 10)     # mean = 10
treatmentA <- rnorm(n = 10, mean = 60, sd = 10)  # mean = 60
treatmentB <- rnorm(n = 10, mean = 85, sd = 10)  # mean = 85

# Create data frame
d_frame <- data.frame(control, treatmentA, treatmentB)

# Make the data frame longer
d_frame <- d_frame %>%
  pivot_longer(cols = control:treatmentB,
               names_to = "group",
               values_to = "response_var")
head(d_frame)

## # A tibble: 6 × 2
##   group      response_var
##   <chr>             <dbl>
## 1 control            1.46
## 2 treatmentA        78.9 
## 3 treatmentB        89.6 
## 4 control            6.47
## 5 treatmentA        68.4 
## 6 treatmentB        67.4

b. Write a custom function that 1) reshuffles the response variable, and 2) calculates the mean of each group in the reshuffled data. Store the means in a vector of length 3.

################################
# FUNCTION: randomization_test
# purpose: reshuffle the response variable and calculate the mean of each group in the reshuffled data
# input: df = data frame with "response_var" and "group" columns
# output: means = vector of length 3, containing the means of each group
# ------------------------------

randomization_test <- function(df = NULL) {
  if(is.null(df)) {
    df <- data.frame(group = rep(c("control", "treatmentA", "treatmentB"), each = 10),
                     response_var = runif(30, min = 0, max = 200))
  }
  
  # reshuffle the response variable
  df$response_var <- sample(df$response_var)
  
  # calculate the mean of each group in the reshuffled data, and store the means in a vector of length 3
  means <- tapply(df$response_var, df$group, mean)
  return(means)
  
} # end of function

randomization_test() # test defaults

##    control treatmentA treatmentB 
##   89.62789  100.68889   95.22473

randomization_test(df = d_frame) # test with our data frame

##    control treatmentA treatmentB 
##   50.21755   47.59663   60.93968

c. Use a for loop to repeat the function in b 100 times. Store the results in a data frame that has 1 column indicating the replicate number and 1 column for each new group mean, for a total of 4 columns.

# Create empty data frame with all NAs
results <- data.frame(replicate = rep(NA,100), 
                      means_control = rep(NA,100), 
                      means_treatmentA = rep(NA,100), 
                      means_treatmentB = rep(NA,100))

# For loop to repeat randomization_test() 100 times
for (i in 1:100) {
  means <- randomization_test(df = d_frame) # run randomization_test() function
  
  results$replicate[i] <- i # fill in replicate column
  
  results$means_control[i] <- means["control"] # fill in means_control column
  results$means_treatmentA[i] <- means["treatmentA"] # fill in means_treatmentA column
  results$means_treatmentB[i] <- means["treatmentB"] # fill in means_treatmentB column
  
} # end of for loop

head(results)

##   replicate means_control means_treatmentA means_treatmentB
## 1         1      51.09730         62.48067         45.17589
## 2         2      34.20561         71.30688         53.24137
## 3         3      47.29543         67.91524         43.54318
## 4         4      63.63411         49.08960         46.03015
## 5         5      29.55995         54.71681         74.47710
## 6         6      54.16426         52.23951         52.35009

d. Use qplot() to create a histogram of the means for each reshuffled group. Or, if you want a challenge, use ggplot() to overlay all 3 histograms in the same figure. How do the distributions of reshuffled means compare to the original means?

# First, make the data frame longer
results_long <- results %>%
  pivot_longer(cols = means_control:means_treatmentB,
               names_to = "group",
               values_to = "means")

head(results_long)

## # A tibble: 6 × 3
##   replicate group            means
##       <int> <chr>            <dbl>
## 1         1 means_control     51.1
## 2         1 means_treatmentA  62.5
## 3         1 means_treatmentB  45.2
## 4         2 means_control     34.2
## 5         2 means_treatmentA  71.3
## 6         2 means_treatmentB  53.2

# all on the same plot
ggplot(data = results_long, 
       aes(x = means, fill = group, color = group)) +
  geom_histogram(position = "dodge", alpha = 0.2) +
  theme(legend.position="top")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# faceted
ggplot(data = results_long, 
       aes(x = means)) +
  geom_histogram(fill = "lightblue", col = "white", position = "dodge") +
  theme(legend.position="top") +
  facet_wrap(~group, nrow = 3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distributions of reshuffled means for the control, treatmentA, and treatmentB groups all look relatively similar, and they are all centered around a mean of 55. It makes sense that these distributions all look similar since the response variables were randomly shuffled in the randomization_test() function. Since the original means were control = 10, treatmentA = 60, and treatmentB = 85, the fact that these distributions are centered around 55 makes sense, since that is centered in the context of the original means.

5. For comparison, calculate in R the standard statistical analysis you would use with these data. How does the p-value compare for the standard test versus the p value you estimated from your randomization test? If the p values seem very different, run the program again with a different starting seed (and/or increase the number of replications in your randomization test). If there are persistent differences in the p value of the standard test versus your randomization, what do you think is responsible for this difference?

Here, I am going to recreate the histogram of simulated means, and add the observed means as vertical lines.

ggplot(data = results_long, 
       aes(x = means, fill = group, color = group)) +
  geom_histogram(position = "dodge", alpha = 0.2) +
  theme(legend.position="top") +
  geom_vline(aes(xintercept = 10), col = "red") +    # add control mean line
  geom_vline(aes(xintercept = 60), col = "green") +  # add treatmentA mean line
  geom_vline(aes(xintercept = 85), col = "blue")     # add treatmentB mean line

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looking at this histogram with the observed means added, we can see that the histogram lies between the control mean and treatmentB mean, and overlaps with the treatmentA mean. Within the histogram itself, the simulated means of the three groups overlap each other quite a lot, which makes sense because the values were shuffled for the randomization. The histograms are not covering the actual observed means for the control and treatmentB groups. This indicates that these observed means would rarely happen by chance if the null hypothesis was true (no statistically significant difference between the groups). Thus, the means for the three groups are likely significantly different from one another.

Back to Home Page

Homework 10

Anna Schmidt

3/30/2022