1. Using a for
loop, write a function to calculate the number of zeroes in a numeric vector. Before entering the loop, set up a counter variable counter <- 0
. Inside the loop, add 1 to counter
each time you have a zero in the vector. Finally, use return(counter)
for the output.
################################
# FUNCTION: calculate_zeroes
# purpose: calculate the number of zeroes in a numeric vector
# input: vec = numeric vector
# output: counter = number of zeroes in the numeric vector
# ------------------------------
calculate_zeroes <- function(vec = c(0, 0, 5, 1)) {
counter <- 0 # set up counter variable
for (i in 1:length(vec)) {
if (vec[i] == 0) counter <- counter + 1
} # end of for loop
return(counter) # output
} # end of function
calculate_zeroes() # test defaults, counter should = 2
## [1] 2
calculate_zeroes(c(0, 10, 25, 0, 0, 0)) # test a different vector, counter should = 4
## [1] 4
2. Use subsetting instead of a loop to rewrite the function as a single line of code.
vec <- c(0, 0, 5, 1)
length(which(vec == 0)) # should = 2
## [1] 2
3. Write a function that takes as input two integers representing the number of rows and columns in a matrix. The output is a matrix of these dimensions in which each element is the product of the row number x the column number.
################################
# FUNCTION: matrix_products
# purpose: create a matrix in which each element is the product of its row# x column#
# input: nrow = integer representing the number of rows in a matrix
# ncol = integer representing the number of columns in a matrix
# output: m = matrix of dimension(nrow,ncol)
# ------------------------------
matrix_products <- function(nrow = 3,
ncol = 4) {
m <- matrix(runif(nrow*ncol), nrow = nrow) # create matrix with random values
# double for loop to loop over rows and columns
for (i in 1:nrow(m)) { # loop over rows
for (j in 1:ncol(m)) { # loop over columns
m[i,j] <- i*j
} # end of column loop
} # end of row loop
return(m)
} # end of function
matrix_products() # test defaults
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 2 4 6 8
## [3,] 3 6 9 12
matrix_products(nrow = 5, ncol = 6) # test with different inputs
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 2 3 4 5 6
## [2,] 2 4 6 8 10 12
## [3,] 3 6 9 12 15 18
## [4,] 4 8 12 16 20 24
## [5,] 5 10 15 20 25 30
4. In the next few lectures, you will learn how to do a randomization test on your data. We will complete some of the steps today to practice calling custom functions within a for loop. Use the code from the March 31st lecture (Randomization Tests) to complete the following steps:
a. Simulate a dataset with 3 groups of data, each group drawn from a distribution with a different mean. The final data frame should have 1 column for group and 1 column for the response variable.
set.seed(2000)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Create random numbers for three groups of data
control <- rnorm(n = 10, mean = 10, sd = 10) # mean = 10
treatmentA <- rnorm(n = 10, mean = 60, sd = 10) # mean = 60
treatmentB <- rnorm(n = 10, mean = 85, sd = 10) # mean = 85
# Create data frame
d_frame <- data.frame(control, treatmentA, treatmentB)
# Make the data frame longer
d_frame <- d_frame %>%
pivot_longer(cols = control:treatmentB,
names_to = "group",
values_to = "response_var")
head(d_frame)
## # A tibble: 6 × 2
## group response_var
## <chr> <dbl>
## 1 control 1.46
## 2 treatmentA 78.9
## 3 treatmentB 89.6
## 4 control 6.47
## 5 treatmentA 68.4
## 6 treatmentB 67.4
b. Write a custom function that 1) reshuffles the response variable, and 2) calculates the mean of each group in the reshuffled data. Store the means in a vector of length 3.
################################
# FUNCTION: randomization_test
# purpose: reshuffle the response variable and calculate the mean of each group in the reshuffled data
# input: df = data frame with "response_var" and "group" columns
# output: means = vector of length 3, containing the means of each group
# ------------------------------
randomization_test <- function(df = NULL) {
if(is.null(df)) {
df <- data.frame(group = rep(c("control", "treatmentA", "treatmentB"), each = 10),
response_var = runif(30, min = 0, max = 200))
}
# reshuffle the response variable
df$response_var <- sample(df$response_var)
# calculate the mean of each group in the reshuffled data, and store the means in a vector of length 3
means <- tapply(df$response_var, df$group, mean)
return(means)
} # end of function
randomization_test() # test defaults
## control treatmentA treatmentB
## 89.62789 100.68889 95.22473
randomization_test(df = d_frame) # test with our data frame
## control treatmentA treatmentB
## 50.21755 47.59663 60.93968
c. Use a for loop to repeat the function in b 100 times. Store the results in a data frame that has 1 column indicating the replicate number and 1 column for each new group mean, for a total of 4 columns.
# Create empty data frame with all NAs
results <- data.frame(replicate = rep(NA,100),
means_control = rep(NA,100),
means_treatmentA = rep(NA,100),
means_treatmentB = rep(NA,100))
# For loop to repeat randomization_test() 100 times
for (i in 1:100) {
means <- randomization_test(df = d_frame) # run randomization_test() function
results$replicate[i] <- i # fill in replicate column
results$means_control[i] <- means["control"] # fill in means_control column
results$means_treatmentA[i] <- means["treatmentA"] # fill in means_treatmentA column
results$means_treatmentB[i] <- means["treatmentB"] # fill in means_treatmentB column
} # end of for loop
head(results)
## replicate means_control means_treatmentA means_treatmentB
## 1 1 51.09730 62.48067 45.17589
## 2 2 34.20561 71.30688 53.24137
## 3 3 47.29543 67.91524 43.54318
## 4 4 63.63411 49.08960 46.03015
## 5 5 29.55995 54.71681 74.47710
## 6 6 54.16426 52.23951 52.35009
d. Use qplot() to create a histogram of the means for each reshuffled group. Or, if you want a challenge, use ggplot() to overlay all 3 histograms in the same figure. How do the distributions of reshuffled means compare to the original means?
# First, make the data frame longer
results_long <- results %>%
pivot_longer(cols = means_control:means_treatmentB,
names_to = "group",
values_to = "means")
head(results_long)
## # A tibble: 6 × 3
## replicate group means
## <int> <chr> <dbl>
## 1 1 means_control 51.1
## 2 1 means_treatmentA 62.5
## 3 1 means_treatmentB 45.2
## 4 2 means_control 34.2
## 5 2 means_treatmentA 71.3
## 6 2 means_treatmentB 53.2
# all on the same plot
ggplot(data = results_long,
aes(x = means, fill = group, color = group)) +
geom_histogram(position = "dodge", alpha = 0.2) +
theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# faceted
ggplot(data = results_long,
aes(x = means)) +
geom_histogram(fill = "lightblue", col = "white", position = "dodge") +
theme(legend.position="top") +
facet_wrap(~group, nrow = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distributions of reshuffled means for the control, treatmentA, and treatmentB groups all look relatively similar, and they are all centered around a mean of 55. It makes sense that these distributions all look similar since the response variables were randomly shuffled in the randomization_test() function. Since the original means were control = 10, treatmentA = 60, and treatmentB = 85, the fact that these distributions are centered around 55 makes sense, since that is centered in the context of the original means.
5. For comparison, calculate in R the standard statistical analysis you would use with these data. How does the p-value compare for the standard test versus the p value you estimated from your randomization test? If the p values seem very different, run the program again with a different starting seed (and/or increase the number of replications in your randomization test). If there are persistent differences in the p value of the standard test versus your randomization, what do you think is responsible for this difference?
Here, I am going to recreate the histogram of simulated means, and add the observed means as vertical lines.
ggplot(data = results_long,
aes(x = means, fill = group, color = group)) +
geom_histogram(position = "dodge", alpha = 0.2) +
theme(legend.position="top") +
geom_vline(aes(xintercept = 10), col = "red") + # add control mean line
geom_vline(aes(xintercept = 60), col = "green") + # add treatmentA mean line
geom_vline(aes(xintercept = 85), col = "blue") # add treatmentB mean line
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at this histogram with the observed means added, we can see that the histogram lies between the control mean and treatmentB mean, and overlaps with the treatmentA mean. Within the histogram itself, the simulated means of the three groups overlap each other quite a lot, which makes sense because the values were shuffled for the randomization. The histograms are not covering the actual observed means for the control and treatmentB groups. This indicates that these observed means would rarely happen by chance if the null hypothesis was true (no statistically significant difference between the groups). Thus, the means for the three groups are likely significantly different from one another.