1. Examine the structure of the iris dataset. How many observations and variables are in the dataset?
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(iris)
## [1] 150 5
The iris dataset contains 150 observations of 5 variables. The variables are all numeric except for Species, which is of type “factor”.
2. Create a new data frame iris1
that contains only the species virginica and versicolor with sepal lengths longer than 6 cm and sepal widths longer than 2.5 cm. How many observations and variables are in the dataset?
iris1 <- filter(iris, Species %in% c("virginica", "versicolor") &
Sepal.Length > 6 &
Sepal.Width > 2.5)
dim(iris1)
## [1] 56 5
The iris1 dataset contains 56 observations of 5 variables.
3. Now, create a iris2
data frame from iris1
that contains only the columns for Species, Sepal.Length, and Sepal.Width. How many observations and variables are in the dataset?
iris2 <- select(iris1, Species, Sepal.Length, Sepal.Width)
dim(iris2)
## [1] 56 3
The iris2 dataset contains 56 observations of 3 variables.
4. Create an iris3
data frame from iris2
that orders the observations from largest to smallest sepal length. Show the first 6 rows of this dataset.
iris3 <- arrange(iris2, by = desc(Sepal.Length))
head(iris3)
## Species Sepal.Length Sepal.Width
## 1 virginica 7.9 3.8
## 2 virginica 7.7 3.8
## 3 virginica 7.7 2.6
## 4 virginica 7.7 2.8
## 5 virginica 7.7 3.0
## 6 virginica 7.6 3.0
5. Create an iris4
data frame from iris3
that creates a column with a sepal area (length * width) value for each observation. How many observations and variables are in the dataset?
iris4 <- mutate(iris3, Sepal.Area = Sepal.Length*Sepal.Width)
dim(iris4)
## [1] 56 4
The iris4 dataset contains 56 observations of 4 variables.
6. Create iris5
that calculates the average sepal length, the average sepal width, and the sample size of the entire iris4
data frame and print iris5
.
iris5 <- summarize(iris4,
Avg.Sepal.Length = mean(Sepal.Length),
Avg.Sepal.Width = mean(Sepal.Width),
Sample.Size = n())
print(iris5)
## Avg.Sepal.Length Avg.Sepal.Width Sample.Size
## 1 6.698214 3.041071 56
7. Finally, create iris6
that calculates the average sepal length, the average sepal width, and the sample size for each species of in the iris4
data frame and print iris6
.
irisSpecies <- group_by(iris4, Species)
head(irisSpecies)
## # A tibble: 6 × 4
## # Groups: Species [1]
## Species Sepal.Length Sepal.Width Sepal.Area
## <fct> <dbl> <dbl> <dbl>
## 1 virginica 7.9 3.8 30.0
## 2 virginica 7.7 3.8 29.3
## 3 virginica 7.7 2.6 20.0
## 4 virginica 7.7 2.8 21.6
## 5 virginica 7.7 3 23.1
## 6 virginica 7.6 3 22.8
iris6 <- summarize(irisSpecies,
Avg.Sepal.Length = mean(Sepal.Length),
Avg.Sepal.Width = mean(Sepal.Width),
Sample.Size = n())
print(iris6)
## # A tibble: 2 × 4
## Species Avg.Sepal.Length Avg.Sepal.Width Sample.Size
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
8. In these exercises, you have successively modified different versions of the data frame iris1 iris1 iris3 iris4 iris5 iris6
. At each stage, the output data frame from one operation serves as the input for the next. A more efficient way to do this is to use the pipe operator %>%
from the tidyr
package. See if you can rework all of your previous statements (except for iris5
) into an extended piping operation that uses iris
as the input and generates irisFinal
as the output.
irisFinal <- iris %>%
filter(Species %in% c("virginica", "versicolor")
& Sepal.Length > 6
& Sepal.Width > 2.5) %>%
select(Species, Sepal.Length, Sepal.Width) %>%
arrange(by = desc(Sepal.Length)) %>%
mutate(Sepal.Area = Sepal.Length*Sepal.Width) %>%
group_by(Species) %>%
summarize(Avg.Sepal.Length = mean(Sepal.Length),
Avg.Sepal.Width = mean(Sepal.Width),
Sample.Size = n())
print(irisFinal)
## # A tibble: 2 × 4
## Species Avg.Sepal.Length Avg.Sepal.Width Sample.Size
## <fct> <dbl> <dbl> <int>
## 1 versicolor 6.48 2.99 17
## 2 virginica 6.79 3.06 39
9. Create a ‘longer’ data frame using the original iris
data set with three columns named “Species”, “Measure”, “Value”. The column “Species” will retain the species names of the data set. The column “Measure” will include whether the value corresponds to Sepal.Length, Sepal.Width, Petal.Length, or Petal.Width and the column “Value” will include the numerical values of those measurements.
df_long <- iris %>%
pivot_longer(cols = Sepal.Length:Petal.Width,
names_to = "Measure",
values_to = "Value")
head(df_long)
## # A tibble: 6 × 3
## Species Measure Value
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3