Data Analysis and Manipulation
Notes:
This page will go over much of the same content as the DataFrames R page, but using tidyverse's dplyr and tidyr packages rather than base R. You may notice that pipes (%>%) are used more often here. Pipes are functionally the same as other elements like summary() or $, but tend to be the predominant syntax for more advanced uses of R, particularly in the tidyverse, as they can help chain multiple operations in the same line of code.
Loading tidyverse modules:
In order to use the tidyverse modules, they first have to be installed. Ensure that the following code is at the top of your coding environment:
#Load tidyverse and required modulees
install.packages("tidyverse")
library(tidyverse)
library(dplyr)
library(tidyr)
Create DataFrame:
Input:
#Create DataFrame
df <- tibble(
id = 1:5,
gender = c("F", "M", "F", "M", "F"),
age = c(68, 54, 49, 28, 36)
)
df
Output:
#A tibble: 5 × 3
id gender age
<int> <chr> <dbl>
1 1 F 68
2 2 M 54
3 3 F 49
4 4 M 28
5 5 F 36
Describe DataFrame:
Input:
#DataFrame size:
list(rows = nrow(df), columns = ncol(df))
#DataFrame column names
colnames(df)
#DataFrame summary
df %>% summary()
Output:
#DataFrame size:
$rows
[1] 5
$columns
[1] 3
#DataFrame column names
[1] "id" "gender" "age"
#DataFrame summary
id gender age
Min. :1 Length:5 Min. :28
1st Qu.:2 Class :character 1st Qu.:36
Median :3 Mode :character Median :49
Mean :3 Mean :47
3rd Qu.:4 3rd Qu.:54
Max. :5 Max. :68
Accessing specific DataFrame subsets:
Input:
# Get "age" column
df %>% select(age)
# Get row 2
df %>% slice(2)
# Get element in row 2, column 3
df %>% slice(2) %>% pull(3)
#Get subset (specific rows and all columns)
df %>% slice(c(1, 3, 5))
#Get subset (all rows and specific columns)
df %>% select(id, age)
#Get subset (all rows meeting specified criteria - numbers)
df %>% filter(age > 50)
#Get subset (all rows meeting specified criteria - strings)
df %>% filter(gender == "F")
#Get subset (all rows meeting specified criteria)
df %>% filter(gender == "F", between(age, 25, 50)
Output:
#Get "age" column
#A tibble: 5 × 1
age
<dbl>
1 68
2 54
3 49
4 28
5 36
#Get row 2
#A tibble: 1 × 3
id gender age
<int> <chr> <dbl>
1 2 M 54
#Get element in row 2, column 3
[1] 54
#Get subset (specific rows and all columns)
# A tibble: 3 × 3
id gender age
<int> <chr> <dbl>
1 1 F 68
2 3 F 49
3 5 F 36
#Get subset (all rows and specific columns)
# A tibble: 5 × 2
id age
<int> <dbl>
1 1 68
2 2 54
3 3 49
4 4 28
5 5 36
#Get subset (all rows meeting specified criteria - numbers)
#A tibble: 2 × 3
id gender age
<int> <chr> <dbl>
1 1 F 68
2 2 M 54
#Get subset (all rows meeting specified criteria - strings)
#A tibble: 3 × 3
id gender age
<int> <chr> <dbl>
1 1 F 68
2 3 F 49
3 5 F 36
#Get subset (all rows meeting specified criteria)
#A tibble: 2 × 3
id gender age
<int> <chr> <dbl>
1 3 F 49
2 5 F 36
Adding Columns:
Input:
#New columns with specified values
df <- df %>%
mutate(
height = c(62, 60, 61, 63, 64),
weight = c(100, 120, 150, 175, 300)
)
df %>% summary()
New column with calculated value
df <- df %>%
mutate(bmi = (weight / (height^2)) * 703)
#Describe DataFrame
df %>% summary()
#Get counts/frequency
df %>% count(gender)
Output:
#New columns with specified values
id gender age height weight
Min. :1 Length:5 Min. :28 Min. :60 Min. :100
1st Qu.:2 Class :character 1st Qu.:36 1st Qu.:61 1st Qu.:120
Median :3 Mode :character Median :49 Median :62 Median :150
Mean :3 Mean :47 Mean :62 Mean :169
3rd Qu.:4 3rd Qu.:54 3rd Qu.:63 3rd Qu.:175
Max. :5 Max. :68 Max. :64 Max. :300
#New column with calculated value
id gender age height weight
Min. :1 Length:5 Min. :28 Min. :60 Min. :100
1st Qu.:2 Class :character 1st Qu.:36 1st Qu.:61 1st Qu.:120
Median :3 Mode :character Median :49 Median :62 Median :150
Mean :3 Mean :47 Mean :62 Mean :169
3rd Qu.:4 3rd Qu.:54 3rd Qu.:63 3rd Qu.:175
Max. :5 Max. :68 Max. :64 Max. :300
bmi
Min. :18.29
1st Qu.:23.43
Median :28.34
Mean :30.51
3rd Qu.:31.00
Max. :51.49
#Get counts/frequency
#A tibble: 2 × 2
gender n
<chr> <int>
1 F 3
2 M 2
Transform DataFrame:
Input:
#Transform DataFrame
#Sort the dataframe by gender then age (reverse for age)
df_sorted <- df %>%
arrange(gender, desc(age))
df_sorted
#Reshape from wide to long format
long_df <- long_df %>%
mutate(value = as.character(value))
long_df
#Reshape from long to wide format based on "id"
wide_df <- long_df %>%
pivot_wider(names_from = variable, values_from = value)
wide_df
Output:
#Sort the dataframe by gender then age (reverse for age)
#A tibble: 5 × 6
id gender age height weight bmi
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 68 62 100 18.3
2 3 F 49 61 150 28.3
3 5 F 36 64 300 51.5
4 2 M 54 60 120 23.4
5 4 M 28 63 175 31.0
#Reshape from wide to long format
id variable value
1.gender 1 gender F
2.gender 2 gender M
3.gender 3 gender F
4.gender 4 gender M
5.gender 5 gender F
1.age 1 age 68
2.age 2 age 54
3.age 3 age 49
4.age 4 age 28
5.age 5 age 36
1.weight 1 weight 100
2.weight 2 weight 120
3.weight 3 weight 150
4.weight 4 weight 175
5.weight 5 weight 300
1.height 1 height 62
2.height 2 height 60
3.height 3 height 61
4.height 4 height 63
5.height 5 height 64
1.bmi 1 bmi 18.2882414151925
2.bmi 2 bmi 23.4333333333333
3.bmi 3 bmi 28.3391561408224
4.bmi 4 bmi 30.9964726631393
5.bmi 5 bmi 51.4892578125
#Reshape from long to wide format based on "id"
# A tibble: 5 × 6
id gender age weight height bmi
<int> <chr> <chr> <chr> <chr> <chr>
1 1 F 68 100 62 18.2882414151925
2 2 M 54 120 60 23.4333333333333
3 3 F 49 150 61 28.3391561408224
4 4 M 28 175 63 30.9964726631393
5 5 F 36 300 64 51.4892578125
Traversing DataFrame (for loops):
Input:
#Size of DataFrame
nrows <- nrow(df)
ncols <- ncol(df)
cat("(nrows, ncols) = ", nrows, ncols, "\n")
#Nested loop to traverse DataFrame
for (row in 1:nrows) {
for (col in 1:ncols) {
value <- df[row, col, drop = TRUE]
cat("value for row", row, "and col", col, "is", value, "\n")
}
}
Output:
#Size of DataFrame
(nrows, ncols) = 5 6
#Nested loop to traverse DataFrame
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 62
value for row 1 and col 5 is 100
value for row 1 and col 6 is 18.28824
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 60
value for row 2 and col 5 is 120
value for row 2 and col 6 is 23.43333
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 61
value for row 3 and col 5 is 150
value for row 3 and col 6 is 28.33916
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 63
value for row 4 and col 5 is 175
value for row 4 and col 6 is 30.99647
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 64
value for row 5 and col 5 is 300
value for row 5 and col 6 is 51.48926
Last updated