Data Analysis and Manipulation

Notes:

This page will go over much of the same content as the DataFrames R page, but using tidyverse's dplyr and tidyr packages rather than base R. You may notice that pipes (%>%) are used more often here. Pipes are functionally the same as other elements like summary() or $, but tend to be the predominant syntax for more advanced uses of R, particularly in the tidyverse, as they can help chain multiple operations in the same line of code.

Loading tidyverse modules:

In order to use the tidyverse modules, they first have to be installed. Ensure that the following code is at the top of your coding environment:

#Load tidyverse and required modulees
install.packages("tidyverse")
library(tidyverse)
library(dplyr)
library(tidyr)

Create DataFrame:

Input:

#Create DataFrame
df <- tibble(
  id = 1:5,
  gender = c("F", "M", "F", "M", "F"),
  age = c(68, 54, 49, 28, 36)
  )
df

Output:

#A tibble: 5 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     2 M         54
3     3 F         49
4     4 M         28
5     5 F         36

Describe DataFrame:

Input:

#DataFrame size:
list(rows = nrow(df), columns = ncol(df))

#DataFrame column names
colnames(df)  

#DataFrame summary
df %>% summary()

Output:

#DataFrame size:
$rows
[1] 5

$columns
[1] 3

#DataFrame column names
[1] "id"     "gender" "age" 

#DataFrame summary
       id       gender               age    
 Min.   :1   Length:5           Min.   :28  
 1st Qu.:2   Class :character   1st Qu.:36  
 Median :3   Mode  :character   Median :49  
 Mean   :3                      Mean   :47  
 3rd Qu.:4                      3rd Qu.:54  
 Max.   :5                      Max.   :68

Accessing specific DataFrame subsets:

Input:

# Get "age" column
df %>% select(age)

# Get row 2
df %>% slice(2)

# Get element in row 2, column 3
df %>% slice(2) %>% pull(3)

#Get subset (specific rows and all columns)
df %>% slice(c(1, 3, 5))

#Get subset (all rows and specific columns)
df %>% select(id, age)

#Get subset (all rows meeting specified criteria - numbers)
df %>% filter(age > 50)

#Get subset (all rows meeting specified criteria - strings)
df %>% filter(gender == "F")

#Get subset (all rows meeting specified criteria)
df %>% filter(gender == "F", between(age, 25, 50)

Output:

#Get "age" column
#A tibble: 5 × 1
    age
  <dbl>
1    68
2    54
3    49
4    28
5    36

#Get row 2
#A tibble: 1 × 3
     id gender   age
  <int> <chr>  <dbl>
1     2 M         54

#Get element in row 2, column 3
[1] 54

#Get subset (specific rows and all columns)
# A tibble: 3 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     3 F         49
3     5 F         36

#Get subset (all rows and specific columns)
# A tibble: 5 × 2
     id   age
  <int> <dbl>
1     1    68
2     2    54
3     3    49
4     4    28
5     5    36

#Get subset (all rows meeting specified criteria - numbers)
#A tibble: 2 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     2 M         54

#Get subset (all rows meeting specified criteria - strings)
#A tibble: 3 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     3 F         49
3     5 F         36

#Get subset (all rows meeting specified criteria)
#A tibble: 2 × 3
     id gender   age
  <int> <chr>  <dbl>
1     3 F         49
2     5 F         36

Adding Columns:

Input:

#New columns with specified values
df <- df %>%
  mutate(
    height = c(62, 60, 61, 63, 64),
    weight = c(100, 120, 150, 175, 300)
  )
df %>% summary()

New column with calculated value
df <- df %>%
  mutate(bmi = (weight / (height^2)) * 703)

#Describe DataFrame
df %>% summary()

#Get counts/frequency
df %>% count(gender)

Output:

#New columns with specified values
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300   

#New column with calculated value
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300  
      bmi       
 Min.   :18.29  
 1st Qu.:23.43  
 Median :28.34  
 Mean   :30.51  
 3rd Qu.:31.00  
 Max.   :51.49  

#Get counts/frequency
#A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 F          3
2 M          2

Transform DataFrame:

Input:

#Transform DataFrame
#Sort the dataframe by gender then age (reverse for age)
df_sorted <- df %>%
  arrange(gender, desc(age))
df_sorted

#Reshape from wide to long format
long_df <- long_df %>%
           mutate(value = as.character(value))
long_df

#Reshape from long to wide format based on "id"
wide_df <- long_df %>%
  pivot_wider(names_from = variable, values_from = value)
wide_df

Output:

#Sort the dataframe by gender then age (reverse for age)
#A tibble: 5 × 6
     id gender   age height weight   bmi
  <int> <chr>  <dbl>  <dbl>  <dbl> <dbl>
1     1 F         68     62    100  18.3
2     3 F         49     61    150  28.3
3     5 F         36     64    300  51.5
4     2 M         54     60    120  23.4
5     4 M         28     63    175  31.0

#Reshape from wide to long format
       id variable            value
1.gender  1   gender                F
2.gender  2   gender                M
3.gender  3   gender                F
4.gender  4   gender                M
5.gender  5   gender                F
1.age     1      age               68
2.age     2      age               54
3.age     3      age               49
4.age     4      age               28
5.age     5      age               36
1.weight  1   weight              100
2.weight  2   weight              120
3.weight  3   weight              150
4.weight  4   weight              175
5.weight  5   weight              300
1.height  1   height               62
2.height  2   height               60
3.height  3   height               61
4.height  4   height               63
5.height  5   height               64
1.bmi     1      bmi 18.2882414151925
2.bmi     2      bmi 23.4333333333333
3.bmi     3      bmi 28.3391561408224
4.bmi     4      bmi 30.9964726631393
5.bmi     5      bmi    51.4892578125

#Reshape from long to wide format based on "id"
# A tibble: 5 × 6
     id gender age   weight height bmi             
  <int> <chr>  <chr> <chr>  <chr>  <chr>           
1     1 F      68    100    62     18.2882414151925
2     2 M      54    120    60     23.4333333333333
3     3 F      49    150    61     28.3391561408224
4     4 M      28    175    63     30.9964726631393
5     5 F      36    300    64     51.4892578125

Traversing DataFrame (for loops):

Input:

#Size of DataFrame
nrows <- nrow(df)
ncols <- ncol(df)

cat("(nrows, ncols) = ", nrows, ncols, "\n")

#Nested loop to traverse DataFrame
for (row in 1:nrows) {
  for (col in 1:ncols) {
    value <- df[row, col, drop = TRUE]
    cat("value for row", row, "and col", col, "is", value, "\n")
  }
}

Output:

#Size of DataFrame
(nrows, ncols) =  5 6 

#Nested loop to traverse DataFrame
value for row 1 and col 1 is 1 
value for row 1 and col 2 is F 
value for row 1 and col 3 is 68 
value for row 1 and col 4 is 62 
value for row 1 and col 5 is 100 
value for row 1 and col 6 is 18.28824 
value for row 2 and col 1 is 2 
value for row 2 and col 2 is M 
value for row 2 and col 3 is 54 
value for row 2 and col 4 is 60 
value for row 2 and col 5 is 120 
value for row 2 and col 6 is 23.43333 
value for row 3 and col 1 is 3 
value for row 3 and col 2 is F 
value for row 3 and col 3 is 49 
value for row 3 and col 4 is 61 
value for row 3 and col 5 is 150 
value for row 3 and col 6 is 28.33916 
value for row 4 and col 1 is 4 
value for row 4 and col 2 is M 
value for row 4 and col 3 is 28 
value for row 4 and col 4 is 63 
value for row 4 and col 5 is 175 
value for row 4 and col 6 is 30.99647 
value for row 5 and col 1 is 5 
value for row 5 and col 2 is F 
value for row 5 and col 3 is 36 
value for row 5 and col 4 is 64 
value for row 5 and col 5 is 300 
value for row 5 and col 6 is 51.48926

PreviousDataFrames

Last updated 8 months ago