# Data Analysis and Manipulation

## Notes:

This page will go over much of the same content as the [DataFrames](https://app.gitbook.com/o/WTF4rLvg01AJV7u3apVQ/s/v4bMBkQ3ZBxkXjskIqBa/~/changes/53/r/dataframes) R page, but using tidyverse's dplyr and tidyr packages rather than base R. You may notice that pipes (%>%) are used more often here. Pipes are functionally the same as other elements like summary() or $, but tend to be the predominant syntax for more advanced uses of R, particularly in the tidyverse, as they can help chain multiple operations in the same line of code.

#### Loading tidyverse modules:

In order to use the tidyverse modules, they first have to be installed. Ensure that the following code is at the top of your coding environment:

```r
#Load tidyverse and required modulees
install.packages("tidyverse")
library(tidyverse)
library(dplyr)
library(tidyr)
```

#### Create DataFrame:

Input:

```r
#Create DataFrame
df <- tibble(
  id = 1:5,
  gender = c("F", "M", "F", "M", "F"),
  age = c(68, 54, 49, 28, 36)
  )
df
```

Output:

```r
#A tibble: 5 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     2 M         54
3     3 F         49
4     4 M         28
5     5 F         36
```

#### Describe DataFrame:

Input:

```r
#DataFrame size:
list(rows = nrow(df), columns = ncol(df))

#DataFrame column names
colnames(df)  

#DataFrame summary
df %>% summary()
```

Output:

```r
#DataFrame size:
$rows
[1] 5

$columns
[1] 3

#DataFrame column names
[1] "id"     "gender" "age" 

#DataFrame summary
       id       gender               age    
 Min.   :1   Length:5           Min.   :28  
 1st Qu.:2   Class :character   1st Qu.:36  
 Median :3   Mode  :character   Median :49  
 Mean   :3                      Mean   :47  
 3rd Qu.:4                      3rd Qu.:54  
 Max.   :5                      Max.   :68  
```

#### Accessing specific DataFrame subsets:

Input:

```r
# Get "age" column
df %>% select(age)

# Get row 2
df %>% slice(2)

# Get element in row 2, column 3
df %>% slice(2) %>% pull(3)

#Get subset (specific rows and all columns)
df %>% slice(c(1, 3, 5))

#Get subset (all rows and specific columns)
df %>% select(id, age)

#Get subset (all rows meeting specified criteria - numbers)
df %>% filter(age > 50)

#Get subset (all rows meeting specified criteria - strings)
df %>% filter(gender == "F")

#Get subset (all rows meeting specified criteria)
df %>% filter(gender == "F", between(age, 25, 50)
```

Output:

```r
#Get "age" column
#A tibble: 5 × 1
    age
  <dbl>
1    68
2    54
3    49
4    28
5    36

#Get row 2
#A tibble: 1 × 3
     id gender   age
  <int> <chr>  <dbl>
1     2 M         54

#Get element in row 2, column 3
[1] 54

#Get subset (specific rows and all columns)
# A tibble: 3 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     3 F         49
3     5 F         36

#Get subset (all rows and specific columns)
# A tibble: 5 × 2
     id   age
  <int> <dbl>
1     1    68
2     2    54
3     3    49
4     4    28
5     5    36

#Get subset (all rows meeting specified criteria - numbers)
#A tibble: 2 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     2 M         54

#Get subset (all rows meeting specified criteria - strings)
#A tibble: 3 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     3 F         49
3     5 F         36

#Get subset (all rows meeting specified criteria)
#A tibble: 2 × 3
     id gender   age
  <int> <chr>  <dbl>
1     3 F         49
2     5 F         36
```

#### Adding Columns:

Input:

```r
#New columns with specified values
df <- df %>%
  mutate(
    height = c(62, 60, 61, 63, 64),
    weight = c(100, 120, 150, 175, 300)
  )
df %>% summary()

New column with calculated value
df <- df %>%
  mutate(bmi = (weight / (height^2)) * 703)

#Describe DataFrame
df %>% summary()

#Get counts/frequency
df %>% count(gender)
```

Output:

```r
#New columns with specified values
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300   

#New column with calculated value
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300  
      bmi       
 Min.   :18.29  
 1st Qu.:23.43  
 Median :28.34  
 Mean   :30.51  
 3rd Qu.:31.00  
 Max.   :51.49  

#Get counts/frequency
#A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 F          3
2 M          2
```

#### Transform DataFrame:

Input:

```r
#Transform DataFrame
#Sort the dataframe by gender then age (reverse for age)
df_sorted <- df %>%
  arrange(gender, desc(age))
df_sorted

#Reshape from wide to long format
long_df <- long_df %>%
           mutate(value = as.character(value))
long_df

#Reshape from long to wide format based on "id"
wide_df <- long_df %>%
  pivot_wider(names_from = variable, values_from = value)
wide_df
```

Output:

```r
#Sort the dataframe by gender then age (reverse for age)
#A tibble: 5 × 6
     id gender   age height weight   bmi
  <int> <chr>  <dbl>  <dbl>  <dbl> <dbl>
1     1 F         68     62    100  18.3
2     3 F         49     61    150  28.3
3     5 F         36     64    300  51.5
4     2 M         54     60    120  23.4
5     4 M         28     63    175  31.0

#Reshape from wide to long format
       id variable            value
1.gender  1   gender                F
2.gender  2   gender                M
3.gender  3   gender                F
4.gender  4   gender                M
5.gender  5   gender                F
1.age     1      age               68
2.age     2      age               54
3.age     3      age               49
4.age     4      age               28
5.age     5      age               36
1.weight  1   weight              100
2.weight  2   weight              120
3.weight  3   weight              150
4.weight  4   weight              175
5.weight  5   weight              300
1.height  1   height               62
2.height  2   height               60
3.height  3   height               61
4.height  4   height               63
5.height  5   height               64
1.bmi     1      bmi 18.2882414151925
2.bmi     2      bmi 23.4333333333333
3.bmi     3      bmi 28.3391561408224
4.bmi     4      bmi 30.9964726631393
5.bmi     5      bmi    51.4892578125

#Reshape from long to wide format based on "id"
# A tibble: 5 × 6
     id gender age   weight height bmi             
  <int> <chr>  <chr> <chr>  <chr>  <chr>           
1     1 F      68    100    62     18.2882414151925
2     2 M      54    120    60     23.4333333333333
3     3 F      49    150    61     28.3391561408224
4     4 M      28    175    63     30.9964726631393
5     5 F      36    300    64     51.4892578125 
```

#### Traversing DataFrame (for loops):

Input:

```r
#Size of DataFrame
nrows <- nrow(df)
ncols <- ncol(df)

cat("(nrows, ncols) = ", nrows, ncols, "\n")

#Nested loop to traverse DataFrame
for (row in 1:nrows) {
  for (col in 1:ncols) {
    value <- df[row, col, drop = TRUE]
    cat("value for row", row, "and col", col, "is", value, "\n")
  }
}
```

Output:

```r
#Size of DataFrame
(nrows, ncols) =  5 6 

#Nested loop to traverse DataFrame
value for row 1 and col 1 is 1 
value for row 1 and col 2 is F 
value for row 1 and col 3 is 68 
value for row 1 and col 4 is 62 
value for row 1 and col 5 is 100 
value for row 1 and col 6 is 18.28824 
value for row 2 and col 1 is 2 
value for row 2 and col 2 is M 
value for row 2 and col 3 is 54 
value for row 2 and col 4 is 60 
value for row 2 and col 5 is 120 
value for row 2 and col 6 is 23.43333 
value for row 3 and col 1 is 3 
value for row 3 and col 2 is F 
value for row 3 and col 3 is 49 
value for row 3 and col 4 is 61 
value for row 3 and col 5 is 150 
value for row 3 and col 6 is 28.33916 
value for row 4 and col 1 is 4 
value for row 4 and col 2 is M 
value for row 4 and col 3 is 28 
value for row 4 and col 4 is 63 
value for row 4 and col 5 is 175 
value for row 4 and col 6 is 30.99647 
value for row 5 and col 1 is 5 
value for row 5 and col 2 is F 
value for row 5 and col 3 is 36 
value for row 5 and col 4 is 64 
value for row 5 and col 5 is 300 
value for row 5 and col 6 is 51.48926 
```
