# DataFrames

> `data.frame`, [`data.table`](https://rdatatable.gitlab.io/data.table/) and the [dplyr](https://dplyr.tidyverse.org/) package provide a set of tools for working with tabular data in R. Their design and functionality are similar to those of DataFrames.jl (in Julia) and [pandas](https://pandas.pydata.org/) (in Python), making them great general purpose data science tools.

This page provides examples of using data.frame, data.table, and dplyr, demonstrating the syntax and common functions within the tools.

## Example

**Installing data.frame, data.table, and dplyr in R.**

The data.frame package comes preloaded into R, and the dplyr package is part of the tidyverse package (see [Packages](https://docs.bcbi.brown.edu/codiac-for-health/computing/r/packages) section for tidyverse installation instructions). To install data.table, use install.packages('data.table').

This example will take place using data.frame as it is does not require additional packages- see resources at the bottom of this page for additional information on data.table and dplyr.

**Create DataFrame**

```r
#Create DataFrame
df <- data.frame(
  id = 1:5,
  gender = c("F", "M", "F", "M", "F"),
  age = c(68, 54, 49, 28, 36)
)
```

**Display DataFrame**

Input:

```r
#Display DataFrame
df
```

Output:

```r
 id gender age
1  1      F  68
2  2      M  54
3  3      F  49
4  4      M  28
5  5      F  36
```

Print first two lines of DataFrame

Input:

```r
#Print first two lines of DataFrame
head(df, 2)
```

Output:

```r
  id gender age
1  1      F  68
2  2      M  54
```

Print last two lines of DataFrame

Input:

```r
# Last two lines of DataFrame
tail(df, 2)
```

Output:

```r
  id gender age
4  4      M  28
5  5      F  36 
```

**Describe DataFrame**

DataFrame size:

Input:

```r
#DataFrame size
dim(df)
```

Output:

```r
#First value represents number of rows, second value represents number of columns
[1] 5 3
```

DataFrame column names:

Input:

```r
#DataFrame column names
colnames(df)
```

Output:

```r
[1] "id"     "gender" "age" 
```

DataFrame description:

Input:

```r
#Describe DataFrame
summary(df)
```

Output:

```r
       id       gender               age    
 Min.   :1   Length:5           Min.   :28  
 1st Qu.:2   Class :character   1st Qu.:36  
 Median :3   Mode  :character   Median :49  
 Mean   :3                      Mean   :47  
 3rd Qu.:4                      3rd Qu.:54  
 Max.   :5                      Max.   :68
```

**Accessing DataFrames**

***Get "age" column (different ways to call the column)***

Input:

```r
#Call by column name
df$age
df[["age"]]

#Get column by column number
df[[3]]
```

Output:

```r
#Call by column name
[1] 68 54 49 28 36
[1] 68 54 49 28 36

#Get column by column number
[1] 68 54 49 28 36
```

***Get row***

Input:

```r
#Print row 2
df[2, ]
```

Output:

```r
  id gender age
2  2      M  54
```

***Get element***

Input:

```r
#Get element in row 2, column 3
df[2,3]
```

Output:

```r
54
```

***Get subset (specific rows and all columns)***

Input:

```r
#Print out rows 1, 3, & 5
df[c(1, 3, 5), ]
```

Output:

```r
  id gender age
1  1      F  68
3  3      F  49
5  5      F  36
```

***Get subset (all rows and specific columns)***

Input:

```r
#Print out all rows and only columns 1 (id) and 3 (age)
#Using column names
df[, c("id", "age")]

#Using column numbers
df[, c(1, 3)]
```

Output:

```r
#Using column names:
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36

#Using column numbers
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36
```

***Get subset (all rows meeting specified criteria - numbers)***

Input:

```r
#Print all rows where age is greater than 50
df[df$age > 50, ]
```

Output:

```r
  id gender age
1  1      F  68
2  2      M  54
```

***Get subset (all rows meeting specified criteria - strings)***

Input:

```r
#Print all rows where gender is female ("F")
df[df$gender == "F", ]
```

Output:

```r
  id gender age
1  1      F  68
3  3      F  49
5  5      F  36
```

***Get subset (all rows meeting specified criteria)***

Input:

```r
#Print all rows where gender is female ("F") and age is between 25-50
df[df$gender == "F" & df$age > 25 & df$age < 50, ]
```

Output:

```r
  id gender age
3  3      F  49
5  5      F  36
```

**Add Column**

***New columns with specified values***

Input:

```r
#Add a column for height
df$height <- c(62, 60, 61, 63, 64)

#Add a column for weight
df$weight <- c(100, 120, 150, 175, 300)

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)
```

Output:

```r
  id gender age height weight
1  1      F  68     62    100
2  2      M  54     60    120
3  3      F  49     61    150
4  4      M  28     63    175
5  5      F  36     64    300

#Describe dataframe to see column names and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300 
```

***New column with calculated value***

Input:

```r
# add a column with calculated BMI
df$bmi <- (df$weight / (df$height^2)) * 703

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)
```

Output:

<pre class="language-r"><code class="lang-r"><strong>#Updated DataFrame
</strong><strong>    id gender age height weight      bmi
</strong>1  1      F  68     62    100 18.28824
2  2      M  54     60    120 23.43333
3  3      F  49     61    150 28.33916
4  4      M  28     63    175 30.99647
5  5      F  36     64    300 51.48926

Describe dataframe to see new bmi column and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300  
      bmi       
 Min.   :18.29  
 1st Qu.:23.43  
 Median :28.34  
 Mean   :30.51  
 3rd Qu.:31.00  
 Max.   :51.49  
</code></pre>

***Get counts/frequency***

Input:

```r
#Get counts of males and females in the dataframe
gender_counts <- table(df$gender)
gender_counts
```

Output:

```r
F M 
3 2 
```

**Transform DataFrame**

***sort***

Input:

```r
#Sort the dataframe by gender then age, in reverse order for age (oldest to youngest)
df_sorted <- df[order(df$gender, -df$age), ]
df_sorted
```

Output:

```r
  id gender age height weight      bmi
1  1      F  68     62    100 18.28824
3  3      F  49     61    150 28.33916
5  5      F  36     64    300 51.48926
2  2      M  54     60    120 23.43333
4  4      M  28     63    175 30.99647
```

***stack (reshape from wide to long format)***

Input:

```r
#Reshape from wide to long format (exclude id column)
long_df <- reshape(df, varying = c("gender", "age", "weight", "height", "bmi"), 
                   v.names = "value", 
                   timevar = "variable", 
                   times = c("gender", "age", "weight", "height", "bmi"), 
                   direction = "long")
long_df
```

Output:

```r
         id variable            value
1.gender  1   gender                F
2.gender  2   gender                M
3.gender  3   gender                F
4.gender  4   gender                M
5.gender  5   gender                F
1.age     1      age               68
2.age     2      age               54
3.age     3      age               49
4.age     4      age               28
5.age     5      age               36
1.weight  1   weight              100
2.weight  2   weight              120
3.weight  3   weight              150
4.weight  4   weight              175
5.weight  5   weight              300
1.height  1   height               62
2.height  2   height               60
3.height  3   height               61
4.height  4   height               63
5.height  5   height               64
1.bmi     1      bmi 18.2882414151925
2.bmi     2      bmi 23.4333333333333
3.bmi     3      bmi 28.3391561408224
4.bmi     4      bmi 30.9964726631393
5.bmi     5      bmi    51.4892578125
```

***unstack (reshape from long to wide format)***

Input:

```r
#Unstack dataframe to return to wide format based off "id"
wide_df <- reshape(long_df, idvar = "id", timevar = "variable", direction = "wide")
wide_df
```

Output:

```r
         id value.gender value.age value.weight value.height        value.bmi
1.gender  1            F        68          100           62 18.2882414151925
2.gender  2            M        54          120           60 23.4333333333333
3.gender  3            F        49          150           61 28.3391561408224
4.gender  4            M        28          175           63 30.9964726631393
5.gender  5            F        36          300           64    51.4892578125
```

**Traversing DataFrame (for loops)**

***sort***

Input:

```r
#Size of dataframe = size(df)
#Set number of rows to nrows and number of columns to ncols
nrows <- nrow(df)
ncols <- ncol(df)

cat("(nrows, ncols) = ", nrows, ncols, "\n")

#Use nested for loop to get information from DataFrame by row and column
for (row in 1:nrows) {
  for (col in 1:ncols) {
    cat("value for row", row, "and col", col, "is", df[row, col], "\n")
  }
}
```

Output:

```r
(nrows, ncols) =  5 6 
value for row 1 and col 1 is 1 
value for row 1 and col 2 is F 
value for row 1 and col 3 is 68 
value for row 1 and col 4 is 62 
value for row 1 and col 5 is 100 
value for row 1 and col 6 is 18.28824 
value for row 2 and col 1 is 2 
value for row 2 and col 2 is M 
value for row 2 and col 3 is 54 
value for row 2 and col 4 is 60 
value for row 2 and col 5 is 120 
value for row 2 and col 6 is 23.43333 
value for row 3 and col 1 is 3 
value for row 3 and col 2 is F 
value for row 3 and col 3 is 49 
value for row 3 and col 4 is 61 
value for row 3 and col 5 is 150 
value for row 3 and col 6 is 28.33916 
value for row 4 and col 1 is 4 
value for row 4 and col 2 is M 
value for row 4 and col 3 is 28 
value for row 4 and col 4 is 63 
value for row 4 and col 5 is 175 
value for row 4 and col 6 is 30.99647 
value for row 5 and col 1 is 5 
value for row 5 and col 2 is F 
value for row 5 and col 3 is 36 
value for row 5 and col 4 is 64 
value for row 5 and col 5 is 300 
value for row 5 and col 6 is 51.48926 
```

## Notes:

When performing functions such as sorting or transformation, using a package like data.table or dplyr will typically be easier than using base R (data.table), as those packages include commands designed for DataFrame manipulation. This guide uses base R for the sake of continuity.

## Resources

* R Documentation: [data.table](https://www.rdocumentation.org/packages/data.table/versions/1.16.2)
* Tidyverse: [dplyr](https://dplyr.tidyverse.org)
