# DataFrames

<mark style="color:$primary;">`data.frame`</mark><mark style="color:$primary;">,</mark> [`data.table`](https://rdatatable.gitlab.io/data.table/) <mark style="color:$primary;">and the</mark> [dplyr](https://dplyr.tidyverse.org/) <mark style="color:$primary;">package provide a set of tools for working with tabular data in R. Their design and functionality are similar to those of DataFrames.jl (in Julia) and</mark> [pandas](https://pandas.pydata.org/) <mark style="color:$primary;">(in Python), making them great general purpose data science tools.</mark>

This page provides examples of using data.frame, data.table, and dplyr, demonstrating the syntax and common functions within the tools.

## Example

### **Installing data.frame, data.table, and dplyr in R.**

The data.frame package comes preloaded into R, and the dplyr package is part of the tidyverse package (see [Packages](/codiac-for-health/computing/r/packages.md) section for tidyverse installation instructions). To install data.table, use install.packages('data.table').

This example will take place using data.frame as it is does not require additional packages- see resources at the bottom of this page for additional information on data.table and dplyr.

### **Create DataFrame**

```r
#Create DataFrame
df <- data.frame(
  id = 1:5,
  gender = c("F", "M", "F", "M", "F"),
  age = c(68, 54, 49, 28, 36)
)
```

### **Display DataFrame**

Input:

```r
#Display DataFrame
df
```

Output:

```r
 id gender age
1  1      F  68
2  2      M  54
3  3      F  49
4  4      M  28
5  5      F  36
```

Print first two lines of DataFrame

Input:

```r
#Print first two lines of DataFrame
head(df, 2)
```

Output:

```r
  id gender age
1  1      F  68
2  2      M  54
```

Print last two lines of DataFrame

Input:

```r
# Last two lines of DataFrame
tail(df, 2)
```

Output:

```r
  id gender age
4  4      M  28
5  5      F  36 
```

### **Describe DataFrame**

DataFrame size:

Input:

```r
#DataFrame size
dim(df)
```

Output:

```r
#First value represents number of rows, second value represents number of columns
[1] 5 3
```

DataFrame column names:

Input:

```r
#DataFrame column names
colnames(df)
```

Output:

```r
[1] "id"     "gender" "age" 
```

DataFrame description:

Input:

```r
#Describe DataFrame
summary(df)
```

Output:

```r
       id       gender               age    
 Min.   :1   Length:5           Min.   :28  
 1st Qu.:2   Class :character   1st Qu.:36  
 Median :3   Mode  :character   Median :49  
 Mean   :3                      Mean   :47  
 3rd Qu.:4                      3rd Qu.:54  
 Max.   :5                      Max.   :68
```

### **Accessing DataFrames**

#### ***Get "age" column (different ways to call the column)***

Input:

```r
#Call by column name
df$age
df[["age"]]

#Get column by column number
df[[3]]
```

Output:

```r
#Call by column name
[1] 68 54 49 28 36
[1] 68 54 49 28 36

#Get column by column number
[1] 68 54 49 28 36
```

#### ***Get row***

Input:

```r
#Print row 2
df[2, ]
```

Output:

```r
  id gender age
2  2      M  54
```

#### ***Get element***

Input:

```r
#Get element in row 2, column 3
df[2,3]
```

Output:

```r
54
```

#### ***Get subset (specific rows and all columns)***

Input:

```r
#Print out rows 1, 3, & 5
df[c(1, 3, 5), ]
```

Output:

```r
  id gender age
1  1      F  68
3  3      F  49
5  5      F  36
```

#### ***Get subset (all rows and specific columns)***

Input:

```r
#Print out all rows and only columns 1 (id) and 3 (age)
#Using column names
df[, c("id", "age")]

#Using column numbers
df[, c(1, 3)]
```

Output:

```r
#Using column names:
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36

#Using column numbers
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36
```

#### ***Get subset (all rows meeting specified criteria - numbers)***

Input:

```r
#Print all rows where age is greater than 50
df[df$age > 50, ]
```

Output:

```r
  id gender age
1  1      F  68
2  2      M  54
```

#### ***Get subset (all rows meeting specified criteria - strings)***

Input:

```r
#Print all rows where gender is female ("F")
df[df$gender == "F", ]
```

Output:

```r
  id gender age
1  1      F  68
3  3      F  49
5  5      F  36
```

#### ***Get subset (all rows meeting specified criteria)***

Input:

```r
#Print all rows where gender is female ("F") and age is between 25-50
df[df$gender == "F" & df$age > 25 & df$age < 50, ]
```

Output:

```r
  id gender age
3  3      F  49
5  5      F  36
```

### **Add Column**

#### ***New columns with specified values***

Input:

```r
#Add a column for height
df$height <- c(62, 60, 61, 63, 64)

#Add a column for weight
df$weight <- c(100, 120, 150, 175, 300)

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)
```

Output:

```r
  id gender age height weight
1  1      F  68     62    100
2  2      M  54     60    120
3  3      F  49     61    150
4  4      M  28     63    175
5  5      F  36     64    300

#Describe dataframe to see column names and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300 
```

#### ***New column with calculated value***

Input:

```r
# add a column with calculated BMI
df$bmi <- (df$weight / (df$height^2)) * 703

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)
```

Output:

<pre class="language-r"><code class="lang-r"><strong>#Updated DataFrame
</strong><strong>    id gender age height weight      bmi
</strong>1  1      F  68     62    100 18.28824
2  2      M  54     60    120 23.43333
3  3      F  49     61    150 28.33916
4  4      M  28     63    175 30.99647
5  5      F  36     64    300 51.48926

Describe dataframe to see new bmi column and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300  
      bmi       
 Min.   :18.29  
 1st Qu.:23.43  
 Median :28.34  
 Mean   :30.51  
 3rd Qu.:31.00  
 Max.   :51.49  
</code></pre>

### ***Get counts/frequency***

Input:

```r
#Get counts of males and females in the dataframe
gender_counts <- table(df$gender)
gender_counts
```

Output:

```r
F M 
3 2 
```

### **Transform DataFrame**

#### ***sort***

Input:

```r
#Sort the dataframe by gender then age, in reverse order for age (oldest to youngest)
df_sorted <- df[order(df$gender, -df$age), ]
df_sorted
```

Output:

```r
  id gender age height weight      bmi
1  1      F  68     62    100 18.28824
3  3      F  49     61    150 28.33916
5  5      F  36     64    300 51.48926
2  2      M  54     60    120 23.43333
4  4      M  28     63    175 30.99647
```

#### ***stack (reshape from wide to long format)***

Input:

```r
#Reshape from wide to long format (exclude id column)
long_df <- reshape(df, varying = c("gender", "age", "weight", "height", "bmi"), 
                   v.names = "value", 
                   timevar = "variable", 
                   times = c("gender", "age", "weight", "height", "bmi"), 
                   direction = "long")
long_df
```

Output:

```r
         id variable            value
1.gender  1   gender                F
2.gender  2   gender                M
3.gender  3   gender                F
4.gender  4   gender                M
5.gender  5   gender                F
1.age     1      age               68
2.age     2      age               54
3.age     3      age               49
4.age     4      age               28
5.age     5      age               36
1.weight  1   weight              100
2.weight  2   weight              120
3.weight  3   weight              150
4.weight  4   weight              175
5.weight  5   weight              300
1.height  1   height               62
2.height  2   height               60
3.height  3   height               61
4.height  4   height               63
5.height  5   height               64
1.bmi     1      bmi 18.2882414151925
2.bmi     2      bmi 23.4333333333333
3.bmi     3      bmi 28.3391561408224
4.bmi     4      bmi 30.9964726631393
5.bmi     5      bmi    51.4892578125
```

#### ***unstack (reshape from long to wide format)***

Input:

```r
#Unstack dataframe to return to wide format based off "id"
wide_df <- reshape(long_df, idvar = "id", timevar = "variable", direction = "wide")
wide_df
```

Output:

```r
         id value.gender value.age value.weight value.height        value.bmi
1.gender  1            F        68          100           62 18.2882414151925
2.gender  2            M        54          120           60 23.4333333333333
3.gender  3            F        49          150           61 28.3391561408224
4.gender  4            M        28          175           63 30.9964726631393
5.gender  5            F        36          300           64    51.4892578125
```

### **Traversing DataFrame (for loops)**

#### ***sort***

Input:

```r
#Size of dataframe = size(df)
#Set number of rows to nrows and number of columns to ncols
nrows <- nrow(df)
ncols <- ncol(df)

cat("(nrows, ncols) = ", nrows, ncols, "\n")

#Use nested for loop to get information from DataFrame by row and column
for (row in 1:nrows) {
  for (col in 1:ncols) {
    cat("value for row", row, "and col", col, "is", df[row, col], "\n")
  }
}
```

Output:

```r
(nrows, ncols) =  5 6 
value for row 1 and col 1 is 1 
value for row 1 and col 2 is F 
value for row 1 and col 3 is 68 
value for row 1 and col 4 is 62 
value for row 1 and col 5 is 100 
value for row 1 and col 6 is 18.28824 
value for row 2 and col 1 is 2 
value for row 2 and col 2 is M 
value for row 2 and col 3 is 54 
value for row 2 and col 4 is 60 
value for row 2 and col 5 is 120 
value for row 2 and col 6 is 23.43333 
value for row 3 and col 1 is 3 
value for row 3 and col 2 is F 
value for row 3 and col 3 is 49 
value for row 3 and col 4 is 61 
value for row 3 and col 5 is 150 
value for row 3 and col 6 is 28.33916 
value for row 4 and col 1 is 4 
value for row 4 and col 2 is M 
value for row 4 and col 3 is 28 
value for row 4 and col 4 is 63 
value for row 4 and col 5 is 175 
value for row 4 and col 6 is 30.99647 
value for row 5 and col 1 is 5 
value for row 5 and col 2 is F 
value for row 5 and col 3 is 36 
value for row 5 and col 4 is 64 
value for row 5 and col 5 is 300 
value for row 5 and col 6 is 51.48926 
```

## Notes:

When performing functions such as sorting or transformation, using a package like data.table or dplyr will typically be easier than using base R (data.table), as those packages include commands designed for DataFrame manipulation. This guide uses base R for the sake of continuity.

## Resources

* R Documentation: [data.table](https://www.rdocumentation.org/packages/data.table/versions/1.16.2)
* Tidyverse: [dplyr](https://dplyr.tidyverse.org)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bcbi.brown.edu/codiac-for-health/computing/r/dataframes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
