DataFrames

data.frame, data.table and the dplyr package provide a set of tools for working with tabular data in R. Their design and functionality are similar to those of DataFrames.jl (in Julia) and pandas (in Python), making them great general purpose data science tools.

This page provides examples of using data.frame, data.table, and dplyr, demonstrating the syntax and common functions within the tools.

Example

Installing data.frame, data.table, and dplyr in R.

The data.frame package comes preloaded into R, and the dplyr package is part of the tidyverse package (see Packages section for tidyverse installation instructions). To install data.table, use install.packages('data.table').

This example will take place using data.frame as it is does not require additional packages- see resources at the bottom of this page for additional information on data.table and dplyr.

Create DataFrame

#Create DataFrame
df <- data.frame(
  id = 1:5,
  gender = c("F", "M", "F", "M", "F"),
  age = c(68, 54, 49, 28, 36)
)

Display DataFrame

Input:

#Display DataFrame
df

Output:

 id gender age
1  1      F  68
2  2      M  54
3  3      F  49
4  4      M  28
5  5      F  36

Print first two lines of DataFrame

Input:

#Print first two lines of DataFrame
head(df, 2)

Output:

  id gender age
1  1      F  68
2  2      M  54

Print last two lines of DataFrame

Input:

# Last two lines of DataFrame
tail(df, 2)

Output:

  id gender age
4  4      M  28
5  5      F  36 

Describe DataFrame

DataFrame size:

Input:

#DataFrame size
dim(df)

Output:

#First value represents number of rows, second value represents number of columns
[1] 5 3

DataFrame column names:

Input:

#DataFrame column names
colnames(df)

Output:

[1] "id"     "gender" "age" 

DataFrame description:

Input:

#Describe DataFrame
summary(df)

Output:

       id       gender               age    
 Min.   :1   Length:5           Min.   :28  
 1st Qu.:2   Class :character   1st Qu.:36  
 Median :3   Mode  :character   Median :49  
 Mean   :3                      Mean   :47  
 3rd Qu.:4                      3rd Qu.:54  
 Max.   :5                      Max.   :68

Accessing DataFrames

Get "age" column (different ways to call the column)

Input:

#Call by column name
df$age
df[["age"]]

#Get column by column number
df[[3]]

Output:

#Call by column name
[1] 68 54 49 28 36
[1] 68 54 49 28 36

#Get column by column number
[1] 68 54 49 28 36

Get row

Input:

#Print row 2
df[2, ]

Output:

  id gender age
2  2      M  54

Get element

Input:

#Get element in row 2, column 3
df[2,3]

Output:

54

Get subset (specific rows and all columns)

Input:

#Print out rows 1, 3, & 5
df[c(1, 3, 5), ]

Output:

  id gender age
1  1      F  68
3  3      F  49
5  5      F  36

Get subset (all rows and specific columns)

Input:

#Print out all rows and only columns 1 (id) and 3 (age)
#Using column names
df[, c("id", "age")]

#Using column numbers
df[, c(1, 3)]

Output:

#Using column names:
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36

#Using column numbers
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36

Get subset (all rows meeting specified criteria - numbers)

Input:

#Print all rows where age is greater than 50
df[df$age > 50, ]

Output:

  id gender age
1  1      F  68
2  2      M  54

Get subset (all rows meeting specified criteria - strings)

Input:

#Print all rows where gender is female ("F")
df[df$gender == "F", ]

Output:

  id gender age
1  1      F  68
3  3      F  49
5  5      F  36

Get subset (all rows meeting specified criteria)

Input:

#Print all rows where gender is female ("F") and age is between 25-50
df[df$gender == "F" & df$age > 25 & df$age < 50, ]

Output:

  id gender age
3  3      F  49
5  5      F  36

Add Column

New columns with specified values

Input:

#Add a column for height
df$height <- c(62, 60, 61, 63, 64)

#Add a column for weight
df$weight <- c(100, 120, 150, 175, 300)

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)

Output:

  id gender age height weight
1  1      F  68     62    100
2  2      M  54     60    120
3  3      F  49     61    150
4  4      M  28     63    175
5  5      F  36     64    300

#Describe dataframe to see column names and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300 

New column with calculated value

Input:

# add a column with calculated BMI
df$bmi <- (df$weight / (df$height^2)) * 703

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)

Output:

#Updated DataFrame
    id gender age height weight      bmi
1  1      F  68     62    100 18.28824
2  2      M  54     60    120 23.43333
3  3      F  49     61    150 28.33916
4  4      M  28     63    175 30.99647
5  5      F  36     64    300 51.48926

Describe dataframe to see new bmi column and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300  
      bmi       
 Min.   :18.29  
 1st Qu.:23.43  
 Median :28.34  
 Mean   :30.51  
 3rd Qu.:31.00  
 Max.   :51.49  

Get counts/frequency

Input:

#Get counts of males and females in the dataframe
gender_counts <- table(df$gender)
gender_counts

Output:

F M 
3 2 

Transform DataFrame

sort

Input:

#Sort the dataframe by gender then age, in reverse order for age (oldest to youngest)
df_sorted <- df[order(df$gender, -df$age), ]
df_sorted

Output:

  id gender age height weight      bmi
1  1      F  68     62    100 18.28824
3  3      F  49     61    150 28.33916
5  5      F  36     64    300 51.48926
2  2      M  54     60    120 23.43333
4  4      M  28     63    175 30.99647

stack (reshape from wide to long format)

Input:

#Reshape from wide to long format (exclude id column)
long_df <- reshape(df, varying = c("gender", "age", "weight", "height", "bmi"), 
                   v.names = "value", 
                   timevar = "variable", 
                   times = c("gender", "age", "weight", "height", "bmi"), 
                   direction = "long")
long_df

Output:

         id variable            value
1.gender  1   gender                F
2.gender  2   gender                M
3.gender  3   gender                F
4.gender  4   gender                M
5.gender  5   gender                F
1.age     1      age               68
2.age     2      age               54
3.age     3      age               49
4.age     4      age               28
5.age     5      age               36
1.weight  1   weight              100
2.weight  2   weight              120
3.weight  3   weight              150
4.weight  4   weight              175
5.weight  5   weight              300
1.height  1   height               62
2.height  2   height               60
3.height  3   height               61
4.height  4   height               63
5.height  5   height               64
1.bmi     1      bmi 18.2882414151925
2.bmi     2      bmi 23.4333333333333
3.bmi     3      bmi 28.3391561408224
4.bmi     4      bmi 30.9964726631393
5.bmi     5      bmi    51.4892578125

unstack (reshape from long to wide format)

Input:

#Unstack dataframe to return to wide format based off "id"
wide_df <- reshape(long_df, idvar = "id", timevar = "variable", direction = "wide")
wide_df

Output:

         id value.gender value.age value.weight value.height        value.bmi
1.gender  1            F        68          100           62 18.2882414151925
2.gender  2            M        54          120           60 23.4333333333333
3.gender  3            F        49          150           61 28.3391561408224
4.gender  4            M        28          175           63 30.9964726631393
5.gender  5            F        36          300           64    51.4892578125

Traversing DataFrame (for loops)

sort

Input:

#Size of dataframe = size(df)
#Set number of rows to nrows and number of columns to ncols
nrows <- nrow(df)
ncols <- ncol(df)

cat("(nrows, ncols) = ", nrows, ncols, "\n")

#Use nested for loop to get information from DataFrame by row and column
for (row in 1:nrows) {
  for (col in 1:ncols) {
    cat("value for row", row, "and col", col, "is", df[row, col], "\n")
  }
}

Output:

(nrows, ncols) =  5 6 
value for row 1 and col 1 is 1 
value for row 1 and col 2 is F 
value for row 1 and col 3 is 68 
value for row 1 and col 4 is 62 
value for row 1 and col 5 is 100 
value for row 1 and col 6 is 18.28824 
value for row 2 and col 1 is 2 
value for row 2 and col 2 is M 
value for row 2 and col 3 is 54 
value for row 2 and col 4 is 60 
value for row 2 and col 5 is 120 
value for row 2 and col 6 is 23.43333 
value for row 3 and col 1 is 3 
value for row 3 and col 2 is F 
value for row 3 and col 3 is 49 
value for row 3 and col 4 is 61 
value for row 3 and col 5 is 150 
value for row 3 and col 6 is 28.33916 
value for row 4 and col 1 is 4 
value for row 4 and col 2 is M 
value for row 4 and col 3 is 28 
value for row 4 and col 4 is 63 
value for row 4 and col 5 is 175 
value for row 4 and col 6 is 30.99647 
value for row 5 and col 1 is 5 
value for row 5 and col 2 is F 
value for row 5 and col 3 is 36 
value for row 5 and col 4 is 64 
value for row 5 and col 5 is 300 
value for row 5 and col 6 is 51.48926 

Notes:

When performing functions such as sorting or transformation, using a package like data.table or dplyr will typically be easier than using base R (data.table), as those packages include commands designed for DataFrame manipulation. This guide uses base R for the sake of continuity.

Resources

Last updated