data.frame, data.table and the dplyr package provide a set of tools for working with tabular data in R. Their design and functionality are similar to those of DataFrames.jl (in Julia) and pandas (in Python), making them great general purpose data science tools.
This page provides examples of using data.frame, data.table, and dplyr, demonstrating the syntax and common functions within the tools.
Example
Installing data.frame, data.table, and dplyr in R.
The data.frame package comes preloaded into R, and the dplyr package is part of the tidyverse package (see Packages section for tidyverse installation instructions). To install data.table, use install.packages('data.table').
This example will take place using data.frame as it is does not require additional packages- see resources at the bottom of this page for additional information on data.table and dplyr.
id gender age
1 1 F 68
2 2 M 54
3 3 F 49
4 4 M 28
5 5 F 36
Print first two lines of DataFrame
Input:
#Print first two lines of DataFrame
head(df, 2)
Output:
id gender age
1 1 F 68
2 2 M 54
Print last two lines of DataFrame
Input:
# Last two lines of DataFrame
tail(df, 2)
Output:
id gender age
4 4 M 28
5 5 F 36
Describe DataFrame
DataFrame size:
Input:
#DataFrame size
dim(df)
Output:
#First value represents number of rows, second value represents number of columns
[1] 5 3
DataFrame column names:
Input:
#DataFrame column names
colnames(df)
Output:
[1] "id" "gender" "age"
DataFrame description:
Input:
#Describe DataFrame
summary(df)
Output:
id gender age
Min. :1 Length:5 Min. :28
1st Qu.:2 Class :character 1st Qu.:36
Median :3 Mode :character Median :49
Mean :3 Mean :47
3rd Qu.:4 3rd Qu.:54
Max. :5 Max. :68
Accessing DataFrames
Get "age" column (different ways to call the column)
Input:
#Call by column name
df$age
df[["age"]]
#Get column by column number
df[[3]]
Output:
#Call by column name
[1] 68 54 49 28 36
[1] 68 54 49 28 36
#Get column by column number
[1] 68 54 49 28 36
Get row
Input:
#Print row 2
df[2, ]
Output:
id gender age
2 2 M 54
Get element
Input:
#Get element in row 2, column 3
df[2,3]
Output:
54
Get subset (specific rows and all columns)
Input:
#Print out rows 1, 3, & 5
df[c(1, 3, 5), ]
Output:
id gender age
1 1 F 68
3 3 F 49
5 5 F 36
Get subset (all rows and specific columns)
Input:
#Print out all rows and only columns 1 (id) and 3 (age)
#Using column names
df[, c("id", "age")]
#Using column numbers
df[, c(1, 3)]
Output:
#Using column names:
id age
1 1 68
2 2 54
3 3 49
4 4 28
5 5 36
#Using column numbers
id age
1 1 68
2 2 54
3 3 49
4 4 28
5 5 36
Get subset (all rows meeting specified criteria - numbers)
Input:
#Print all rows where age is greater than 50
df[df$age > 50, ]
Output:
id gender age
1 1 F 68
2 2 M 54
Get subset (all rows meeting specified criteria - strings)
Input:
#Print all rows where gender is female ("F")
df[df$gender == "F", ]
Output:
id gender age
1 1 F 68
3 3 F 49
5 5 F 36
Get subset (all rows meeting specified criteria)
Input:
#Print all rows where gender is female ("F") and age is between 25-50
df[df$gender == "F" & df$age > 25 & df$age < 50, ]
Output:
id gender age
3 3 F 49
5 5 F 36
Add Column
New columns with specified values
Input:
#Add a column for height
df$height <- c(62, 60, 61, 63, 64)
#Add a column for weight
df$weight <- c(100, 120, 150, 175, 300)
#Print DataFrame to see changes
df
#Describe DataFrame to see column names and summary
summary(df)
Output:
id gender age height weight
1 1 F 68 62 100
2 2 M 54 60 120
3 3 F 49 61 150
4 4 M 28 63 175
5 5 F 36 64 300
#Describe dataframe to see column names and summary:
id gender age height weight
Min. :1 Length:5 Min. :28 Min. :60 Min. :100
1st Qu.:2 Class :character 1st Qu.:36 1st Qu.:61 1st Qu.:120
Median :3 Mode :character Median :49 Median :62 Median :150
Mean :3 Mean :47 Mean :62 Mean :169
3rd Qu.:4 3rd Qu.:54 3rd Qu.:63 3rd Qu.:175
Max. :5 Max. :68 Max. :64 Max. :300
New column with calculated value
Input:
# add a column with calculated BMI
df$bmi <- (df$weight / (df$height^2)) * 703
#Print DataFrame to see changes
df
#Describe DataFrame to see column names and summary
summary(df)
Output:
#Updated DataFrame
id gender age height weight bmi
1 1 F 68 62 100 18.28824
2 2 M 54 60 120 23.43333
3 3 F 49 61 150 28.33916
4 4 M 28 63 175 30.99647
5 5 F 36 64 300 51.48926
Describe dataframe to see new bmi column and summary:
id gender age height weight
Min. :1 Length:5 Min. :28 Min. :60 Min. :100
1st Qu.:2 Class :character 1st Qu.:36 1st Qu.:61 1st Qu.:120
Median :3 Mode :character Median :49 Median :62 Median :150
Mean :3 Mean :47 Mean :62 Mean :169
3rd Qu.:4 3rd Qu.:54 3rd Qu.:63 3rd Qu.:175
Max. :5 Max. :68 Max. :64 Max. :300
bmi
Min. :18.29
1st Qu.:23.43
Median :28.34
Mean :30.51
3rd Qu.:31.00
Max. :51.49
Get counts/frequency
Input:
#Get counts of males and females in the dataframe
gender_counts <- table(df$gender)
gender_counts
Output:
F M
3 2
Transform DataFrame
sort
Input:
#Sort the dataframe by gender then age, in reverse order for age (oldest to youngest)
df_sorted <- df[order(df$gender, -df$age), ]
df_sorted
Output:
id gender age height weight bmi
1 1 F 68 62 100 18.28824
3 3 F 49 61 150 28.33916
5 5 F 36 64 300 51.48926
2 2 M 54 60 120 23.43333
4 4 M 28 63 175 30.99647
stack (reshape from wide to long format)
Input:
#Reshape from wide to long format (exclude id column)
long_df <- reshape(df, varying = c("gender", "age", "weight", "height", "bmi"),
v.names = "value",
timevar = "variable",
times = c("gender", "age", "weight", "height", "bmi"),
direction = "long")
long_df
Output:
id variable value
1.gender 1 gender F
2.gender 2 gender M
3.gender 3 gender F
4.gender 4 gender M
5.gender 5 gender F
1.age 1 age 68
2.age 2 age 54
3.age 3 age 49
4.age 4 age 28
5.age 5 age 36
1.weight 1 weight 100
2.weight 2 weight 120
3.weight 3 weight 150
4.weight 4 weight 175
5.weight 5 weight 300
1.height 1 height 62
2.height 2 height 60
3.height 3 height 61
4.height 4 height 63
5.height 5 height 64
1.bmi 1 bmi 18.2882414151925
2.bmi 2 bmi 23.4333333333333
3.bmi 3 bmi 28.3391561408224
4.bmi 4 bmi 30.9964726631393
5.bmi 5 bmi 51.4892578125
unstack (reshape from long to wide format)
Input:
#Unstack dataframe to return to wide format based off "id"
wide_df <- reshape(long_df, idvar = "id", timevar = "variable", direction = "wide")
wide_df
Output:
id value.gender value.age value.weight value.height value.bmi
1.gender 1 F 68 100 62 18.2882414151925
2.gender 2 M 54 120 60 23.4333333333333
3.gender 3 F 49 150 61 28.3391561408224
4.gender 4 M 28 175 63 30.9964726631393
5.gender 5 F 36 300 64 51.4892578125
Traversing DataFrame (for loops)
sort
Input:
#Size of dataframe = size(df)
#Set number of rows to nrows and number of columns to ncols
nrows <- nrow(df)
ncols <- ncol(df)
cat("(nrows, ncols) = ", nrows, ncols, "\n")
#Use nested for loop to get information from DataFrame by row and column
for (row in 1:nrows) {
for (col in 1:ncols) {
cat("value for row", row, "and col", col, "is", df[row, col], "\n")
}
}
Output:
(nrows, ncols) = 5 6
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 62
value for row 1 and col 5 is 100
value for row 1 and col 6 is 18.28824
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 60
value for row 2 and col 5 is 120
value for row 2 and col 6 is 23.43333
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 61
value for row 3 and col 5 is 150
value for row 3 and col 6 is 28.33916
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 63
value for row 4 and col 5 is 175
value for row 4 and col 6 is 30.99647
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 64
value for row 5 and col 5 is 300
value for row 5 and col 6 is 51.48926
Notes:
When performing functions such as sorting or transformation, using a package like data.table or dplyr will typically be easier than using base R (data.table), as those packages include commands designed for DataFrame manipulation. This guide uses base R for the sake of continuity.