data.frame, data.table and the dplyr package provide a set of tools for working with tabular data in R. Their design and functionality are similar to those of DataFrames.jl (in Julia) and pandas (in Python), making them great general purpose data science tools.
This page provides examples of using data.frame, data.table, and dplyr, demonstrating the syntax and common functions within the tools.
Example
Installing data.frame, data.table, and dplyr in R.
The data.frame package comes preloaded into R, and the dplyr package is part of the tidyverse package (see Packages section for tidyverse installation instructions). To install data.table, use install.packages('data.table').
This example will take place using data.frame as it is does not require additional packages- see resources at the bottom of this page for additional information on data.table and dplyr.
Create DataFrame
#Create DataFramedf <-data.frame( id =1:5, gender =c("F", "M", "F", "M", "F"), age =c(68, 54, 49, 28, 36))
Display DataFrame
Input:
#Display DataFramedf
Output:
id gender age11 F 6822 M 5433 F 4944 M 2855 F 36
Print first two lines of DataFrame
Input:
#Print first two lines of DataFramehead(df, 2)
Output:
id gender age11 F 6822 M 54
Print last two lines of DataFrame
Input:
# Last two lines of DataFrametail(df, 2)
Output:
id gender age44 M 2855 F 36
Describe DataFrame
DataFrame size:
Input:
#DataFrame sizedim(df)
Output:
#First value represents number of rows, second value represents number of columns[1] 53
DataFrame column names:
Input:
#DataFrame column namescolnames(df)
Output:
[1] "id""gender""age"
DataFrame description:
Input:
#Describe DataFramesummary(df)
Output:
id gender age Min. :1 Length:5 Min. :28 1st Qu.:2 Class :character 1st Qu.:36 Median :3 Mode :character Median :49 Mean :3 Mean :47 3rd Qu.:4 3rd Qu.:54 Max. :5 Max. :68
Accessing DataFrames
Get "age" column (different ways to call the column)
Input:
#Call by column namedf$agedf[["age"]]#Get column by column numberdf[[3]]
Output:
#Call by column name[1] 6854492836[1] 6854492836#Get column by column number[1] 6854492836
Get row
Input:
#Print row 2df[2, ]
Output:
id gender age22 M 54
Get element
Input:
#Get element in row 2, column 3df[2,3]
Output:
54
Get subset (specific rows and all columns)
Input:
#Print out rows 1, 3, & 5df[c(1, 3, 5), ]
Output:
id gender age11 F 6833 F 4955 F 36
Get subset (all rows and specific columns)
Input:
#Print out all rows and only columns 1 (id) and 3 (age)#Using column namesdf[, c("id", "age")]#Using column numbersdf[, c(1, 3)]
Output:
#Using column names: id age11682254334944285536#Using column numbers id age11682254334944285536
Get subset (all rows meeting specified criteria - numbers)
Input:
#Print all rows where age is greater than 50df[df$age >50, ]
Output:
id gender age11 F 6822 M 54
Get subset (all rows meeting specified criteria - strings)
Input:
#Print all rows where gender is female ("F")df[df$gender =="F", ]
Output:
id gender age11 F 6833 F 4955 F 36
Get subset (all rows meeting specified criteria)
Input:
#Print all rows where gender is female ("F") and age is between 25-50df[df$gender =="F"& df$age >25& df$age <50, ]
Output:
id gender age33 F 4955 F 36
Add Column
New columns with specified values
Input:
#Add a column for heightdf$height <-c(62, 60, 61, 63, 64)#Add a column for weightdf$weight <-c(100, 120, 150, 175, 300)#Print DataFrame to see changesdf#Describe DataFrame to see column names and summarysummary(df)
Output:
id gender age height weight11 F 686210022 M 546012033 F 496115044 M 286317555 F 3664300#Describe dataframe to see column names and summary: id gender age height weight Min. :1 Length:5 Min. :28 Min. :60 Min. :100 1st Qu.:2 Class :character 1st Qu.:36 1st Qu.:61 1st Qu.:120 Median :3 Mode :character Median :49 Median :62 Median :150 Mean :3 Mean :47 Mean :62 Mean :169 3rd Qu.:4 3rd Qu.:54 3rd Qu.:63 3rd Qu.:175 Max. :5 Max. :68 Max. :64 Max. :300
New column with calculated value
Input:
# add a column with calculated BMIdf$bmi <- (df$weight / (df$height^2)) *703#Print DataFrame to see changesdf#Describe DataFrame to see column names and summarysummary(df)
Output:
#Updated DataFrame id gender age height weight bmi11 F 686210018.2882422 M 546012023.4333333 F 496115028.3391644 M 286317530.9964755 F 366430051.48926Describe dataframe to see new bmi column and summary: id gender age height weight Min. :1 Length:5 Min. :28 Min. :60 Min. :100 1st Qu.:2 Class :character 1st Qu.:36 1st Qu.:61 1st Qu.:120 Median :3 Mode :character Median :49 Median :62 Median :150 Mean :3 Mean :47 Mean :62 Mean :169 3rd Qu.:4 3rd Qu.:54 3rd Qu.:63 3rd Qu.:175 Max. :5 Max. :68 Max. :64 Max. :300 bmi Min. :18.29 1st Qu.:23.43 Median :28.34 Mean :30.51 3rd Qu.:31.00 Max. :51.49
Get counts/frequency
Input:
#Get counts of males and females in the dataframegender_counts <-table(df$gender)gender_counts
Output:
F M 32
Transform DataFrame
sort
Input:
#Sort the dataframe by gender then age, in reverse order for age (oldest to youngest)df_sorted <- df[order(df$gender, -df$age), ]df_sorted
Output:
id gender age height weight bmi11 F 686210018.2882433 F 496115028.3391655 F 366430051.4892622 M 546012023.4333344 M 286317530.99647
stack (reshape from wide to long format)
Input:
#Reshape from wide to long format (exclude id column)long_df <-reshape(df, varying =c("gender", "age", "weight", "height", "bmi"), v.names ="value", timevar ="variable", times =c("gender", "age", "weight", "height", "bmi"), direction ="long")long_df
#Unstack dataframe to return to wide format based off "id"wide_df <-reshape(long_df, idvar ="id", timevar ="variable", direction ="wide")wide_df
Output:
id value.gender value.age value.weight value.height value.bmi1.gender 1 F 681006218.28824141519252.gender 2 M 541206023.43333333333333.gender 3 F 491506128.33915614082244.gender 4 M 281756330.99647266313935.gender 5 F 363006451.4892578125
Traversing DataFrame (for loops)
sort
Input:
#Size of dataframe = size(df)#Set number of rows to nrows and number of columns to ncolsnrows <-nrow(df)ncols <-ncol(df)cat("(nrows, ncols) = ", nrows, ncols, "\n")#Use nested for loop to get information from DataFrame by row and columnfor (row in1:nrows) {for (col in1:ncols) {cat("value for row", row, "and col", col, "is", df[row, col], "\n") }}
Output:
(nrows, ncols) = 56value for row 1 and col 1 is 1value for row 1 and col 2 is F value for row 1 and col 3 is 68value for row 1 and col 4 is 62value for row 1 and col 5 is 100value for row 1 and col 6 is 18.28824value for row 2 and col 1 is 2value for row 2 and col 2 is M value for row 2 and col 3 is 54value for row 2 and col 4 is 60value for row 2 and col 5 is 120value for row 2 and col 6 is 23.43333value for row 3 and col 1 is 3value for row 3 and col 2 is F value for row 3 and col 3 is 49value for row 3 and col 4 is 61value for row 3 and col 5 is 150value for row 3 and col 6 is 28.33916value for row 4 and col 1 is 4value for row 4 and col 2 is M value for row 4 and col 3 is 28value for row 4 and col 4 is 63value for row 4 and col 5 is 175value for row 4 and col 6 is 30.99647value for row 5 and col 1 is 5value for row 5 and col 2 is F value for row 5 and col 3 is 36value for row 5 and col 4 is 64value for row 5 and col 5 is 300value for row 5 and col 6 is 51.48926
Notes:
When performing functions such as sorting or transformation, using a package like data.table or dplyr will typically be easier than using base R (data.table), as those packages include commands designed for DataFrame manipulation. This guide uses base R for the sake of continuity.