DataFrames.jl is a Julia package that provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool. [1]
This page provides examples of using DataFrames.jl, demonstrating the syntax and common functions within the package.
Example
Install and Load DataFrames.jl Package
using Pkg
# Add DataFrames package
Pkg.add("DataFrames")
# Load paackages
using DataFrames
5×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 2 M 54
3 │ 3 F 49
4 │ 4 M 28
5 │ 5 F 36
First two lines of dataframe:
Input:
println(first(df, 2))
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 2 M 54
Last two lines of dataframe:
Input:
println(last(df, 2))
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 4 M 28
2 │ 5 F 36
Describe Dataframe
Dataframe size:
Input:
# dataframe size
println(size(df))
Output:
(5, 3)
Dataframe column names:
Input:
# dataframe column names
println(names(df))
Output:
["id", "gender", "age"]
Dataframe description:
Input:
# describe dataframe
println(describe(df))
Output:
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ id 3.0 1 3.0 5 0 Int64
2 │ gender F M 0 String
3 │ age 47.0 28 49.0 68 0 Int64
Accessing DataFrames
Get "age" column (different ways to call the column)
Input:
# call by column name
println(df[!, :age])
# get column by column number
println(df[!, 3])
# alternate syntax
println(df.age)
DataFrameRow
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
2 │ 2 M 54
Get element
Input:
# get element in row 2, column 3
println(df[2,3])
Output:
54
Get subset (specific rows and all columns)
Input:
# print out rows 1, 3, & 5
println(df[[1,3,5], :])
Output:
3×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 3 F 49
3 │ 5 F 36
Get subset (all rows and specific columns)
Input:
# print out all rows and only columns 1 (id) and 3 (age)
println("Using column names:\n")
println(df[:, [:id, :age]])
println()
println("Using column numbers:\n")
println(df[:, [1,3]])
Get subset (all rows meeting specified criteria - numbers)
Input:
# print out all rows where age is greater than 50
println(df[df.age .> 50, :])
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 2 M 54
Get subset (all rows meeting specified criteria - strings)
Input:
# print out all rows where gender is female ("F")
println(df[df.gender .== "F", :])
Output:
3×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 3 F 49
3 │ 5 F 36
Get subset (all rows meeting specified criteria)
Input:
# print out all rows where gender is female ("F") and age is between 25-50
println(df[(df.gender .== "F") .& (25 .< df.age .< 50), :])
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 3 F 49
2 │ 5 F 36
Add Column
New columns with specified values
Input:
# add a column for weight
df.weight = [100, 120, 150, 175, 300]
# add a column for height
df.height = [62, 60, 61, 63, 64]
println(df)
println()
println("Describe dataframe to see column names and summary:\n")
println(describe(df))
Output:
5×5 DataFrame
Row │ id gender age weight height
│ Int64 String Int64 Int64 Int64
─────┼──────────────────────────────────────
1 │ 1 F 68 100 62
2 │ 2 M 54 120 60
3 │ 3 F 49 150 61
4 │ 4 M 28 175 63
5 │ 5 F 36 300 64
Describe dataframe to see column names and summary:
5×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ id 3.0 1 3.0 5 0 Int64
2 │ gender F M 0 String
3 │ age 47.0 28 49.0 68 0 Int64
4 │ weight 169.0 100 150.0 300 0 Int64
5 │ height 62.0 60 62.0 64 0 Int64
New column with calculated value
Input:
# add a column with calculated BMI
df.bmi = map((x,y) -> (x/y^2)*703, df.weight, df.height)
println(df)
println()
println("Describe dataframe to see new bmi column and summary:\n")
println(describe(df))
Output:
5×6 DataFrame
Row │ id gender age weight height bmi
│ Int64 String Int64 Int64 Int64 Float64
─────┼───────────────────────────────────────────────
1 │ 1 F 68 100 62 18.2882
2 │ 2 M 54 120 60 23.4333
3 │ 3 F 49 150 61 28.3392
4 │ 4 M 28 175 63 30.9965
5 │ 5 F 36 300 64 51.4893
Describe dataframe to see new bmi column and summary:
6×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼──────────────────────────────────────────────────────────────────
1 │ id 3.0 1 3.0 5 0 Int64
2 │ gender F M 0 String
3 │ age 47.0 28 49.0 68 0 Int64
4 │ weight 169.0 100 150.0 300 0 Int64
5 │ height 62.0 60 62.0 64 0 Int64
6 │ bmi 30.5093 18.2882 28.3392 51.4893 0 Float64
Get counts/frequency
Input:
# get counts of males and females in the dataframe
println(combine(groupby(df, :gender), nrow => :N))
Output:
2×2 DataFrame
Row │ gender N
│ String Int64
─────┼───────────────
1 │ F 3
2 │ M 2
Transform DataFrame
sort
Input:
# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)
println(sort(df, [:gender, :age], rev=(false, true)))
Output:
TypeError: in keyword argument rev, expected Union{Bool, AbstractArray{Bool,1}}, got Tuple{Bool,Bool}
stack (reshape from wide to long format)
Input:
# Reshape from wide to long format (disclude id to see which column and value matches which patient id)
long_df = stack(df, Not(:id))
println(long_df)
#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))
wide_df = unstack(long_df, :id, :variable, :value)
println(wide_df)
Output:
5×6 DataFrame
Row │ id gender age weight height bmi
│ Int64 Any Any Any Any Any
─────┼─────────────────────────────────────────────
1 │ 1 F 68 100 62 18.2882
2 │ 2 M 54 120 60 23.4333
3 │ 3 F 49 150 61 28.3392
4 │ 4 M 28 175 63 30.9965
5 │ 5 F 36 300 64 51.4893
Traversing DataFrame (for loops)
sort
Input:
# size of dataframe = size(df)
# set number of rows to nrows and number of columns to ncols
println("(nrows, ncols) = $(size(df))")
nrows, ncols = size(df)
# use nested for loop to get information from dataframe by row and column
for row in 1:nrows
for col in 1:ncols
println("value for row $row and col $col is $(df[row,col])")
end
end
Output:
(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125
Exercises
Analyzing Health Datasets with DataFrames in Julia - Forthcoming!