DataFrames.jl is a Julia package that provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of (in Python) and data.frame, and (in R), making it a great general purpose data science tool.
This page provides examples of using DataFrames.jl, demonstrating the syntax and common functions within the package.
Example
Install and Load DataFrames.jl Package
using Pkg
# Add DataFrames package
Pkg.add("DataFrames")
# Load paackages
using DataFrames
5×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 2 M 54
3 │ 3 F 49
4 │ 4 M 28
5 │ 5 F 36
First two lines of dataframe:
Input:
println(first(df, 2))
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 2 M 54
Last two lines of dataframe:
Input:
println(last(df, 2))
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 4 M 28
2 │ 5 F 36
Describe Dataframe
Dataframe size:
Input:
# dataframe size
println(size(df))
Output:
(5, 3)
Dataframe column names:
Input:
# dataframe column names
println(names(df))
Output:
["id", "gender", "age"]
Dataframe description:
Input:
# describe dataframe
println(describe(df))
Output:
3×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ id 3.0 1 3.0 5 0 Int64
2 │ gender F M 0 String
3 │ age 47.0 28 49.0 68 0 Int64
Accessing DataFrames
Get "age" column (different ways to call the column)
Input:
# call by column name
println(df[!, :age])
# get column by column number
println(df[!, 3])
# alternate syntax
println(df.age)
DataFrameRow
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
2 │ 2 M 54
Get element
Input:
# get element in row 2, column 3
println(df[2,3])
Output:
54
Get subset (specific rows and all columns)
Input:
# print out rows 1, 3, & 5
println(df[[1,3,5], :])
Output:
3×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 3 F 49
3 │ 5 F 36
Get subset (all rows and specific columns)
Input:
# print out all rows and only columns 1 (id) and 3 (age)
println("Using column names:\n")
println(df[:, [:id, :age]])
println()
println("Using column numbers:\n")
println(df[:, [1,3]])
Get subset (all rows meeting specified criteria - numbers)
Input:
# print out all rows where age is greater than 50
println(df[df.age .> 50, :])
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 2 M 54
Get subset (all rows meeting specified criteria - strings)
Input:
# print out all rows where gender is female ("F")
println(df[df.gender .== "F", :])
Output:
3×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 1 F 68
2 │ 3 F 49
3 │ 5 F 36
Get subset (all rows meeting specified criteria)
Input:
# print out all rows where gender is female ("F") and age is between 25-50
println(df[(df.gender .== "F") .& (25 .< df.age .< 50), :])
Output:
2×3 DataFrame
Row │ id gender age
│ Int64 String Int64
─────┼──────────────────────
1 │ 3 F 49
2 │ 5 F 36
Add Column
New columns with specified values
Input:
# add a column for weight
df.weight = [100, 120, 150, 175, 300]
# add a column for height
df.height = [62, 60, 61, 63, 64]
println(df)
println()
println("Describe dataframe to see column names and summary:\n")
println(describe(df))
Output:
5×5 DataFrame
Row │ id gender age weight height
│ Int64 String Int64 Int64 Int64
─────┼──────────────────────────────────────
1 │ 1 F 68 100 62
2 │ 2 M 54 120 60
3 │ 3 F 49 150 61
4 │ 4 M 28 175 63
5 │ 5 F 36 300 64
Describe dataframe to see column names and summary:
5×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼────────────────────────────────────────────────────────
1 │ id 3.0 1 3.0 5 0 Int64
2 │ gender F M 0 String
3 │ age 47.0 28 49.0 68 0 Int64
4 │ weight 169.0 100 150.0 300 0 Int64
5 │ height 62.0 60 62.0 64 0 Int64
New column with calculated value
Input:
# add a column with calculated BMI
df.bmi = map((x,y) -> (x/y^2)*703, df.weight, df.height)
println(df)
println()
println("Describe dataframe to see new bmi column and summary:\n")
println(describe(df))
Output:
5×6 DataFrame
Row │ id gender age weight height bmi
│ Int64 String Int64 Int64 Int64 Float64
─────┼───────────────────────────────────────────────
1 │ 1 F 68 100 62 18.2882
2 │ 2 M 54 120 60 23.4333
3 │ 3 F 49 150 61 28.3392
4 │ 4 M 28 175 63 30.9965
5 │ 5 F 36 300 64 51.4893
Describe dataframe to see new bmi column and summary:
6×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼──────────────────────────────────────────────────────────────────
1 │ id 3.0 1 3.0 5 0 Int64
2 │ gender F M 0 String
3 │ age 47.0 28 49.0 68 0 Int64
4 │ weight 169.0 100 150.0 300 0 Int64
5 │ height 62.0 60 62.0 64 0 Int64
6 │ bmi 30.5093 18.2882 28.3392 51.4893 0 Float64
Get counts/frequency
Input:
# get counts of males and females in the dataframe
println(combine(groupby(df, :gender), nrow => :N))
Output:
2×2 DataFrame
Row │ gender N
│ String Int64
─────┼───────────────
1 │ F 3
2 │ M 2
Transform DataFrame
sort
Input:
# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)
println(sort(df, [:gender, :age], rev=(false, true)))
Output:
TypeError: in keyword argument rev, expected Union{Bool, AbstractArray{Bool,1}}, got Tuple{Bool,Bool}
stack (reshape from wide to long format)
Input:
# Reshape from wide to long format (disclude id to see which column and value matches which patient id)
long_df = stack(df, Not(:id))
println(long_df)
#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))
wide_df = unstack(long_df, :id, :variable, :value)
println(wide_df)
Output:
5×6 DataFrame
Row │ id gender age weight height bmi
│ Int64 Any Any Any Any Any
─────┼─────────────────────────────────────────────
1 │ 1 F 68 100 62 18.2882
2 │ 2 M 54 120 60 23.4333
3 │ 3 F 49 150 61 28.3392
4 │ 4 M 28 175 63 30.9965
5 │ 5 F 36 300 64 51.4893
Traversing DataFrame (for loops)
sort
Input:
# size of dataframe = size(df)
# set number of rows to nrows and number of columns to ncols
println("(nrows, ncols) = $(size(df))")
nrows, ncols = size(df)
# use nested for loop to get information from dataframe by row and column
for row in 1:nrows
for col in 1:ncols
println("value for row $row and col $col is $(df[row,col])")
end
end
Output:
(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125
Exercises
Analyzing Health Datasets with DataFrames in Julia - Forthcoming!
References
Resources
JuliaData Contributors. (n.d.). DataFrames.jl - JuliaData. Retrieved May 1, 2024, from