DataFrames.jl is a Julia package that provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool. [1]
This page provides examples of using DataFrames.jl, demonstrating the syntax and common functions within the package.
Example
Install and Load DataFrames.jl Package
using Pkg# Add DataFrames packagePkg.add("DataFrames")# Load paackagesusing DataFrames
5×3 DataFrame Row │ id gender age │ Int64 String Int64─────┼──────────────────────1 │ 1 F 682 │ 2 M 543 │ 3 F 494 │ 4 M 285 │ 5 F 36
First two lines of dataframe:
Input:
println(first(df,2))
Output:
2×3 DataFrame Row │ id gender age │ Int64 String Int64─────┼──────────────────────1 │ 1 F 682 │ 2 M 54
Last two lines of dataframe:
Input:
println(last(df,2))
Output:
2×3 DataFrame Row │ id gender age │ Int64 String Int64─────┼──────────────────────1 │ 4 M 282 │ 5 F 36
Describe Dataframe
Dataframe size:
Input:
# dataframe sizeprintln(size(df))
Output:
(5,3)
Dataframe column names:
Input:
# dataframe column namesprintln(names(df))
Output:
["id","gender","age"]
Dataframe description:
Input:
# describe dataframeprintln(describe(df))
Output:
3×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType─────┼────────────────────────────────────────────────────────1 │ id 3.013.050 Int642 │ gender F M 0 String3 │ age 47.02849.0680 Int64
Accessing DataFrames
Get "age" column (different ways to call the column)
Input:
# call by column nameprintln(df[!, :age])# get column by column numberprintln(df[!,3])# alternate syntaxprintln(df.age)
Output:
[68,54,49,28,36][68,54,49,28,36][68,54,49,28,36]
Get row
Input:
# print row 2println(df[2, :])
Output:
DataFrameRow Row │ id gender age │ Int64 String Int64─────┼──────────────────────2 │ 2 M 54
Get element
Input:
# get element in row 2, column 3println(df[2,3])
Output:
54
Get subset (specific rows and all columns)
Input:
# print out rows 1, 3, & 5println(df[[1,3,5], :])
Output:
3×3 DataFrame Row │ id gender age │ Int64 String Int64─────┼──────────────────────1 │ 1 F 682 │ 3 F 493 │ 5 F 36
Get subset (all rows and specific columns)
Input:
# print out all rows and only columns 1 (id) and 3 (age)println("Using column names:\n")println(df[:, [:id, :age]])println()println("Using column numbers:\n")println(df[:, [1,3]])
Output:
Using column names:5×2 DataFrame Row │ id age │ Int64 Int64─────┼──────────────1 │ 1682 │ 2543 │ 3494 │ 4285 │ 536Using column numbers:5×2 DataFrame Row │ id age │ Int64 Int64─────┼──────────────1 │ 1682 │ 2543 │ 3494 │ 4285 │ 536
Get subset (all rows meeting specified criteria - numbers)
Input:
# print out all rows where age is greater than 50println(df[df.age .>50, :])
Output:
2×3 DataFrame Row │ id gender age │ Int64 String Int64─────┼──────────────────────1 │ 1 F 682 │ 2 M 54
Get subset (all rows meeting specified criteria - strings)
Input:
# print out all rows where gender is female ("F")println(df[df.gender .=="F", :])
Output:
3×3 DataFrame Row │ id gender age │ Int64 String Int64─────┼──────────────────────1 │ 1 F 682 │ 3 F 493 │ 5 F 36
Get subset (all rows meeting specified criteria)
Input:
# print out all rows where gender is female ("F") and age is between 25-50println(df[(df.gender .=="F") .& (25.< df.age .<50), :])
Output:
2×3 DataFrame Row │ id gender age │ Int64 String Int64─────┼──────────────────────1 │ 3 F 492 │ 5 F 36
Add Column
New columns with specified values
Input:
# add a column for weightdf.weight = [100,120,150,175,300]# add a column for heightdf.height = [62,60,61,63,64]println(df)println()println("Describe dataframe to see column names and summary:\n")println(describe(df))
Output:
5×5 DataFrame Row │ id gender age weight height │ Int64 String Int64 Int64 Int64─────┼──────────────────────────────────────1 │ 1 F 68100622 │ 2 M 54120603 │ 3 F 49150614 │ 4 M 28175635 │ 5 F 3630064Describe dataframe to see column names and summary:5×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType─────┼────────────────────────────────────────────────────────1 │ id 3.013.050 Int642 │ gender F M 0 String3 │ age 47.02849.0680 Int644 │ weight 169.0100150.03000 Int645 │ height 62.06062.0640 Int64
New column with calculated value
Input:
# add a column with calculated BMIdf.bmi =map((x,y) -> (x/y^2)*703, df.weight, df.height)println(df)println()println("Describe dataframe to see new bmi column and summary:\n")println(describe(df))
Output:
5×6 DataFrame Row │ id gender age weight height bmi │ Int64 String Int64 Int64 Int64 Float64─────┼───────────────────────────────────────────────1 │ 1 F 681006218.28822 │ 2 M 541206023.43333 │ 3 F 491506128.33924 │ 4 M 281756330.99655 │ 5 F 363006451.4893Describe dataframe to see new bmi column and summary:6×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType─────┼──────────────────────────────────────────────────────────────────1 │ id 3.013.050 Int642 │ gender F M 0 String3 │ age 47.02849.0680 Int644 │ weight 169.0100150.03000 Int645 │ height 62.06062.0640 Int646 │ bmi 30.509318.288228.339251.48930 Float64
Get counts/frequency
Input:
# get counts of males and females in the dataframeprintln(combine(groupby(df, :gender), nrow => :N))
Output:
2×2 DataFrame Row │ gender N │ String Int64 ─────┼───────────────1 │ F 32 │ M 2
Transform DataFrame
sort
Input:
# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)println(sort(df, [:gender, :age], rev=(false,true)))
#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))wide_df =unstack(long_df, :id, :variable, :value)println(wide_df)
Output:
5×6 DataFrame Row │ id gender age weight height bmi │ Int64 Any Any Any Any Any─────┼─────────────────────────────────────────────1 │ 1 F 681006218.28822 │ 2 M 541206023.43333 │ 3 F 491506128.33924 │ 4 M 281756330.99655 │ 5 F 363006451.4893
Traversing DataFrame (for loops)
sort
Input:
# size of dataframe = size(df)# set number of rows to nrows and number of columns to ncolsprintln("(nrows, ncols) = $(size(df))")nrows, ncols =size(df)# use nested for loop to get information from dataframe by row and columnfor row in1:nrowsfor col in1:ncolsprintln("value for row $row and col $col is $(df[row,col])")endend
Output:
(nrows, ncols) = (5,6)value for row 1 and col 1 is 1value for row 1 and col 2 is Fvalue for row 1 and col 3 is 68value for row 1 and col 4 is 100value for row 1 and col 5 is 62value for row 1 and col 6 is 18.28824141519251value for row 2 and col 1 is 2value for row 2 and col 2 is Mvalue for row 2 and col 3 is 54value for row 2 and col 4 is 120value for row 2 and col 5 is 60value for row 2 and col 6 is 23.433333333333334value for row 3 and col 1 is 3value for row 3 and col 2 is Fvalue for row 3 and col 3 is 49value for row 3 and col 4 is 150value for row 3 and col 5 is 61value for row 3 and col 6 is 28.339156140822357value for row 4 and col 1 is 4value for row 4 and col 2 is Mvalue for row 4 and col 3 is 28value for row 4 and col 4 is 175value for row 4 and col 5 is 63value for row 4 and col 6 is 30.99647266313933value for row 5 and col 1 is 5value for row 5 and col 2 is Fvalue for row 5 and col 3 is 36value for row 5 and col 4 is 300value for row 5 and col 5 is 64value for row 5 and col 6 is 51.4892578125
Exercises
Analyzing Health Datasets with DataFrames in Julia - Forthcoming!