DataFrames

DataFrames.jl is a Julia package that provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool. [1]

This page provides examples of using DataFrames.jl, demonstrating the syntax and common functions within the package.

Example

Install and Load DataFrames.jl Package

using Pkg

# Add DataFrames package
Pkg.add("DataFrames")

# Load paackages
using DataFrames

Create Dataframe

# Create dataframe
df = DataFrame(id = 1:5, gender = ["F", "M", "F", "M", "F"], age = [68, 54, 49, 28, 36])

Display Dataframe

Input:

# display dataframe
println(df)

Output:

5×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   11  F          68
   22  M          54
   33  F          49
   44  M          28
   55  F          36

First two lines of dataframe:

Input:

println(first(df, 2))

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   11  F          68
   22  M          54

Last two lines of dataframe:

Input:

println(last(df, 2))

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   14  M          28
   25  F          36

Describe Dataframe

Dataframe size:

Input:

# dataframe size
println(size(df))

Output:

(5, 3)

Dataframe column names:

Input:

# dataframe column names
println(names(df))

Output:

["id", "gender", "age"]

Dataframe description:

Input:

# describe dataframe
println(describe(df))

Output:

3×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union  Any  Union  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64

Accessing DataFrames

Get "age" column (different ways to call the column)

Input:

# call by column name
println(df[!, :age])

# get column by column number
println(df[!, 3])

# alternate syntax
println(df.age)

Output:

[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]

Get row

Input:

# print row 2
println(df[2, :])

Output:

DataFrameRow
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   22  M          54

Get element

Input:

# get element in row 2, column 3
println(df[2,3])

Output:

54

Get subset (specific rows and all columns)

Input:

# print out rows 1, 3, & 5
println(df[[1,3,5], :])

Output:

3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   11  F          68
   23  F          49
   35  F          36

Get subset (all rows and specific columns)

Input:

# print out all rows and only columns 1 (id) and 3 (age)
println("Using column names:\n")
println(df[:, [:id, :age]])
println()

println("Using column numbers:\n")
println(df[:, [1,3]])

Output:

Using column names:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   11     68
   22     54
   33     49
   44     28
   55     36

Using column numbers:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   11     68
   22     54
   33     49
   44     28
   55     36

Get subset (all rows meeting specified criteria - numbers)

Input:

# print out all rows where age is greater than 50
println(df[df.age .> 50, :])

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   11  F          68
   22  M          54

Get subset (all rows meeting specified criteria - strings)

Input:

# print out all rows where gender is female ("F")
println(df[df.gender .== "F", :])

Output:

3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   11  F          68
   23  F          49
   35  F          36

Get subset (all rows meeting specified criteria)

Input:

# print out all rows where gender is female ("F") and age is between 25-50
println(df[(df.gender .== "F") .& (25 .< df.age .< 50), :])

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   13  F          49
   25  F          36

Add Column

New columns with specified values

Input:

# add a column for weight
df.weight = [100, 120, 150, 175, 300]

# add a column for height
df.height = [62, 60, 61, 63, 64]

println(df)
println()

println("Describe dataframe to see column names and summary:\n")
println(describe(df))

Output:

5×5 DataFrame
 Row │ id     gender  age    weight  height
     │ Int64  String  Int64  Int64   Int64
─────┼──────────────────────────────────────
   11  F          68     100      62
   22  M          54     120      60
   33  F          49     150      61
   44  M          28     175      63
   55  F          36     300      64

Describe dataframe to see column names and summary:

5×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union  Any  Union  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64
   4 │ weight    169.0   100  150.0   300         0  Int64
   5 │ height    62.0    60   62.0    64          0  Int64

New column with calculated value

Input:

# add a column with calculated BMI
df.bmi = map((x,y) -> (x/y^2)*703, df.weight, df.height)

println(df)
println()

println("Describe dataframe to see new bmi column and summary:\n")
println(describe(df))

Output:

5×6 DataFrame
 Row │ id     gender  age    weight  height  bmi
     │ Int64  String  Int64  Int64   Int64   Float64
─────┼───────────────────────────────────────────────
   11  F          68     100      62  18.2882
   22  M          54     120      60  23.4333
   33  F          49     150      61  28.3392
   44  M          28     175      63  30.9965
   55  F          36     300      64  51.4893

Describe dataframe to see new bmi column and summary:

6×7 DataFrame
 Row │ variable  mean     min      median   max      nmissing  eltype
     │ Symbol    Union   Any      Union   Any      Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ id        3.0      1        3.0      5               0  Int64
   2 │ gender             F                 M               0  String
   3 │ age       47.0     28       49.0     68              0  Int64
   4 │ weight    169.0    100      150.0    300             0  Int64
   5 │ height    62.0     60       62.0     64              0  Int64
   6 │ bmi       30.5093  18.2882  28.3392  51.4893         0  Float64

Get counts/frequency

Input:

# get counts of males and females in the dataframe
println(combine(groupby(df, :gender), nrow => :N))

Output:

2×2 DataFrame
 Row │ gender  N     
     │ String  Int64 
─────┼───────────────
   1 │ F           3
   2 │ M           2

Transform DataFrame

sort

Input:

# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)
println(sort(df, [:gender, :age], rev=(false, true)))

Output:

TypeError: in keyword argument rev, expected Union{Bool, AbstractArray{Bool,1}}, got Tuple{Bool,Bool}

stack (reshape from wide to long format)

Input:

# Reshape from wide to long format (disclude id to see which column and value matches which patient id)
long_df = stack(df, Not(:id))
println(long_df)

Output:

25×3 DataFrame
 Row │ id     variable  value
     │ Int64  String    Any
─────┼──────────────────────────
   11  gender    F
   22  gender    M
   33  gender    F
   44  gender    M
   55  gender    F
   61  age       68
   72  age       54
   83  age       49
   94  age       28
  105  age       36
  111  weight    100
  122  weight    120
  133  weight    150
  144  weight    175
  155  weight    300
  161  height    62
  172  height    60
  183  height    61
  194  height    63
  205  height    64
  211  bmi       18.2882
  222  bmi       23.4333
  233  bmi       28.3392
  244  bmi       30.9965
  255  bmi       51.4893

unstack (reshape from long to wide format)

Input:

#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))
wide_df = unstack(long_df, :id, :variable, :value)
println(wide_df)

Output:

5×6 DataFrame
 Row │ id     gender  age  weight  height  bmi
     │ Int64  Any     Any  Any     Any     Any
─────┼─────────────────────────────────────────────
   11  F       68   100     62      18.2882
   22  M       54   120     60      23.4333
   33  F       49   150     61      28.3392
   44  M       28   175     63      30.9965
   55  F       36   300     64      51.4893

Traversing DataFrame (for loops)

sort

Input:

# size of dataframe = size(df)
# set number of rows to nrows and number of columns to ncols
println("(nrows, ncols) = $(size(df))")
nrows, ncols = size(df)

# use nested for loop to get information from dataframe by row and column
for row in 1:nrows
  for col in 1:ncols
	println("value for row $row and col $col is $(df[row,col])")
  end
end

Output:

(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125

Exercises

  • Analyzing Health Datasets with DataFrames in Julia - Forthcoming!

References

  1. JuliaData Contributors. (n.d.). DataFrames.jl - JuliaData. Retrieved May 1, 2024, from https://dataframes.juliadata.org/stable/

Resources

Last updated