6.8 DataFrames

Documentation
Theory
Practice

Documentation

Theory

Practice

Install and Load DataFrames.jl Package

using Pkg

# Add DataFrames package
Pkg.add("DataFrames")

# Load paackages
using DataFrames

Create Dataframe

# Create dataframe
df = DataFrame(id = 1:5, gender = ["F", "M", "F", "M", "F"], age = [68, 54, 49, 28, 36])

Display Dataframe

Input:

# display dataframe
println(df)

Output:

5×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
   3 │     3  F          49
   4 │     4  M          28
   5 │     5  F          36

First two lines of dataframe:

Input:

println(first(df, 2))

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54

Last two lines of dataframe:

Input:

println(last(df, 2))

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     4  M          28
   2 │     5  F          36

Describe Dataframe

Dataframe size:

Input:

# dataframe size
println(size(df))

Output:

(5, 3)

Dataframe column names:

Input:

# dataframe column names
println(names(df))

Output:

["id", "gender", "age"]

Dataframe description:

Input:

# describe dataframe
println(describe(df))

Output:

3×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64

Accessing DataFrames

Get "age" column (different ways to call the column)

Input:

# call by column name
println(df[!, :age])

# get column by column number
println(df[!, 3])

# alternate syntax
println(df.age)

Output:

[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]

Get row

Input:

# print row 2
println(df[2, :])

Output:

DataFrameRow
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   2 │     2  M          54

Get element

Input:

# get element in row 2, column 3
println(df[2,3])

Output:

Get subset (specific rows and all columns)

Input:

# print out rows 1, 3, & 5
println(df[[1,3,5], :])

Output:

3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36

Get subset (all rows and specific columns)

Input:

# print out all rows and only columns 1 (id) and 3 (age)
println("Using column names:\n")
println(df[:, [:id, :age]])
println()

println("Using column numbers:\n")
println(df[:, [1,3]])

Output:

Using column names:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36

Using column numbers:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36

Get subset (all rows meeting specified criteria - numbers)

Input:

# print out all rows where age is greater than 50
println(df[df.age .> 50, :])

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54

Get subset (all rows meeting specified criteria - strings)

Input:

# print out all rows where gender is female ("F")
println(df[df.gender .== "F", :])

Output:

3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36

Get subset (all rows meeting specified criteria)

Input:

# print out all rows where gender is female ("F") and age is between 25-50
println(df[(df.gender .== "F") .& (25 .< df.age .< 50), :])

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     3  F          49
   2 │     5  F          36

Add Column

New columns with specified values

Input:

# add a column for weight
df.weight = [100, 120, 150, 175, 300]

# add a column for height
df.height = [62, 60, 61, 63, 64]

println(df)
println()

println("Describe dataframe to see column names and summary:\n")
println(describe(df))

Output:

5×5 DataFrame
 Row │ id     gender  age    weight  height
     │ Int64  String  Int64  Int64   Int64
─────┼──────────────────────────────────────
   1 │     1  F          68     100      62
   2 │     2  M          54     120      60
   3 │     3  F          49     150      61
   4 │     4  M          28     175      63
   5 │     5  F          36     300      64

Describe dataframe to see column names and summary:

5×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64
   4 │ weight    169.0   100  150.0   300         0  Int64
   5 │ height    62.0    60   62.0    64          0  Int64

New column with calculated value

Input:

# add a column with calculated BMI
df.bmi = map((x,y) -> (x/y^2)*703, df.weight, df.height)

println(df)
println()

println("Describe dataframe to see new bmi column and summary:\n")
println(describe(df))

Output:

5×6 DataFrame
 Row │ id     gender  age    weight  height  bmi
     │ Int64  String  Int64  Int64   Int64   Float64
─────┼───────────────────────────────────────────────
   1 │     1  F          68     100      62  18.2882
   2 │     2  M          54     120      60  23.4333
   3 │     3  F          49     150      61  28.3392
   4 │     4  M          28     175      63  30.9965
   5 │     5  F          36     300      64  51.4893

Describe dataframe to see new bmi column and summary:

6×7 DataFrame
 Row │ variable  mean     min      median   max      nmissing  eltype
     │ Symbol    Union…   Any      Union…   Any      Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ id        3.0      1        3.0      5               0  Int64
   2 │ gender             F                 M               0  String
   3 │ age       47.0     28       49.0     68              0  Int64
   4 │ weight    169.0    100      150.0    300             0  Int64
   5 │ height    62.0     60       62.0     64              0  Int64
   6 │ bmi       30.5093  18.2882  28.3392  51.4893         0  Float64

Get counts/frequency

Input:

# get counts of males and females in the dataframe
println(by(df, :gender, N = :gender => length))

Output:

ArgumentError: by function was removed from DataFrames.jl. Use the `combine(groupby(...), ...)` or `combine(f, groupby(...))` instead.

Transform DataFrame

sort

Input:

# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)
println(sort(df, [:gender, :age], rev=(false, true)))

Output:

TypeError: in keyword argument rev, expected Union{Bool, AbstractArray{Bool,1}}, got Tuple{Bool,Bool}

stack (reshape from wide to long format)

Input:

# Reshape from wide to long format (disclude id to see which column and value matches which patient id)
long_df = stack(df, Not(:id))
println(long_df)

Output:

25×3 DataFrame
 Row │ id     variable  value
     │ Int64  String    Any
─────┼──────────────────────────
   1 │     1  gender    F
   2 │     2  gender    M
   3 │     3  gender    F
   4 │     4  gender    M
   5 │     5  gender    F
   6 │     1  age       68
   7 │     2  age       54
   8 │     3  age       49
   9 │     4  age       28
  10 │     5  age       36
  11 │     1  weight    100
  12 │     2  weight    120
  13 │     3  weight    150
  14 │     4  weight    175
  15 │     5  weight    300
  16 │     1  height    62
  17 │     2  height    60
  18 │     3  height    61
  19 │     4  height    63
  20 │     5  height    64
  21 │     1  bmi       18.2882
  22 │     2  bmi       23.4333
  23 │     3  bmi       28.3392
  24 │     4  bmi       30.9965
  25 │     5  bmi       51.4893

unstack (reshape from long to wide format)

Input:

#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))
wide_df = unstack(long_df, :id, :variable, :value)
println(wide_df)

Output:

5×6 DataFrame
 Row │ id     gender  age  weight  height  bmi
     │ Int64  Any     Any  Any     Any     Any
─────┼─────────────────────────────────────────────
   1 │     1  F       68   100     62      18.2882
   2 │     2  M       54   120     60      23.4333
   3 │     3  F       49   150     61      28.3392
   4 │     4  M       28   175     63      30.9965
   5 │     5  F       36   300     64      51.4893

Traversing DataFrame (for loops)

sort

Input:

# size of dataframe = size(df)
# set number of rows to nrows and number of columns to ncols
println("(nrows, ncols) = $(size(df))")
nrows, ncols = size(df)

# use nested for loop to get information from dataframe by row and column
for row in 1:nrows
  for col in 1:ncols
	println("value for row $row and col $col is $(df[row,col])")
  end
end

Output:

(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125