# DataFrames

> DataFrames.jl is a Julia package that provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of [pandas](https://pandas.pydata.org/) (in Python) and `data.frame`, [`data.table`](https://rdatatable.gitlab.io/data.table/) and [dplyr](https://dplyr.tidyverse.org/) (in R), making it a great general purpose data science tool. [\[1\]](#references)&#x20;

This page provides examples of using DataFrames.jl, demonstrating the syntax and common functions within the package.

## Example <a href="#practice" id="practice"></a>

**Install and Load DataFrames.jl Package**

```julia
using Pkg

# Add DataFrames package
Pkg.add("DataFrames")

# Load paackages
using DataFrames
```

**Create Dataframe**

```julia
# Create dataframe
df = DataFrame(id = 1:5, gender = ["F", "M", "F", "M", "F"], age = [68, 54, 49, 28, 36])
```

**Display Dataframe**

Input:

```julia
# display dataframe
println(df)
```

Output:

```julia
5×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
   3 │     3  F          49
   4 │     4  M          28
   5 │     5  F          36
```

First two lines of dataframe:

Input:

```julia
println(first(df, 2))
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
```

Last two lines of dataframe:

Input:

```julia
println(last(df, 2))
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     4  M          28
   2 │     5  F          36
```

**Describe Dataframe**

Dataframe size:

Input:

```julia
# dataframe size
println(size(df))
```

Output:

```julia
(5, 3)
```

Dataframe column names:

Input:

```julia
# dataframe column names
println(names(df))
```

Output:

```julia
["id", "gender", "age"]
```

Dataframe description:

Input:

```julia
# describe dataframe
println(describe(df))
```

Output:

```julia
3×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64
```

**Accessing DataFrames**

***Get "age" column (different ways to call the column)***

Input:

```julia
# call by column name
println(df[!, :age])

# get column by column number
println(df[!, 3])

# alternate syntax
println(df.age)
```

Output:

```julia
[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]
```

***Get row***

Input:

```julia
# print row 2
println(df[2, :])
```

Output:

```julia
DataFrameRow
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   2 │     2  M          54
```

***Get element***

Input:

```julia
# get element in row 2, column 3
println(df[2,3])
```

Output:

```julia
54
```

***Get subset (specific rows and all columns)***

Input:

```julia
# print out rows 1, 3, & 5
println(df[[1,3,5], :])
```

Output:

```julia
3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36
```

***Get subset (all rows and specific columns)***

Input:

```julia
# print out all rows and only columns 1 (id) and 3 (age)
println("Using column names:\n")
println(df[:, [:id, :age]])
println()

println("Using column numbers:\n")
println(df[:, [1,3]])
```

Output:

```julia
Using column names:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36

Using column numbers:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36
```

***Get subset (all rows meeting specified criteria - numbers)***

Input:

```julia
# print out all rows where age is greater than 50
println(df[df.age .> 50, :])
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
```

***Get subset (all rows meeting specified criteria - strings)***

Input:

```julia
# print out all rows where gender is female ("F")
println(df[df.gender .== "F", :])
```

Output:

```julia
3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36
```

***Get subset (all rows meeting specified criteria)***

Input:

```julia
# print out all rows where gender is female ("F") and age is between 25-50
println(df[(df.gender .== "F") .& (25 .< df.age .< 50), :])
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     3  F          49
   2 │     5  F          36
```

**Add Column**

***New columns with specified values***

Input:

```julia
# add a column for weight
df.weight = [100, 120, 150, 175, 300]

# add a column for height
df.height = [62, 60, 61, 63, 64]

println(df)
println()

println("Describe dataframe to see column names and summary:\n")
println(describe(df))
```

Output:

```julia
5×5 DataFrame
 Row │ id     gender  age    weight  height
     │ Int64  String  Int64  Int64   Int64
─────┼──────────────────────────────────────
   1 │     1  F          68     100      62
   2 │     2  M          54     120      60
   3 │     3  F          49     150      61
   4 │     4  M          28     175      63
   5 │     5  F          36     300      64

Describe dataframe to see column names and summary:

5×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64
   4 │ weight    169.0   100  150.0   300         0  Int64
   5 │ height    62.0    60   62.0    64          0  Int64
```

***New column with calculated value***

Input:

```julia
# add a column with calculated BMI
df.bmi = map((x,y) -> (x/y^2)*703, df.weight, df.height)

println(df)
println()

println("Describe dataframe to see new bmi column and summary:\n")
println(describe(df))
```

Output:

```julia
5×6 DataFrame
 Row │ id     gender  age    weight  height  bmi
     │ Int64  String  Int64  Int64   Int64   Float64
─────┼───────────────────────────────────────────────
   1 │     1  F          68     100      62  18.2882
   2 │     2  M          54     120      60  23.4333
   3 │     3  F          49     150      61  28.3392
   4 │     4  M          28     175      63  30.9965
   5 │     5  F          36     300      64  51.4893

Describe dataframe to see new bmi column and summary:

6×7 DataFrame
 Row │ variable  mean     min      median   max      nmissing  eltype
     │ Symbol    Union…   Any      Union…   Any      Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ id        3.0      1        3.0      5               0  Int64
   2 │ gender             F                 M               0  String
   3 │ age       47.0     28       49.0     68              0  Int64
   4 │ weight    169.0    100      150.0    300             0  Int64
   5 │ height    62.0     60       62.0     64              0  Int64
   6 │ bmi       30.5093  18.2882  28.3392  51.4893         0  Float64
```

***Get counts/frequency***

Input:

```julia
# get counts of males and females in the dataframe
println(combine(groupby(df, :gender), nrow => :N))
```

Output:

```julia
2×2 DataFrame
 Row │ gender  N     
     │ String  Int64 
─────┼───────────────
   1 │ F           3
   2 │ M           2
```

**Transform DataFrame**

***sort***

Input:

```julia
# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)
println(sort(df, [:gender, :age], rev=(false, true)))
```

Output:

```julia
TypeError: in keyword argument rev, expected Union{Bool, AbstractArray{Bool,1}}, got Tuple{Bool,Bool}
```

***stack (reshape from wide to long format)***

Input:

```julia
# Reshape from wide to long format (disclude id to see which column and value matches which patient id)
long_df = stack(df, Not(:id))
println(long_df)
```

Output:

```julia
25×3 DataFrame
 Row │ id     variable  value
     │ Int64  String    Any
─────┼──────────────────────────
   1 │     1  gender    F
   2 │     2  gender    M
   3 │     3  gender    F
   4 │     4  gender    M
   5 │     5  gender    F
   6 │     1  age       68
   7 │     2  age       54
   8 │     3  age       49
   9 │     4  age       28
  10 │     5  age       36
  11 │     1  weight    100
  12 │     2  weight    120
  13 │     3  weight    150
  14 │     4  weight    175
  15 │     5  weight    300
  16 │     1  height    62
  17 │     2  height    60
  18 │     3  height    61
  19 │     4  height    63
  20 │     5  height    64
  21 │     1  bmi       18.2882
  22 │     2  bmi       23.4333
  23 │     3  bmi       28.3392
  24 │     4  bmi       30.9965
  25 │     5  bmi       51.4893
```

***unstack (reshape from long to wide format)***

Input:

```julia
#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))
wide_df = unstack(long_df, :id, :variable, :value)
println(wide_df)
```

Output:

```julia
5×6 DataFrame
 Row │ id     gender  age  weight  height  bmi
     │ Int64  Any     Any  Any     Any     Any
─────┼─────────────────────────────────────────────
   1 │     1  F       68   100     62      18.2882
   2 │     2  M       54   120     60      23.4333
   3 │     3  F       49   150     61      28.3392
   4 │     4  M       28   175     63      30.9965
   5 │     5  F       36   300     64      51.4893
```

**Traversing DataFrame (for loops)**

***sort***

Input:

```julia
# size of dataframe = size(df)
# set number of rows to nrows and number of columns to ncols
println("(nrows, ncols) = $(size(df))")
nrows, ncols = size(df)

# use nested for loop to get information from dataframe by row and column
for row in 1:nrows
  for col in 1:ncols
	println("value for row $row and col $col is $(df[row,col])")
  end
end
```

Output:

```julia
(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125
```

## Exercises <a href="#documentation" id="documentation"></a>

* Analyzing Health Datasets with DataFrames in Julia - *<mark style="color:yellow;">Forthcoming!</mark>*

## References

1. JuliaData Contributors. (n.d.). DataFrames.jl - JuliaData. Retrieved May 1, 2024, from <https://dataframes.juliadata.org/stable/>

## Resources <a href="#documentation" id="documentation"></a>

* Julia Package: [DataFrames.jl](https://dataframes.juliadata.org/stable/)
* Julia Package: [CSV.jl](https://juliadata.github.io/CSV.jl/stable/)
* Julia Data Science: [DataFrames.jl](https://juliadatascience.io/dataframes)
* Introducing Julia Wikibook: [DataFrames](https://en.wikibooks.org/wiki/Introducing_Julia/DataFrames)
* [Julia DataFrames Cheat Sheets](https://jcharistech.wordpress.com/julia-dataframes-cheat-sheets/)
