# DataFrames

<mark style="color:$primary;">DataFrames.jl is a Julia package that provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of</mark> [pandas](https://pandas.pydata.org/) <mark style="color:$primary;">(in Python) and</mark> <mark style="color:$primary;"></mark><mark style="color:$primary;">`data.frame`</mark><mark style="color:$primary;">,</mark> [`data.table`](https://rdatatable.gitlab.io/data.table/) <mark style="color:$primary;">and</mark> [dplyr](https://dplyr.tidyverse.org/) <mark style="color:$primary;">(in R), making it a great general purpose data science tool.</mark> [\[1\]](#references)&#x20;

This page provides examples of using DataFrames.jl, demonstrating the syntax and common functions within the package.

## Example <a href="#practice" id="practice"></a>

**Install and Load DataFrames.jl Package**

```julia
using Pkg

# Add DataFrames package
Pkg.add("DataFrames")

# Load paackages
using DataFrames
```

**Create Dataframe**

```julia
# Create dataframe
df = DataFrame(id = 1:5, gender = ["F", "M", "F", "M", "F"], age = [68, 54, 49, 28, 36])
```

**Display Dataframe**

Input:

```julia
# display dataframe
println(df)
```

Output:

```julia
5×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
   3 │     3  F          49
   4 │     4  M          28
   5 │     5  F          36
```

First two lines of dataframe:

Input:

```julia
println(first(df, 2))
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
```

Last two lines of dataframe:

Input:

```julia
println(last(df, 2))
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     4  M          28
   2 │     5  F          36
```

**Describe Dataframe**

Dataframe size:

Input:

```julia
# dataframe size
println(size(df))
```

Output:

```julia
(5, 3)
```

Dataframe column names:

Input:

```julia
# dataframe column names
println(names(df))
```

Output:

```julia
["id", "gender", "age"]
```

Dataframe description:

Input:

```julia
# describe dataframe
println(describe(df))
```

Output:

```julia
3×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64
```

**Accessing DataFrames**

***Get "age" column (different ways to call the column)***

Input:

```julia
# call by column name
println(df[!, :age])

# get column by column number
println(df[!, 3])

# alternate syntax
println(df.age)
```

Output:

```julia
[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]
```

***Get row***

Input:

```julia
# print row 2
println(df[2, :])
```

Output:

```julia
DataFrameRow
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   2 │     2  M          54
```

***Get element***

Input:

```julia
# get element in row 2, column 3
println(df[2,3])
```

Output:

```julia
54
```

***Get subset (specific rows and all columns)***

Input:

```julia
# print out rows 1, 3, & 5
println(df[[1,3,5], :])
```

Output:

```julia
3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36
```

***Get subset (all rows and specific columns)***

Input:

```julia
# print out all rows and only columns 1 (id) and 3 (age)
println("Using column names:\n")
println(df[:, [:id, :age]])
println()

println("Using column numbers:\n")
println(df[:, [1,3]])
```

Output:

```julia
Using column names:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36

Using column numbers:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36
```

***Get subset (all rows meeting specified criteria - numbers)***

Input:

```julia
# print out all rows where age is greater than 50
println(df[df.age .> 50, :])
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
```

***Get subset (all rows meeting specified criteria - strings)***

Input:

```julia
# print out all rows where gender is female ("F")
println(df[df.gender .== "F", :])
```

Output:

```julia
3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36
```

***Get subset (all rows meeting specified criteria)***

Input:

```julia
# print out all rows where gender is female ("F") and age is between 25-50
println(df[(df.gender .== "F") .& (25 .< df.age .< 50), :])
```

Output:

```julia
2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     3  F          49
   2 │     5  F          36
```

**Add Column**

***New columns with specified values***

Input:

```julia
# add a column for weight
df.weight = [100, 120, 150, 175, 300]

# add a column for height
df.height = [62, 60, 61, 63, 64]

println(df)
println()

println("Describe dataframe to see column names and summary:\n")
println(describe(df))
```

Output:

```julia
5×5 DataFrame
 Row │ id     gender  age    weight  height
     │ Int64  String  Int64  Int64   Int64
─────┼──────────────────────────────────────
   1 │     1  F          68     100      62
   2 │     2  M          54     120      60
   3 │     3  F          49     150      61
   4 │     4  M          28     175      63
   5 │     5  F          36     300      64

Describe dataframe to see column names and summary:

5×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64
   4 │ weight    169.0   100  150.0   300         0  Int64
   5 │ height    62.0    60   62.0    64          0  Int64
```

***New column with calculated value***

Input:

```julia
# add a column with calculated BMI
df.bmi = map((x,y) -> (x/y^2)*703, df.weight, df.height)

println(df)
println()

println("Describe dataframe to see new bmi column and summary:\n")
println(describe(df))
```

Output:

```julia
5×6 DataFrame
 Row │ id     gender  age    weight  height  bmi
     │ Int64  String  Int64  Int64   Int64   Float64
─────┼───────────────────────────────────────────────
   1 │     1  F          68     100      62  18.2882
   2 │     2  M          54     120      60  23.4333
   3 │     3  F          49     150      61  28.3392
   4 │     4  M          28     175      63  30.9965
   5 │     5  F          36     300      64  51.4893

Describe dataframe to see new bmi column and summary:

6×7 DataFrame
 Row │ variable  mean     min      median   max      nmissing  eltype
     │ Symbol    Union…   Any      Union…   Any      Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ id        3.0      1        3.0      5               0  Int64
   2 │ gender             F                 M               0  String
   3 │ age       47.0     28       49.0     68              0  Int64
   4 │ weight    169.0    100      150.0    300             0  Int64
   5 │ height    62.0     60       62.0     64              0  Int64
   6 │ bmi       30.5093  18.2882  28.3392  51.4893         0  Float64
```

***Get counts/frequency***

Input:

```julia
# get counts of males and females in the dataframe
println(combine(groupby(df, :gender), nrow => :N))
```

Output:

```julia
2×2 DataFrame
 Row │ gender  N     
     │ String  Int64 
─────┼───────────────
   1 │ F           3
   2 │ M           2
```

**Transform DataFrame**

***sort***

Input:

```julia
# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)
println(sort(df, [:gender, :age], rev=(false, true)))
```

Output:

```julia
TypeError: in keyword argument rev, expected Union{Bool, AbstractArray{Bool,1}}, got Tuple{Bool,Bool}
```

***stack (reshape from wide to long format)***

Input:

```julia
# Reshape from wide to long format (disclude id to see which column and value matches which patient id)
long_df = stack(df, Not(:id))
println(long_df)
```

Output:

```julia
25×3 DataFrame
 Row │ id     variable  value
     │ Int64  String    Any
─────┼──────────────────────────
   1 │     1  gender    F
   2 │     2  gender    M
   3 │     3  gender    F
   4 │     4  gender    M
   5 │     5  gender    F
   6 │     1  age       68
   7 │     2  age       54
   8 │     3  age       49
   9 │     4  age       28
  10 │     5  age       36
  11 │     1  weight    100
  12 │     2  weight    120
  13 │     3  weight    150
  14 │     4  weight    175
  15 │     5  weight    300
  16 │     1  height    62
  17 │     2  height    60
  18 │     3  height    61
  19 │     4  height    63
  20 │     5  height    64
  21 │     1  bmi       18.2882
  22 │     2  bmi       23.4333
  23 │     3  bmi       28.3392
  24 │     4  bmi       30.9965
  25 │     5  bmi       51.4893
```

***unstack (reshape from long to wide format)***

Input:

```julia
#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))
wide_df = unstack(long_df, :id, :variable, :value)
println(wide_df)
```

Output:

```julia
5×6 DataFrame
 Row │ id     gender  age  weight  height  bmi
     │ Int64  Any     Any  Any     Any     Any
─────┼─────────────────────────────────────────────
   1 │     1  F       68   100     62      18.2882
   2 │     2  M       54   120     60      23.4333
   3 │     3  F       49   150     61      28.3392
   4 │     4  M       28   175     63      30.9965
   5 │     5  F       36   300     64      51.4893
```

**Traversing DataFrame (for loops)**

***sort***

Input:

```julia
# size of dataframe = size(df)
# set number of rows to nrows and number of columns to ncols
println("(nrows, ncols) = $(size(df))")
nrows, ncols = size(df)

# use nested for loop to get information from dataframe by row and column
for row in 1:nrows
  for col in 1:ncols
	println("value for row $row and col $col is $(df[row,col])")
  end
end
```

Output:

```julia
(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125
```

## Exercises <a href="#documentation" id="documentation"></a>

* Analyzing Health Datasets with DataFrames in Julia - *<mark style="color:yellow;">Forthcoming!</mark>*

## References

1. JuliaData Contributors. (n.d.). [DataFrames.jl - JuliaData](https://dataframes.juliadata.org/stable/). Retrieved May 1, 2024.&#x20;

## Resources <a href="#documentation" id="documentation"></a>

* Julia Package: [DataFrames.jl](https://dataframes.juliadata.org/stable/)
* Julia Package: [CSV.jl](https://juliadata.github.io/CSV.jl/stable/)
* Julia Data Science: [DataFrames.jl](https://juliadatascience.io/dataframes)
* Introducing Julia Wikibook: [DataFrames](https://en.wikibooks.org/wiki/Introducing_Julia/DataFrames)
* [Julia DataFrames Cheat Sheets](https://jcharistech.wordpress.com/julia-dataframes-cheat-sheets/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bcbi.brown.edu/codiac-for-health/computing/julia/dataframes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
