# Data Frames and Data Manipulation

This page provides examples of using the pandas package in Python, demonstrating the syntax and common functions within the package.

## Example <a href="#practice" id="practice"></a>

**Install and Load Pandas**

```julia
# Load the pandas package
import pandas as pd
```

**Create Dataframe**

```python
# Import pandas
import pandas as pd

# Create data as key-value pairs
data = {'id': [1,2,3,4,5],
        'gender': ["F", "M", "F", "M", "F"],
        'age': [68, 54, 49, 28, 36]}
        
# Put the data into a data frame
df = pd.DataFrame(data)
```

**Display Dataframe**

Input:

```julia
# display dataframe
print(df)
```

Output:

```julia
  id gender  age
0   1      F   68
1   2      M   54
2   3      F   49
3   4      M   28
4   5      F   36
```

First two lines of dataframe:

Input:

```julia
print(df.head(2))
```

Output:

```python
 id gender  age
0   1      F   68
1   2      M   54
```

Last two lines of dataframe:

Input:

```julia
println(df.tail(2))
```

Output:

```python
  id gender  age
3   4      M   28
4   5      F   36
```

**Describe Dataframe**

Dataframe size:

Input:

```python
# dataframe size
print(df.shape)
```

Output:

```python
(5, 3)
```

Dataframe column names:

Input:

```python
# dataframe column names
print(df.columns)
```

Output:

```python
Index(['id', 'gender', 'age'], dtype='object')
```

Dataframe description:

Input:

```python
# describe dataframe
print(df.describe())
```

Output:

```python
 id        age
count  5.000000   5.000000
mean   3.000000  47.000000
std    1.581139  15.620499
min    1.000000  28.000000
25%    2.000000  36.000000
50%    3.000000  49.000000
75%    4.000000  54.000000
max    5.000000  68.000000
```

**Accessing DataFrames**

***Get "age" column (different ways to call the column)***

Input:

```python
# call by column name
print(df['age'])

# get column by column number
println(df.iloc[:, 2])

```

Output:

```julia
0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64

0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64
```

***Get row***

Input:

```python
# print row 2
print(df.iloc[1])
```

Output:

```python
id         2
gender     M
age       54
Name: 1, dtype: object
```

***Get element***

Input:

```python
# get element in row 2, column 3
print(df.iloc[1,2])
```

Output:

```python
54
```

***Get subset (specific rows and all columns)***

Input:

```python
# print out rows 1, 3, & 5
print(df.iloc[[0, 2, 4], :])
```

Output:

```python
   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
```

***Get subset (all rows and specific columns)***

Input:

```python
# print out all rows and only columns 1 (id) and 3 (age)
print("Using column names:\n")
print(df[['id', 'age']])
print("")

print("Using column numbers:\n")
print(df.iloc[:, [0, 2]])
```

Output:

```python
Using column names:

   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36


Using column numbers:


   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36

```

***Get subset (all rows meeting specified criteria - numbers)***

Input:

```python
# print out all rows where age is greater than 50
print(df[df['age'] > 50])
```

Output:

```python
   id gender  age
0   1      F   68
1   2      M   54
```

***Get subset (all rows meeting specified criteria - strings)***

Input:

```python
# print out all rows where gender is female ("F")
print(df[df['gender'] == 'F'])
```

Output:

```python
   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
```

***Get subset (all rows meeting specified criteria)***

Input:

```python
# print out all rows where gender is female ("F") and age is between 25-50
print(df[(df['gender'] == 'F') & (df['age'] > 25) & (df['age'] < 50)])
```

Output:

```python
   id gender  age
2   3      F   49
4   5      F   36
```

**Add Column**

***New columns with specified values***

Input:

```python
# add a column for weight
df['weight'] = [100, 120, 150, 175, 300]

# add a column for height
df['height'] = [62, 60, 61, 63, 64]

print(df)
print("")

print("Describe dataframe to see column names and summary:\n")
print(df.describe())
```

Output:

```python
   id gender  age  weight  height
0   1      F   68     100      62
1   2      M   54     120      60
2   3      F   49     150      61
3   4      M   28     175      63
4   5      F   36     300      64

Describe dataframe to see column names and summary:

             id        age      weight     height
count  5.000000   5.000000    5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000
std    1.581139  15.620499   78.612976   1.581139
min    1.000000  28.000000  100.000000  60.000000
25%    2.000000  36.000000  120.000000  61.000000
50%    3.000000  49.000000  150.000000  62.000000
75%    4.000000  54.000000  175.000000  63.000000
max    5.000000  68.000000  300.000000  64.000000
```

***New column with calculated value***

Input:

```python
# Add a column with calculated BMI
df['bmi'] = (df['weight'] / df['height']**2) * 703

# Print the DataFrame
print(df)
print()

# Print summary statistics of the DataFrame
print("Describe dataframe to see new bmi column and summary:\n")
print(df.describe())
```

Output:

```python
   id gender  age  weight  height        bmi
0   1      F   68     100      62  18.288241
1   2      M   54     120      60  23.433333
2   3      F   49     150      61  28.339156
3   4      M   28     175      63  30.996473
4   5      F   36     300      64  51.489258

Describe dataframe to see new bmi column and summary:

             id        age      weight     height        bmi
count  5.000000   5.000000    5.000000   5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000  30.509292
std    1.581139  15.620499   78.612976   1.581139  12.693789
min    1.000000  28.000000  100.000000  60.000000  18.288241
25%    2.000000  36.000000  120.000000  61.000000  23.433333
50%    3.000000  49.000000  150.000000  62.000000  28.339156
75%    4.000000  54.000000  175.000000  63.000000  30.996473
max    5.000000  68.000000  300.000000  64.000000  51.489258
```

***Get counts/frequency***

Input:

````python
# Get counts of males and females in the DataFrame
gender_counts = df['gender'].value_counts().reset_index()
gender_counts.columns = ['gender', 'N']

# Print the result
print(gender_counts)
```
````

Output:

```python
  gender  N
0      F  3
1      M  2
```

**Transform DataFrame**

***sort***

Input:

```python
# Sort the DataFrame by gender and then by age in reverse order for age (oldest to youngest)
sorted_df = df.sort_values(by=['gender', 'age'], ascending=[True, False])

# Print the sorted DataFrame
print(sorted_df)
```

Output:

```python
   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
1   2      M   54
3   4      M   28
```

***stack (reshape from wide to long format)***

Input:

```python
# Reshape from wide to long format (disclude 'id' column)
long_df = pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')

# Print the reshaped DataFrame
print(long_df)
```

Output:

```python
       id variable      value
0    1   gender          F
1    2   gender          M
2    3   gender          F
3    4   gender          M
4    5   gender          F
5    1      age         68
6    2      age         54
7    3      age         49
8    4      age         28
9    5      age         36
10   1   weight        100
11   2   weight        120
12   3   weight        150
13   4   weight        175
14   5   weight        300
15   1   height         62
16   2   height         60
17   3   height         61
18   4   height         63
19   5   height         64
20   1      bmi  18.288241
21   2      bmi  23.433333
22   3      bmi  28.339156
23   4      bmi  30.996473
24   5      bmi  51.489258
```

***unstack (reshape from long to wide format)***

Input:

```python
# Unstack the DataFrame to get back to wide format based on "id"
wide_df = long_df.pivot(index='id', columns='variable', values='value')

# Print the reshaped DataFrame
print(wide_df)
```

Output:

```python
variable age        bmi gender height weight
id                                          
1         68  18.288241      F     62    100
2         54  23.433333      M     60    120
3         49  28.339156      F     61    150
4         28  30.996473      M     63    175
5         36  51.489258      F     64    300
```

**Traversing DataFrame (for loops)**

***sort***

Input:

```python
# Get number of rows and columns
nrows, ncols = df.shape
print(f"(nrows, ncols) = ({nrows}, {ncols})")

# Use nested for loop to get information from the DataFrame by row and column
for row in range(nrows):
    for col in range(ncols):
        print(f"value for row {row+1} and col {col+1} is {df.iloc[row, col]}")
```

Output:

```python
(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125
```

## Exercises <a href="#documentation" id="documentation"></a>

* Analyzing Health Datasets with Pandas in Python- *<mark style="color:yellow;">Forthcoming!</mark>*

## Resources <a href="#documentation" id="documentation"></a>

* Python Pandas: [Pandas Documentation](https://pandas.pydata.org/docs/)
* W3 Schools: [Pandas Tutorial](https://www.w3schools.com/python/pandas/default.asp)
* Geeks for Geeks: [Pandas Introduction](https://www.geeksforgeeks.org/introduction-to-pandas-in-python/)
* [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
