# Data Frames and Data Manipulation

This page provides examples of using the pandas package in Python, demonstrating the syntax and common functions within the package.

## Example <a href="#practice" id="practice"></a>

**Install and Load Pandas**

```julia
# Load the pandas package
import pandas as pd
```

**Create Dataframe**

```python
# Import pandas
import pandas as pd

# Create data as key-value pairs
data = {'id': [1,2,3,4,5],
        'gender': ["F", "M", "F", "M", "F"],
        'age': [68, 54, 49, 28, 36]}
        
# Put the data into a data frame
df = pd.DataFrame(data)
```

**Display Dataframe**

Input:

```julia
# display dataframe
print(df)
```

Output:

```julia
  id gender  age
0   1      F   68
1   2      M   54
2   3      F   49
3   4      M   28
4   5      F   36
```

First two lines of dataframe:

Input:

```julia
print(df.head(2))
```

Output:

```python
 id gender  age
0   1      F   68
1   2      M   54
```

Last two lines of dataframe:

Input:

```julia
println(df.tail(2))
```

Output:

```python
  id gender  age
3   4      M   28
4   5      F   36
```

**Describe Dataframe**

Dataframe size:

Input:

```python
# dataframe size
print(df.shape)
```

Output:

```python
(5, 3)
```

Dataframe column names:

Input:

```python
# dataframe column names
print(df.columns)
```

Output:

```python
Index(['id', 'gender', 'age'], dtype='object')
```

Dataframe description:

Input:

```python
# describe dataframe
print(df.describe())
```

Output:

```python
 id        age
count  5.000000   5.000000
mean   3.000000  47.000000
std    1.581139  15.620499
min    1.000000  28.000000
25%    2.000000  36.000000
50%    3.000000  49.000000
75%    4.000000  54.000000
max    5.000000  68.000000
```

**Accessing DataFrames**

***Get "age" column (different ways to call the column)***

Input:

```python
# call by column name
print(df['age'])

# get column by column number
println(df.iloc[:, 2])

```

Output:

```julia
0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64

0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64
```

***Get row***

Input:

```python
# print row 2
print(df.iloc[1])
```

Output:

```python
id         2
gender     M
age       54
Name: 1, dtype: object
```

***Get element***

Input:

```python
# get element in row 2, column 3
print(df.iloc[1,2])
```

Output:

```python
54
```

***Get subset (specific rows and all columns)***

Input:

```python
# print out rows 1, 3, & 5
print(df.iloc[[0, 2, 4], :])
```

Output:

```python
   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
```

***Get subset (all rows and specific columns)***

Input:

```python
# print out all rows and only columns 1 (id) and 3 (age)
print("Using column names:\n")
print(df[['id', 'age']])
print("")

print("Using column numbers:\n")
print(df.iloc[:, [0, 2]])
```

Output:

```python
Using column names:

   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36


Using column numbers:


   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36

```

***Get subset (all rows meeting specified criteria - numbers)***

Input:

```python
# print out all rows where age is greater than 50
print(df[df['age'] > 50])
```

Output:

```python
   id gender  age
0   1      F   68
1   2      M   54
```

***Get subset (all rows meeting specified criteria - strings)***

Input:

```python
# print out all rows where gender is female ("F")
print(df[df['gender'] == 'F'])
```

Output:

```python
   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
```

***Get subset (all rows meeting specified criteria)***

Input:

```python
# print out all rows where gender is female ("F") and age is between 25-50
print(df[(df['gender'] == 'F') & (df['age'] > 25) & (df['age'] < 50)])
```

Output:

```python
   id gender  age
2   3      F   49
4   5      F   36
```

**Add Column**

***New columns with specified values***

Input:

```python
# add a column for weight
df['weight'] = [100, 120, 150, 175, 300]

# add a column for height
df['height'] = [62, 60, 61, 63, 64]

print(df)
print("")

print("Describe dataframe to see column names and summary:\n")
print(df.describe())
```

Output:

```python
   id gender  age  weight  height
0   1      F   68     100      62
1   2      M   54     120      60
2   3      F   49     150      61
3   4      M   28     175      63
4   5      F   36     300      64

Describe dataframe to see column names and summary:

             id        age      weight     height
count  5.000000   5.000000    5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000
std    1.581139  15.620499   78.612976   1.581139
min    1.000000  28.000000  100.000000  60.000000
25%    2.000000  36.000000  120.000000  61.000000
50%    3.000000  49.000000  150.000000  62.000000
75%    4.000000  54.000000  175.000000  63.000000
max    5.000000  68.000000  300.000000  64.000000
```

***New column with calculated value***

Input:

```python
# Add a column with calculated BMI
df['bmi'] = (df['weight'] / df['height']**2) * 703

# Print the DataFrame
print(df)
print()

# Print summary statistics of the DataFrame
print("Describe dataframe to see new bmi column and summary:\n")
print(df.describe())
```

Output:

```python
   id gender  age  weight  height        bmi
0   1      F   68     100      62  18.288241
1   2      M   54     120      60  23.433333
2   3      F   49     150      61  28.339156
3   4      M   28     175      63  30.996473
4   5      F   36     300      64  51.489258

Describe dataframe to see new bmi column and summary:

             id        age      weight     height        bmi
count  5.000000   5.000000    5.000000   5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000  30.509292
std    1.581139  15.620499   78.612976   1.581139  12.693789
min    1.000000  28.000000  100.000000  60.000000  18.288241
25%    2.000000  36.000000  120.000000  61.000000  23.433333
50%    3.000000  49.000000  150.000000  62.000000  28.339156
75%    4.000000  54.000000  175.000000  63.000000  30.996473
max    5.000000  68.000000  300.000000  64.000000  51.489258
```

***Get counts/frequency***

Input:

````python
# Get counts of males and females in the DataFrame
gender_counts = df['gender'].value_counts().reset_index()
gender_counts.columns = ['gender', 'N']

# Print the result
print(gender_counts)
```
````

Output:

```python
  gender  N
0      F  3
1      M  2
```

**Transform DataFrame**

***sort***

Input:

```python
# Sort the DataFrame by gender and then by age in reverse order for age (oldest to youngest)
sorted_df = df.sort_values(by=['gender', 'age'], ascending=[True, False])

# Print the sorted DataFrame
print(sorted_df)
```

Output:

```python
   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
1   2      M   54
3   4      M   28
```

***stack (reshape from wide to long format)***

Input:

```python
# Reshape from wide to long format (disclude 'id' column)
long_df = pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')

# Print the reshaped DataFrame
print(long_df)
```

Output:

```python
       id variable      value
0    1   gender          F
1    2   gender          M
2    3   gender          F
3    4   gender          M
4    5   gender          F
5    1      age         68
6    2      age         54
7    3      age         49
8    4      age         28
9    5      age         36
10   1   weight        100
11   2   weight        120
12   3   weight        150
13   4   weight        175
14   5   weight        300
15   1   height         62
16   2   height         60
17   3   height         61
18   4   height         63
19   5   height         64
20   1      bmi  18.288241
21   2      bmi  23.433333
22   3      bmi  28.339156
23   4      bmi  30.996473
24   5      bmi  51.489258
```

***unstack (reshape from long to wide format)***

Input:

```python
# Unstack the DataFrame to get back to wide format based on "id"
wide_df = long_df.pivot(index='id', columns='variable', values='value')

# Print the reshaped DataFrame
print(wide_df)
```

Output:

```python
variable age        bmi gender height weight
id                                          
1         68  18.288241      F     62    100
2         54  23.433333      M     60    120
3         49  28.339156      F     61    150
4         28  30.996473      M     63    175
5         36  51.489258      F     64    300
```

**Traversing DataFrame (for loops)**

***sort***

Input:

```python
# Get number of rows and columns
nrows, ncols = df.shape
print(f"(nrows, ncols) = ({nrows}, {ncols})")

# Use nested for loop to get information from the DataFrame by row and column
for row in range(nrows):
    for col in range(ncols):
        print(f"value for row {row+1} and col {col+1} is {df.iloc[row, col]}")
```

Output:

```python
(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125
```

## Exercises <a href="#documentation" id="documentation"></a>

* Analyzing Health Datasets with Pandas in Python- *<mark style="color:yellow;">Forthcoming!</mark>*

## Resources <a href="#documentation" id="documentation"></a>

* Python Pandas: [Pandas Documentation](https://pandas.pydata.org/docs/)
* W3 Schools: [Pandas Tutorial](https://www.w3schools.com/python/pandas/default.asp)
* Geeks for Geeks: [Pandas Introduction](https://www.geeksforgeeks.org/introduction-to-pandas-in-python/)
* [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.bcbi.brown.edu/codiac-for-health/computing/python/data-frames-and-data-manipulation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
