This page provides examples of using the pandas package in Python, demonstrating the syntax and common functions within the package.
Example
Install and Load Pandas
# Load the pandas package
import pandas as pd
Create Dataframe
# Import pandas
import pandas as pd
# Create data as key-value pairs
data = {'id': [1,2,3,4,5],
'gender': ["F", "M", "F", "M", "F"],
'age': [68, 54, 49, 28, 36]}
# Put the data into a data frame
df = pd.DataFrame(data)
Display Dataframe
Input:
# display dataframe
print(df)
Output:
id gender age
0 1 F 68
1 2 M 54
2 3 F 49
3 4 M 28
4 5 F 36
First two lines of dataframe:
Input:
print(df.head(2))
Output:
id gender age
0 1 F 68
1 2 M 54
Last two lines of dataframe:
Input:
println(df.tail(2))
Output:
id gender age
3 4 M 28
4 5 F 36
Describe Dataframe
Dataframe size:
Input:
# dataframe size
print(df.shape)
Output:
(5, 3)
Dataframe column names:
Input:
# dataframe column names
print(df.columns)
Output:
Index(['id', 'gender', 'age'], dtype='object')
Dataframe description:
Input:
# describe dataframe
print(df.describe())
Output:
id age
count 5.000000 5.000000
mean 3.000000 47.000000
std 1.581139 15.620499
min 1.000000 28.000000
25% 2.000000 36.000000
50% 3.000000 49.000000
75% 4.000000 54.000000
max 5.000000 68.000000
Accessing DataFrames
Get "age" column (different ways to call the column)
Input:
# call by column name
print(df['age'])
# get column by column number
println(df.iloc[:, 2])
# print out all rows and only columns 1 (id) and 3 (age)
print("Using column names:\n")
print(df[['id', 'age']])
print("")
print("Using column numbers:\n")
print(df.iloc[:, [0, 2]])
Output:
Using column names:
id age
0 1 68
1 2 54
2 3 49
3 4 28
4 5 36
Using column numbers:
id age
0 1 68
1 2 54
2 3 49
3 4 28
4 5 36
Get subset (all rows meeting specified criteria - numbers)
Input:
# print out all rows where age is greater than 50
print(df[df['age'] > 50])
Output:
id gender age
0 1 F 68
1 2 M 54
Get subset (all rows meeting specified criteria - strings)
Input:
# print out all rows where gender is female ("F")
print(df[df['gender'] == 'F'])
Output:
id gender age
0 1 F 68
2 3 F 49
4 5 F 36
Get subset (all rows meeting specified criteria)
Input:
# print out all rows where gender is female ("F") and age is between 25-50
print(df[(df['gender'] == 'F') & (df['age'] > 25) & (df['age'] < 50)])
Output:
id gender age
2 3 F 49
4 5 F 36
Add Column
New columns with specified values
Input:
# add a column for weight
df['weight'] = [100, 120, 150, 175, 300]
# add a column for height
df['height'] = [62, 60, 61, 63, 64]
print(df)
print("")
print("Describe dataframe to see column names and summary:\n")
print(df.describe())
Output:
id gender age weight height
0 1 F 68 100 62
1 2 M 54 120 60
2 3 F 49 150 61
3 4 M 28 175 63
4 5 F 36 300 64
Describe dataframe to see column names and summary:
id age weight height
count 5.000000 5.000000 5.000000 5.000000
mean 3.000000 47.000000 169.000000 62.000000
std 1.581139 15.620499 78.612976 1.581139
min 1.000000 28.000000 100.000000 60.000000
25% 2.000000 36.000000 120.000000 61.000000
50% 3.000000 49.000000 150.000000 62.000000
75% 4.000000 54.000000 175.000000 63.000000
max 5.000000 68.000000 300.000000 64.000000
New column with calculated value
Input:
# Add a column with calculated BMI
df['bmi'] = (df['weight'] / df['height']**2) * 703
# Print the DataFrame
print(df)
print()
# Print summary statistics of the DataFrame
print("Describe dataframe to see new bmi column and summary:\n")
print(df.describe())
Output:
id gender age weight height bmi
0 1 F 68 100 62 18.288241
1 2 M 54 120 60 23.433333
2 3 F 49 150 61 28.339156
3 4 M 28 175 63 30.996473
4 5 F 36 300 64 51.489258
Describe dataframe to see new bmi column and summary:
id age weight height bmi
count 5.000000 5.000000 5.000000 5.000000 5.000000
mean 3.000000 47.000000 169.000000 62.000000 30.509292
std 1.581139 15.620499 78.612976 1.581139 12.693789
min 1.000000 28.000000 100.000000 60.000000 18.288241
25% 2.000000 36.000000 120.000000 61.000000 23.433333
50% 3.000000 49.000000 150.000000 62.000000 28.339156
75% 4.000000 54.000000 175.000000 63.000000 30.996473
max 5.000000 68.000000 300.000000 64.000000 51.489258
Get counts/frequency
Input:
# Get counts of males and females in the DataFrame
gender_counts = df['gender'].value_counts().reset_index()
gender_counts.columns = ['gender', 'N']
# Print the result
print(gender_counts)
```
Output:
gender N
0 F 3
1 M 2
Transform DataFrame
sort
Input:
# Sort the DataFrame by gender and then by age in reverse order for age (oldest to youngest)
sorted_df = df.sort_values(by=['gender', 'age'], ascending=[True, False])
# Print the sorted DataFrame
print(sorted_df)
Output:
id gender age
0 1 F 68
2 3 F 49
4 5 F 36
1 2 M 54
3 4 M 28
stack (reshape from wide to long format)
Input:
# Reshape from wide to long format (disclude 'id' column)
long_df = pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')
# Print the reshaped DataFrame
print(long_df)
Output:
id variable value
0 1 gender F
1 2 gender M
2 3 gender F
3 4 gender M
4 5 gender F
5 1 age 68
6 2 age 54
7 3 age 49
8 4 age 28
9 5 age 36
10 1 weight 100
11 2 weight 120
12 3 weight 150
13 4 weight 175
14 5 weight 300
15 1 height 62
16 2 height 60
17 3 height 61
18 4 height 63
19 5 height 64
20 1 bmi 18.288241
21 2 bmi 23.433333
22 3 bmi 28.339156
23 4 bmi 30.996473
24 5 bmi 51.489258
unstack (reshape from long to wide format)
Input:
# Unstack the DataFrame to get back to wide format based on "id"
wide_df = long_df.pivot(index='id', columns='variable', values='value')
# Print the reshaped DataFrame
print(wide_df)
Output:
variable age bmi gender height weight
id
1 68 18.288241 F 62 100
2 54 23.433333 M 60 120
3 49 28.339156 F 61 150
4 28 30.996473 M 63 175
5 36 51.489258 F 64 300
Traversing DataFrame (for loops)
sort
Input:
# Get number of rows and columns
nrows, ncols = df.shape
print(f"(nrows, ncols) = ({nrows}, {ncols})")
# Use nested for loop to get information from the DataFrame by row and column
for row in range(nrows):
for col in range(ncols):
print(f"value for row {row+1} and col {col+1} is {df.iloc[row, col]}")
Output:
(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125
Exercises
Analyzing Health Datasets with Pandas in Python- Forthcoming!