This page provides examples of using the pandas package in Python, demonstrating the syntax and common functions within the package.
Example
Install and Load Pandas
# Load the pandas packageimport pandas as pd
Create Dataframe
# Import pandasimport pandas as pd# Create data as key-value pairsdata ={'id': [1,2,3,4,5],'gender': ["F","M","F","M","F"],'age': [68,54,49,28,36]}# Put the data into a data framedf = pd.DataFrame(data)
Display Dataframe
Input:
# display dataframeprint(df)
Output:
id gender age01 F 6812 M 5423 F 4934 M 2845 F 36
First two lines of dataframe:
Input:
print(df.head(2))
Output:
id gender age01 F 6812 M 54
Last two lines of dataframe:
Input:
println(df.tail(2))
Output:
id gender age34 M 2845 F 36
Describe Dataframe
Dataframe size:
Input:
# dataframe sizeprint(df.shape)
Output:
(5,3)
Dataframe column names:
Input:
# dataframe column namesprint(df.columns)
Output:
Index(['id', 'gender', 'age'], dtype='object')
Dataframe description:
Input:
# describe dataframeprint(df.describe())
Output:
id agecount 5.0000005.000000mean 3.00000047.000000std 1.58113915.620499min1.00000028.00000025%2.00000036.00000050%3.00000049.00000075%4.00000054.000000max5.00000068.000000
Accessing DataFrames
Get "age" column (different ways to call the column)
Input:
# call by column nameprint(df['age'])# get column by column numberprintln(df.iloc[:, 2])
# print out all rows and only columns 1 (id) and 3 (age)print("Using column names:\n")print(df[['id', 'age']])print("")print("Using column numbers:\n")print(df.iloc[:, [0, 2]])
Output:
Using column names:id age01681254234934284536Using column numbers:id age01681254234934284536
Get subset (all rows meeting specified criteria - numbers)
Input:
# print out all rows where age is greater than 50print(df[df['age'] >50])
Output:
id gender age01 F 6812 M 54
Get subset (all rows meeting specified criteria - strings)
Input:
# print out all rows where gender is female ("F")print(df[df['gender'] =='F'])
Output:
id gender age01 F 6823 F 4945 F 36
Get subset (all rows meeting specified criteria)
Input:
# print out all rows where gender is female ("F") and age is between 25-50print(df[(df['gender'] =='F') & (df['age'] >25) & (df['age'] <50)])
Output:
id gender age23 F 4945 F 36
Add Column
New columns with specified values
Input:
# add a column for weightdf['weight']= [100,120,150,175,300]# add a column for heightdf['height']= [62,60,61,63,64]print(df)print("")print("Describe dataframe to see column names and summary:\n")print(df.describe())
Output:
id gender age weight height01 F 681006212 M 541206023 F 491506134 M 281756345 F 3630064Describe dataframe to see column names and summary:id age weight heightcount 5.0000005.0000005.0000005.000000mean 3.00000047.000000169.00000062.000000std 1.58113915.62049978.6129761.581139min1.00000028.000000100.00000060.00000025%2.00000036.000000120.00000061.00000050%3.00000049.000000150.00000062.00000075%4.00000054.000000175.00000063.000000max5.00000068.000000300.00000064.000000
New column with calculated value
Input:
# Add a column with calculated BMIdf['bmi']= (df['weight']/ df['height']**2) *703# Print the DataFrameprint(df)print()# Print summary statistics of the DataFrameprint("Describe dataframe to see new bmi column and summary:\n")print(df.describe())
Output:
id gender age weight height bmi01 F 681006218.28824112 M 541206023.43333323 F 491506128.33915634 M 281756330.99647345 F 363006451.489258Describe dataframe to see new bmi column and summary:id age weight height bmicount 5.0000005.0000005.0000005.0000005.000000mean 3.00000047.000000169.00000062.00000030.509292std 1.58113915.62049978.6129761.58113912.693789min1.00000028.000000100.00000060.00000018.28824125%2.00000036.000000120.00000061.00000023.43333350%3.00000049.000000150.00000062.00000028.33915675%4.00000054.000000175.00000063.00000030.996473max5.00000068.000000300.00000064.00000051.489258
Get counts/frequency
Input:
# Get counts of males and females in the DataFramegender_counts = df['gender'].value_counts().reset_index()gender_counts.columns = ['gender','N']# Print the resultprint(gender_counts)```
Output:
gender N0 F 31 M 2
Transform DataFrame
sort
Input:
# Sort the DataFrame by gender and then by age in reverse order for age (oldest to youngest)sorted_df = df.sort_values(by=['gender', 'age'], ascending=[True, False])# Print the sorted DataFrameprint(sorted_df)
Output:
id gender age01 F 6823 F 4945 F 3612 M 5434 M 28
stack (reshape from wide to long format)
Input:
# Reshape from wide to long format (disclude 'id' column)long_df = pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')# Print the reshaped DataFrameprint(long_df)
Output:
id variable value01 gender F12 gender M23 gender F34 gender M45 gender F51 age 6862 age 5473 age 4984 age 2895 age 36101 weight 100112 weight 120123 weight 150134 weight 175145 weight 300151 height 62162 height 60173 height 61184 height 63195 height 64201 bmi 18.288241212 bmi 23.433333223 bmi 28.339156234 bmi 30.996473245 bmi 51.489258
unstack (reshape from long to wide format)
Input:
# Unstack the DataFrame to get back to wide format based on "id"wide_df = long_df.pivot(index='id', columns='variable', values='value')# Print the reshaped DataFrameprint(wide_df)
Output:
variable age bmi gender height weightid16818.288241 F 6210025423.433333 M 6012034928.339156 F 6115042830.996473 M 6317553651.489258 F 64300
Traversing DataFrame (for loops)
sort
Input:
# Get number of rows and columnsnrows, ncols = df.shapeprint(f"(nrows, ncols) = ({nrows}, {ncols})")# Use nested for loop to get information from the DataFrame by row and columnfor row inrange(nrows):for col inrange(ncols):print(f"value for row {row+1} and col {col+1} is {df.iloc[row, col]}")
Output:
(nrows, ncols) = (5,6)value for row 1and col 1is1value for row 1and col 2is Fvalue for row 1and col 3is68value for row 1and col 4is100value for row 1and col 5is62value for row 1and col 6is18.28824141519251value for row 2and col 1is2value for row 2and col 2is Mvalue for row 2and col 3is54value for row 2and col 4is120value for row 2and col 5is60value for row 2and col 6is23.433333333333334value for row 3and col 1is3value for row 3and col 2is Fvalue for row 3and col 3is49value for row 3and col 4is150value for row 3and col 5is61value for row 3and col 6is28.339156140822357value for row 4and col 1is4value for row 4and col 2is Mvalue for row 4and col 3is28value for row 4and col 4is175value for row 4and col 5is63value for row 4and col 6is30.99647266313933value for row 5and col 1is5value for row 5and col 2is Fvalue for row 5and col 3is36value for row 5and col 4is300value for row 5and col 5is64value for row 5and col 6is51.4892578125
Exercises
Analyzing Health Datasets with Pandas in Python- Forthcoming!