Data Frames and Data Manipulation

This page provides examples of using the pandas package in Python, demonstrating the syntax and common functions within the package.

Example

Install and Load Pandas

# Load the pandas package
import pandas as pd

Create Dataframe

# Import pandas
import pandas as pd

# Create data as key-value pairs
data = {'id': [1,2,3,4,5],
        'gender': ["F", "M", "F", "M", "F"],
        'age': [68, 54, 49, 28, 36]}
        
# Put the data into a data frame
df = pd.DataFrame(data)

Display Dataframe

Input:

# display dataframe
print(df)

Output:

  id gender  age
0   1      F   68
1   2      M   54
2   3      F   49
3   4      M   28
4   5      F   36

First two lines of dataframe:

Input:

print(df.head(2))

Output:

 id gender  age
0   1      F   68
1   2      M   54

Last two lines of dataframe:

Input:

println(df.tail(2))

Output:

  id gender  age
3   4      M   28
4   5      F   36

Describe Dataframe

Dataframe size:

Input:

# dataframe size
print(df.shape)

Output:

(5, 3)

Dataframe column names:

Input:

# dataframe column names
print(df.columns)

Output:

Index(['id', 'gender', 'age'], dtype='object')

Dataframe description:

Input:

# describe dataframe
print(df.describe())

Output:

 id        age
count  5.000000   5.000000
mean   3.000000  47.000000
std    1.581139  15.620499
min    1.000000  28.000000
25%    2.000000  36.000000
50%    3.000000  49.000000
75%    4.000000  54.000000
max    5.000000  68.000000

Accessing DataFrames

Get "age" column (different ways to call the column)

Input:

# call by column name
print(df['age'])

# get column by column number
println(df.iloc[:, 2])

Output:

0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64

0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64

Get row

Input:

# print row 2
print(df.iloc[1])

Output:

id         2
gender     M
age       54
Name: 1, dtype: object

Get element

Input:

# get element in row 2, column 3
print(df.iloc[1,2])

Output:

54

Get subset (specific rows and all columns)

Input:

# print out rows 1, 3, & 5
print(df.iloc[[0, 2, 4], :])

Output:

   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36

Get subset (all rows and specific columns)

Input:

# print out all rows and only columns 1 (id) and 3 (age)
print("Using column names:\n")
print(df[['id', 'age']])
print("")

print("Using column numbers:\n")
print(df.iloc[:, [0, 2]])

Output:

Using column names:

   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36


Using column numbers:


   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36

Get subset (all rows meeting specified criteria - numbers)

Input:

# print out all rows where age is greater than 50
print(df[df['age'] > 50])

Output:

   id gender  age
0   1      F   68
1   2      M   54

Get subset (all rows meeting specified criteria - strings)

Input:

# print out all rows where gender is female ("F")
print(df[df['gender'] == 'F'])

Output:

   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36

Get subset (all rows meeting specified criteria)

Input:

# print out all rows where gender is female ("F") and age is between 25-50
print(df[(df['gender'] == 'F') & (df['age'] > 25) & (df['age'] < 50)])

Output:

   id gender  age
2   3      F   49
4   5      F   36

Add Column

New columns with specified values

Input:

# add a column for weight
df['weight'] = [100, 120, 150, 175, 300]

# add a column for height
df['height'] = [62, 60, 61, 63, 64]

print(df)
print("")

print("Describe dataframe to see column names and summary:\n")
print(df.describe())

Output:

   id gender  age  weight  height
0   1      F   68     100      62
1   2      M   54     120      60
2   3      F   49     150      61
3   4      M   28     175      63
4   5      F   36     300      64

Describe dataframe to see column names and summary:

             id        age      weight     height
count  5.000000   5.000000    5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000
std    1.581139  15.620499   78.612976   1.581139
min    1.000000  28.000000  100.000000  60.000000
25%    2.000000  36.000000  120.000000  61.000000
50%    3.000000  49.000000  150.000000  62.000000
75%    4.000000  54.000000  175.000000  63.000000
max    5.000000  68.000000  300.000000  64.000000

New column with calculated value

Input:

# Add a column with calculated BMI
df['bmi'] = (df['weight'] / df['height']**2) * 703

# Print the DataFrame
print(df)
print()

# Print summary statistics of the DataFrame
print("Describe dataframe to see new bmi column and summary:\n")
print(df.describe())

Output:

   id gender  age  weight  height        bmi
0   1      F   68     100      62  18.288241
1   2      M   54     120      60  23.433333
2   3      F   49     150      61  28.339156
3   4      M   28     175      63  30.996473
4   5      F   36     300      64  51.489258

Describe dataframe to see new bmi column and summary:

             id        age      weight     height        bmi
count  5.000000   5.000000    5.000000   5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000  30.509292
std    1.581139  15.620499   78.612976   1.581139  12.693789
min    1.000000  28.000000  100.000000  60.000000  18.288241
25%    2.000000  36.000000  120.000000  61.000000  23.433333
50%    3.000000  49.000000  150.000000  62.000000  28.339156
75%    4.000000  54.000000  175.000000  63.000000  30.996473
max    5.000000  68.000000  300.000000  64.000000  51.489258

Get counts/frequency

Input:

# Get counts of males and females in the DataFrame
gender_counts = df['gender'].value_counts().reset_index()
gender_counts.columns = ['gender', 'N']

# Print the result
print(gender_counts)
```

Output:

  gender  N
0      F  3
1      M  2

Transform DataFrame

sort

Input:

# Sort the DataFrame by gender and then by age in reverse order for age (oldest to youngest)
sorted_df = df.sort_values(by=['gender', 'age'], ascending=[True, False])

# Print the sorted DataFrame
print(sorted_df)

Output:

   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
1   2      M   54
3   4      M   28

stack (reshape from wide to long format)

Input:

# Reshape from wide to long format (disclude 'id' column)
long_df = pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')

# Print the reshaped DataFrame
print(long_df)

Output:

       id variable      value
0    1   gender          F
1    2   gender          M
2    3   gender          F
3    4   gender          M
4    5   gender          F
5    1      age         68
6    2      age         54
7    3      age         49
8    4      age         28
9    5      age         36
10   1   weight        100
11   2   weight        120
12   3   weight        150
13   4   weight        175
14   5   weight        300
15   1   height         62
16   2   height         60
17   3   height         61
18   4   height         63
19   5   height         64
20   1      bmi  18.288241
21   2      bmi  23.433333
22   3      bmi  28.339156
23   4      bmi  30.996473
24   5      bmi  51.489258

unstack (reshape from long to wide format)

Input:

# Unstack the DataFrame to get back to wide format based on "id"
wide_df = long_df.pivot(index='id', columns='variable', values='value')

# Print the reshaped DataFrame
print(wide_df)

Output:

variable age        bmi gender height weight
id                                          
1         68  18.288241      F     62    100
2         54  23.433333      M     60    120
3         49  28.339156      F     61    150
4         28  30.996473      M     63    175
5         36  51.489258      F     64    300

Traversing DataFrame (for loops)

sort

Input:

# Get number of rows and columns
nrows, ncols = df.shape
print(f"(nrows, ncols) = ({nrows}, {ncols})")

# Use nested for loop to get information from the DataFrame by row and column
for row in range(nrows):
    for col in range(ncols):
        print(f"value for row {row+1} and col {col+1} is {df.iloc[row, col]}")

Output:

(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125

Exercises

  • Analyzing Health Datasets with Pandas in Python- Forthcoming!

Resources

Last updated