LogoLogo
Computing Skills
Computing Skills
  • Introduction
  • File Directory Structures
  • Text Editors
  • GitHub
  • Unix
  • Julia
    • Installation
    • REPL
    • Basic Syntax
    • Numbers and Math
    • Strings and Characters
    • Regular Expressions
    • Control Flow
    • Collections and Data Structures
    • File Input/Output
    • Packages
    • DataFrames
    • JuliaPlots
    • ScikitLearn.jl
    • JuliaStats
    • Exercises
  • Python
    • Installation
    • REPL
    • Basic Syntax
    • Numbers and Math
    • Strings and Characters
    • Regular Expressions
    • Control Flow
    • Collections and Data Structures
    • File Input/Output
    • Packages
    • Data Frames and Data Manipulation
  • R
    • Installation
    • REPL
    • Basic Syntax
    • Numbers and Math
    • Strings and Characters
    • Regular Expression
    • Control Flow
    • Collections and Data Structures
    • File Input/Output
    • Packages
    • DataFrames
    • Data Analysis and Manipulation
Powered by GitBook
On this page
  • Example
  • Exercises
  • Resources
Export as PDF
  1. Python

Data Frames and Data Manipulation

This page provides examples of using the pandas package in Python, demonstrating the syntax and common functions within the package.

Example

Install and Load Pandas

# Load the pandas package
import pandas as pd

Create Dataframe

# Import pandas
import pandas as pd

# Create data as key-value pairs
data = {'id': [1,2,3,4,5],
        'gender': ["F", "M", "F", "M", "F"],
        'age': [68, 54, 49, 28, 36]}
        
# Put the data into a data frame
df = pd.DataFrame(data)

Display Dataframe

Input:

# display dataframe
print(df)

Output:

  id gender  age
0   1      F   68
1   2      M   54
2   3      F   49
3   4      M   28
4   5      F   36

First two lines of dataframe:

Input:

print(df.head(2))

Output:

 id gender  age
0   1      F   68
1   2      M   54

Last two lines of dataframe:

Input:

println(df.tail(2))

Output:

  id gender  age
3   4      M   28
4   5      F   36

Describe Dataframe

Dataframe size:

Input:

# dataframe size
print(df.shape)

Output:

(5, 3)

Dataframe column names:

Input:

# dataframe column names
print(df.columns)

Output:

Index(['id', 'gender', 'age'], dtype='object')

Dataframe description:

Input:

# describe dataframe
print(df.describe())

Output:

 id        age
count  5.000000   5.000000
mean   3.000000  47.000000
std    1.581139  15.620499
min    1.000000  28.000000
25%    2.000000  36.000000
50%    3.000000  49.000000
75%    4.000000  54.000000
max    5.000000  68.000000

Accessing DataFrames

Get "age" column (different ways to call the column)

Input:

# call by column name
print(df['age'])

# get column by column number
println(df.iloc[:, 2])

Output:

0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64

0    68
1    54
2    49
3    28
4    36
Name: age, dtype: int64

Get row

Input:

# print row 2
print(df.iloc[1])

Output:

id         2
gender     M
age       54
Name: 1, dtype: object

Get element

Input:

# get element in row 2, column 3
print(df.iloc[1,2])

Output:

54

Get subset (specific rows and all columns)

Input:

# print out rows 1, 3, & 5
print(df.iloc[[0, 2, 4], :])

Output:

   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36

Get subset (all rows and specific columns)

Input:

# print out all rows and only columns 1 (id) and 3 (age)
print("Using column names:\n")
print(df[['id', 'age']])
print("")

print("Using column numbers:\n")
print(df.iloc[:, [0, 2]])

Output:

Using column names:

   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36


Using column numbers:


   id  age
0   1   68
1   2   54
2   3   49
3   4   28
4   5   36

Get subset (all rows meeting specified criteria - numbers)

Input:

# print out all rows where age is greater than 50
print(df[df['age'] > 50])

Output:

   id gender  age
0   1      F   68
1   2      M   54

Get subset (all rows meeting specified criteria - strings)

Input:

# print out all rows where gender is female ("F")
print(df[df['gender'] == 'F'])

Output:

   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36

Get subset (all rows meeting specified criteria)

Input:

# print out all rows where gender is female ("F") and age is between 25-50
print(df[(df['gender'] == 'F') & (df['age'] > 25) & (df['age'] < 50)])

Output:

   id gender  age
2   3      F   49
4   5      F   36

Add Column

New columns with specified values

Input:

# add a column for weight
df['weight'] = [100, 120, 150, 175, 300]

# add a column for height
df['height'] = [62, 60, 61, 63, 64]

print(df)
print("")

print("Describe dataframe to see column names and summary:\n")
print(df.describe())

Output:

   id gender  age  weight  height
0   1      F   68     100      62
1   2      M   54     120      60
2   3      F   49     150      61
3   4      M   28     175      63
4   5      F   36     300      64

Describe dataframe to see column names and summary:

             id        age      weight     height
count  5.000000   5.000000    5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000
std    1.581139  15.620499   78.612976   1.581139
min    1.000000  28.000000  100.000000  60.000000
25%    2.000000  36.000000  120.000000  61.000000
50%    3.000000  49.000000  150.000000  62.000000
75%    4.000000  54.000000  175.000000  63.000000
max    5.000000  68.000000  300.000000  64.000000

New column with calculated value

Input:

# Add a column with calculated BMI
df['bmi'] = (df['weight'] / df['height']**2) * 703

# Print the DataFrame
print(df)
print()

# Print summary statistics of the DataFrame
print("Describe dataframe to see new bmi column and summary:\n")
print(df.describe())

Output:

   id gender  age  weight  height        bmi
0   1      F   68     100      62  18.288241
1   2      M   54     120      60  23.433333
2   3      F   49     150      61  28.339156
3   4      M   28     175      63  30.996473
4   5      F   36     300      64  51.489258

Describe dataframe to see new bmi column and summary:

             id        age      weight     height        bmi
count  5.000000   5.000000    5.000000   5.000000   5.000000
mean   3.000000  47.000000  169.000000  62.000000  30.509292
std    1.581139  15.620499   78.612976   1.581139  12.693789
min    1.000000  28.000000  100.000000  60.000000  18.288241
25%    2.000000  36.000000  120.000000  61.000000  23.433333
50%    3.000000  49.000000  150.000000  62.000000  28.339156
75%    4.000000  54.000000  175.000000  63.000000  30.996473
max    5.000000  68.000000  300.000000  64.000000  51.489258

Get counts/frequency

Input:

# Get counts of males and females in the DataFrame
gender_counts = df['gender'].value_counts().reset_index()
gender_counts.columns = ['gender', 'N']

# Print the result
print(gender_counts)
```

Output:

  gender  N
0      F  3
1      M  2

Transform DataFrame

sort

Input:

# Sort the DataFrame by gender and then by age in reverse order for age (oldest to youngest)
sorted_df = df.sort_values(by=['gender', 'age'], ascending=[True, False])

# Print the sorted DataFrame
print(sorted_df)

Output:

   id gender  age
0   1      F   68
2   3      F   49
4   5      F   36
1   2      M   54
3   4      M   28

stack (reshape from wide to long format)

Input:

# Reshape from wide to long format (disclude 'id' column)
long_df = pd.melt(df, id_vars=['id'], var_name='variable', value_name='value')

# Print the reshaped DataFrame
print(long_df)

Output:

       id variable      value
0    1   gender          F
1    2   gender          M
2    3   gender          F
3    4   gender          M
4    5   gender          F
5    1      age         68
6    2      age         54
7    3      age         49
8    4      age         28
9    5      age         36
10   1   weight        100
11   2   weight        120
12   3   weight        150
13   4   weight        175
14   5   weight        300
15   1   height         62
16   2   height         60
17   3   height         61
18   4   height         63
19   5   height         64
20   1      bmi  18.288241
21   2      bmi  23.433333
22   3      bmi  28.339156
23   4      bmi  30.996473
24   5      bmi  51.489258

unstack (reshape from long to wide format)

Input:

# Unstack the DataFrame to get back to wide format based on "id"
wide_df = long_df.pivot(index='id', columns='variable', values='value')

# Print the reshaped DataFrame
print(wide_df)

Output:

variable age        bmi gender height weight
id                                          
1         68  18.288241      F     62    100
2         54  23.433333      M     60    120
3         49  28.339156      F     61    150
4         28  30.996473      M     63    175
5         36  51.489258      F     64    300

Traversing DataFrame (for loops)

sort

Input:

# Get number of rows and columns
nrows, ncols = df.shape
print(f"(nrows, ncols) = ({nrows}, {ncols})")

# Use nested for loop to get information from the DataFrame by row and column
for row in range(nrows):
    for col in range(ncols):
        print(f"value for row {row+1} and col {col+1} is {df.iloc[row, col]}")

Output:

(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125

Exercises

  • Analyzing Health Datasets with Pandas in Python- Forthcoming!

Resources

PreviousPackagesNextR

Last updated 5 months ago

Python Pandas:

W3 Schools:

Geeks for Geeks:

Pandas Documentation
Pandas Tutorial
Pandas Introduction
Pandas Cheat Sheet