1 of 46

Computing Skills

Introduction

The chapter provides instructions and examples of using computing skills for health data and technology research.

Visit other chapters in CODIAC for Health using the Table of Contents or menu in the upper left corner.

File Directory Structures

All major operating systems organize files into hierarchical directories. Understanding these file directory structures is vital when interacting with data files using Unix commands or a programming language.

This page describes file directory structures generally as well as some of the differences between file directory structures within different operating systems.

Hierarchical Structure

Directories allow users to group files into an organized structure. They are typically visualized like root systems of trees, the highest level of which is called the "root directory". Subdirectories branch down from the root directory, containing files as well as additional subdirectories.

Directories and files are typically described using the path used to reach them through the directory structure, starting with the root directory. In Linux and Mac operating systems, the root directory is indicated as "/" (In Windows OS, the root directory is indicated as "\"). An additional "/" (or "\" for Windows OS) is placed between each object in the path.

For example, looking at Figure 1, File_B1a2 could be described with:

/Directory_B/Directory_B1/Directory_B1a/File_B1a2

GUI

All major operating systems also provide users with a graphical user interface, or GUI (often pronounced "gooey"), which allows interaction with software and files through visual icons. If you are not already familiar with accessing files and directories through the command line, you are likely familiar with using a GUI file system. While not the recommended method for interacting with files while programming, the GUI file system can be a useful tool for visualizing a directory structure.

Figure 2 displays the GUI file system for a computer running MacOS. Though the GUI directory structure is visualized horizontally, the "root system" is still clearly visible. Using its complete path, the file "medication_data" should be described as:

/Users/<username>/Documents/project_a/data_files/medication_data

Resources

Text Editors

Programming languages are written using text editor applications. These applications allow users to create and edit free text, which can then be run as programs. Text editors differ in complexity, some including extra functionality for easier, more efficient programming. Text editors with auto-complete suggest common functions or existing variables as the programmer begins to type, which the programmer can then select without needing to finish typing. Some text editors offer options to run individual lines of code or entire programs while editing files.

Popular Text Editors

Microsoft Visual Studio Code (recommended)

Available for Mac, Windows, and Linux operating systems
Includes support for debugging, syntax highlighting, auto-complete, and additional user-friendly functionality
Download Microsoft Visual Studio Code

Jupyter Notebooks

Web application text editor, no download necessary
Includes options for interactive output (HTML, images, videos, LaTeX, and custom MIME types), support for big data tools, such as Apache Spark, and options for sharing notebooks with others
Run individual lines of code or entire programs at once
Get started with Jupyter Notebooks

VI/VIM

Highly configurable
Included in most UNIX operating systems (e.g., Linux, or MacOS), no download necessary
Write files from the Terminal
Interactive VIM Tutorial

Emacs

Highly configurable
Included in most UNIX operating systems (e.g., Linux, or MacOS), no download necessary, also available for Windows
Wide range of built-in features for text editing, such as syntax highlighting, automatic indentation, and search and replace
Learn more about Emacs

Pico

Included in most UNIX operating systems (e.g., Linux, or MacOS), no download necessary
Most of the editing commands are displayed at the bottom of the editing screen for easy reference
Learn more about Pico

GitHub

GitHub is a code hosting platform that allows developers to create, store, manage, and share their code. It uses Git software, providing the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project. Refer to for additional GitHub documentation and tutorials.

Saving and Retrieving Code Changes

Like other cloud platforms (e.g., Google Docs), GitHub allows users to work on projects together. Please note, code changes must be manually saved. GitHub does not automatically save your work. To save changes, open the Terminal application, navigate to the cloned repository, and run the following commands, replacing "INSERT PROGRESS NOTE" with brief description of changes.

git add -A : adds all your code changes to the GitHub repository
git commit -m"INSERT PROGRESS NOTE" : adds a note to the commit which you and your team can reference later. This note should be brief and informative, describing the purpose of your code changes.
git push : saves your code changes to the GitHub repository.

If multiple users are pushing code changes to your GitHub repository, make sure to retrieve or "pull" these edits before you begin making code changes. To do so, open the Terminal application, navigate to the cloned repository, and run the following command. If you have made any code changes, you will need to save them first for the pull to work.

When your are making code changes, you should git pull before making any edits. This will keep your team from encountering "merge conflicts", which can become difficult to troubleshoot. To mitigate merge conflicts, make sure to communicate with your team. Inform your team whenever you push new code changes so that everyone is always working one the most updated version of the code.

Resolving Merge Conflicts

Merge conflicts happen when you attempt to merge code branches that have competing commits. They are often caused by users making code changes without pulling first. To resolve a merge conflict, work through the following steps:

Identify the location of the merge conflict.
Manually edit the conflicted file from a single machine, selecting the changes you want to keep in the final merge.
Push the selected changes to GitHub.

All team members should pull the corrected changes from GitHub before continuing to make code changes.

Resources

Julia

Julia is an open source dynamic programming language for high-level, high-performance numerical computing [1]. Julia provides ease and expressiveness (similar to R, MATLAB, and Python), but also supports general programming [2].

Development of Julia began in 2009, and the first version was released in February 2012. The current version of Julia is 1.11 (as of November 2024).

References

Resources

Julia Performance Metrics
ThinkJulia: How to Think Like a Computer Scientist
Julia Data Science
Learn X in Y Minutes: X=Julia
Julia Cheat Sheet
Introducing Julia Wikibook
JuliaHealth

Installation

Instructions for installing Julia on macOS and Windows operating systems can be found here.

Package managers such as Homebrew (macOS and Linux) and Chocolatey (Windows) can be used to facilitate installation.

For most users, it is recommended to download the current stable release from https://julialang.org/downloads/.

Some developers might wish to use a different version, or to switch between versions. For this, the Juliaup version manager can be useful.

Julia is also available for use in Brown's Computing Environments:

Oscar (for high-performance computing)
Stronghold (for secure computing)

Resources

Julia Documentation: Installation

REPL

Julia comes with a full-featured interactive command-line REPL (read-eval-print loop) built into the julia executable. In addition to allowing quick and easy evaluation of Julia statements, it has a searchable history, tab-completion, many helpful keybindings, and dedicated help ? and shell modes ;. [1]

This page provides examples of using REPL on the command line.

Julia REPL Example (local)

Type julia in terminal to launch REPL

Julia REPL Help Pages (local)

Type "?" to enter help pages within REPL
Type a function from Julia to read help pages (ex: println)

References

Julia Contributors. (n.d.). REPL - Standard Library - Julia Language. Retrieved May 1, 2024, from https://docs.julialang.org/en/v1/stdlib/REPL/

Resources

Basic Syntax

"Hello, World!" Program

This is the typical first program for those new to a general purpose programming language like Julia. It can be used to test that the of Julia is working and also introduce Julia's basic syntax using the environment or running code written using a at the command line.

Input:

Output:

Here are variations of the "Hello, World!" programming using variables and different print statements.

Input:

Output:

Variable Assignment

In order to assign variables in Julia, you write the desired name for your variable, an = sign, and what the value of the variable should be.

Input:

Output:

Comments

We can write comments on our code, which do not run, to describe what certain lines of code or section of code do
- These comments are just for the programmer, they will not appear anywhere in the output and just are there to explain what the code is doing or to provide helpful notes
- To make a comment in Julia, you can use the “#” symbol and then type your comment
Sometimes you might want to write longer comments that span multiple lines – to do this you can surround these comments with #= above the start as well as =# below the end

Input:

Output:

Print Statements

Without using a print statement, Julia will only print out the most recent item that has an output. In order to print multiple things, we can use the print() or println() functions.

Input:

Output:

Exercises

Use Julia in Brown Oscar Computing Environment - Forthcoming!
Use Julia in Brown Stronghold Computing Environment - Forthcoming!

Resources

Julia Documentation:
Julia Documentation:
Think Julia:
Think Julia:

Numbers and Math

This page provides syntax for using numbers and mathematic operations in Julia. Each section includes an example to demonstrate the described syntax and operations.

Types of Numbers

Integer (positive and negative counting number) - e.g., -3, -2, -1, 0, 1, 2, and 3
- Signed: Int8, Int16, Int32, Int64, and Int128
- Unsigned: UInt8, UInt16, UInt32, UInt64, and UInt128
- Boolean: Bool (0 = False and 1 = True)
Float (real or floating point numbers) - e.g., -2.14, 0.0, and 3.777
- Float16, Float32, Float64

Use typeof() function to determine type

Input:

Output:

Arithmetic Operators

Operator

Example

Input:

Output:

Comparison Operators and Functions

Input:

Operator

Example

Output:

Exercises

Create a Health Calculator Using Julia - Forthcoming!

Resources

Julia Documentation:
Julia Documentation:
Julia Documentation:
Julia Documentation:
Think Julia:

Strings and Characters

This page provides syntax for strings and characters in Julia as well as some of their associated functions. Each section includes an example to demonstrate the described syntax or function.

Characters and Strings

Char is a single character
String is a sequence of one or more characters (index values start at 1)

Some functions that can be performed on strings

Action

Function

Use typeof() function to determine type

Input:

Output:

Resources

Julia Documentation:
Julia Documentation:
Think Julia:

Regular Expressions

Regular expressions (regex) are powerful tools for pattern matching and text processing. They are represented as a pattern that consists of a special set of characters to search for in a string str.

This page provides syntax for regular expressions in Julia . Each section includes an example to demonstrate the described methods.

Functions

Action

Function

Check if regex matches a string

occursin(r"pattern", str)

Capture regex matches

match(r"pattern", str)

Specify alternative regex

pattern1|pattern2

Character Class

Character class specifies a list of characters to match ([...] where ... represents the list) or not match ([^...])

Character Class

...

Any lowercase vowel

\[aeiou]

Any digit

[0-9]

Any lowercase letter

[a-z]

Any uppercase letter

[A-Z]

Any digit, lowercase letter, or uppercase letter

[a-zA-Z0-9]

Anything except a lowercase vowel

[^aeiou]

Anything except a digit

[^0-9]

Anything except a space

[^ ]

Any character

.

Any word character (equivalent to [a-zA-Z0-9_])

\w

Any non-word character (equivalent to [^a-zA-Z0-9_])

W

A digit character (equivalent to [0-9])

\d

Any non-digit character (equivalent to [^0-9])

\D

Any whitespace character (equivalent to [\t\r\n\f])

\s

Any non-whitespace character (equivalent to [^\t\r\n\f])

\S

Anchors

Anchors are special characters that can be used to match a pattern at a specified position

Anchor

Special Character

Beginning of line

^

End of line

$

Beginning of string

\A

End of string

\Z

Repetition and Quantifier Characters

Repetition or quantifier characters specify the number of times to match a particular character or set of characters

Repetition

Character

Zero or more times

*

One or more times

+

Zero or one time

?

Exactly n times

{n}

n or more times

{n,}

m or less times

{,m}

At least n and at most m times

{n.m}

Input:

# regex.jl
number1 = "(555)123-4567"
number2 = "123-45-6789"

# check if matches
if occursin(r"\([0-9]{3}\)[0-9]{3}-[0-9]{4}", number1)
   println("match!")
end

if occursin(r"\([0-9]{3}\)[0-9]{3}-[0-9]{4}", number2)
  println("match!")
else
  println("no match!")
end

# capture matches
# use parentheses to "capture" different parts of a regular 
# expression for later use the first set of parentheses corresponds 
# to index 1, second to index 2, etc.

number_details = match(r"\(([0-9]{3})\)([0-9]{3}-[0-9]{4})", number1)

if number_details != nothing
   area_code = number_details[1]
   phone_number = number_details[2]

   println("area code: $area_code")
   println("phone number: $phone_number")
end

Output:

match!
no match!
area code: 555
phone number: 123-4567

Resources

Julia Documentation: Manual - Strings (see Regular Expressions)
Think Julia: Chapter 8 - Strings
Regular Expressions 101
Regular Expressions Library
Regular Expressions Cheat Sheet

Control Flow

In computer science, control flow (or flow of control) is the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated. [1]

This page provides syntax for some of the common control flow methods in Julia . Each section includes an example to demonstrate the described methods.

Use Cases and Syntax

Test if a specified expression is true or false
Short-circuit evaluation
- Test if all of the conditions are true x && y
- Test if any of the conditions are true x || y
- Test if a condition is not true !z
Conditional evaluation
- if statement
- if-else
- if-elseif-else
- ?: (ternary operator)

Conditional Statements

Input:

# conditions.jl
# Demonstrates use of if statement

x, y, z = 100, 200, 300
println("x = $x, y = $y, z = $z")

# Test if x equals 100
if x == 100
  println("$x equals 100")
end

# Test if y does not equal z
if !(y == z)
   println("$y does not equal $z")
end

# Test multiple conditions
if x < y < z
  println("$y is less than $z and greater than $x")
end

# Test multiple conditions using "&&"
if x < y && x < z
  println("$x is less than $y and $z")
end

# Test multiple conditions using "||"
if y < x || y < z
  println("$y is less than $x or $z")
end

# if-else statement
if x < 100
  println("$x less than 100")
else
  println("$x is equal to or greater than 100")
end

# Same logic as above but using the ternary or 
# base three operator (?:)
println(x < 100 ? "$x less than 100 again" : "$x equal to or greater than 100 again")

# if-elseif-else statement
if y < 100
   println("$y is less than 100")
elseif y < 200
  println("$y is less than 200")
elseif y < 300
  println("$y is less than 300")
else
  println("$y is greater than or equal to 300")
end

Output:

x = 100, y = 200, z = 300
100 equals 100
200 does not equal 300
200 is less than 300 and greater than 100
100 is less than 200 and 300
200 is less than 100 or 300
100 is equal to or greater than 100
100 equal to or greater than 100 again
200 is less than 300

Loops

Repeat a block of code a specified number of times or until some condition is met.
while loop
for loop
Use break to terminate loop

Input:

# Demonstrates use of loops                                                                                    

i = 1

# while loop for incrementing i by 1 from 1 to 3
while i <= 3
  println("while: $i")
  global i += 1     # updating operator; equivalent to i = i + 1
end

# for loop
for j = 1:3
  println("for: $j")
end

for j in 1:3
  println("for again: $j")
end

# nested for loop
for j = 1:3
  for k = 1:3
    println("nested for: $j * $k = $(j*k)")
  end
end

Output:

while: 1
while: 2
while: 3
for: 1
for: 2
for: 3
for again: 1
for again: 2
for again: 3
nested for: 1 * 1 = 1
nested for: 1 * 2 = 2
nested for: 1 * 3 = 3
nested for: 2 * 1 = 2
nested for: 2 * 2 = 4
nested for: 2 * 3 = 6
nested for: 3 * 1 = 3
nested for: 3 * 2 = 6
nested for: 3 * 3 = 9

Comparison Operators and Functions

Operator

Example

Equality

x == y or isequal(x, y)

Inequality

x != y or !isequal (x, y)

Less than

x < y

Less than or equal to

x <= y

Greater than

x > y

Greater than or equal to

x >= y

Input:

# compare.jl                                                                                                 
# Demonstrate comparison operators                                                                               

# Assign values to variables using parallel assignment                                                           
c1, c2, c3, c4 = 25, 50, 75, 50
println("c1 = $(c1), c2 = $(c2), c3 = $(c3), c4 = $(c4)")

# Output results of different comparison operations                                                             
 
# Testing equality                                                                                               
println("  c1 = c3 is $(c1 == c3)")
println("  c2 = c4 is $(isequal(c2, c4))")

# Changing values using abbreviated assignment operators                                                        
c1 *= 3    	# Shorthand for c1 = c1 * 3                                                                       
c4 += 1    	# Shorthand for c4 = c4 + 1                                                                       

println("c1 = $(c1), c2 = $(c2), c3 = $(c3), c4 = $(c4)")
 
# Testing less than and greater than
println("  c1 < c2 is $(c1 < c2)")
println("  c4 <= c2 is $(c4 <= c2)")
println("  c1 > c2 is $(c1 > c2)")
println("  c3 >= c2 is $(c3 >= c2)")

Output:

c1 = 25, c2 = 50, c3 = 75, c4 = 50
  c1 = c3 is false
  c2 = c4 is true
c1 = 75, c2 = 50, c3 = 75, c4 = 51
  c1 < c2 is false
  c4 <= c2 is false
  c1 > c2 is true
  c3 >= c2 is true

References

Wikipedia contributors. (n.d.). Control flow. In Wikipedia. Retrieved May 1, 2024, from https://en.wikipedia.org/wiki/Control_flow

Resources

Julia Documentation: Manual - Control Flow
Think Julia: Chapter 5 - Conditionals and Recursion
Think Julia: Chapter 7 - Iteration

Collections and Data Structures

In computer programming, a collection is a grouping of some variable number of data items (possibly zero) that have some shared significance to the problem being solved and need to be operated upon together in some controlled fashion.

This page provides syntax for different types of collections and data structures in Julia (arrays, sets, dictionaries, etc.). Each section includes an example to demonstrate the described methods.

Arrays

Arrays are ordered collection of elements. In Julia they are automatically indexed (consecutively numbered) by an integer starting with 1.

Creating arrays

Action

Syntax

Creating array from string

Action

Syntax

Accessing elements

Action

Syntax

Adding and removing elements

Action

Syntax

Sort and unique

Action

Syntax

Compare arrays

Action

Syntax

Convert array to string

Action

Syntax

Input:

Output:

Sets

Sets are an unordered collection of unique elements.

Creating sets

Action

Syntax

Interacting with sets

Action

Syntax

Comparing sets

Action

Syntax

Input:

Output:

Dictionaries

Dictionaries are unordered collection of key-value pairs where the key serves as the index (“associative collection”). Similar to elements of a set, keys are always unique.

Creating dictionaries

Action

Syntax

Accessing dictionaries

Action

Syntax

Converting dictionaries

Action

Syntax

Sorting dictionaries

Action

Syntax

Input:

Output:

References

Wikipedia contributors (n.d.). Collection. In Wikipedia. Retrieved May 1, 2024, from

Resources

Julia Documentation:
Think Julia:
Think Julia:
Think Julia:

File Input/Output

Many Julia programs involve the input and output of files. When analyzing a dataset, that dataset file will need to be pulled into your program (input). If you want to see the results of your analysis, your program will need an output.

This section provides the syntax for inputing files (reading) and outputting results (writing) use base Julia (i.e., no packages such as CSV.jl).

UC Irvine Machine Learning Repository: Adult Data Set

Tabulate and report counts for sex in from the .

Dataset (example lines from adult.data)

Input (process_file.jl)

Output

Terminal

Exercises

Analyze the MIMIC-IV Demo Files Using Julia - Forthcoming!
Analyze the SyntheticRI Demo Files Using Julia - Forthcoming!

Resources

Julia Documentation:
Think Julia:

Packages

In computer programming, a package is a collection of modules or programs that are often published as tools for a range of common use cases, such as text processing and doing math. Programmers can install these packages and take advantage of their functionality within their own code.

This page provides instructions for installing, using, and troubleshooting packages in Julia.

Installing Packages

Start Julia REPL by typing the following in Terminal or PowerShell (Note: do not need to type $ - this is to indicate the shell prompt)

$ julia

Go into REPL mode for Pkg, Julia’s built in package manager, by pressing ]

$ julia ]

$ (@v1.4) pkg>

Update package repository in Pkg REPL

$ (@v1.4) pkg> update

Add packages in Pkg REPL

$ (@v1.4) pkg> add CSV

$ (@v1.4) pkg> add DataFrames

Check installation

(@v1.4) pkg> status
            Status `~/.julia/environments/v1.0/Project.toml`
                [336ed68f] CSV v0.4.3
                [a93c6f00] DataFrames v0.17.1
                ...

Get back to the Julia REPL and exit by pressing backspace or ^C.

(@v1.4) pkg>

julia>

To see REPL history

$ more ~/.julia/logs/repl_history.jl

Using Packages

julia> using CSV
julia> using DataFrames

julia> exit()

Troubleshooting

If you get an error like: ERROR: SystemError: opening file "C:\\Users\\User\\.julia\\registries\\General\\Registry.toml": No such file or directory
- Delete C:\\Users\\User\\.julia\\registries where User is your computer’s username and try again
- https://discourse.julialang.org/t/registry-toml-missing/24152

Resources

Julia Pkg
Julia Package Registries
JuliaHealth and BioJulia organizations (focused on Julia packages for health and life sciences)
Julia Package: CSV.jl
Julia Package: DataFrames.jl

DataFrames

DataFrames.jl is a Julia package that provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to those of pandas (in Python) and data.frame, data.table and dplyr (in R), making it a great general purpose data science tool. [1]

This page provides examples of using DataFrames.jl, demonstrating the syntax and common functions within the package.

Example

Install and Load DataFrames.jl Package

using Pkg

# Add DataFrames package
Pkg.add("DataFrames")

# Load paackages
using DataFrames

Create Dataframe

# Create dataframe
df = DataFrame(id = 1:5, gender = ["F", "M", "F", "M", "F"], age = [68, 54, 49, 28, 36])

Display Dataframe

Input:

# display dataframe
println(df)

Output:

5×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54
   3 │     3  F          49
   4 │     4  M          28
   5 │     5  F          36

First two lines of dataframe:

Input:

println(first(df, 2))

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54

Last two lines of dataframe:

Input:

println(last(df, 2))

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     4  M          28
   2 │     5  F          36

Describe Dataframe

Dataframe size:

Input:

# dataframe size
println(size(df))

Output:

(5, 3)

Dataframe column names:

Input:

# dataframe column names
println(names(df))

Output:

["id", "gender", "age"]

Dataframe description:

Input:

# describe dataframe
println(describe(df))

Output:

3×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64

Accessing DataFrames

Get "age" column (different ways to call the column)

Input:

# call by column name
println(df[!, :age])

# get column by column number
println(df[!, 3])

# alternate syntax
println(df.age)

Output:

[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]
[68, 54, 49, 28, 36]

Get row

Input:

# print row 2
println(df[2, :])

Output:

DataFrameRow
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   2 │     2  M          54

Get element

Input:

# get element in row 2, column 3
println(df[2,3])

Output:

Get subset (specific rows and all columns)

Input:

# print out rows 1, 3, & 5
println(df[[1,3,5], :])

Output:

3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36

Get subset (all rows and specific columns)

Input:

# print out all rows and only columns 1 (id) and 3 (age)
println("Using column names:\n")
println(df[:, [:id, :age]])
println()

println("Using column numbers:\n")
println(df[:, [1,3]])

Output:

Using column names:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36

Using column numbers:

5×2 DataFrame
 Row │ id     age
     │ Int64  Int64
─────┼──────────────
   1 │     1     68
   2 │     2     54
   3 │     3     49
   4 │     4     28
   5 │     5     36

Get subset (all rows meeting specified criteria - numbers)

Input:

# print out all rows where age is greater than 50
println(df[df.age .> 50, :])

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     2  M          54

Get subset (all rows meeting specified criteria - strings)

Input:

# print out all rows where gender is female ("F")
println(df[df.gender .== "F", :])

Output:

3×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     1  F          68
   2 │     3  F          49
   3 │     5  F          36

Get subset (all rows meeting specified criteria)

Input:

# print out all rows where gender is female ("F") and age is between 25-50
println(df[(df.gender .== "F") .& (25 .< df.age .< 50), :])

Output:

2×3 DataFrame
 Row │ id     gender  age
     │ Int64  String  Int64
─────┼──────────────────────
   1 │     3  F          49
   2 │     5  F          36

Add Column

New columns with specified values

Input:

# add a column for weight
df.weight = [100, 120, 150, 175, 300]

# add a column for height
df.height = [62, 60, 61, 63, 64]

println(df)
println()

println("Describe dataframe to see column names and summary:\n")
println(describe(df))

Output:

5×5 DataFrame
 Row │ id     gender  age    weight  height
     │ Int64  String  Int64  Int64   Int64
─────┼──────────────────────────────────────
   1 │     1  F          68     100      62
   2 │     2  M          54     120      60
   3 │     3  F          49     150      61
   4 │     4  M          28     175      63
   5 │     5  F          36     300      64

Describe dataframe to see column names and summary:

5×7 DataFrame
 Row │ variable  mean    min  median  max  nmissing  eltype
     │ Symbol    Union…  Any  Union…  Any  Int64     DataType
─────┼────────────────────────────────────────────────────────
   1 │ id        3.0     1    3.0     5           0  Int64
   2 │ gender            F            M           0  String
   3 │ age       47.0    28   49.0    68          0  Int64
   4 │ weight    169.0   100  150.0   300         0  Int64
   5 │ height    62.0    60   62.0    64          0  Int64

New column with calculated value

Input:

# add a column with calculated BMI
df.bmi = map((x,y) -> (x/y^2)*703, df.weight, df.height)

println(df)
println()

println("Describe dataframe to see new bmi column and summary:\n")
println(describe(df))

Output:

5×6 DataFrame
 Row │ id     gender  age    weight  height  bmi
     │ Int64  String  Int64  Int64   Int64   Float64
─────┼───────────────────────────────────────────────
   1 │     1  F          68     100      62  18.2882
   2 │     2  M          54     120      60  23.4333
   3 │     3  F          49     150      61  28.3392
   4 │     4  M          28     175      63  30.9965
   5 │     5  F          36     300      64  51.4893

Describe dataframe to see new bmi column and summary:

6×7 DataFrame
 Row │ variable  mean     min      median   max      nmissing  eltype
     │ Symbol    Union…   Any      Union…   Any      Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ id        3.0      1        3.0      5               0  Int64
   2 │ gender             F                 M               0  String
   3 │ age       47.0     28       49.0     68              0  Int64
   4 │ weight    169.0    100      150.0    300             0  Int64
   5 │ height    62.0     60       62.0     64              0  Int64
   6 │ bmi       30.5093  18.2882  28.3392  51.4893         0  Float64

Get counts/frequency

Input:

# get counts of males and females in the dataframe
println(combine(groupby(df, :gender), nrow => :N))

Output:

2×2 DataFrame
 Row │ gender  N     
     │ String  Int64 
─────┼───────────────
   1 │ F           3
   2 │ M           2

Transform DataFrame

sort

Input:

# sort the dataframe by gender and then age in reverse order for age (oldest to youngest)
println(sort(df, [:gender, :age], rev=(false, true)))

Output:

TypeError: in keyword argument rev, expected Union{Bool, AbstractArray{Bool,1}}, got Tuple{Bool,Bool}

stack (reshape from wide to long format)

Input:

# Reshape from wide to long format (disclude id to see which column and value matches which patient id)
long_df = stack(df, Not(:id))
println(long_df)

Output:

25×3 DataFrame
 Row │ id     variable  value
     │ Int64  String    Any
─────┼──────────────────────────
   1 │     1  gender    F
   2 │     2  gender    M
   3 │     3  gender    F
   4 │     4  gender    M
   5 │     5  gender    F
   6 │     1  age       68
   7 │     2  age       54
   8 │     3  age       49
   9 │     4  age       28
  10 │     5  age       36
  11 │     1  weight    100
  12 │     2  weight    120
  13 │     3  weight    150
  14 │     4  weight    175
  15 │     5  weight    300
  16 │     1  height    62
  17 │     2  height    60
  18 │     3  height    61
  19 │     4  height    63
  20 │     5  height    64
  21 │     1  bmi       18.2882
  22 │     2  bmi       23.4333
  23 │     3  bmi       28.3392
  24 │     4  bmi       30.9965
  25 │     5  bmi       51.4893

unstack (reshape from long to wide format)

Input:

#unstack dataframe to get back to wide format based off "id" (unstack(df, :id, :variable, :value))
wide_df = unstack(long_df, :id, :variable, :value)
println(wide_df)

Output:

5×6 DataFrame
 Row │ id     gender  age  weight  height  bmi
     │ Int64  Any     Any  Any     Any     Any
─────┼─────────────────────────────────────────────
   1 │     1  F       68   100     62      18.2882
   2 │     2  M       54   120     60      23.4333
   3 │     3  F       49   150     61      28.3392
   4 │     4  M       28   175     63      30.9965
   5 │     5  F       36   300     64      51.4893

Traversing DataFrame (for loops)

sort

Input:

# size of dataframe = size(df)
# set number of rows to nrows and number of columns to ncols
println("(nrows, ncols) = $(size(df))")
nrows, ncols = size(df)

# use nested for loop to get information from dataframe by row and column
for row in 1:nrows
  for col in 1:ncols
	println("value for row $row and col $col is $(df[row,col])")
  end
end

Output:

(nrows, ncols) = (5, 6)
value for row 1 and col 1 is 1
value for row 1 and col 2 is F
value for row 1 and col 3 is 68
value for row 1 and col 4 is 100
value for row 1 and col 5 is 62
value for row 1 and col 6 is 18.28824141519251
value for row 2 and col 1 is 2
value for row 2 and col 2 is M
value for row 2 and col 3 is 54
value for row 2 and col 4 is 120
value for row 2 and col 5 is 60
value for row 2 and col 6 is 23.433333333333334
value for row 3 and col 1 is 3
value for row 3 and col 2 is F
value for row 3 and col 3 is 49
value for row 3 and col 4 is 150
value for row 3 and col 5 is 61
value for row 3 and col 6 is 28.339156140822357
value for row 4 and col 1 is 4
value for row 4 and col 2 is M
value for row 4 and col 3 is 28
value for row 4 and col 4 is 175
value for row 4 and col 5 is 63
value for row 4 and col 6 is 30.99647266313933
value for row 5 and col 1 is 5
value for row 5 and col 2 is F
value for row 5 and col 3 is 36
value for row 5 and col 4 is 300
value for row 5 and col 5 is 64
value for row 5 and col 6 is 51.4892578125

Exercises

Analyzing Health Datasets with DataFrames in Julia - Forthcoming!

References

JuliaData Contributors. (n.d.). DataFrames.jl - JuliaData. Retrieved May 1, 2024, from https://dataframes.juliadata.org/stable/

Resources

Julia Package: DataFrames.jl
Julia Package: CSV.jl
Julia Data Science: DataFrames.jl
Introducing Julia Wikibook: DataFrames
Julia DataFrames Cheat Sheets

JuliaPlots

JuliaPlots is one of the most popular data visualization packages for Julia as it is easy to use and interfaces with many other Julia packages.

Installation & Setup

To begin, import the "Plots" package and initialize it with the following code.

Creating a Plot

Use plot to create a new plot, and plot! to add to an existing plot

To create a first plot of sin(x), we will assign two variables and use the plot function to visualize them.

Output

Adding/Modifying Plot Attributes

There are many attributes you can modify to incorporate additional detail and/or change the style of a plot, such as titles, axis labels, line width, and legends, to name a few. In Plots, changing the modifier is as easy as typing the name of the attribute followed by an exclamation point (xlabel!). Below are some examples of attribute addition and modification.

The default for Plots is modifying the current plot. To modify the attribute of a plot other than the current one, include the plot name following the attribute. For example, to change the x-axis label of a plot called "plotname", you would write: xlabel!(plotname, "x")

Output

Saving Plots

To save your plots from the Plots package, there are a few options depending on whether you want the plot to save as a .png or .pdf.

Resources

JuliaPlots documentation:
JuliaPlots documentation:
JuliaPlots documentation:

ScikitLearn.jl

ScikitLearn.jl lets you use many stats packages and machine learning models from Python's scikit-learn library — but directly in Julia! It helps you do things like predictions, classifications, and more using very beginner-friendly tools.

With ScikitLearn.jl, you can:

Train and evaluate machine learning models
Use toy datasets to explore machine learning models

Installation & Setup

First, make sure you have Julia installed. On Oscar you can just enter the command module load julia in terminal. If not, refer to to install the appropriate version of Julia for you computer.

Once Julia is installed, enter the Julia interactive window by entering the command julia.

Once in the interactive window enter the following command to download the appropriate packages:

This command installs Python's ScikitLearn package to your conda environment. Now, open Julia and run one at a time (these might take a while so be patient):

If you are using ScikitLearn for the first time you might need to install it. Julia should automatically give you some installation prompts.

Example 1: Logistic Regression

ScikitLearn has several 'toy' datasets that can be used for experimentation and development (see ). We’ll use a pretty well know dataset of iris flowers to train a model to predict a flower's type given some quantitative descriptive data. We will start with a basic logistic regression model (more info ).

Example 2: Decision Tree

Now let’s try using a decision tree to classify the same flowers.

Note that the 'simpler' logistic regression model actually may outperform the more complex decision tree. In this case that is due to the simplicity of the Iris dataset.

Key Terms to Know

Term

What It Means

Resources

JuliaStats

JuliaStats contains basic statistics functionality, which can be used as the foundation for statistics, machine learning, and data science needs. It is efficient, scalable, and reusable!

Installation & Setup

JuliaStats is not a single package, but rather a suite of packages. Specific packages can be downloaded depending on your needs.

To begin, import the package manager and initialize your desired package with the following code.

import Pkg
Pkg.add(*package name*)

using *package name*

For example, if you wanted to download the StatsBase package, use the following code.

import Pkg
Pkg.add("StatsBase")

using StatsBase

Commonly Used Packages

Package

Use

StatsBase.jl

Basic statistics, weights, sampling, counts, and summary statistics.

Distributions.jl

Probability distributions and related functions (PDF, CDF, sampling, etc).

StatsModel.jl

Statistical model formulas

GLM.jl

Generalized linear models (e.g., linear regression, logistic regression).

MixedModels.jl

Linear and generalized linear mixed-effects models.

HypothesisTest.jl

Statistical hypothesis tests (t-tests, chi-squared, ANOVA, etc).

MultivariateStats.jl

Multivariate analysis (PCA, factor analysis, ICA, etc).

Please refer to each package's documentation for a list of available functions and their usage.

Example

# Using StatsBase
data = ..
mean_val = mean(data)
var_val = var(data)

# Using Distributions
pdf_val = pdf(Normal(0,1), 1)

# Using GLM
df = DataFrame(..)
model = lm(@formula(y ~ x), df)

Resources

Exercises

List of exercises found across the different Julia pages.

Use Julia in Brown Oscar Computing Environment - Forthcoming!
Use Julia in Brown Stronghold Computing Environment - Forthcoming!
Create a Health Calculator Using Julia - Forthcoming!
- Create a Pediatric Dosage Calculator Using Julia
- Create a BMI Calculator Using Julia
Analyze Health Datasets Using Unix Commands - Forthcoming!
- Analyze MIMIC-IV Demo Files Using Unix Commands
- Analyze SyntheticRI Demo Files Using Unix
Analyze Health Datasets Using Julia - Forthcoming!
- Analyze MIMIC-IV Demo Files Using Julia
- Analyze SyntheticRI Demo Files Using Julia

Python

Python is one of the many languages used by the data science community to perform data manipulation, statistical modeling and machine learning. Its design philosophy emphasizes code readability. The python community is huge, offering an enormous library of technical support documentation. If you don't know how to do something in Python, chances are, someone else asked a similar question online and received a comprehensive answer.

Resources

Installation

Instructions for installing Python on macOS and Windows operating systems can be found .

For most users, it is recommended to download the current stable release from .

Some developers might wish to use a different version, or to switch between versions. For this, the can be useful.

Python is also available for use in Brown's :

Oscar (for high-performance computing)
Stronghold (for secure computing)

macOS 16.X Ventura

The following instructions have been tested on computers running macOS 16 Big Ventura. In order to check the macOS version running on your computer, click on the "apple" icon in the top left hand corner of your screen and select "About This Mac." A window will pop up that includes a version number. Confirm you are running at least Version 16.X (where 'X' is any number). These instructions will likely work with earlier versions of macOS as well. If you are not running macOS 11.X Big Sur, you can upgrade for free following the instructions provided on .

Download Python

Navigate to and download the most recent version of Python for macOS.

Install Python

Open the downloaded file (e.g., python-3.12.3-macos11.pkg). A window will pop up with installation instructions. Progress through the prompts until Python has been installed in your Applications folder. Next, double click on the Python folder shortcut in your Applications folder to open it.

Run Python

Open, Terminal, type python3, and hit return. Python should open. To quit Python, type quit() and hit return.

Troubleshooting

If you get a Permission denied error, rerun the command prepended with sudo. You will be prompted to enter your computer password.

WindowsOS

The following instructions have been tested on computers running Windows 10. Confirm that you are running at least Windows 10. These instructions will likely work with earlier versions of Windows, however they have not been tested.

Download Python

Navigate to and download the most recent version of Python for Windows (32-bit or 64-bit depending on the specifications of your device).

Install Python

Open the downloaded file (e.g., python-3.10.10-amd64.exe). A window will pop up with installation instructions. Progress through the prompts until Python has been installed on your device. When prompted with Advanced Options, make sure to check "Add Python to environment variables".

Run Python

Open Command Prompt, type py, and hit enter. Python should open to quit Python, type quit() and hit return.

REPL

Python comes with a full-featured interactive command-line REPL (read-eval-print loop) built into the pythonexecutable. In addition to allowing quick and easy evaluation of Python statements, it has a searchable history, tab-completion, many helpful keybindings, and dedicated help ? and shell modes ;.

This page provides examples of using REPL on the command line

Python REPL Example (local)

Type python in terminal to launch REPL

Python REPL Help Pages (local)

Type "help" to enter help pages within REPL
Type a function from Python to read help pages (ex:print)
Press q to quit

Resources

Real Python: The Python Standard REPL: Try Out Code and Ideas Quickly

Basic Syntax

"Hello, World!" Program

This is the typical first program for those new to a general purpose programming language like Python. It can be used to test that the Installation of Python is working and also introduce Python's basic syntax using the REPL environment or running code written using a Text Editor at the Unix command line.

Input:

# hello.py
# This is a single line comment 
'''
This is a block comment to show 
comments across multiple lines.
'''

print("Hello, World!")

Output:

Hello, World!

Here are variations of the "Hello, World!" programming using variables and different print statements.

Input:

# hello2.py

greeting = "Hello, World!"

print(greeting) # print greeting
print(f"Greeting 1: {greeting}") # print greeting as part of a string phrase
print(f"Greeting 2: {greeting}\n") # print with newline (\n) character

Output:

Hello, World!
Greeting 1: Hello, World!
Greeting 2: Hello, World!

Variable Assignment

In order to assign variables in Python, you write the desired name for your variable, an “=” sign, and what the value of the variable should be.

Input:

x = 7
x

Output:

Comments

We can write comments on our code, which do not run, to describe what certain lines of code or section of code do
- These comments are just for the programmer, they will not appear anywhere in the output and just are there to explain what the code is doing or to provide helpful notes
- To make a comment in Python, you can use the “#” symbol and then type your comment
Sometimes you might want to write longer comments that span multiple lines – to do this you can surround these comments with three tick marks above the start as well as three tick marks below the end

Input:

# Assigns variable x to have value 7
x = 7

'''
Now we want to print out what x is. We can do this by simply typing x and 
hitting run. This comment spans multiple lines. These types of comments are 
useful when describing complex functions or algorithms.
'''

x

Output:

Print Statements

Without using a print statement, Python will only print out the most recent item that has an output. In order to print multiple things, we can use the print() function

Input:

# Assign x, y, and z variables 
x = 7
y = 10
z = 4

z
print(x)
print(y)

Output:

7
10

Indentation

Python is very sensitive with its indentation notation. Indentation should only be used in hierarchical structures, such as a class, function, or loop. Indents in improper locations will cause an error

Input:

# Assign x and y variables 

x = 7
    y = 10
    
print(x)
print(y)

Output:

IndentationError: unexpected indent

Exercises

Use Python in Brown Oscar Computing Environment - Forthcoming!
Use Python in Brown Stronghold Computing Environment - Forthcoming!

Resources

Real Python: Variables
W3 Schools: Comments

Numbers and Math

This page provides syntax for using numbers and mathematic operations in Python. Each section includes an example to demonstrate the described syntax and operations.

Types of Numbers

Integer (positive and negative counting number) - e.g., -3, -2, -1, 0, 1, 2, and 3:
- int - holds signed integers of non-limited length
- long - holds long integers (exists in Python 2.X, depreciated in Python 3.X)
Float (real or floating point numbers) - e.g., -2.14, 0.0, and 3.777
- float
Boolean: (0 = False and 1 = True)
- bool

Use type() function to determine type

Input:

# Define two variables x and y
x = 100
y = 3.14

# Print out the variable types for each
print(type(x))
print(type(y))

Output:

<class 'int'>
<class 'float'>

Arithmetic Operators

Operator

Example

Addition

x + y

Subtraction

x - y

Multiplication

x * y

Division

x / y

Floor Division

x//y

Power (Exponent)

x ** y

Remainder (Modulo)

x % y

Input:

# Demonstrates different math operations
using f-strings

n1 = 7    # First number
n2 = 3    # Second number
 
# Output results of different math operations
print(f"{n1} + {n2} = {(n1 + n2)}")           # Addition
print(f"{n1} - {n2} = {(n1 - n2)}")           # Subtraction 
print(f"{n1} * {n2} = {(n1 * n2)}")           # Multiplication 
print(f"{n1} / {n2} = {(n1 / n2)}")           # Division 
print(f"{n1} // {n2} = {(n1 // n2)}")         # Floor Division
print(f"{n1} ** {n2} = {(n1 ** n2)}")         # Power/Exponent
print(f"{n1} % {n2} = {(n1 % n2)}")           # Modulo/Remainder

Output:

7 + 3 = 10
7 - 3 = 4
7 * 3 = 21
7 / 3 = 2.3333333333333335
7 // 3 = 2
7 ^ 3 = 343
7 % 3 = 1

Comparison Operators and Functions

Input:

Operator

Example

Equality

x == y or isequal(x, y)

Inequality

x != y or !isequal (x, y)

Less than

x < y

Less than or equal to

x <= y

Greater than

x > y

Greater than or equal to

x >= y

# compare.py
# Demonstrate comparison operators                                                                               

# Assign values to variables using parallel assignment                                                           
c1, c2, c3, c4 = 25, 50, 75, 50
print(f"  c1 = {c1}, c2 = {c2}, c3 = {c3}), c4 = {c4}")

# Output results of different comparison operations                                                             
 
# Testing equality                                                                                               
print(f"c1 = c3 is {(c1 == c3)}")

# Changing values using abbreviated assignment operators                                                        
c1 *= 3    	# Shorthand for c1 = c1 * 3                                                                       
c4 += 1    	# Shorthand for c4 = c4 + 1                                                                       

print(f"c1 = {c1}, c2 = {c2}, c3 = {c3}, c4 = {c4}")
 
# Testing less than and greater than
print(f"  c1 < c2 is {(c1 < c2)}")
print(f"  c4 <= c2 is {(c4 <= c2)}")
print(f"  c1 > c2 is {(c1 > c2)}")
print(f"  c3 >= c2 is {(c3 >= c2)}")

Output:

 c1 = 25, c2 = 50, c3 = 75), c4 = 50
c1 = c3 is False
c1 = 75, c2 = 50, c3 = 75, c4 = 51
  c1 < c2 is False
  c4 <= c2 is False
  c1 > c2 is True
  c3 >= c2 is True

Exercises

Create a Health Calculator Using Python - Forthcoming!

Resources

Strings and Characters

This page provides syntax for different data types in Python as well as some of their associated functions. Each section includes an example to demonstrate the described syntax or function.

Strings

A string is a sequence of one or more characters (index values start at 0)

Some functions and index methods that can be performed on strings

Action

Function

get word length

len("abc")

extract nth character from word

"abc"[n]

extract substring nth-mth character from word

"abc"[n:m]

search for character in word

"abc".index("character")

search for subword in word

"ab" in "abc"

remove white spaces from the end of a word

"abc ".strip()

remove last character from word

"abc"[:-1]

determine data structure type

type("abc")

Input:

# strings.py

letter = "b"
word = "good-bye"
subword = "good"

word_length = len(word)
word_first_char = word[0]
word_subword = word[5:8]

print(f"Length of word: {word_length}")
print(f"First letter: {word_first_char}")
print(f"Last three characters: {word_subword}")

print(f"{letter} is in {word}: {(word.index(letter))}")
print(f"{subword} is in {word}: {(subword in word)}")
print(f"remove the last character: {(word[:-1])}")

Output:

Length of word: 8
First character: g
Last three characters: bye
b is in good-bye: 5
good is in good-bye: True
chop off the last character: good-by

Resources

W3 Schools: Python Strings

Regular Expressions

Regular expressions are powerful tools for pattern matching and text processing. They are represented as a pattern that consists of a special set of characters to search for in a string str. The regex module needs to be imported before use.

This page provides syntax for regular expressions in Python . Each section includes an example to demonstrate the described methods.

Functions

Action

Function

Check if regex matches a string

re.search("pattern", string, flag=0)

Capture regex matches

re.match("pattern", string, flag=0)

Specify alternative regex

pattern1|pattern2

Character Class

Character class specifies a list of characters to match ([...] where ... represents the list) or not match ([^...])

Character Class

...

Any lowercase vowel

[aeiou]

Any digit

[0-9]

Any lowercase letter

[a-z]

Any uppercase letter

[A-Z]

Any digit, lowercase letter, or uppercase letter

[a-zA-Z0-9]

Anything except a lowercase vowel

[^aeiou]

Anything except a digit

[^0-9]

Anything except a space

[^ ]

Any character

.

Any word character (equivalent to [a-zA-Z0-9_])

\w

Any non-word character (equivalent to [^a-zA-Z0-9_])

W

A digit character (equivalent to [0-9])

\d

Any non-digit character (equivalent to [^0-9])

\D

Any whitespace character (equivalent to [\t\r\n\f])

\s

Any non-whitespace character (equivalent to [^\t\r\n\f])

\S

Anchors

Anchors are special characters that can be used to match a pattern at a specified position

Anchor

Special Character

Beginning of line

^

End of line

$

Beginning of string

\A

End of string

\Z

Repetition and Quantifier Characters

Repetition or quantifier characters specify the number of times to match a particular character or set of characters

Repetition

Character

Zero or more times

*

One or more times

+

Zero or one time

?

Exactly n times

{n}

n or more times

{n,}

m or less times

{,m}

At least n and at most m times

{n.m}

Input:

# regex.jl
number1 = "(555)123-4567"
number2 = "123-45-6789"

# check if matches
if occursin(r"\([0-9]{3}\)[0-9]{3}-[0-9]{4}", number1)
   println("match!")
end

if occursin(r"\([0-9]{3}\)[0-9]{3}-[0-9]{4}", number2)
  println("match!")
else
  println("no match!")
end

# capture matches
# use parentheses to "capture" different parts of a regular 
# expression for later use the first set of parentheses corresponds 
# to index 1, second to index 2, etc.

number_details = match(r"\(([0-9]{3})\)([0-9]{3}-[0-9]{4})", number1)

if number_details != nothing
   area_code = number_details[1]
   phone_number = number_details[2]

   println("area code: $area_code")
   println("phone number: $phone_number")
end

Output:

match!
no match!
area code: 555
phone number: 123-4567

Resources

Control Flow

In computer science, control flow (or flow of control) is the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated. [1]

This page provides syntax for some of the common control flow methods in Python. Each section includes an example to demonstrate the described methods

Use Cases and Syntax

Test if a specified expression is true or false
Short-circuit evaluation
- Test if all of the conditions are true x and y
- Test if any of the conditions are true x or y
- Test if a condition is not true not z
Conditional evaluation
- if statement
- if-else
- if-elif-else
- Ternary operator
  - true_value if condition else false_value

Conditional Statements

Input:

x, y, z = 100, 200, 300
print(f"x = {x}, y = {y}, z = {z}")

# Test if x equals 100
if x == 100:
    print(f"{x} equals 100")
    
# Test if y does not equal z
if y != z:
    print(f"{y} does not equal {z}")
    
# Test multiple conditions
if x < y < z:
    print(f"{y} is less than {z} and greater than {x}")

# Test multiple conditions using "and"
if x < y and x < z:
    print(f"{x} is less than {y} and {z}")

# Test multiple conditions using "or"
if y < x or y < z:
    print(f"{y} is less than {x} or {z]")

# if-else statement 
if x < 100:
    print(f"{x} less than 100")
else:
    print(f"{x} is equal to or greater than 100")
    
# Same logic as above but using the ternary operator
print(f"{x} less than 100 again" if x < 100 else f"{x} equal to or greater than 100 again")

# if-elif-else statement
if y < 100:
    print(f"{y} is less than 100")
elif y < 200:
    print(f"{y} is less than 200")
elif y < 300:
    print(f"{y} is less than 300")
else:
    print(f"{y} is greater than or equal to 300")

Output:

x = 100, y = 200, z = 300
100 equals 100
200 does not equal 300
200 is less than 300 and greater than 100
100 is less than 200 and 300
200 is less than 100 or 300
100 is equal to or greater than 100
100 equal to or greater than 100 again
200 is less than 300

Loops

Repeat a block of code a specified number of times or until some condition is met
while loop
for loop
Use break to terminate loop

Input:

# Demonstrates use of loops

i = 1

# while loop for incrementing i by 1 from 1 to 3
while i <= 3:
    print(f"while: {i}")
    i +=1

# for loop 
for j in range(1,4):
    print(f"for: {j}")
    
for j in range(1,4):
    print(f"for again: {j}")
    
# nested for loop
for j in range(1,4):
    for k in range(1,4):
        print(f"nested for: {j} * {k} = {j*k}")

Output:

while: 1
while: 2
while: 3
for: 1
for: 2
for: 3
for again: 1
for again: 2
for again: 3
nested for: 1 * 1 = 1
nested for: 1 * 2 = 2
nested for: 1 * 3 = 3
nested for: 2 * 1 = 2
nested for: 2 * 2 = 4
nested for: 2 * 3 = 6
nested for: 3 * 1 = 3
nested for: 3 * 2 = 6
nested for: 3 * 3 = 9

Comparison Operators and Functions

Operator

Example

Equality

x == y

Inequality

x != y

Less than

x < y

Less than or equal to

x <= y

Greater than

x > y

Greater than or equal to

x >= y

Input:

# Demonstrate comparison operators

# Assign values to variables using parallel assignment
c1, c2, c3, c4 = 25, 50, 75, 50
print(f"c1 = {c1}, c2 = {c2}, c3 = {c3}, c4 = {c4}")

# Output results of different comparison operations

# Testing equality
print(f" c1 = c3 is {c1 == c3}")
print(f" c2 = c4 is {c2 == c4}")

# Changing values using abbreviated assignment operators 
c1 *= 3     # shorthand for c1 = c1 * 3
c4 += 1     # shorthand for c4 = c4 + 1

print(f"c1 = {c1}, c2 = {c2}, c3 = {c3}, c4 = {c4}")

# Testing less than and greater than 
print(f" c1 < c2 is {c1 < c2}")
print(f" c4 <= c2 is {c4 < c2}")
print(f" c1 > c2 is {c1 > c2}")
print(f" c3 >= c2 is {c3 >= c2}")

Output:

c1 = 25, c2 = 50, c3 = 75, c4 = 50
 c1 = c3 is False
 c2 = c4 is True
c1 = 75, c2 = 50, c3 = 75, c4 = 51
 c1 < c2 is False
 c4 <= c2 is False
 c1 > c2 is True
 c3 >= c2 is True

Resources

Python Documentation: Control Flow
Python Wiki: For Loops
W3 Schools: Python For Loops
W3 Schools: Python Conditionals and If Statements

File Input/Output

Many Python programs involve the input and output of files. When analyzing a dataset, that dataset file will need to be pulled into your program (input). If you want to see the results of your analysis, your program will need an output.

This section provides the syntax for inputting files (reading) and outputting results (writing) using base Python (i.e, no packages such as Pandas)

UC Irvine Machine Learning Repository: Adult Data Set

Tabulate and report counts for sex in from the .

Dataset (example lines from adult.data)

Input (process_file.py)

Output

Terminal

Exercises

Analyze the MIMIC-IV Demo Files Using Julia - Forthcoming!
Analyze the SyntheticRI Demo Files Using Julia - Forthcoming

Resources

Tutorials Point:
Data Science Central:

Packages

In computer programming, a package is a collection of modules or programs that are often published as tools for a range of common use cases, such as text processing and doing math. Programmers can install these packages and take advantage of their functionality within their own code.

This page provides instructions for installing, using, and troubleshooting packages in Python.

Installing and Loading Packages

There is a two-step process for using an external package in Python. First, if it is your first time using the package, you must install the package. This only needs to be done once for the environment you are working in, even if you are using different documents or files. Then, you must load the package to your specific document. Let's look at an example using the NumPy package

Installing Packages Syntax

To install a package, we use the pip command as follows:

pip install numpy

Again note that this only needs to be done once. After you have installed a package you do not need to do so again, you can simply load it

Loading Packages

If we want to load an entire package (instead of just certain functions), we can use the import command as follows:

import numpy as np

We import the name of the package and name is as some shorthand name so that we do not need to type the whole package name every time we want to use a function from that package. In order to call a function from an imported package we can use the shorthand name followed by a dot followed by the name of the function. Here is an example:

# Creating an array
array1 = np.array([1, 2, 3, 4, 5])

# Getting the mean of the values in our array
mean = np.mean(array1)

Module-Based Packages

Some packages will have many different parts, or modules, and we might not want to use all of these modules at once. Importing all of these modules when we don't need them can be an unnecessary waste of computing power, so instead we can only import the functions we need. Let's look at the scikit-learn package for example

Scikit-Learn

We can install this package the same way as above, however we will not import the whole package at once. Instead, we will only import the functions we need from the modules we need. Here is an example of how we can import the train_test_split() function from the model_selection module of scikit-learn (or sklearn for short)

from sklearn.model_selection import train_test_split

R

R is one of the many languages used by the data science community to perform data manipulation, statistical modeling and machine learning. R was designed by statisticians for statistical computing.

Resources

Installation

For most users, it is recommended to download the current stable release from https://cloud.r-project.org/.

Some developers might wish to use a different version, or to switch between versions. For this, the rvenv package can be useful.

R is also available for use in Brown's Computing Environments:

Oscar (for high-performance computing)
Stronghold (for secure computing)

macOS

Download and install the latest version of The R Project for Statistical computing for macOS here.
For an integrated development environment (IDE) / graphical interface, you can also download and install R Studio from here.

Windows

Download and install the latest version of The R Project for Statistical computing for Windows here.
For an integrated development environment (IDE) / graphical interface, you can also download and install R Studio from here.

REPL

R comes with a full-featured interactive command-line REPL (read-eval-print loop) built into theR executable. In addition to allowing quick and easy evaluation of R statements, it has a searchable history, tab-completion, many helpful keybindings, and dedicated help ? and shell modes ;.

This page provides examples of using REPL on the command line.

R REPL Example

Type "module load r" in terminal to load the R module, then on a new line type "R" to launch R
In terminal, q() quits the R module

R REPL Help Pages

Type "?" or help(function) to enter help pages within R's REPL
For example, to ask for help with linear functions in R, use help(lm) (output shown below)

Resources

REPL Environment Help

Basic Syntax

"Hello, World!" Program

This is the typical first program for those new to a programming language. It can be used to test that the Installation of R is working and also introduce R's basic syntax using the REPL environment or running code written using a Text Editor at the Unix command line.

Inputs:

#This is a single line comment
print("Hello, World!")

Outputs:

"Hello, World!"

Variable Assignment

Operator

Description

Example

<- or = or <<-

Left Assignment

x <- 7, x = 7, x <<- 7

-> or ->>

Right Assignment

x -> 7, x ->> 7

Vectors (Classes)

Type

Example

Logical

TRUE, FALSE

Numeric

1, 55, 999

Integer

1L, 32L, 0L

Complex

2 + 3i

Character

"great", "23.4"

Print Statements

Unlike other languages, R does not require the use of print statements to output code, but it does allow them. To print, you can simply write code, or include the code you want to be printed in a print() statement.

Vector Assignment and Print Statement examples:

Inputs:

#Assign three colors to the "apple" variable
apple <- c('red','green','yellow')

print(apple)

#Get the class of the vector (with and without print statement)
print(class(apple))
class(apple)

Outputs:

"red"  "green"  "yellow"
"character"
"character"

Comments

We can write comments on our code, which do not run, to describe what certain lines of code or section of code do. These comments are just for the programmer- they will not appear anywhere in the output and simply explain what the code is doing or provide helpful notes.

To comment in R, use the “#” symbol and type your comment on the same line
R has no syntax for multi-line comments, so each line that is commented out needs a "#" symbol at the beginning

Resources

R Documentation: Vectors and Assignment
R Documentation: Comments

Numbers and Math

Arithmetic Operators

Operator

Description

Inputs:

Outputs:

Comparison Operators

Operator

Description

Resources

R Documentation:
R Documentation:

Strings and Characters

String Functions

Action

Function

Get string length

nchar(string)

Combine two strings

str_c(string1, string2)

Sort values within a string

sort(string1, string2, string3)

Inputs:

#String length
nchar("codiac")

#Combine strings
str_c("patient ", c("a", "b", "c"))

#Sort values in a string
x <- c("carrot", "apple", "banana")
sort(x)

Outputs:

#String length
6

#Combine strings
"patient a" "patient b" "patient c"

#Sort values in a string
"apple" "banana" "carrot"

Resources

R for Data Science: String Functions

Regular Expression

RegEx Functions

Action

Function

Inputs:

Outputs:

Resources

DataCamp:

Control Flow

Use Cases & Syntax

Used to test if a specific case is true or false

Short-circuit evaluation:

Test if all conditions are true
Test if any conditions are true
Test if a condition is not true

Conditional evaluation

If statement: run code if this statement is true
- Only used at the beginning of a conditional statement
Else if statement: if previous statements aren't true, try this
- Can be used an unlimited number of times in an if statement
Else statement: catch-all for anything outside of prior statements
- Only used to end a conditional statement

Inputs:

Outputs:

Loops

Repeats a block of code a specified number of times or until some condition is met

While loop
For loop
Use break to terminate loop

Inputs:

Outputs:

Comparison Operators

Operator

Description

Input:

Output:

Resources

R Documentation:
R Documentation:

File Input/Output

When coding in R, you will often need to input datasets to work with! The easiest ways to do so are either from a .csv file or a .txt file. To do this, you can use the read.csv() and read_table() functions, respectively. The following demonstrates these functions using a hypothetical "hospital_data" dataset.

To output a file from R, use the syntax sink("FileName.FileType").

File Input:

#If the dataset is already loaded into the R directory
read.csv("hospital_data.csv")
read_table("hospital_data.txt")

#To add a new dataset from machine downloads to directory (Mac)
read.csv("/users/username/Downloads/hospital_data.csv")
read_table("/users/username/Downloads/hospital_data.txt")

#To add a new dataset from machine desktop to directory (Windows)
read.csv("C:\\Users\\username\\Desktop\\hospital_data.csv")
read_table("C:\\Users\\username\\Desktop\\hospital_data.txt")

#Note that forward slashes are used on Mac and backwards slashes are used by Windows

File Output:

#To output a file as a .txt file:
sink("hospital_data.txt")

#To output a file as a .csv file:
sink("hospital_data.csv")

Resources:

R Documentation: read.csv file input
- More read.csv resources here
R Documentation: read_table file input
R Documentation: File output

Packages

In computer programming, a package is a collection of modules or programs that are often published as tools for a range of common use cases, such as text processing and doing math. Programmers can install these packages and take advantage of their functionality within their own code.

This page includes instructions for installing packages in R and a description of some of R's most frequently used packages.

Installing Packages

To install a package in R, you can either:

Use the install.packages("PackageName") function if you have the package downloaded locally on your machine
Or if you are using RStudio, you can use Tools > Install packages, enter in the package name and click Install

Once you install the package, you have to load it into your library using the libary(PackageName) function.

#Installing a package downloaded locally
install.packages("tidyverse")

#Once the package is installed, you have to load it
library(tidyverse)

Helpful Packages

In R, tidyverse is one of the most popular packages, as it contains an assortment of packages used for data science, such as:

ggplot2, used to create graphics and data visualization
dplyr, contains functions used for data manipulation, like mutate() and filter()
tidyr, used for data organization and cleaning
tibble, an optimized dataframe visualizer
readxl, can be used to input Excel files in .xlsx format into R

Resources

R Documentation: Packages
Tidyverse

DataFrames

data.frame, data.table and the dplyr package provide a set of tools for working with tabular data in R. Their design and functionality are similar to those of DataFrames.jl (in Julia) and pandas (in Python), making them great general purpose data science tools.

This page provides examples of using data.frame, data.table, and dplyr, demonstrating the syntax and common functions within the tools.

Example

Installing data.frame, data.table, and dplyr in R.

The data.frame package comes preloaded into R, and the dplyr package is part of the tidyverse package (see Packages section for tidyverse installation instructions). To install data.table, use install.packages('data.table').

This example will take place using data.frame as it is does not require additional packages- see resources at the bottom of this page for additional information on data.table and dplyr.

Create DataFrame

#Create DataFrame
df <- data.frame(
  id = 1:5,
  gender = c("F", "M", "F", "M", "F"),
  age = c(68, 54, 49, 28, 36)
)

Display DataFrame

Input:

#Display DataFrame
df

Output:

 id gender age
1  1      F  68
2  2      M  54
3  3      F  49
4  4      M  28
5  5      F  36

Print first two lines of DataFrame

Input:

#Print first two lines of DataFrame
head(df, 2)

Output:

  id gender age
1  1      F  68
2  2      M  54

Print last two lines of DataFrame

Input:

# Last two lines of DataFrame
tail(df, 2)

Output:

  id gender age
4  4      M  28
5  5      F  36

Describe DataFrame

DataFrame size:

Input:

#DataFrame size
dim(df)

Output:

#First value represents number of rows, second value represents number of columns
[1] 5 3

DataFrame column names:

Input:

#DataFrame column names
colnames(df)

Output:

[1] "id"     "gender" "age"

DataFrame description:

Input:

#Describe DataFrame
summary(df)

Output:

       id       gender               age    
 Min.   :1   Length:5           Min.   :28  
 1st Qu.:2   Class :character   1st Qu.:36  
 Median :3   Mode  :character   Median :49  
 Mean   :3                      Mean   :47  
 3rd Qu.:4                      3rd Qu.:54  
 Max.   :5                      Max.   :68

Accessing DataFrames

Get "age" column (different ways to call the column)

Input:

#Call by column name
df$age
df[["age"]]

#Get column by column number
df[[3]]

Output:

#Call by column name
[1] 68 54 49 28 36
[1] 68 54 49 28 36

#Get column by column number
[1] 68 54 49 28 36

Get row

Input:

#Print row 2
df[2, ]

Output:

  id gender age
2  2      M  54

Get element

Input:

#Get element in row 2, column 3
df[2,3]

Output:

Get subset (specific rows and all columns)

Input:

#Print out rows 1, 3, & 5
df[c(1, 3, 5), ]

Output:

  id gender age
1  1      F  68
3  3      F  49
5  5      F  36

Get subset (all rows and specific columns)

Input:

#Print out all rows and only columns 1 (id) and 3 (age)
#Using column names
df[, c("id", "age")]

#Using column numbers
df[, c(1, 3)]

Output:

#Using column names:
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36

#Using column numbers
  id age
1  1  68
2  2  54
3  3  49
4  4  28
5  5  36

Get subset (all rows meeting specified criteria - numbers)

Input:

#Print all rows where age is greater than 50
df[df$age > 50, ]

Output:

  id gender age
1  1      F  68
2  2      M  54

Get subset (all rows meeting specified criteria - strings)

Input:

#Print all rows where gender is female ("F")
df[df$gender == "F", ]

Output:

  id gender age
1  1      F  68
3  3      F  49
5  5      F  36

Get subset (all rows meeting specified criteria)

Input:

#Print all rows where gender is female ("F") and age is between 25-50
df[df$gender == "F" & df$age > 25 & df$age < 50, ]

Output:

  id gender age
3  3      F  49
5  5      F  36

Add Column

New columns with specified values

Input:

#Add a column for height
df$height <- c(62, 60, 61, 63, 64)

#Add a column for weight
df$weight <- c(100, 120, 150, 175, 300)

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)

Output:

  id gender age height weight
1  1      F  68     62    100
2  2      M  54     60    120
3  3      F  49     61    150
4  4      M  28     63    175
5  5      F  36     64    300

#Describe dataframe to see column names and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300

New column with calculated value

Input:

# add a column with calculated BMI
df$bmi <- (df$weight / (df$height^2)) * 703

#Print DataFrame to see changes
df

#Describe DataFrame to see column names and summary
summary(df)

Output:

#Updated DataFrame
    id gender age height weight      bmi
1  1      F  68     62    100 18.28824
2  2      M  54     60    120 23.43333
3  3      F  49     61    150 28.33916
4  4      M  28     63    175 30.99647
5  5      F  36     64    300 51.48926

Describe dataframe to see new bmi column and summary:
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300  
      bmi       
 Min.   :18.29  
 1st Qu.:23.43  
 Median :28.34  
 Mean   :30.51  
 3rd Qu.:31.00  
 Max.   :51.49

Get counts/frequency

Input:

#Get counts of males and females in the dataframe
gender_counts <- table(df$gender)
gender_counts

Output:

F M 
3 2

Transform DataFrame

sort

Input:

#Sort the dataframe by gender then age, in reverse order for age (oldest to youngest)
df_sorted <- df[order(df$gender, -df$age), ]
df_sorted

Output:

  id gender age height weight      bmi
1  1      F  68     62    100 18.28824
3  3      F  49     61    150 28.33916
5  5      F  36     64    300 51.48926
2  2      M  54     60    120 23.43333
4  4      M  28     63    175 30.99647

stack (reshape from wide to long format)

Input:

#Reshape from wide to long format (exclude id column)
long_df <- reshape(df, varying = c("gender", "age", "weight", "height", "bmi"), 
                   v.names = "value", 
                   timevar = "variable", 
                   times = c("gender", "age", "weight", "height", "bmi"), 
                   direction = "long")
long_df

Output:

         id variable            value
1.gender  1   gender                F
2.gender  2   gender                M
3.gender  3   gender                F
4.gender  4   gender                M
5.gender  5   gender                F
1.age     1      age               68
2.age     2      age               54
3.age     3      age               49
4.age     4      age               28
5.age     5      age               36
1.weight  1   weight              100
2.weight  2   weight              120
3.weight  3   weight              150
4.weight  4   weight              175
5.weight  5   weight              300
1.height  1   height               62
2.height  2   height               60
3.height  3   height               61
4.height  4   height               63
5.height  5   height               64
1.bmi     1      bmi 18.2882414151925
2.bmi     2      bmi 23.4333333333333
3.bmi     3      bmi 28.3391561408224
4.bmi     4      bmi 30.9964726631393
5.bmi     5      bmi    51.4892578125

unstack (reshape from long to wide format)

Input:

#Unstack dataframe to return to wide format based off "id"
wide_df <- reshape(long_df, idvar = "id", timevar = "variable", direction = "wide")
wide_df

Output:

         id value.gender value.age value.weight value.height        value.bmi
1.gender  1            F        68          100           62 18.2882414151925
2.gender  2            M        54          120           60 23.4333333333333
3.gender  3            F        49          150           61 28.3391561408224
4.gender  4            M        28          175           63 30.9964726631393
5.gender  5            F        36          300           64    51.4892578125

Traversing DataFrame (for loops)

sort

Input:

#Size of dataframe = size(df)
#Set number of rows to nrows and number of columns to ncols
nrows <- nrow(df)
ncols <- ncol(df)

cat("(nrows, ncols) = ", nrows, ncols, "\n")

#Use nested for loop to get information from DataFrame by row and column
for (row in 1:nrows) {
  for (col in 1:ncols) {
    cat("value for row", row, "and col", col, "is", df[row, col], "\n")
  }
}

Output:

(nrows, ncols) =  5 6 
value for row 1 and col 1 is 1 
value for row 1 and col 2 is F 
value for row 1 and col 3 is 68 
value for row 1 and col 4 is 62 
value for row 1 and col 5 is 100 
value for row 1 and col 6 is 18.28824 
value for row 2 and col 1 is 2 
value for row 2 and col 2 is M 
value for row 2 and col 3 is 54 
value for row 2 and col 4 is 60 
value for row 2 and col 5 is 120 
value for row 2 and col 6 is 23.43333 
value for row 3 and col 1 is 3 
value for row 3 and col 2 is F 
value for row 3 and col 3 is 49 
value for row 3 and col 4 is 61 
value for row 3 and col 5 is 150 
value for row 3 and col 6 is 28.33916 
value for row 4 and col 1 is 4 
value for row 4 and col 2 is M 
value for row 4 and col 3 is 28 
value for row 4 and col 4 is 63 
value for row 4 and col 5 is 175 
value for row 4 and col 6 is 30.99647 
value for row 5 and col 1 is 5 
value for row 5 and col 2 is F 
value for row 5 and col 3 is 36 
value for row 5 and col 4 is 64 
value for row 5 and col 5 is 300 
value for row 5 and col 6 is 51.48926

Notes:

When performing functions such as sorting or transformation, using a package like data.table or dplyr will typically be easier than using base R (data.table), as those packages include commands designed for DataFrame manipulation. This guide uses base R for the sake of continuity.

Resources

R Documentation: data.table
Tidyverse: dplyr

Data Analysis and Manipulation

Notes:

This page will go over much of the same content as the DataFrames R page, but using tidyverse's dplyr and tidyr packages rather than base R. You may notice that pipes (%>%) are used more often here. Pipes are functionally the same as other elements like summary() or $, but tend to be the predominant syntax for more advanced uses of R, particularly in the tidyverse, as they can help chain multiple operations in the same line of code.

Loading tidyverse modules:

In order to use the tidyverse modules, they first have to be installed. Ensure that the following code is at the top of your coding environment:

#Load tidyverse and required modulees
install.packages("tidyverse")
library(tidyverse)
library(dplyr)
library(tidyr)

Create DataFrame:

Input:

#Create DataFrame
df <- tibble(
  id = 1:5,
  gender = c("F", "M", "F", "M", "F"),
  age = c(68, 54, 49, 28, 36)
  )
df

Output:

#A tibble: 5 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     2 M         54
3     3 F         49
4     4 M         28
5     5 F         36

Describe DataFrame:

Input:

#DataFrame size:
list(rows = nrow(df), columns = ncol(df))

#DataFrame column names
colnames(df)  

#DataFrame summary
df %>% summary()

Output:

#DataFrame size:
$rows
[1] 5

$columns
[1] 3

#DataFrame column names
[1] "id"     "gender" "age" 

#DataFrame summary
       id       gender               age    
 Min.   :1   Length:5           Min.   :28  
 1st Qu.:2   Class :character   1st Qu.:36  
 Median :3   Mode  :character   Median :49  
 Mean   :3                      Mean   :47  
 3rd Qu.:4                      3rd Qu.:54  
 Max.   :5                      Max.   :68

Accessing specific DataFrame subsets:

Input:

# Get "age" column
df %>% select(age)

# Get row 2
df %>% slice(2)

# Get element in row 2, column 3
df %>% slice(2) %>% pull(3)

#Get subset (specific rows and all columns)
df %>% slice(c(1, 3, 5))

#Get subset (all rows and specific columns)
df %>% select(id, age)

#Get subset (all rows meeting specified criteria - numbers)
df %>% filter(age > 50)

#Get subset (all rows meeting specified criteria - strings)
df %>% filter(gender == "F")

#Get subset (all rows meeting specified criteria)
df %>% filter(gender == "F", between(age, 25, 50)

Output:

#Get "age" column
#A tibble: 5 × 1
    age
  <dbl>
1    68
2    54
3    49
4    28
5    36

#Get row 2
#A tibble: 1 × 3
     id gender   age
  <int> <chr>  <dbl>
1     2 M         54

#Get element in row 2, column 3
[1] 54

#Get subset (specific rows and all columns)
# A tibble: 3 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     3 F         49
3     5 F         36

#Get subset (all rows and specific columns)
# A tibble: 5 × 2
     id   age
  <int> <dbl>
1     1    68
2     2    54
3     3    49
4     4    28
5     5    36

#Get subset (all rows meeting specified criteria - numbers)
#A tibble: 2 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     2 M         54

#Get subset (all rows meeting specified criteria - strings)
#A tibble: 3 × 3
     id gender   age
  <int> <chr>  <dbl>
1     1 F         68
2     3 F         49
3     5 F         36

#Get subset (all rows meeting specified criteria)
#A tibble: 2 × 3
     id gender   age
  <int> <chr>  <dbl>
1     3 F         49
2     5 F         36

Adding Columns:

Input:

#New columns with specified values
df <- df %>%
  mutate(
    height = c(62, 60, 61, 63, 64),
    weight = c(100, 120, 150, 175, 300)
  )
df %>% summary()

New column with calculated value
df <- df %>%
  mutate(bmi = (weight / (height^2)) * 703)

#Describe DataFrame
df %>% summary()

#Get counts/frequency
df %>% count(gender)

Output:

#New columns with specified values
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300   

#New column with calculated value
       id       gender               age         height       weight   
 Min.   :1   Length:5           Min.   :28   Min.   :60   Min.   :100  
 1st Qu.:2   Class :character   1st Qu.:36   1st Qu.:61   1st Qu.:120  
 Median :3   Mode  :character   Median :49   Median :62   Median :150  
 Mean   :3                      Mean   :47   Mean   :62   Mean   :169  
 3rd Qu.:4                      3rd Qu.:54   3rd Qu.:63   3rd Qu.:175  
 Max.   :5                      Max.   :68   Max.   :64   Max.   :300  
      bmi       
 Min.   :18.29  
 1st Qu.:23.43  
 Median :28.34  
 Mean   :30.51  
 3rd Qu.:31.00  
 Max.   :51.49  

#Get counts/frequency
#A tibble: 2 × 2
  gender     n
  <chr>  <int>
1 F          3
2 M          2

Transform DataFrame:

Input:

#Transform DataFrame
#Sort the dataframe by gender then age (reverse for age)
df_sorted <- df %>%
  arrange(gender, desc(age))
df_sorted

#Reshape from wide to long format
long_df <- long_df %>%
           mutate(value = as.character(value))
long_df

#Reshape from long to wide format based on "id"
wide_df <- long_df %>%
  pivot_wider(names_from = variable, values_from = value)
wide_df

Output:

#Sort the dataframe by gender then age (reverse for age)
#A tibble: 5 × 6
     id gender   age height weight   bmi
  <int> <chr>  <dbl>  <dbl>  <dbl> <dbl>
1     1 F         68     62    100  18.3
2     3 F         49     61    150  28.3
3     5 F         36     64    300  51.5
4     2 M         54     60    120  23.4
5     4 M         28     63    175  31.0

#Reshape from wide to long format
       id variable            value
1.gender  1   gender                F
2.gender  2   gender                M
3.gender  3   gender                F
4.gender  4   gender                M
5.gender  5   gender                F
1.age     1      age               68
2.age     2      age               54
3.age     3      age               49
4.age     4      age               28
5.age     5      age               36
1.weight  1   weight              100
2.weight  2   weight              120
3.weight  3   weight              150
4.weight  4   weight              175
5.weight  5   weight              300
1.height  1   height               62
2.height  2   height               60
3.height  3   height               61
4.height  4   height               63
5.height  5   height               64
1.bmi     1      bmi 18.2882414151925
2.bmi     2      bmi 23.4333333333333
3.bmi     3      bmi 28.3391561408224
4.bmi     4      bmi 30.9964726631393
5.bmi     5      bmi    51.4892578125

#Reshape from long to wide format based on "id"
# A tibble: 5 × 6
     id gender age   weight height bmi             
  <int> <chr>  <chr> <chr>  <chr>  <chr>           
1     1 F      68    100    62     18.2882414151925
2     2 M      54    120    60     23.4333333333333
3     3 F      49    150    61     28.3391561408224
4     4 M      28    175    63     30.9964726631393
5     5 F      36    300    64     51.4892578125

Traversing DataFrame (for loops):

Input:

#Size of DataFrame
nrows <- nrow(df)
ncols <- ncol(df)

cat("(nrows, ncols) = ", nrows, ncols, "\n")

#Nested loop to traverse DataFrame
for (row in 1:nrows) {
  for (col in 1:ncols) {
    value <- df[row, col, drop = TRUE]
    cat("value for row", row, "and col", col, "is", value, "\n")
  }
}

Output:

#Size of DataFrame
(nrows, ncols) =  5 6 

#Nested loop to traverse DataFrame
value for row 1 and col 1 is 1 
value for row 1 and col 2 is F 
value for row 1 and col 3 is 68 
value for row 1 and col 4 is 62 
value for row 1 and col 5 is 100 
value for row 1 and col 6 is 18.28824 
value for row 2 and col 1 is 2 
value for row 2 and col 2 is M 
value for row 2 and col 3 is 54 
value for row 2 and col 4 is 60 
value for row 2 and col 5 is 120 
value for row 2 and col 6 is 23.43333 
value for row 3 and col 1 is 3 
value for row 3 and col 2 is F 
value for row 3 and col 3 is 49 
value for row 3 and col 4 is 61 
value for row 3 and col 5 is 150 
value for row 3 and col 6 is 28.33916 
value for row 4 and col 1 is 4 
value for row 4 and col 2 is M 
value for row 4 and col 3 is 28 
value for row 4 and col 4 is 63 
value for row 4 and col 5 is 175 
value for row 4 and col 6 is 30.99647 
value for row 5 and col 1 is 5 
value for row 5 and col 2 is F 
value for row 5 and col 3 is 36 
value for row 5 and col 4 is 64 
value for row 5 and col 5 is 300 
value for row 5 and col 6 is 51.48926