Stories
Discover boundless stories from unique narrators (storytellers 🙃)
R functions, loops, conditional statements - Week 4 Lecture Notes
17501 • Apr 5, 2025
Functions in R Programming
Basic Function Structure
function_name <- function(parameter1, parameter2) {
# Function body
result <- # operations
return(result)
}
Example 1: Simple Function
# Calculate area of rectangle
calculate_area <- function(length, width) {
area <- length * width
return(area)
}
# Usage
calculate_area(5, 3) # Returns 15
Example 2: Function with Default Parameters
greet_user <- function(name = "User") {
greeting <- paste("Hello,", name, "!")
return(greeting)
}
greet_user() # Returns "Hello, User !"
greet_user("Maria") # Returns "Hello, Maria !"
Conditional Statements (if, else if, else)
Basic Structure
if (condition) {
# code if condition is TRUE
} else if (another_condition) {
# code if another_condition is TRUE
} else {
# code if all conditions are FALSE
}
Example 1: Simple Grade Calculator
get_grade <- function(score) {
if (score >= 90) {
return("A")
} else if (score >= 80) {
return("B")
} else if (score >= 70) {
return("C")
} else {
return("F")
}
}
get_grade(85) # Returns "B"
Example 2: Number Check
check_number <- function(x) {
if (x > 0) {
return("Positive")
} else if (x < 0) {
return("Negative")
} else {
return("Zero")
}
}
Conditions and Logical Operators
Common Logical Operators
==
: Equal to!=
: Not equal to>
: Greater than<
: Less than>=
: Greater than or equal to<=
: Less than or equal to&
: AND|
: OR!
: NOT
Example: Complex Conditions
check_eligibility <- function(age, income) {
if (age >= 18 & income >= 30000) {
return("Eligible")
} else if (age >= 21 | income >= 50000) {
return("Conditionally Eligible")
} else {
return("Not Eligible")
}
}
check_eligibility(19, 35000) # Returns "Eligible"
Example: Multiple Conditions
categorize_day <- function(day, temperature) {
if (day %in% c("Saturday", "Sunday") & temperature > 20) {
return("Perfect weekend!")
} else if (day %in% c("Saturday", "Sunday")) {
return("Cold weekend")
} else {
return("Weekday")
}
}
categorize_day("Saturday", 25) # Returns "Perfect weekend!"
Best Practices
- Always use clear and descriptive function names
- Include documentation/comments for complex functions
- Handle edge cases and invalid inputs
- Keep functions focused on a single task
- Use consistent indentation for readability
Example: Good Practice Implementation
calculate_bmi <- function(weight, height) {
# Input validation
if (!is.numeric(weight) | !is.numeric(height)) {
return("Error: Inputs must be numeric")
}
if (weight <= 0 | height <= 0) {
return("Error: Values must be positive")
}
# Calculate BMI
bmi <- weight / (height^2)
# Categorize BMI
if (bmi < 18.5) {
return("Underweight")
} else if (bmi < 25) {
return("Normal")
} else if (bmi < 30) {
return("Overweight")
} else {
return("Obese")
}
}
# Usage
calculate_bmi(70, 1.75) # Returns BMI category
...
Data manipulation in R - Week 3 Lecture Notes
17501 • Apr 5, 2025
Package Management in R
1. Understanding R Package Ecosystem
The R package system is hierarchical:
# Base R: Comes with basic installation
mean(1:10) # Base function
# Recommended packages: Nearly standard but need loading
library(MASS) # A recommended package
# Third-party packages: Need installation and loading
install.packages("tidyverse") # Popular meta-package
2. Package Installation Strategies
Basic Installation:
# Single package
install.packages("dplyr")
# Multiple packages
install.packages(c("dplyr", "ggplot2", "tidyr"))
# With specific parameters for troubleshooting
install.packages("dplyr",
dependencies = TRUE, # Install all required packages
type = "binary", # Avoid source compilation
repos = "https://cran.rstudio.com/") # Specify repository
1. Package Loading and Namespace Management
Basic Loading:
# Standard loading
library(dplyr)
# Alternative loading with error handling
if (!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}
Namespace Conflicts and Resolution:
# Example of conflict
library(dplyr)
library(plyr) # Also has a select() function
# Three ways to handle conflicts:
# 1. Explicit namespace
dplyr::select(mtcars, mpg, cyl)
# 2. With package environment
with(dplyr, select(mtcars, mpg, cyl))
# 3. Import specific functions
import::from(dplyr, select, filter)
2. Package Version Management
Checking and Updating:
# Check installed packages
installed.packages()
# Check specific package version
packageVersion("dplyr")
# Update packages
update.packages()
# Install specific version (requires devtools)
devtools::install_version("dplyr", version = "1.0.0")
3. Important Caveats and Best Practices
Package Loading Order:
# BAD: Potential conflicts unclear
library(dplyr)
library(plyr)
library(tidyr)
# GOOD: Organized loading with comments
# Core data manipulation
library(dplyr) # Main data manipulation
library(tidyr) # Data reshaping
# Visualization
library(ggplot2) # Plotting
Function Conflicts Resolution:
# Check for conflicts
conflicts()
# Create alias for frequently used conflicting functions
filter_df <- dplyr::filter
select_df <- dplyr::select
# Use conflicted package for explicit conflict resolution
library(conflicted)
conflict_prefer("filter", "dplyr")
4. Project-specific Package Management
Using renv
for Project Isolation:
# Initialize project-specific package management
renv::init()
# Install project packages
renv::install("dplyr")
# Snapshot current project state
renv::snapshot()
# Restore project packages
renv::restore()
5. Common Pitfalls and Solutions
Package Loading Errors:
# Problem: Package not found
library(nonexistentpackage) # Error
# Solution: Check and install
if (!require("package")) {
install.packages("package")
library(package)
}
# Problem: Version conflicts
# Solution: Use packageVersion() to check versions
if (packageVersion("dplyr") < "1.0.0") {
install.packages("dplyr")
}
File Path Management
Code Examples:
# Mac/Linux path
"~/Documents/data.csv"
# Windows paths (both valid)
"C:\\Data\\data.csv" # Double backslash
"C:/Data/data.csv" # Forward slash
Important Caveats:
- Path specifications differ between Windows and Mac/Linux
- Working directory management is crucial:
# Check current working directory
getwd()
# Set working directory
setwd("~/Documents/Project")
# List files in current directory
list.files()
Reading Data Files
Code Examples:
# CSV files
# Base R approach
data_base <- read.csv("file.csv", header = TRUE)
# readr approach (faster)
data_readr <- read_csv("file.csv")
# Excel files
library(readxl)
data_excel <- read_excel("file.xlsx", sheet = 1)
# SPSS files
library(haven)
data_spss <- read_sav("file.sav")
Important Caveats:
- Always verify data structure after reading:
str(data)
head(data)
- Excel date handling requires special attention:
# Converting Excel dates
as.Date(43800, origin = "1899-12-30")
Data Manipulation and Subsetting
Basic subsetting in R:
# Create a simple dataset for demonstration
employee_data <- data.frame(
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
age = c(25, NA, 45, 32, NA),
salary = c(50000, 60000, NA, 75000, 80000)
)
# Basic subsetting by condition
young_employees <- employee_data[employee_data$age < 40, ]
When we subset like this, R evaluates each row against the condition and returns a logical vector (TRUE/FALSE). However, this introduces our first important consideration: how R handles missing values (NA) in comparisons.
Let's see what happens with NA values in comparisons:
# Demonstrate NA behavior
ages <- c(25, NA, 45, 32, NA)
ages < 40
# Returns: TRUE, NA, FALSE, TRUE, NA
# This means when we subset:
employee_data[employee_data$age < 40, ]
# We get rows where age < 40 is TRUE *and* rows where the comparison returns NA
What is which()
function
This is where the which()
function becomes valuable. Let's understand how it works, as it ==only the indices where the condition is TRUE==
# Using which() function
which(ages < 40)
# Returns: 1, 4 (only the indices where the condition is TRUE)
# Therefore:
employee_data[which(employee_data$age < 40), ]
# Only returns rows where age < 40 is definitively TRUE
The which()
function serves several important purposes:
1. NA Handling:
# Without which()
salary_filter <- employee_data[employee_data$salary > 60000, ]
# With which()
salary_filter_clean <- employee_data[which(employee_data$salary > 60000), ]
# The second approach excludes NA values automatically
2. Multiple Conditions:
# Complex conditions become clearer with which()
high_paid_young <- employee_data[which(
employee_data$salary > 60000 &
employee_data$age < 40
), ]
# This clearly shows which rows meet both conditions
3. Finding Specific Positions:
# Find indices of specific values
which(employee_data$name == "Alice") # Returns row number for Alice
# Can be used for multiple matches
# CRUTIAL: Pay attention to a complex and pretty interesting use case!
which(employee_data$salary > mean(employee_data$salary, na.rm = TRUE))
Advanced subsetting techniques:
# Using %in% operator for multiple values
selected_employees <- employee_data[which(
employee_data$name %in% c("Alice", "Bob", "Eve")
), ]
# Combining conditions with NA handling
qualified_employees <- employee_data[which(
employee_data$age >= 30 &
employee_data$salary >= 70000 &
!is.na(employee_data$age) & # Explicitly exclude NAs
!is.na(employee_data$salary)
), ]
Important considerations when using which()
:
2. Performance Impact:
# For very large datasets, which() might have performance implications
# In such cases, you might want to use data.table or dplyr alternatives:
library(dplyr)
qualified_employees <- employee_data %>%
filter(!is.na(age), !is.na(salary), age >= 30, salary >= 70000)
3. Maintaining Data Integrity:
# which() helps prevent unexpected results
# Bad approach:
mean(employee_data$salary[employee_data$salary > 60000]) # Includes NAs
# Better approach:
mean(employee_data$salary[which(employee_data$salary > 60000)]) # Excludes NAs
4. Logical Vector Operations:
# Understanding the difference
logical_vector <- employee_data$age < 40 # Contains TRUE, FALSE, NA
which_vector <- which(employee_data$age < 40) # Contains only matching indices
# This can be important for calculations
sum(logical_vector) # Might give NA
length(which_vector) # Gives actual count of TRUE values
Best Practices:
5. Always consider NA values in your data:
# Check for NAs before subsetting
sum(is.na(employee_data$age))
sum(is.na(employee_data$salary))
# Document your NA handling strategy
6. Use explicit NA handling when needed:
# Combining which() with explicit NA handling
clean_subset <- employee_data[which(
employee_data$age < 40 &
!is.na(employee_data$age)
), ]
7. Consider using modern alternatives:
# CRUTIAL: For real, take a close look at this library!
# dplyr approach for complex subsetting
library(dplyr)
clean_subset <- employee_data %>%
filter(!is.na(age), age < 40)
Data Manipulation withdplyr
package
Understanding dplyr's Core Philosophy
The dplyr package is built around a set of verb functions that each perform a specific data manipulation task. Let's start with a practical example:
library(dplyr)
# Create a sample dataset for demonstration
sales_data <- data.frame(
date = as.Date('2024-01-01') + 0:29,
region = rep(c("North", "South", "East", "West"), each = 8)[1:30],
sales = round(runif(30, 1000, 5000)),
profit = round(runif(30, 100, 1000))
)
# Basic dplyr operations pipeline
sales_analysis <- sales_data %>%
group_by(region) %>%
summarise(
total_sales = sum(sales),
avg_profit = mean(profit),
transactions = n() # the number of rows after grouping
) %>%
arrange(desc(total_sales))
[!info] ATTENTION: Here
n()
is ==a counting function== indplyr
that returns ==the number of rows in the current group==.PS: much like
count()
insql
Key Concepts and Crucial Moments to Watch:
1. The Pipeline Operator (%>%)
:
# Without pipeline (harder to read)
arrange(
summarise(
group_by(sales_data, region),
total_sales = sum(sales)
),
desc(total_sales)
)
# With pipeline (more readable)
sales_data %>%
group_by(region) %>%
summarise(total_sales = sum(sales)) %>%
arrange(desc(total_sales))
[!important] IMPORTANT: The pipeline operator ==passes the result as the first argument to the next function==. Be aware that some functions might need explicit argument naming if not using the first argument.
2. Grouping Operations:
# Simple grouping
sales_data %>%
group_by(region) %>%
summarise(mean_sales = mean(sales))
# Multiple grouping variables
sales_data %>%
group_by(region, month = format(date, "%m")) %>%
summarise(mean_sales = mean(sales))
# CRUCIAL: Remember to ungroup when needed
# It is important because the next mutate operation is performed on
# over the whole dataset, NOT over the previously grouped one
sales_data %>%
group_by(region) %>%
mutate(region_avg = mean(sales)) %>%
ungroup() %>% # Don't forget this!
mutate(overall_avg = mean(sales))
[!info] Comparison between
mutate
andsummarize
Whilesummarize
reduces multiple rows into single summary ==rows by the grouped values==,
mutate
creates new columns while ==preserving the original number of rows==
PS: mutate
just attaches the value (be it aggregate or any other computed one) as a new column of the row, without reducing it. summarize
, however, collapses data (by the grouped columns) to create summary statistics.
1. Summarizing with NA Values:
# Add some NA values for demonstration
sales_data$sales[c(5, 15)] <- NA
# Bad: NAs will propagate
sales_data %>%
group_by(region) %>%
summarise(mean_sales = mean(sales))
# Good: Handle NAs explicitly
sales_data %>%
group_by(region) %>%
summarise(mean_sales = mean(sales, na.rm = TRUE))
2. Multiple Operations and Order Sensitivity:
# Order matters! Be careful with these operations
sales_data %>%
group_by(region) %>%
filter(sales > mean(sales)) %>% # This uses group-wise mean
summarise(high_sales_count = n())
# Different result if we change order
# CRUTIAL: A very interesting use case
sales_data %>%
filter(sales > mean(sales)) %>% # This uses overall mean
group_by(region) %>%
summarise(high_sales_count = n())
3. Common Pitfalls and Solutions:
Handling Grouped Operations:
# Problem: Unexpected results due to grouping
sales_data %>%
group_by(region) %>%
mutate(pct_of_total = sales / sum(sales)) %>%
ungroup() # Always ungroup after grouped operations
# Solution: Be explicit about grouping scope
sales_data %>%
mutate(total_sales = sum(sales)) %>%
group_by(region) %>%
mutate(pct_of_region = sales / sum(sales),
pct_of_total = sales / first(total_sales)) %>%
ungroup()
starwars %>%
group_by(homeworld) %>% # grouping by homeworld field
filter(homeworld %in% c('Tatooine', 'Naboo') | eye_color == 'blue') %>% # multiple filtering
summarise(population = n()) %>% # counting the population of the groups
arrange(desc(population)) %>% # ordering it by population descendantly
slice(1:3) %>% # getting the top 3 hometowns with the most population
ungroup() # ungrouping for safity purposes
Joining Tables:
# Create a reference table
region_info <- data.frame(
region = c("North", "South", "East", "West"),
manager = c("Alice", "Bob", "Charlie", "David")
)
# Safe joining with explicit join type
sales_analysis <- sales_data %>%
left_join(region_info, by = "region") # Be explicit about join columns
# Check for unmatched rows
anti_join(sales_data, region_info, by = "region")
4. Advanced Features and Best Practices:
Using across() for Multiple Columns:
# Modern approach for operating on multiple columns
sales_data %>%
group_by(region) %>%
summarise(across(
c(sales, profit),
list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE)
),
.names = "{.col}_{.fn}"
))
Syntax explained
mean = ~mean(., na.rm = TRUE)
Breaking it down:
~
is aformula operator
in R.
represents the ==current column== being processedmean =
names the output- The whole structure is a lambda/anonymous function
Same formula written in traditional R:
# Traditional function
function(x) mean(x, na.rm = TRUE)
# dplyr shorthand
~mean(., na.rm = TRUE)
Example with multiple calculations:
# Verbose way
sales_data %>%
group_by(region) %>%
summarise(
sales_mean = mean(sales, na.rm = TRUE),
sales_sd = sd(sales, na.rm = TRUE),
profit_mean = mean(profit, na.rm = TRUE),
profit_sd = sd(profit, na.rm = TRUE)
)
# Compact way using across()
sales_data %>%
group_by(region) %>%
summarise(across(
c(sales, profit),
list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE)
)
))
Key Takeaways:
Always be mindful of grouping:
- Use group_by() intentionally
- Remember to ungroup() when finished
Handle missing values explicitly:
- Use
na.rm = TRUE
when appropriate - Consider filtering
NA
s beforehand if they're problematic
- Use
Pay attention to operation order:
- Operations are sequential
- Grouping affects subsequent calculations
- Filtering before or after grouping can give different results
Document your pipeline:
- Add comments explaining complex transformations
- Break long pipelines into meaningful chunks
- Consider intermediate assignments for clarity
Data Restructuring
Code Examples:
library(tidyr)
# Wide to Long format
long_data <- wide_data %>%
pivot_longer(
cols = starts_with("SurveyItem"),
names_to = "Question",
values_to = "Response"
)
# Long to Wide format
wide_data <- long_data %>%
pivot_wider(
names_from = Question,
values_from = Response
)
Key Areas to Watch For:
- Logical Operations:
- Parentheses matter in complex conditions:
# Different results:
data[(x < 5 | x > 10) & y == "A", ] # Correct
data[x < 5 | x > 10 & y == "A", ] # Incorrect
- Data Type Verification:
# Always check data types after import
str(data)
class(data$column)
# Convert if necessary
data$column <- as.factor(data$column)
- Missing Values:
# Check for missing values
sum(is.na(data))
# Handle missing values explicitly
data %>%
filter(!is.na(column)) %>%
summarise(mean = mean(value))
- Merging Data:
# Ensure key columns are properly identified
merged_data <- merge(
data1, data2,
by.x = "ID1", by.y = "ID2",
all = TRUE # Keep all rows
)
# Always verify merge results
dim(data1) # Original dimensions
dim(data2) # Original dimensions
dim(merged_data) # Should make sense given the merge type
References
- Introduction to
dplyr
package - Index page of
dplyr
package - which also includes the 1st reference - A great
tifyr
package for data restructuring
R Data Structures and packages - Week 2 Lecture Notes
17501 • Apr 5, 2025
Introduction
R provides several data structures to store data in different formats. These include:
- Vectors
- Factors
- Data Frames
- Matrices
- Lists
- Arrays (not covered in this module)
Homogeneous vs. Heterogeneous Data Structures
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1D | Atomic Vector | List |
2D | Matrix | Data Frame |
nD | Array |
- Homogeneous: All elements must be of the same type.
- Heterogeneous: Elements can be of different types.
Vectors
Definition: A basic data structure that stores multiple values of the same type.
Creation: Use the
c()
function.vector1 <- c(3, 6, 9)
Length: Use
length()
to find the number of elements.length(vector1) # Returns 3
Accessing Elements: Use bracket notation
[]
. Indexing starts at 1.vector1[1] # First element (3) vector1[2] # Second element (6) vector1[c(1,2)] # First and second elements (3, 6) vector1[-1] # All elements except the first vector1[-c(1,2)]# All elements except first and second
Edge Cases:
vector1[0]
returns an empty vectorvector1[4]
returnsNA
for a vector of length 3- Negative indices remove elements at those positions
- Cannot mix positive and negative indices in the same selection
Negative Indexing: Omits elements.
vector1[-1] # Returns elements except the first
Modifying Elements:
vector1[2] <- 100 # Changes the second element to 100
Adding Elements:
vector1[4] <- 200 # Adds 200 at position 4
Vector Operations:
- Arithmetic operations are element-wise.
vector1 + 1 # Adds 1 to each element vector1 * 2 # Multiplies each element by 2
Mixed Data Types: R implicitly converts mixed types to character.
vector10 <- c(10, "20", 30) # Converts to character
Using seq() Function: For generating more complex sequences:
# Basic usage with named arguments seq(from = 0, to = 10, by = 2) # Returns: 0 2 4 6 8 10 # Same call with positional arguments seq(0, 10, by = 2) # Returns: 0 2 4 6 8 10 # Decimal steps are allowed seq(0, 5, by = 0.5) # Returns: 0.0 0.5 1.0 1.5 ... 4.0 4.5 5.0 # Watch out for unexpected results with step size seq(1, 10, by = 5) # Returns: 1 6 only seq(1, 10, by = 3) # Returns: 1 4 7 10
Important Considerations:
- The sequence always starts at
from
and proceeds by steps of sizeby
- It will not exceed the
to
value, which may result in fewer elements than expected - Using named arguments (
from
,to
,by
) makes code more readable and less error-prone - The
by
argument determines how many elements you get, so choose it carefully
- The sequence always starts at
Factors
- Definition: Special vectors for storing ==categorical data==.
- Creation: Use
factor()
.
[!info] Categorical Variables in R: Factors are R's way of storing categorical data - like ==enums== in other languages They store values ==as integers internally== but display them as predefined categories
Creating Factors:
# Basic factor creation phoneType <- factor(c("iPhone", "Android", "Android", "iPhone", "Other")) # Creating with predefined levels (including levels that might appear later) phoneType <- factor(c("iPhone", "Android"), levels = c("iPhone", "Android", "Other", "Windows"))
Understanding Levels:
# Levels are the unique categories allowed in the factor levels(phoneType) # Shows all possible categories # Factors are stored as integers internally as.numeric(phoneType) # Shows the internal integer representation # For example, might show: 1 2 2 1 3 (where 1=iPhone, 2=Android, etc.)
Working with Levels:
# Check current levels str(phoneType) # Shows factor structure and levels # Modify level names levels(phoneType)[levels(phoneType) == "Other"] <- "Unknown" # Reorder levels (useful for plotting and modeling) phoneType <- relevel(phoneType, ref = "iPhone") # Make iPhone the reference level phoneType <- factor(phoneType, levels = c("iPhone", "Android", "Other")) # Complete reordering
Handling Missing Values:
# NA values are allowed and handled specially phoneType <- factor(c("iPhone", "Android", NA, "iPhone")) is.na(phoneType) # Identifies NA values
Common Operations:
# Count occurrences of each level table(phoneType) # Basic frequency table summary(phoneType) # Similar to table() but includes NA count # Convert to/from factors as.character(phoneType) # Convert factor to character vector as.factor(c("A", "B")) # Convert character vector to factor
Important Considerations:
- Factors are memory-efficient for repeated categorical values
- They maintain order of categories (unlike character vectors)
- They prevent data entry errors by allowing only predefined values
- Useful for statistical modeling where categorical variables need special handling
# This will create NA and warning phoneType[1] <- "Windows Phone" # Error if "Windows Phone" not in levels # To add new levels, you must redefine the factor phoneType <- factor(phoneType, levels = c(levels(phoneType), "Windows Phone"))
Practical Example:
# Real-world usage in data analysis satisfaction <- factor(c("High", "Medium", "Low", "High"), levels = c("Low", "Medium", "High"), ordered = TRUE) # Creates an ordered factor # Useful for plotting barplot(table(satisfaction)) # Creates bar plot with categories
Data Frames
[!info] Definition and Core Concepts: Data frames are 2-dimensional structures ==similar to database tables or Excel sheets==
- ==Each column== can have a different data type (unlike matrices)
- ==Column names== must be unique
- ==All columns== must have the same number of rows
Creating Data Frames:
# Basic creation employeeData <- data.frame( EmployeeID = 101:105, FirstName = c("Kim", "Ken", "Bob", "Bill", "Cindy"), Age = c(24, 23, 54, NA, 64), PayType = factor(c("Hourly", "Salaried", "Hourly", "Hourly", "Salaried")), stringsAsFactors = FALSE # Prevents automatic conversion of strings to factors ) # From existing vectors ids <- 1:3 names <- c("Alice", "Bob", "Charlie") scores <- c(85, 92, 78) df <- data.frame(ID = ids, Name = names, Score = scores) # From a matrix mat <- matrix(1:9, nrow = 3) df_from_matrix <- as.data.frame(mat)
Examining Data Frame Structure:
# Basic information str(employeeData) # Shows structure dim(employeeData) # Returns dimensions nrow(employeeData) # Number of rows ncol(employeeData) # Number of columns names(employeeData) # Column names head(employeeData, n=2) # First 2 rows tail(employeeData, n=2) # Last 2 rows
Accessing and Subsetting:
# Column access employeeData$FirstName # Using $ employeeData[["FirstName"]] # Using [[]] employeeData[, "FirstName"] # Using [,] employeeData[, c("FirstName", "Age")] # Multiple columns # Row access employeeData[1, ] # First row employeeData[1:3, ] # First three rows # Both rows and columns # TODO: Pay more attention here!!! Useful for filtering !!! employeeData[1:3, c("FirstName", "Age")] # First three rows, two specific columns employeeData[c(1, 3), c(2, 4)] # reference the first and third rows with respect to the second and fourth columns # Conditional subsetting employeeData[employeeData$Age > 30, ] # Rows where Age > 30 # TODO: Pay more attention here!!! Useful for filtering !!! subset(employeeData, Age > 30) # Same thing using subset()
Modifying Data Frames:
# Adding new columns employeeData$Department <- c("HR", "IT", "Sales", "IT", "HR") employeeData[["Salary"]] <- c(50000, 60000, 75000, 65000, 80000) # Modifying existing columns employeeData$Age <- employeeData$Age + 1 # Increment all ages # Adding new rows new_row <- data.frame( EmployeeID = 106, FirstName = "Jamie", Age = 56, PayType = "Hourly", Department = "Sales", Salary = 70000 ) employeeData <- rbind(employeeData, new_row) # Removing rows/columns employeeData <- employeeData[-1, ] # Remove first row employeeData$Department <- NULL # Remove Department column
Common Operations:
# Sorting (Very Useful) employeeData[order(employeeData$Age), ] # Sort by Age employeeData[order(employeeData$Department, -employeeData$Salary), ] # Multiple sort criteria # Summarizing summary(employeeData) # Statistical summary table(employeeData$Department) # Frequency table of Department # Aggregating aggregate(Salary ~ Department, data = employeeData, FUN = mean) # Mean salary by department # Handling missing values complete.cases(employeeData) # Identify complete rows na.omit(employeeData) # Remove rows with any NA
Advanced Features:
# Merging data frames df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C")) df2 <- data.frame(ID = 2:4, Score = c(88, 94, 82)) merge(df1, df2, by = "ID") # SQL-like join # Reshaping data # Wide to long format library(tidyr) long_data <- gather(employeeData, key = "Variable", value = "Value", -EmployeeID) # Computing on columns employeeData$BonusEligible <- employeeData$Salary > 70000
Matrices
- Definition: 2D homogeneous data structure.
- Creation: Use
matrix()
.matrix1 <- matrix(c(1, 0, -20, 0, 1, -15, 1, -1, 0), nrow = 3, ncol = 3, byrow = TRUE)
- Accessing Elements:
matrix1[2, 3] # Element at row 2, column 3 matrix1[1, ] # Entire first row matrix1[, 2] # Entire second column
- Matrix Operations:
mat1 + mat2 # Element-wise addition mat1 * mat2 # Element-wise multiplication
Arrays
- Definition: nD homogeneous data structure.
- Creation: Use
array()
.array1 <- array(c(1:8), dim = c(2, 2, 2))
- Named Arrays:
named_array <- array(c(1:8), dim = c(2, 2, 2), dimnames = list(c("r1", "r2"), c("c1", "c2"), c("m1", "m2")))
Lists
- Definition: Heterogeneous data structure that can nest other objects.
- Creation: Use
list()
.list1 <- list(Element1 = demoVec, Element2 = c("A", "B"), Element3 = 3, Element4 = demoDF)
- Accessing Elements:
list1$Element1 # Returns the first element list1[[1]] # Same as above
- Adding Elements:
list1$NewElement <- "New Value" # Adds a new element
R Packages
Package Basics:
- Packages expand R's default functionality
- Thousands available via CRAN (Comprehensive R Archive Network)
- Browse packages at:
- By name: cran.r-project.org/web/packages/available_packages_by_name.html
- By task: cran.r-project.org/web/views/
Installation Methods:
# Method 1: Using install.packages() function install.packages(c("foreign", "readr", "haven")) # Method 2: For packages with potential installation issues install.packages(c("dplyr", "car"), dependencies = TRUE, type = "binary", ask = FALSE) # Method 3: Interactive installation via RStudio # Tools -> Install Packages...
Important: Package installation only needs to happen once per R major version
Loading Packages:
# Load one package at a time library(foreign) library(readr) library(haven) # Get package documentation help(package = "haven")
Package Management:
# Update packages update.packages() # Via function # Or: Tools -> Check for Package Updates... (in RStudio) # Unload a package detach("package:readr", unload = TRUE)
Handling Package Conflicts:
# When packages have functions with same name: # Option 1: Use package-specific reference dplyr::recode(data) # Use dplyr's recode car::recode(data) # Use car's recode # Option 2: Control through loading order library(car) # Load first package library(dplyr) # Most recently loaded takes precedence
Best Practices:
- Avoid reinstalling packages unnecessarily
- Be aware of package loading order
- Consider using specific package references (::) for functions with same names
- Watch for package conflict messages when loading libraries
- RStudio 1.2+ will offer to install missing packages automatically
- Restart R before updating packages that are currently loaded
File Paths and Working Directories
- Setting Working Directory:
setwd("~/Documents/ResearchProject")
- Getting Working Directory:
getwd()
- Relative Paths:
list.files("Data Files") # Lists files in the "Data Files" subdirectory list.files("Data Files/More Data") list.files("..") # step down a folder
Reading Data Files
- CSV Files:
CSV_base_example <- read.csv("Data Files/ExampleData.csv") # using an external library library(readr) CSV_readr_example <- read_csv("Data Files/ExampleData.csv")
- Excel Files:
# using an external library library(readxl) XSLX_readxl_example <- read_excel("Data Files/ExampleData.xlsx", sheet = 1)
- SPSS Files:
# with `foreign` library library(foreign) SAV_foreign_example <- read.spss("Data Files/ExampleData.sav", to.data.frame = TRUE) # with `haven` library SAV_haven_example <- read_sav("Data Files/ExampleData.sav")
Writing Data Files
- CSV Files:
write.csv(CSV_readr_example, file = "Data Files/More Data/WriteExampleData.csv", row.names = FALSE)
# TODO: As mentioned earlier, see the `fwrite` function is a faster alternative
data.table::fwrite(CSV_readr_example, file = "AltExampleData.csv")
- Excel Files:
write_xlsx(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.xlsx")
- SPSS Files:
write_sav(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.sav")
This summary covers all key concepts, examples, and edge cases from your lecture notes. You can now use this as a reference in your Obsidian notes!
...ISDS Preperation
Anonymous • Mar 31, 2025
I have recently started revising the materials from the Introduction to Statistics and Data Science module at WIUT.
Here is what i love about the module:
- It covers a great range of things that are widely in every domain
Here is the list of R related lecture notes
As there will be some questions from R programming language, I decided to share my lecture notes with you:
...Introduction to R Programming - Week 1 Lecture Notes
admin3 • Mar 23, 2025
Table of Contents
- Software Installation
- RStudio Interface
- Arithmetic Operations
- Mathematical Functions
- Relational Operators
- Data Classes
- Missing Data
- R Objects and Assignment
Software Installation
Required Software
- R: The core programming language
- Download from CRAN (Comprehensive R Archive Network)
- Platform-specific versions available for Mac, Windows, Linux
- RStudio: Integrated Development Environment (IDE)
- Download from Posit website
- Provides unified interface across operating systems
Important URLs
- R Download: https://cran.r-project.org/
- RStudio Download: https://posit.co/download/rstudio-desktop/
RStudio Interface
Key Components
Source Pane (Top Left)
- Where R scripts are written and edited
- Save files with
.R
extension
# Example script content # Calculate average temperature temp_celsius <- 25 temp_fahrenheit <- (temp_celsius * 9/5) + 32
Console (Bottom Left)
- Displays executed commands and output
- Direct command entry possible
> 2 + 2 [1] 4
Environment Pane (Top Right)
- Shows active variables and objects
# After running: temp_celsius # Value: 25 temp_fahrenheit # Value: 77
Arithmetic Operations
Basic Operators with Examples
# Addition
5 + 3 # Output: 8
# Subtraction
10 - 4 # Output: 6
# Multiplication
6 * 7 # Output: 42
# Division
15 / 3 # Output: 5
# Exponents
2 ^ 3 # Output: 8
# Modulo (remainder)
17 %% 5 # Output: 2
# Integer division
17 %/% 5 # Output: 3
Order of Operations Examples
# Different results based on parentheses
4 + 2 * 3 # Output: 10 (multiplication first)
(4 + 2) * 3 # Output: 18 (addition first)
# Complex calculation
((10 + 5) * 2) / 5 # Output: 6
Mathematical Functions
Common Functions with Examples
# Square root
sqrt(16) # Output: 4
sqrt(c(9, 16, 25)) # Output: 3 4 5
# Absolute value
abs(-7.5) # Output: 7.5
abs(c(-2, 0, 2)) # Output: 2 0 2
# Logarithms
log10(100) # Output: 2
log(exp(1)) # Output: 1
# Exponential
exp(2) # Output: 7.389056
Function Help Example
# Getting help for sqrt function
?sqrt
# Returns documentation showing:
# sqrt(x) # where x is a numeric vector
Relational Operators
Examples with Different Data Types
# Numeric comparisons
5 < 10 # Output: TRUE
7 >= 7 # Output: TRUE
3 == 3 # Output: TRUE
4 != 5 # Output: TRUE
# String comparisons
"apple" == "apple" # Output: TRUE
"a" < "b" # Output: TRUE
# Mixed type comparisons
5 == "5" # Output: FALSE
Data Classes
Type Examples and Conversions
# Numeric
x <- 10.5
typeof(x) # Output: "double"
# Integer
y <- 10L
typeof(y) # Output: "integer"
# Character
name <- "John"
typeof(name) # Output: "character"
# Logical
is_valid <- TRUE
typeof(is_valid) # Output: "logical"
# Type conversion examples
as.integer(10.7) # Output: 10
as.character(123) # Output: "123"
as.numeric("456") # Output: 456
as.Date(43800, origin = "1899-12-30") # "2019-12-01"
Testing Types
# Using is.* functions
x <- 10.5
is.numeric(x) # Output: TRUE
is.integer(x) # Output: FALSE
is.character(x) # Output: FALSE
# Multiple checks
y <- "123"
is.numeric(y) # Output: FALSE
is.numeric(as.numeric(y)) # Output: TRUE
Missing Data
Working with NA and NaN
# Creating missing values
x <- c(1, NA, 3, NaN, 5)
# Testing for NA
is.na(x) # Output: FALSE TRUE FALSE TRUE TRUE
# Calculations with NA
sum(c(1, NA, 3)) # Output: NA
sum(c(1, NA, 3), na.rm = TRUE) # Output: 4
# NA vs NaN
0/0 # Output: NaN
NA + 1 # Output: NA
Type Conversion and Comparison
Understanding Type Conversion
# Different numeric types
x_int <- 5L # integer
x_num <- 5 # numeric/double
x_int == x_num # Output: TRUE (values are equal)
typeof(x_int) == typeof(x_num) # Output: FALSE (types are different)
# Detailed example
varA <- 3.3 # double/numeric
varB <- "hello there" # character
varC <- FALSE # logical
varD <- 5L # integer
varE <- 5 # double
varF <- varD + varE # double (integer + numeric = numeric)
varG <- 2 * varC # numeric (numeric * logical = numeric)
# Checking types
typeof(varA) # "double"
typeof(varB) # "character"
typeof(varC) # "logical"
typeof(varD) # "integer"
typeof(varE) # "double"
typeof(varF) # "double"
typeof(varG) # "double"
Key Points About Type Conversion
Implicit Conversion
- R automatically converts between integer and numeric types in calculations
- Logical values convert to 1 (TRUE) or 0 (FALSE) in numeric operations
- The "wider" type usually prevails (e.g., integer + numeric = numeric)
Value vs Type Comparison
5L == 5 # TRUE (comparing values) typeof(5L) == typeof(5) # FALSE (comparing types: "integer" vs "double")
Type Hierarchy
- character > numeric > integer > logical
- When mixing types, R usually converts to the higher type
1L + 2.5 # Result is numeric (2.5 wins) TRUE + 1L # Result is integer (1L wins) TRUE + 1.0 # Result is numeric (1.0 wins)
R Objects and Assignment
Variable Assignment Examples
# Basic assignment
age <- 25
name <- "Alice"
# Multiple assignments
height <- weight <- 70
# Complex assignments
bmi <- weight / (height/100)^2
# Listing objects
ls() # Shows all objects in environment
# Removing objects
rm(age) # Removes single object
rm(list = ls()) # Removes all objects
Naming Conventions Examples
# Valid names
valid_name <- 1
validName <- 2
VALID_NAME <- 3
.hidden_name <- 4
# Invalid names (will cause errors)
# 1name <- 5 # Can't start with number
# _name <- 6 # Can't start with underscore
# name-1 <- 7 # Can't use hyphen
Practice Exercises: 7. Create variables of different types and test their classes 8. Perform arithmetic operations with variables 9. Try working with missing values and understand their behavior 10. Practice naming conventions and object assignments
References
R data type and packages -> the continuation of the R chronicles (Week 2)
A 2 hours-long video tutorial:
...