Stories

Discover boundless stories from unique narrators (storytellers 🙃)

R functions, loops, conditional statements - Week 4 Lecture Notes

17501 • Apr 5, 2025

Functions in R Programming

Basic Function Structure

function_name <- function(parameter1, parameter2) {
    # Function body
    result <- # operations
    return(result)
}

Example 1: Simple Function

# Calculate area of rectangle
calculate_area <- function(length, width) {
    area <- length * width
    return(area)
}

# Usage
calculate_area(5, 3)  # Returns 15

Example 2: Function with Default Parameters

greet_user <- function(name = "User") {
    greeting <- paste("Hello,", name, "!")
    return(greeting)
}

greet_user()           # Returns "Hello, User !"
greet_user("Maria")    # Returns "Hello, Maria !"

Conditional Statements (if, else if, else)

Basic Structure

if (condition) {
    # code if condition is TRUE
} else if (another_condition) {
    # code if another_condition is TRUE
} else {
    # code if all conditions are FALSE
}

Example 1: Simple Grade Calculator

get_grade <- function(score) {
    if (score >= 90) {
        return("A")
    } else if (score >= 80) {
        return("B")
    } else if (score >= 70) {
        return("C")
    } else {
        return("F")
    }
}

get_grade(85)  # Returns "B"

Example 2: Number Check

check_number <- function(x) {
    if (x > 0) {
        return("Positive")
    } else if (x < 0) {
        return("Negative")
    } else {
        return("Zero")
    }
}

Conditions and Logical Operators

Common Logical Operators

  • == : Equal to
  • != : Not equal to
  • > : Greater than
  • < : Less than
  • >= : Greater than or equal to
  • <= : Less than or equal to
  • & : AND
  • | : OR
  • ! : NOT

Example: Complex Conditions

check_eligibility <- function(age, income) {
    if (age >= 18 & income >= 30000) {
        return("Eligible")
    } else if (age >= 21 | income >= 50000) {
        return("Conditionally Eligible")
    } else {
        return("Not Eligible")
    }
}

check_eligibility(19, 35000)  # Returns "Eligible"

Example: Multiple Conditions

categorize_day <- function(day, temperature) {
    if (day %in% c("Saturday", "Sunday") & temperature > 20) {
        return("Perfect weekend!")
    } else if (day %in% c("Saturday", "Sunday")) {
        return("Cold weekend")
    } else {
        return("Weekday")
    }
}

categorize_day("Saturday", 25)  # Returns "Perfect weekend!"

Best Practices

  1. Always use clear and descriptive function names
  2. Include documentation/comments for complex functions
  3. Handle edge cases and invalid inputs
  4. Keep functions focused on a single task
  5. Use consistent indentation for readability

Example: Good Practice Implementation

calculate_bmi <- function(weight, height) {
    # Input validation
    if (!is.numeric(weight) | !is.numeric(height)) {
        return("Error: Inputs must be numeric")
    }
    if (weight <= 0 | height <= 0) {
        return("Error: Values must be positive")
    }
    
    # Calculate BMI
    bmi <- weight / (height^2)
    
    # Categorize BMI
    if (bmi < 18.5) {
        return("Underweight")
    } else if (bmi < 25) {
        return("Normal")
    } else if (bmi < 30) {
        return("Overweight")
    } else {
        return("Obese")
    }
}

# Usage
calculate_bmi(70, 1.75)  # Returns BMI category
...
Data manipulation in R - Week 3 Lecture Notes

17501 • Apr 5, 2025

Package Management in R

1. Understanding R Package Ecosystem

The R package system is hierarchical:

# Base R: Comes with basic installation
mean(1:10)  # Base function

# Recommended packages: Nearly standard but need loading
library(MASS)  # A recommended package

# Third-party packages: Need installation and loading
install.packages("tidyverse")  # Popular meta-package

2. Package Installation Strategies

Basic Installation:

# Single package
install.packages("dplyr")

# Multiple packages
install.packages(c("dplyr", "ggplot2", "tidyr"))

# With specific parameters for troubleshooting
install.packages("dplyr",
                dependencies = TRUE,  # Install all required packages
                type = "binary",     # Avoid source compilation
                repos = "https://cran.rstudio.com/")  # Specify repository

1. Package Loading and Namespace Management

Basic Loading:

# Standard loading
library(dplyr)

# Alternative loading with error handling
if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}

Namespace Conflicts and Resolution:

# Example of conflict
library(dplyr)
library(plyr)  # Also has a select() function

# Three ways to handle conflicts:

# 1. Explicit namespace
dplyr::select(mtcars, mpg, cyl)

# 2. With package environment
with(dplyr, select(mtcars, mpg, cyl))

# 3. Import specific functions
import::from(dplyr, select, filter)

2. Package Version Management

Checking and Updating:

# Check installed packages
installed.packages()

# Check specific package version
packageVersion("dplyr")

# Update packages
update.packages()

# Install specific version (requires devtools)
devtools::install_version("dplyr", version = "1.0.0")

3. Important Caveats and Best Practices

Package Loading Order:

# BAD: Potential conflicts unclear
library(dplyr)
library(plyr)
library(tidyr)

# GOOD: Organized loading with comments
# Core data manipulation
library(dplyr)      # Main data manipulation
library(tidyr)      # Data reshaping
# Visualization
library(ggplot2)    # Plotting

Function Conflicts Resolution:

# Check for conflicts
conflicts()

# Create alias for frequently used conflicting functions
filter_df <- dplyr::filter
select_df <- dplyr::select

# Use conflicted package for explicit conflict resolution
library(conflicted)
conflict_prefer("filter", "dplyr")

4. Project-specific Package Management

Using renv for Project Isolation:

# Initialize project-specific package management
renv::init()

# Install project packages
renv::install("dplyr")

# Snapshot current project state
renv::snapshot()

# Restore project packages
renv::restore()

5. Common Pitfalls and Solutions

Package Loading Errors:

# Problem: Package not found
library(nonexistentpackage)  # Error

# Solution: Check and install
if (!require("package")) {
  install.packages("package")
  library(package)
}

# Problem: Version conflicts
# Solution: Use packageVersion() to check versions
if (packageVersion("dplyr") < "1.0.0") {
  install.packages("dplyr")
}

File Path Management

Code Examples:

# Mac/Linux path
"~/Documents/data.csv"

# Windows paths (both valid)
"C:\\Data\\data.csv"    # Double backslash
"C:/Data/data.csv"      # Forward slash

Important Caveats:

  • Path specifications differ between Windows and Mac/Linux
  • Working directory management is crucial:
# Check current working directory
getwd()

# Set working directory
setwd("~/Documents/Project")

# List files in current directory
list.files()

Reading Data Files

Code Examples:

# CSV files
# Base R approach
data_base <- read.csv("file.csv", header = TRUE)

# readr approach (faster)
data_readr <- read_csv("file.csv")

# Excel files
library(readxl)
data_excel <- read_excel("file.xlsx", sheet = 1)

# SPSS files
library(haven)
data_spss <- read_sav("file.sav")

Important Caveats:

  • Always verify data structure after reading:
str(data)
head(data)
  • Excel date handling requires special attention:
# Converting Excel dates
as.Date(43800, origin = "1899-12-30")

Data Manipulation and Subsetting

Basic subsetting in R:

# Create a simple dataset for demonstration
employee_data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  age = c(25, NA, 45, 32, NA),
  salary = c(50000, 60000, NA, 75000, 80000)
)

# Basic subsetting by condition
young_employees <- employee_data[employee_data$age < 40, ]

When we subset like this, R evaluates each row against the condition and returns a logical vector (TRUE/FALSE). However, this introduces our first important consideration: how R handles missing values (NA) in comparisons.

Let's see what happens with NA values in comparisons:

# Demonstrate NA behavior
ages <- c(25, NA, 45, 32, NA)
ages < 40
# Returns: TRUE, NA, FALSE, TRUE, NA

# This means when we subset:
employee_data[employee_data$age < 40, ]
# We get rows where age < 40 is TRUE *and* rows where the comparison returns NA

What is which() function

This is where the which() function becomes valuable. Let's understand how it works, as it ==only the indices where the condition is TRUE==

# Using which() function
which(ages < 40)
# Returns: 1, 4 (only the indices where the condition is TRUE)

# Therefore:
employee_data[which(employee_data$age < 40), ]
# Only returns rows where age < 40 is definitively TRUE

The which() function serves several important purposes:

1. NA Handling:

# Without which()
salary_filter <- employee_data[employee_data$salary > 60000, ]

# With which()
salary_filter_clean <- employee_data[which(employee_data$salary > 60000), ]

# The second approach excludes NA values automatically

2. Multiple Conditions:

# Complex conditions become clearer with which()
high_paid_young <- employee_data[which(
  employee_data$salary > 60000 & 
  employee_data$age < 40
), ]

# This clearly shows which rows meet both conditions

3. Finding Specific Positions:

# Find indices of specific values
which(employee_data$name == "Alice")  # Returns row number for Alice

# Can be used for multiple matches
# CRUTIAL: Pay attention to a complex and pretty interesting use case!
which(employee_data$salary > mean(employee_data$salary, na.rm = TRUE))

Advanced subsetting techniques:

# Using %in% operator for multiple values
selected_employees <- employee_data[which(
  employee_data$name %in% c("Alice", "Bob", "Eve")
), ]

# Combining conditions with NA handling
qualified_employees <- employee_data[which(
  employee_data$age >= 30 &
  employee_data$salary >= 70000 &
  !is.na(employee_data$age) &  # Explicitly exclude NAs
  !is.na(employee_data$salary)
), ]

Important considerations when using which():

2. Performance Impact:

# For very large datasets, which() might have performance implications
# In such cases, you might want to use data.table or dplyr alternatives:
library(dplyr)
qualified_employees <- employee_data %>%
  filter(!is.na(age), !is.na(salary), age >= 30, salary >= 70000)

3. Maintaining Data Integrity:

# which() helps prevent unexpected results
# Bad approach:
mean(employee_data$salary[employee_data$salary > 60000])  # Includes NAs

# Better approach:
mean(employee_data$salary[which(employee_data$salary > 60000)])  # Excludes NAs

4. Logical Vector Operations:

# Understanding the difference
logical_vector <- employee_data$age < 40           # Contains TRUE, FALSE, NA
which_vector <- which(employee_data$age < 40)      # Contains only matching indices

# This can be important for calculations
sum(logical_vector)    # Might give NA
length(which_vector)   # Gives actual count of TRUE values

Best Practices:

5. Always consider NA values in your data:

# Check for NAs before subsetting
sum(is.na(employee_data$age))
sum(is.na(employee_data$salary))

# Document your NA handling strategy

6. Use explicit NA handling when needed:

# Combining which() with explicit NA handling
clean_subset <- employee_data[which(
  employee_data$age < 40 & 
  !is.na(employee_data$age)
), ]

7. Consider using modern alternatives:

# CRUTIAL: For real, take a close look at this library!
# dplyr approach for complex subsetting
library(dplyr)
clean_subset <- employee_data %>%
  filter(!is.na(age), age < 40)

Data Manipulation withdplyr package

Understanding dplyr's Core Philosophy

The dplyr package is built around a set of verb functions that each perform a specific data manipulation task. Let's start with a practical example:

library(dplyr)

# Create a sample dataset for demonstration
sales_data <- data.frame(
    date = as.Date('2024-01-01') + 0:29,
    region = rep(c("North", "South", "East", "West"), each = 8)[1:30],
    sales = round(runif(30, 1000, 5000)),
    profit = round(runif(30, 100, 1000))
)

# Basic dplyr operations pipeline
sales_analysis <- sales_data %>%
    group_by(region) %>%
    summarise(
        total_sales = sum(sales),
        avg_profit = mean(profit),
        transactions = n() # the number of rows after grouping
    ) %>%
    arrange(desc(total_sales))

[!info] ATTENTION: Here n() is ==a counting function== in dplyr that returns ==the number of rows in the current group==.

PS: much like count() in sql

Key Concepts and Crucial Moments to Watch:

1. The Pipeline Operator (%>%):

# Without pipeline (harder to read)
arrange(
    summarise(
        group_by(sales_data, region),
        total_sales = sum(sales)
    ),
    desc(total_sales)
)

# With pipeline (more readable)
sales_data %>%
    group_by(region) %>%
    summarise(total_sales = sum(sales)) %>%
    arrange(desc(total_sales))

[!important] IMPORTANT: The pipeline operator ==passes the result as the first argument to the next function==. Be aware that some functions might need explicit argument naming if not using the first argument.

2. Grouping Operations:

# Simple grouping
sales_data %>%
    group_by(region) %>%
    summarise(mean_sales = mean(sales))

# Multiple grouping variables
sales_data %>%
    group_by(region, month = format(date, "%m")) %>%
    summarise(mean_sales = mean(sales))

# CRUCIAL: Remember to ungroup when needed
# It is important because the next mutate operation is performed on 
# over the whole dataset, NOT over the previously grouped one
sales_data %>%
    group_by(region) %>%
    mutate(region_avg = mean(sales)) %>%
    ungroup() %>%  # Don't forget this!
    mutate(overall_avg = mean(sales))

[!info] Comparison between mutate and summarize While summarize reduces multiple rows into single summary ==rows by the grouped values==,

mutate creates new columns while ==preserving the original number of rows==

PS: mutate just attaches the value (be it aggregate or any other computed one) as a new column of the row, without reducing it. summarize, however, collapses data (by the grouped columns) to create summary statistics.

1. Summarizing with NA Values:

# Add some NA values for demonstration
sales_data$sales[c(5, 15)] <- NA

# Bad: NAs will propagate
sales_data %>%
    group_by(region) %>%
    summarise(mean_sales = mean(sales))

# Good: Handle NAs explicitly
sales_data %>%
    group_by(region) %>%
    summarise(mean_sales = mean(sales, na.rm = TRUE))

2. Multiple Operations and Order Sensitivity:

# Order matters! Be careful with these operations
sales_data %>%
    group_by(region) %>%
    filter(sales > mean(sales)) %>%  # This uses group-wise mean
    summarise(high_sales_count = n())

# Different result if we change order
# CRUTIAL: A very interesting use case 
sales_data %>%
    filter(sales > mean(sales)) %>%  # This uses overall mean
    group_by(region) %>%
    summarise(high_sales_count = n())

3. Common Pitfalls and Solutions:

Handling Grouped Operations:
# Problem: Unexpected results due to grouping
sales_data %>%
    group_by(region) %>%
    mutate(pct_of_total = sales / sum(sales)) %>%
    ungroup()  # Always ungroup after grouped operations

# Solution: Be explicit about grouping scope
sales_data %>%
    mutate(total_sales = sum(sales)) %>%
    group_by(region) %>%
    mutate(pct_of_region = sales / sum(sales),
           pct_of_total = sales / first(total_sales)) %>%
    ungroup()

starwars %>%
  group_by(homeworld) %>% # grouping by homeworld field
  filter(homeworld %in% c('Tatooine', 'Naboo') | eye_color == 'blue') %>% # multiple filtering 
  summarise(population = n()) %>% # counting the population of the groups
  arrange(desc(population)) %>% # ordering it by population descendantly
  slice(1:3) %>% # getting the top 3 hometowns with the most population
  ungroup() # ungrouping for safity purposes

Joining Tables:

# Create a reference table
region_info <- data.frame(
    region = c("North", "South", "East", "West"),
    manager = c("Alice", "Bob", "Charlie", "David")
)

# Safe joining with explicit join type
sales_analysis <- sales_data %>%
    left_join(region_info, by = "region")  # Be explicit about join columns

# Check for unmatched rows
anti_join(sales_data, region_info, by = "region")

4. Advanced Features and Best Practices:

Using across() for Multiple Columns:

# Modern approach for operating on multiple columns
sales_data %>%
    group_by(region) %>%
    summarise(across(
        c(sales, profit),
        list(
            mean = ~mean(., na.rm = TRUE),
            sd = ~sd(., na.rm = TRUE)
        ),
        .names = "{.col}_{.fn}"
    ))
Syntax explained
mean = ~mean(., na.rm = TRUE)

Breaking it down:

  • ~ is a formula operator in R
  • . represents the ==current column== being processed
  • mean = names the output
  • The whole structure is a lambda/anonymous function

Same formula written in traditional R:

# Traditional function
function(x) mean(x, na.rm = TRUE)

# dplyr shorthand
~mean(., na.rm = TRUE)

Example with multiple calculations:

# Verbose way
sales_data %>%
    group_by(region) %>%
    summarise(
        sales_mean = mean(sales, na.rm = TRUE),
        sales_sd = sd(sales, na.rm = TRUE),
        profit_mean = mean(profit, na.rm = TRUE),
        profit_sd = sd(profit, na.rm = TRUE)
    )

# Compact way using across()
sales_data %>%
    group_by(region) %>%
    summarise(across(
        c(sales, profit),
        list(
            mean = ~mean(., na.rm = TRUE),
            sd = ~sd(., na.rm = TRUE)
        )
    ))

Key Takeaways:

  1. Always be mindful of grouping:

    • Use group_by() intentionally
    • Remember to ungroup() when finished
  2. Handle missing values explicitly:

    • Use na.rm = TRUE when appropriate
    • Consider filtering NAs beforehand if they're problematic
  3. Pay attention to operation order:

    • Operations are sequential
    • Grouping affects subsequent calculations
    • Filtering before or after grouping can give different results
  4. Document your pipeline:

    • Add comments explaining complex transformations
    • Break long pipelines into meaningful chunks
    • Consider intermediate assignments for clarity

Data Restructuring

Code Examples:

library(tidyr)

# Wide to Long format
long_data <- wide_data %>%
  pivot_longer(
    cols = starts_with("SurveyItem"),
    names_to = "Question",
    values_to = "Response"
  )

# Long to Wide format
wide_data <- long_data %>%
  pivot_wider(
    names_from = Question,
    values_from = Response
  )

Key Areas to Watch For:

  1. Logical Operations:
  • Parentheses matter in complex conditions:
# Different results:
data[(x < 5 | x > 10) & y == "A", ]  # Correct
data[x < 5 | x > 10 & y == "A", ]    # Incorrect
  1. Data Type Verification:
# Always check data types after import
str(data)
class(data$column)

# Convert if necessary
data$column <- as.factor(data$column)
  1. Missing Values:
# Check for missing values
sum(is.na(data))

# Handle missing values explicitly
data %>%
  filter(!is.na(column)) %>%
  summarise(mean = mean(value))
  1. Merging Data:
# Ensure key columns are properly identified
merged_data <- merge(
  data1, data2,
  by.x = "ID1", by.y = "ID2",
  all = TRUE  # Keep all rows
)

# Always verify merge results
dim(data1)  # Original dimensions
dim(data2)  # Original dimensions
dim(merged_data)  # Should make sense given the merge type

References

  1. Introduction to dplyr package
  2. Index page of dplyr package - which also includes the 1st reference
  3. A great tifyr package for data restructuring
...
R Data Structures and packages - Week 2 Lecture Notes

17501 • Apr 5, 2025

Introduction

R provides several data structures to store data in different formats. These include:

  • Vectors
  • Factors
  • Data Frames
  • Matrices
  • Lists
  • Arrays (not covered in this module)

Homogeneous vs. Heterogeneous Data Structures

Dimension Homogeneous Heterogeneous
1D Atomic Vector List
2D Matrix Data Frame
nD Array
  • Homogeneous: All elements must be of the same type.
  • Heterogeneous: Elements can be of different types.

Vectors

  • Definition: A basic data structure that stores multiple values of the same type.

  • Creation: Use the c() function.

    vector1 <- c(3, 6, 9)
    
  • Length: Use length() to find the number of elements.

    length(vector1)  # Returns 3
    
  • Accessing Elements: Use bracket notation []. Indexing starts at 1.

    vector1[1]      # First element (3)
    vector1[2]      # Second element (6)
    vector1[c(1,2)] # First and second elements (3, 6)
    vector1[-1]     # All elements except the first
    vector1[-c(1,2)]# All elements except first and second
    
  • Edge Cases:

    • vector1[0] returns an empty vector
    • vector1[4] returns NA for a vector of length 3
    • Negative indices remove elements at those positions
    • Cannot mix positive and negative indices in the same selection
  • Negative Indexing: Omits elements.

    vector1[-1]  # Returns elements except the first
    
  • Modifying Elements:

    vector1[2] <- 100  # Changes the second element to 100
    
  • Adding Elements:

    vector1[4] <- 200  # Adds 200 at position 4
    
  • Vector Operations:

    • Arithmetic operations are element-wise.
    vector1 + 1  # Adds 1 to each element
    vector1 * 2  # Multiplies each element by 2
    
  • Mixed Data Types: R implicitly converts mixed types to character.

    vector10 <- c(10, "20", 30)  # Converts to character
    
  • Using seq() Function: For generating more complex sequences:

    # Basic usage with named arguments
    seq(from = 0, to = 10, by = 2)  # Returns: 0 2 4 6 8 10
    
    # Same call with positional arguments
    seq(0, 10, by = 2)              # Returns: 0 2 4 6 8 10
    
    # Decimal steps are allowed
    seq(0, 5, by = 0.5)            # Returns: 0.0 0.5 1.0 1.5 ... 4.0 4.5 5.0
    
    # Watch out for unexpected results with step size
    seq(1, 10, by = 5)             # Returns: 1 6 only
    seq(1, 10, by = 3)             # Returns: 1 4 7 10
    
  • Important Considerations:

    • The sequence always starts at from and proceeds by steps of size by
    • It will not exceed the to value, which may result in fewer elements than expected
    • Using named arguments (from, to, by) makes code more readable and less error-prone
    • The by argument determines how many elements you get, so choose it carefully

Factors

  • Definition: Special vectors for storing ==categorical data==.
  • Creation: Use factor().

[!info] Categorical Variables in R: Factors are R's way of storing categorical data - like ==enums== in other languages They store values ==as integers internally== but display them as predefined categories

  • Creating Factors:

    # Basic factor creation
    phoneType <- factor(c("iPhone", "Android", "Android", "iPhone", "Other"))
    
    # Creating with predefined levels (including levels that might appear later)
    phoneType <- factor(c("iPhone", "Android"),
                       levels = c("iPhone", "Android", "Other", "Windows"))
    
  • Understanding Levels:

    # Levels are the unique categories allowed in the factor
    levels(phoneType)  # Shows all possible categories
    
    # Factors are stored as integers internally
    as.numeric(phoneType)  # Shows the internal integer representation
    # For example, might show: 1 2 2 1 3 (where 1=iPhone, 2=Android, etc.)
    
  • Working with Levels:

    # Check current levels
    str(phoneType)     # Shows factor structure and levels
    
    # Modify level names
    levels(phoneType)[levels(phoneType) == "Other"] <- "Unknown"
    
    # Reorder levels (useful for plotting and modeling)
    phoneType <- relevel(phoneType, ref = "iPhone")  # Make iPhone the reference level
    phoneType <- factor(phoneType, 
                       levels = c("iPhone", "Android", "Other"))  # Complete reordering
    
  • Handling Missing Values:

    # NA values are allowed and handled specially
    phoneType <- factor(c("iPhone", "Android", NA, "iPhone"))
    is.na(phoneType)  # Identifies NA values
    
  • Common Operations:

    # Count occurrences of each level
    table(phoneType)         # Basic frequency table
    summary(phoneType)       # Similar to table() but includes NA count
    
    # Convert to/from factors
    as.character(phoneType)  # Convert factor to character vector
    as.factor(c("A", "B"))  # Convert character vector to factor
    
  • Important Considerations:

    • Factors are memory-efficient for repeated categorical values
    • They maintain order of categories (unlike character vectors)
    • They prevent data entry errors by allowing only predefined values
    • Useful for statistical modeling where categorical variables need special handling
    # This will create NA and warning
    phoneType[1] <- "Windows Phone"  # Error if "Windows Phone" not in levels
    
    # To add new levels, you must redefine the factor
    phoneType <- factor(phoneType, 
                       levels = c(levels(phoneType), "Windows Phone"))
    
  • Practical Example:

    # Real-world usage in data analysis
    satisfaction <- factor(c("High", "Medium", "Low", "High"),
                          levels = c("Low", "Medium", "High"),
                          ordered = TRUE)  # Creates an ordered factor
    
    # Useful for plotting
    barplot(table(satisfaction))  # Creates bar plot with categories
    

Data Frames

[!info] Definition and Core Concepts: Data frames are 2-dimensional structures ==similar to database tables or Excel sheets==

  • ==Each column== can have a different data type (unlike matrices)
  • ==Column names== must be unique
  • ==All columns== must have the same number of rows
  • Creating Data Frames:

    # Basic creation
    employeeData <- data.frame(
      EmployeeID = 101:105,
      FirstName = c("Kim", "Ken", "Bob", "Bill", "Cindy"),
      Age = c(24, 23, 54, NA, 64),
      PayType = factor(c("Hourly", "Salaried", "Hourly", "Hourly", "Salaried")),
      stringsAsFactors = FALSE  # Prevents automatic conversion of strings to factors
    )
    
    # From existing vectors
    ids <- 1:3
    names <- c("Alice", "Bob", "Charlie")
    scores <- c(85, 92, 78)
    df <- data.frame(ID = ids, Name = names, Score = scores)
    
    # From a matrix
    mat <- matrix(1:9, nrow = 3)
    df_from_matrix <- as.data.frame(mat)
    
  • Examining Data Frame Structure:

    # Basic information
    str(employeeData)        # Shows structure
    dim(employeeData)        # Returns dimensions
    nrow(employeeData)       # Number of rows
    ncol(employeeData)       # Number of columns
    names(employeeData)      # Column names
    head(employeeData, n=2)  # First 2 rows
    tail(employeeData, n=2)  # Last 2 rows
    
  • Accessing and Subsetting:

    # Column access
    employeeData$FirstName           # Using $
    employeeData[["FirstName"]]      # Using [[]]
    employeeData[, "FirstName"]      # Using [,]
    employeeData[, c("FirstName", "Age")]  # Multiple columns
    
    # Row access
    employeeData[1, ]               # First row
    employeeData[1:3, ]             # First three rows
    
    # Both rows and columns
    # TODO: Pay more attention here!!! Useful for filtering !!!
    employeeData[1:3, c("FirstName", "Age")]  # First three rows, two specific columns
    employeeData[c(1, 3), c(2, 4)]  # reference the first and third rows with respect to the second and fourth columns
    
    # Conditional subsetting
    employeeData[employeeData$Age > 30, ]     # Rows where Age > 30
    
    # TODO: Pay more attention here!!! Useful for filtering !!!
    subset(employeeData, Age > 30)            # Same thing using subset()
    
  • Modifying Data Frames:

    # Adding new columns
    employeeData$Department <- c("HR", "IT", "Sales", "IT", "HR")
    employeeData[["Salary"]] <- c(50000, 60000, 75000, 65000, 80000)
    
    # Modifying existing columns
    employeeData$Age <- employeeData$Age + 1  # Increment all ages
    
    # Adding new rows
    new_row <- data.frame(
      EmployeeID = 106,
      FirstName = "Jamie",
      Age = 56,
      PayType = "Hourly",
      Department = "Sales",
      Salary = 70000
    )
    employeeData <- rbind(employeeData, new_row)
    
    # Removing rows/columns
    employeeData <- employeeData[-1, ]         # Remove first row
    employeeData$Department <- NULL            # Remove Department column
    
  • Common Operations:

    # Sorting (Very Useful)
    employeeData[order(employeeData$Age), ]    # Sort by Age
    employeeData[order(employeeData$Department, -employeeData$Salary), ]  # Multiple sort criteria
    
    # Summarizing
    summary(employeeData)                      # Statistical summary
    table(employeeData$Department)             # Frequency table of Department
    
    # Aggregating
    aggregate(Salary ~ Department, data = employeeData, FUN = mean)  # Mean salary by department
    
    # Handling missing values
    complete.cases(employeeData)               # Identify complete rows
    na.omit(employeeData)                      # Remove rows with any NA
    
  • Advanced Features:

    # Merging data frames
    df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C"))
    df2 <- data.frame(ID = 2:4, Score = c(88, 94, 82))
    merge(df1, df2, by = "ID")                # SQL-like join
    
    # Reshaping data
    # Wide to long format
    library(tidyr)
    long_data <- gather(employeeData, key = "Variable", value = "Value", -EmployeeID)
    
    # Computing on columns
    employeeData$BonusEligible <- employeeData$Salary > 70000
    

Matrices

  • Definition: 2D homogeneous data structure.
  • Creation: Use matrix().
    matrix1 <- matrix(c(1, 0, -20, 0, 1, -15, 1, -1, 0), nrow = 3, ncol = 3, byrow = TRUE)
    
  • Accessing Elements:
    matrix1[2, 3]  # Element at row 2, column 3
    matrix1[1, ]   # Entire first row
    matrix1[, 2]   # Entire second column
    
  • Matrix Operations:
    mat1 + mat2  # Element-wise addition
    mat1 * mat2  # Element-wise multiplication
    

Arrays

  • Definition: nD homogeneous data structure.
  • Creation: Use array().
    array1 <- array(c(1:8), dim = c(2, 2, 2))
    
  • Named Arrays:
    named_array <- array(c(1:8), dim = c(2, 2, 2), dimnames = list(c("r1", "r2"), c("c1", "c2"), c("m1", "m2")))
    

Lists

  • Definition: Heterogeneous data structure that can nest other objects.
  • Creation: Use list().
    list1 <- list(Element1 = demoVec, Element2 = c("A", "B"), Element3 = 3, Element4 = demoDF)
    
  • Accessing Elements:
    list1$Element1  # Returns the first element
    list1[[1]]      # Same as above
    
  • Adding Elements:
    list1$NewElement <- "New Value"  # Adds a new element
    

R Packages

  • Package Basics:

    • Packages expand R's default functionality
    • Thousands available via CRAN (Comprehensive R Archive Network)
    • Browse packages at:
      • By name: cran.r-project.org/web/packages/available_packages_by_name.html
      • By task: cran.r-project.org/web/views/
  • Installation Methods:

    # Method 1: Using install.packages() function
    install.packages(c("foreign", "readr", "haven"))
    
    # Method 2: For packages with potential installation issues
    install.packages(c("dplyr", "car"), 
                    dependencies = TRUE, 
                    type = "binary", 
                    ask = FALSE)
    
    # Method 3: Interactive installation via RStudio
    # Tools -> Install Packages...
    

    Important: Package installation only needs to happen once per R major version

  • Loading Packages:

    # Load one package at a time
    library(foreign)
    library(readr)
    library(haven)
    
    # Get package documentation
    help(package = "haven")
    
  • Package Management:

    # Update packages
    update.packages()  # Via function
    # Or: Tools -> Check for Package Updates... (in RStudio)
    
    # Unload a package
    detach("package:readr", unload = TRUE)
    
  • Handling Package Conflicts:

    # When packages have functions with same name:
    # Option 1: Use package-specific reference
    dplyr::recode(data)  # Use dplyr's recode
    car::recode(data)    # Use car's recode
    
    # Option 2: Control through loading order
    library(car)      # Load first package
    library(dplyr)    # Most recently loaded takes precedence
    
  • Best Practices:

    • Avoid reinstalling packages unnecessarily
    • Be aware of package loading order
    • Consider using specific package references (::) for functions with same names
    • Watch for package conflict messages when loading libraries
    • RStudio 1.2+ will offer to install missing packages automatically
    • Restart R before updating packages that are currently loaded

File Paths and Working Directories

  • Setting Working Directory:
    setwd("~/Documents/ResearchProject")
    
  • Getting Working Directory:
getwd()
  • Relative Paths:
    list.files("Data Files")  # Lists files in the "Data Files" subdirectory
    list.files("Data Files/More Data")
    list.files("..")  # step down a folder
    

Reading Data Files

  • CSV Files:
    CSV_base_example <- read.csv("Data Files/ExampleData.csv")
    
    # using an external library
    library(readr)
    CSV_readr_example <- read_csv("Data Files/ExampleData.csv")
    
  • Excel Files:
    # using an external library
    library(readxl)
    XSLX_readxl_example <- read_excel("Data Files/ExampleData.xlsx", sheet = 1)
    
  • SPSS Files:
    # with `foreign` library
    library(foreign)
    
    SAV_foreign_example <- read.spss("Data Files/ExampleData.sav", to.data.frame = TRUE)
    
    # with `haven` library
    SAV_haven_example <- read_sav("Data Files/ExampleData.sav")
    

Writing Data Files

  • CSV Files:

	write.csv(CSV_readr_example, file = "Data Files/More Data/WriteExampleData.csv", row.names = FALSE)
	
	# TODO: As mentioned earlier, see the `fwrite` function is a faster alternative 
	data.table::fwrite(CSV_readr_example, file = "AltExampleData.csv")
  • Excel Files:
    write_xlsx(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.xlsx")
    
  • SPSS Files:
    write_sav(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.sav")
    

This summary covers all key concepts, examples, and edge cases from your lecture notes. You can now use this as a reference in your Obsidian notes!

...
ISDS Preperation

Anonymous • Mar 31, 2025

I have recently started revising the materials from the Introduction to Statistics and Data Science module at WIUT.

Here is what i love about the module:

  • It covers a great range of things that are widely in every domain

Here is the list of R related lecture notes

As there will be some questions from R programming language, I decided to share my lecture notes with you:

...
Introduction to R Programming - Week 1 Lecture Notes

admin3 • Mar 23, 2025

Table of Contents

  1. Software Installation
  2. RStudio Interface
  3. Arithmetic Operations
  4. Mathematical Functions
  5. Relational Operators
  6. Data Classes
  7. Missing Data
  8. R Objects and Assignment

Software Installation

Required Software

  • R: The core programming language
    • Download from CRAN (Comprehensive R Archive Network)
    • Platform-specific versions available for Mac, Windows, Linux
  • RStudio: Integrated Development Environment (IDE)
    • Download from Posit website
    • Provides unified interface across operating systems

Important URLs

RStudio Interface

Key Components

  1. Source Pane (Top Left)

    • Where R scripts are written and edited
    • Save files with .R extension
    # Example script content
    # Calculate average temperature
    temp_celsius <- 25
    temp_fahrenheit <- (temp_celsius * 9/5) + 32
    
  2. Console (Bottom Left)

    • Displays executed commands and output
    • Direct command entry possible
    > 2 + 2
    [1] 4
    
  3. Environment Pane (Top Right)

    • Shows active variables and objects
    # After running:
    temp_celsius    # Value: 25
    temp_fahrenheit # Value: 77
    

Arithmetic Operations

Basic Operators with Examples

# Addition
5 + 3        # Output: 8
# Subtraction
10 - 4       # Output: 6
# Multiplication
6 * 7        # Output: 42
# Division
15 / 3       # Output: 5
# Exponents
2 ^ 3        # Output: 8
# Modulo (remainder)
17 %% 5      # Output: 2
# Integer division
17 %/% 5     # Output: 3

Order of Operations Examples

# Different results based on parentheses
4 + 2 * 3        # Output: 10 (multiplication first)
(4 + 2) * 3      # Output: 18 (addition first)

# Complex calculation
((10 + 5) * 2) / 5   # Output: 6

Mathematical Functions

Common Functions with Examples

# Square root
sqrt(16)              # Output: 4
sqrt(c(9, 16, 25))   # Output: 3 4 5

# Absolute value
abs(-7.5)            # Output: 7.5
abs(c(-2, 0, 2))     # Output: 2 0 2

# Logarithms
log10(100)           # Output: 2
log(exp(1))          # Output: 1

# Exponential
exp(2)               # Output: 7.389056

Function Help Example

# Getting help for sqrt function
?sqrt
# Returns documentation showing:
# sqrt(x)   # where x is a numeric vector

Relational Operators

Examples with Different Data Types

# Numeric comparisons
5 < 10               # Output: TRUE
7 >= 7               # Output: TRUE
3 == 3               # Output: TRUE
4 != 5               # Output: TRUE

# String comparisons
"apple" == "apple"   # Output: TRUE
"a" < "b"            # Output: TRUE

# Mixed type comparisons
5 == "5"             # Output: FALSE

Data Classes

Type Examples and Conversions

# Numeric
x <- 10.5
typeof(x)            # Output: "double"

# Integer
y <- 10L
typeof(y)            # Output: "integer"

# Character
name <- "John"
typeof(name)         # Output: "character"

# Logical
is_valid <- TRUE
typeof(is_valid)     # Output: "logical"

# Type conversion examples
as.integer(10.7)     # Output: 10
as.character(123)    # Output: "123"
as.numeric("456")    # Output: 456
as.Date(43800, origin = "1899-12-30") # "2019-12-01"

Testing Types

# Using is.* functions
x <- 10.5
is.numeric(x)        # Output: TRUE
is.integer(x)        # Output: FALSE
is.character(x)      # Output: FALSE

# Multiple checks
y <- "123"
is.numeric(y)        # Output: FALSE
is.numeric(as.numeric(y))  # Output: TRUE

Missing Data

Working with NA and NaN

# Creating missing values
x <- c(1, NA, 3, NaN, 5)

# Testing for NA
is.na(x)             # Output: FALSE TRUE FALSE TRUE TRUE

# Calculations with NA
sum(c(1, NA, 3))     # Output: NA
sum(c(1, NA, 3), na.rm = TRUE)  # Output: 4

# NA vs NaN
0/0                  # Output: NaN
NA + 1               # Output: NA

Type Conversion and Comparison

Understanding Type Conversion

# Different numeric types
x_int <- 5L          # integer
x_num <- 5           # numeric/double
x_int == x_num       # Output: TRUE (values are equal)
typeof(x_int) == typeof(x_num)  # Output: FALSE (types are different)

# Detailed example
varA <- 3.3          # double/numeric
varB <- "hello there"  # character
varC <- FALSE        # logical
varD <- 5L           # integer
varE <- 5            # double
varF <- varD + varE  # double (integer + numeric = numeric)
varG <- 2 * varC     # numeric (numeric * logical = numeric)

# Checking types
typeof(varA)   # "double"
typeof(varB)   # "character"
typeof(varC)   # "logical"
typeof(varD)   # "integer"
typeof(varE)   # "double"
typeof(varF)   # "double"
typeof(varG)   # "double"

Key Points About Type Conversion

  1. Implicit Conversion

    • R automatically converts between integer and numeric types in calculations
    • Logical values convert to 1 (TRUE) or 0 (FALSE) in numeric operations
    • The "wider" type usually prevails (e.g., integer + numeric = numeric)
  2. Value vs Type Comparison

    5L == 5     # TRUE (comparing values)
    typeof(5L) == typeof(5)  # FALSE (comparing types: "integer" vs "double")
    
  3. Type Hierarchy

    • character > numeric > integer > logical
    • When mixing types, R usually converts to the higher type
    1L + 2.5    # Result is numeric (2.5 wins)
    TRUE + 1L   # Result is integer (1L wins)
    TRUE + 1.0  # Result is numeric (1.0 wins)
    

R Objects and Assignment

Variable Assignment Examples

# Basic assignment
age <- 25
name <- "Alice"

# Multiple assignments
height <- weight <- 70

# Complex assignments
bmi <- weight / (height/100)^2

# Listing objects
ls()                 # Shows all objects in environment

# Removing objects
rm(age)              # Removes single object
rm(list = ls())      # Removes all objects

Naming Conventions Examples

# Valid names
valid_name <- 1
validName <- 2
VALID_NAME <- 3
.hidden_name <- 4

# Invalid names (will cause errors)
# 1name <- 5      # Can't start with number
# _name <- 6      # Can't start with underscore
# name-1 <- 7     # Can't use hyphen

Practice Exercises: 7. Create variables of different types and test their classes 8. Perform arithmetic operations with variables 9. Try working with missing values and understand their behavior 10. Practice naming conventions and object assignments

References

R data type and packages -> the continuation of the R chronicles (Week 2)

A 2 hours-long video tutorial:

R Programming Tutorial

...