Stories

Discover boundless stories from unique narrators (storytellers 🙃)

R functions, loops, conditional statements - Week 4 Lecture Notes

17501 • Apr 5, 2025

Functions in R Programming

Basic Function Structure

function_name <- function(parameter1, parameter2) {
    # Function body
    result <- # operations
    return(result)
}

Example 1: Simple Function

# Calculate area of rectangle
calculate_area <- function(length, width) {
    area <- length * width
    return(area)
}

# Usage
calculate_area(5, 3)  # Returns 15

Example 2: Function with Default Parameters

greet_user <- function(name = "User") {
    greeting <- paste("Hello,", name, "!")
    return(greeting)
}

greet_user()           # Returns "Hello, User !"
greet_user("Maria")    # Returns "Hello, Maria !"

Conditional Statements (if, else if, else)

Basic Structure

if (condition) {
    # code if condition is TRUE
} else if (another_condition) {
    # code if another_condition is TRUE
} else {
    # code if all conditions are FALSE
}

Example 1: Simple Grade Calculator

get_grade <- function(score) {
    if (score >= 90) {
        return("A")
    } else if (score >= 80) {
        return("B")
    } else if (score >= 70) {
        return("C")
    } else {
        return("F")
    }
}

get_grade(85)  # Returns "B"

Example 2: Number Check

check_number <- function(x) {
    if (x > 0) {
        return("Positive")
    } else if (x < 0) {
        return("Negative")
    } else {
        return("Zero")
    }
}

Conditions and Logical Operators

Common Logical Operators

== : Equal to
!= : Not equal to
> : Greater than
< : Less than
>= : Greater than or equal to
<= : Less than or equal to
& : AND
| : OR
! : NOT

Example: Complex Conditions

check_eligibility <- function(age, income) {
    if (age >= 18 & income >= 30000) {
        return("Eligible")
    } else if (age >= 21 | income >= 50000) {
        return("Conditionally Eligible")
    } else {
        return("Not Eligible")
    }
}

check_eligibility(19, 35000)  # Returns "Eligible"

Example: Multiple Conditions

categorize_day <- function(day, temperature) {
    if (day %in% c("Saturday", "Sunday") & temperature > 20) {
        return("Perfect weekend!")
    } else if (day %in% c("Saturday", "Sunday")) {
        return("Cold weekend")
    } else {
        return("Weekday")
    }
}

categorize_day("Saturday", 25)  # Returns "Perfect weekend!"

Best Practices

Always use clear and descriptive function names
Include documentation/comments for complex functions
Handle edge cases and invalid inputs
Keep functions focused on a single task
Use consistent indentation for readability

Example: Good Practice Implementation

calculate_bmi <- function(weight, height) {
    # Input validation
    if (!is.numeric(weight) | !is.numeric(height)) {
        return("Error: Inputs must be numeric")
    }
    if (weight <= 0 | height <= 0) {
        return("Error: Values must be positive")
    }
    
    # Calculate BMI
    bmi <- weight / (height^2)
    
    # Categorize BMI
    if (bmi < 18.5) {
        return("Underweight")
    } else if (bmi < 25) {
        return("Normal")
    } else if (bmi < 30) {
        return("Overweight")
    } else {
        return("Obese")
    }
}

# Usage
calculate_bmi(70, 1.75)  # Returns BMI category

...

statistics university issues r programming language

5 0 0

Data manipulation in R - Week 3 Lecture Notes

17501 • Apr 5, 2025

Package Management in R

1. Understanding R Package Ecosystem

The R package system is hierarchical:

# Base R: Comes with basic installation
mean(1:10)  # Base function

# Recommended packages: Nearly standard but need loading
library(MASS)  # A recommended package

# Third-party packages: Need installation and loading
install.packages("tidyverse")  # Popular meta-package

2. Package Installation Strategies

Basic Installation:

# Single package
install.packages("dplyr")

# Multiple packages
install.packages(c("dplyr", "ggplot2", "tidyr"))

# With specific parameters for troubleshooting
install.packages("dplyr",
                dependencies = TRUE,  # Install all required packages
                type = "binary",     # Avoid source compilation
                repos = "https://cran.rstudio.com/")  # Specify repository

1. Package Loading and Namespace Management

Basic Loading:

# Standard loading
library(dplyr)

# Alternative loading with error handling
if (!require(dplyr)) {
  install.packages("dplyr")
  library(dplyr)
}

Namespace Conflicts and Resolution:

# Example of conflict
library(dplyr)
library(plyr)  # Also has a select() function

# Three ways to handle conflicts:

# 1. Explicit namespace
dplyr::select(mtcars, mpg, cyl)

# 2. With package environment
with(dplyr, select(mtcars, mpg, cyl))

# 3. Import specific functions
import::from(dplyr, select, filter)

2. Package Version Management

Checking and Updating:

# Check installed packages
installed.packages()

# Check specific package version
packageVersion("dplyr")

# Update packages
update.packages()

# Install specific version (requires devtools)
devtools::install_version("dplyr", version = "1.0.0")

3. Important Caveats and Best Practices

Package Loading Order:

# BAD: Potential conflicts unclear
library(dplyr)
library(plyr)
library(tidyr)

# GOOD: Organized loading with comments
# Core data manipulation
library(dplyr)      # Main data manipulation
library(tidyr)      # Data reshaping
# Visualization
library(ggplot2)    # Plotting

Function Conflicts Resolution:

# Check for conflicts
conflicts()

# Create alias for frequently used conflicting functions
filter_df <- dplyr::filter
select_df <- dplyr::select

# Use conflicted package for explicit conflict resolution
library(conflicted)
conflict_prefer("filter", "dplyr")

4. Project-specific Package Management

Using renv for Project Isolation:

# Initialize project-specific package management
renv::init()

# Install project packages
renv::install("dplyr")

# Snapshot current project state
renv::snapshot()

# Restore project packages
renv::restore()

5. Common Pitfalls and Solutions

Package Loading Errors:

# Problem: Package not found
library(nonexistentpackage)  # Error

# Solution: Check and install
if (!require("package")) {
  install.packages("package")
  library(package)
}

# Problem: Version conflicts
# Solution: Use packageVersion() to check versions
if (packageVersion("dplyr") < "1.0.0") {
  install.packages("dplyr")
}

File Path Management

Code Examples:

# Mac/Linux path
"~/Documents/data.csv"

# Windows paths (both valid)
"C:\\Data\\data.csv"    # Double backslash
"C:/Data/data.csv"      # Forward slash

Important Caveats:

Path specifications differ between Windows and Mac/Linux
Working directory management is crucial:

# Check current working directory
getwd()

# Set working directory
setwd("~/Documents/Project")

# List files in current directory
list.files()

Reading Data Files

Code Examples:

# CSV files
# Base R approach
data_base <- read.csv("file.csv", header = TRUE)

# readr approach (faster)
data_readr <- read_csv("file.csv")

# Excel files
library(readxl)
data_excel <- read_excel("file.xlsx", sheet = 1)

# SPSS files
library(haven)
data_spss <- read_sav("file.sav")

Important Caveats:

Always verify data structure after reading:

str(data)
head(data)

Excel date handling requires special attention:

# Converting Excel dates
as.Date(43800, origin = "1899-12-30")

Data Manipulation and Subsetting

Basic subsetting in R:

# Create a simple dataset for demonstration
employee_data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  age = c(25, NA, 45, 32, NA),
  salary = c(50000, 60000, NA, 75000, 80000)
)

# Basic subsetting by condition
young_employees <- employee_data[employee_data$age < 40, ]

When we subset like this, R evaluates each row against the condition and returns a logical vector (TRUE/FALSE). However, this introduces our first important consideration: how R handles missing values (NA) in comparisons.

Let's see what happens with NA values in comparisons:

# Demonstrate NA behavior
ages <- c(25, NA, 45, 32, NA)
ages < 40
# Returns: TRUE, NA, FALSE, TRUE, NA

# This means when we subset:
employee_data[employee_data$age < 40, ]
# We get rows where age < 40 is TRUE *and* rows where the comparison returns NA

What is `which()` function

This is where the which() function becomes valuable. Let's understand how it works, as it ==only the indices where the condition is TRUE==

# Using which() function
which(ages < 40)
# Returns: 1, 4 (only the indices where the condition is TRUE)

# Therefore:
employee_data[which(employee_data$age < 40), ]
# Only returns rows where age < 40 is definitively TRUE

The which() function serves several important purposes:

1. NA Handling:

# Without which()
salary_filter <- employee_data[employee_data$salary > 60000, ]

# With which()
salary_filter_clean <- employee_data[which(employee_data$salary > 60000), ]

# The second approach excludes NA values automatically

2. Multiple Conditions:

# Complex conditions become clearer with which()
high_paid_young <- employee_data[which(
  employee_data$salary > 60000 & 
  employee_data$age < 40
), ]

# This clearly shows which rows meet both conditions

3. Finding Specific Positions:

# Find indices of specific values
which(employee_data$name == "Alice")  # Returns row number for Alice

# Can be used for multiple matches
# CRUTIAL: Pay attention to a complex and pretty interesting use case!
which(employee_data$salary > mean(employee_data$salary, na.rm = TRUE))

Advanced subsetting techniques:

# Using %in% operator for multiple values
selected_employees <- employee_data[which(
  employee_data$name %in% c("Alice", "Bob", "Eve")
), ]

# Combining conditions with NA handling
qualified_employees <- employee_data[which(
  employee_data$age >= 30 &
  employee_data$salary >= 70000 &
  !is.na(employee_data$age) &  # Explicitly exclude NAs
  !is.na(employee_data$salary)
), ]

Important considerations when using `which()`:

2. Performance Impact:

# For very large datasets, which() might have performance implications
# In such cases, you might want to use data.table or dplyr alternatives:
library(dplyr)
qualified_employees <- employee_data %>%
  filter(!is.na(age), !is.na(salary), age >= 30, salary >= 70000)

3. Maintaining Data Integrity:

# which() helps prevent unexpected results
# Bad approach:
mean(employee_data$salary[employee_data$salary > 60000])  # Includes NAs

# Better approach:
mean(employee_data$salary[which(employee_data$salary > 60000)])  # Excludes NAs

4. Logical Vector Operations:

# Understanding the difference
logical_vector <- employee_data$age < 40           # Contains TRUE, FALSE, NA
which_vector <- which(employee_data$age < 40)      # Contains only matching indices

# This can be important for calculations
sum(logical_vector)    # Might give NA
length(which_vector)   # Gives actual count of TRUE values

Best Practices:

5. Always consider NA values in your data:

# Check for NAs before subsetting
sum(is.na(employee_data$age))
sum(is.na(employee_data$salary))

# Document your NA handling strategy

6. Use explicit NA handling when needed:

# Combining which() with explicit NA handling
clean_subset <- employee_data[which(
  employee_data$age < 40 & 
  !is.na(employee_data$age)
), ]

7. Consider using modern alternatives:

# CRUTIAL: For real, take a close look at this library!
# dplyr approach for complex subsetting
library(dplyr)
clean_subset <- employee_data %>%
  filter(!is.na(age), age < 40)

Data Manipulation with`dplyr` package

Understanding dplyr's Core Philosophy

The dplyr package is built around a set of verb functions that each perform a specific data manipulation task. Let's start with a practical example:

library(dplyr)

# Create a sample dataset for demonstration
sales_data <- data.frame(
    date = as.Date('2024-01-01') + 0:29,
    region = rep(c("North", "South", "East", "West"), each = 8)[1:30],
    sales = round(runif(30, 1000, 5000)),
    profit = round(runif(30, 100, 1000))
)

# Basic dplyr operations pipeline
sales_analysis <- sales_data %>%
    group_by(region) %>%
    summarise(
        total_sales = sum(sales),
        avg_profit = mean(profit),
        transactions = n() # the number of rows after grouping
    ) %>%
    arrange(desc(total_sales))

[!info] ATTENTION: Here n() is ==a counting function== in dplyr that returns ==the number of rows in the current group==.

PS: much like count() in sql

Key Concepts and Crucial Moments to Watch:

1. The Pipeline Operator `(%>%)`:

# Without pipeline (harder to read)
arrange(
    summarise(
        group_by(sales_data, region),
        total_sales = sum(sales)
    ),
    desc(total_sales)
)

# With pipeline (more readable)
sales_data %>%
    group_by(region) %>%
    summarise(total_sales = sum(sales)) %>%
    arrange(desc(total_sales))

[!important] IMPORTANT: The pipeline operator ==passes the result as the first argument to the next function==. Be aware that some functions might need explicit argument naming if not using the first argument.

2. Grouping Operations:

# Simple grouping
sales_data %>%
    group_by(region) %>%
    summarise(mean_sales = mean(sales))

# Multiple grouping variables
sales_data %>%
    group_by(region, month = format(date, "%m")) %>%
    summarise(mean_sales = mean(sales))

# CRUCIAL: Remember to ungroup when needed
# It is important because the next mutate operation is performed on 
# over the whole dataset, NOT over the previously grouped one
sales_data %>%
    group_by(region) %>%
    mutate(region_avg = mean(sales)) %>%
    ungroup() %>%  # Don't forget this!
    mutate(overall_avg = mean(sales))

[!info] Comparison between mutate and summarize While summarize reduces multiple rows into single summary ==rows by the grouped values==,

mutate creates new columns while ==preserving the original number of rows==

PS: mutate just attaches the value (be it aggregate or any other computed one) as a new column of the row, without reducing it. summarize, however, collapses data (by the grouped columns) to create summary statistics.

1. Summarizing with NA Values:

# Add some NA values for demonstration
sales_data$sales[c(5, 15)] <- NA

# Bad: NAs will propagate
sales_data %>%
    group_by(region) %>%
    summarise(mean_sales = mean(sales))

# Good: Handle NAs explicitly
sales_data %>%
    group_by(region) %>%
    summarise(mean_sales = mean(sales, na.rm = TRUE))

2. Multiple Operations and Order Sensitivity:

# Order matters! Be careful with these operations
sales_data %>%
    group_by(region) %>%
    filter(sales > mean(sales)) %>%  # This uses group-wise mean
    summarise(high_sales_count = n())

# Different result if we change order
# CRUTIAL: A very interesting use case 
sales_data %>%
    filter(sales > mean(sales)) %>%  # This uses overall mean
    group_by(region) %>%
    summarise(high_sales_count = n())

3. Common Pitfalls and Solutions:

Handling Grouped Operations:

# Problem: Unexpected results due to grouping
sales_data %>%
    group_by(region) %>%
    mutate(pct_of_total = sales / sum(sales)) %>%
    ungroup()  # Always ungroup after grouped operations

# Solution: Be explicit about grouping scope
sales_data %>%
    mutate(total_sales = sum(sales)) %>%
    group_by(region) %>%
    mutate(pct_of_region = sales / sum(sales),
           pct_of_total = sales / first(total_sales)) %>%
    ungroup()

starwars %>%
  group_by(homeworld) %>% # grouping by homeworld field
  filter(homeworld %in% c('Tatooine', 'Naboo') | eye_color == 'blue') %>% # multiple filtering 
  summarise(population = n()) %>% # counting the population of the groups
  arrange(desc(population)) %>% # ordering it by population descendantly
  slice(1:3) %>% # getting the top 3 hometowns with the most population
  ungroup() # ungrouping for safity purposes

Joining Tables:

# Create a reference table
region_info <- data.frame(
    region = c("North", "South", "East", "West"),
    manager = c("Alice", "Bob", "Charlie", "David")
)

# Safe joining with explicit join type
sales_analysis <- sales_data %>%
    left_join(region_info, by = "region")  # Be explicit about join columns

# Check for unmatched rows
anti_join(sales_data, region_info, by = "region")

4. Advanced Features and Best Practices:

Using across() for Multiple Columns:

# Modern approach for operating on multiple columns
sales_data %>%
    group_by(region) %>%
    summarise(across(
        c(sales, profit),
        list(
            mean = ~mean(., na.rm = TRUE),
            sd = ~sd(., na.rm = TRUE)
        ),
        .names = "{.col}_{.fn}"
    ))

Syntax explained

mean = ~mean(., na.rm = TRUE)

Breaking it down:

~ is a formula operator in R
. represents the ==current column== being processed
mean = names the output
The whole structure is a lambda/anonymous function

Same formula written in traditional R:

# Traditional function
function(x) mean(x, na.rm = TRUE)

# dplyr shorthand
~mean(., na.rm = TRUE)

Example with multiple calculations:

# Verbose way
sales_data %>%
    group_by(region) %>%
    summarise(
        sales_mean = mean(sales, na.rm = TRUE),
        sales_sd = sd(sales, na.rm = TRUE),
        profit_mean = mean(profit, na.rm = TRUE),
        profit_sd = sd(profit, na.rm = TRUE)
    )

# Compact way using across()
sales_data %>%
    group_by(region) %>%
    summarise(across(
        c(sales, profit),
        list(
            mean = ~mean(., na.rm = TRUE),
            sd = ~sd(., na.rm = TRUE)
        )
    ))

Key Takeaways:

Always be mindful of grouping:
- Use group_by() intentionally
- Remember to ungroup() when finished
Handle missing values explicitly:
- Use na.rm = TRUE when appropriate
- Consider filtering NAs beforehand if they're problematic
Pay attention to operation order:
- Operations are sequential
- Grouping affects subsequent calculations
- Filtering before or after grouping can give different results
Document your pipeline:
- Add comments explaining complex transformations
- Break long pipelines into meaningful chunks
- Consider intermediate assignments for clarity

Data Restructuring

Code Examples:

library(tidyr)

# Wide to Long format
long_data <- wide_data %>%
  pivot_longer(
    cols = starts_with("SurveyItem"),
    names_to = "Question",
    values_to = "Response"
  )

# Long to Wide format
wide_data <- long_data %>%
  pivot_wider(
    names_from = Question,
    values_from = Response
  )

Key Areas to Watch For:

Logical Operations:

Parentheses matter in complex conditions:

# Different results:
data[(x < 5 | x > 10) & y == "A", ]  # Correct
data[x < 5 | x > 10 & y == "A", ]    # Incorrect

Data Type Verification:

# Always check data types after import
str(data)
class(data$column)

# Convert if necessary
data$column <- as.factor(data$column)

Missing Values:

# Check for missing values
sum(is.na(data))

# Handle missing values explicitly
data %>%
  filter(!is.na(column)) %>%
  summarise(mean = mean(value))

Merging Data:

# Ensure key columns are properly identified
merged_data <- merge(
  data1, data2,
  by.x = "ID1", by.y = "ID2",
  all = TRUE  # Keep all rows
)

# Always verify merge results
dim(data1)  # Original dimensions
dim(data2)  # Original dimensions
dim(merged_data)  # Should make sense given the merge type

References

Introduction to dplyr package
Index page of dplyr package - which also includes the 1st reference
A great tifyr package for data restructuring

...

statistics university issues r programming language

4 0 0

R Data Structures and packages - Week 2 Lecture Notes

17501 • Apr 5, 2025

Introduction

R provides several data structures to store data in different formats. These include:

Vectors
Factors
Data Frames
Matrices
Lists
Arrays (not covered in this module)

Homogeneous vs. Heterogeneous Data Structures

Dimension	Homogeneous	Heterogeneous
1D	Atomic Vector	List
2D	Matrix	Data Frame
nD	Array

Homogeneous: All elements must be of the same type.
Heterogeneous: Elements can be of different types.

Vectors

Definition: A basic data structure that stores multiple values of the same type.
Creation: Use the c() function.
```
vector1 <- c(3, 6, 9)
```
Length: Use length() to find the number of elements.
```
length(vector1)  # Returns 3
```

Accessing Elements: Use bracket notation []. Indexing starts at 1.

vector1[1]      # First element (3)
vector1[2]      # Second element (6)
vector1[c(1,2)] # First and second elements (3, 6)
vector1[-1]     # All elements except the first
vector1[-c(1,2)]# All elements except first and second

Edge Cases:
- vector1[0] returns an empty vector
- vector1[4] returns NA for a vector of length 3
- Negative indices remove elements at those positions
- Cannot mix positive and negative indices in the same selection

Negative Indexing: Omits elements.

vector1[-1]  # Returns elements except the first

Modifying Elements:

vector1[2] <- 100  # Changes the second element to 100

Adding Elements:

vector1[4] <- 200  # Adds 200 at position 4

Vector Operations:

Arithmetic operations are element-wise.

vector1 + 1  # Adds 1 to each element
vector1 * 2  # Multiplies each element by 2

Mixed Data Types: R implicitly converts mixed types to character.
```
vector10 <- c(10, "20", 30)  # Converts to character
```

Using seq() Function: For generating more complex sequences:

# Basic usage with named arguments
seq(from = 0, to = 10, by = 2)  # Returns: 0 2 4 6 8 10

# Same call with positional arguments
seq(0, 10, by = 2)              # Returns: 0 2 4 6 8 10

# Decimal steps are allowed
seq(0, 5, by = 0.5)            # Returns: 0.0 0.5 1.0 1.5 ... 4.0 4.5 5.0

# Watch out for unexpected results with step size
seq(1, 10, by = 5)             # Returns: 1 6 only
seq(1, 10, by = 3)             # Returns: 1 4 7 10

Important Considerations:
- The sequence always starts at from and proceeds by steps of size by
- It will not exceed the to value, which may result in fewer elements than expected
- Using named arguments (from, to, by) makes code more readable and less error-prone
- The by argument determines how many elements you get, so choose it carefully

Factors

Definition: Special vectors for storing ==categorical data==.
Creation: Use factor().

[!info] Categorical Variables in R: Factors are R's way of storing categorical data - like ==enums== in other languages They store values ==as integers internally== but display them as predefined categories

Creating Factors:

# Basic factor creation
phoneType <- factor(c("iPhone", "Android", "Android", "iPhone", "Other"))

# Creating with predefined levels (including levels that might appear later)
phoneType <- factor(c("iPhone", "Android"),
                   levels = c("iPhone", "Android", "Other", "Windows"))

Understanding Levels:

# Levels are the unique categories allowed in the factor
levels(phoneType)  # Shows all possible categories

# Factors are stored as integers internally
as.numeric(phoneType)  # Shows the internal integer representation
# For example, might show: 1 2 2 1 3 (where 1=iPhone, 2=Android, etc.)

Working with Levels:

# Check current levels
str(phoneType)     # Shows factor structure and levels

# Modify level names
levels(phoneType)[levels(phoneType) == "Other"] <- "Unknown"

# Reorder levels (useful for plotting and modeling)
phoneType <- relevel(phoneType, ref = "iPhone")  # Make iPhone the reference level
phoneType <- factor(phoneType, 
                   levels = c("iPhone", "Android", "Other"))  # Complete reordering

Handling Missing Values:

# NA values are allowed and handled specially
phoneType <- factor(c("iPhone", "Android", NA, "iPhone"))
is.na(phoneType)  # Identifies NA values

Common Operations:

# Count occurrences of each level
table(phoneType)         # Basic frequency table
summary(phoneType)       # Similar to table() but includes NA count

# Convert to/from factors
as.character(phoneType)  # Convert factor to character vector
as.factor(c("A", "B"))  # Convert character vector to factor

Important Considerations:
- Factors are memory-efficient for repeated categorical values
- They maintain order of categories (unlike character vectors)
- They prevent data entry errors by allowing only predefined values
- Useful for statistical modeling where categorical variables need special handling
```
# This will create NA and warning
phoneType[1] <- "Windows Phone"  # Error if "Windows Phone" not in levels

# To add new levels, you must redefine the factor
phoneType <- factor(phoneType, 
                   levels = c(levels(phoneType), "Windows Phone"))
```

Practical Example:

# Real-world usage in data analysis
satisfaction <- factor(c("High", "Medium", "Low", "High"),
                      levels = c("Low", "Medium", "High"),
                      ordered = TRUE)  # Creates an ordered factor

# Useful for plotting
barplot(table(satisfaction))  # Creates bar plot with categories

Data Frames

[!info] Definition and Core Concepts: Data frames are 2-dimensional structures ==similar to database tables or Excel sheets==

==Each column== can have a different data type (unlike matrices)

==Column names== must be unique

==All columns== must have the same number of rows

Creating Data Frames:

# Basic creation
employeeData <- data.frame(
  EmployeeID = 101:105,
  FirstName = c("Kim", "Ken", "Bob", "Bill", "Cindy"),
  Age = c(24, 23, 54, NA, 64),
  PayType = factor(c("Hourly", "Salaried", "Hourly", "Hourly", "Salaried")),
  stringsAsFactors = FALSE  # Prevents automatic conversion of strings to factors
)

# From existing vectors
ids <- 1:3
names <- c("Alice", "Bob", "Charlie")
scores <- c(85, 92, 78)
df <- data.frame(ID = ids, Name = names, Score = scores)

# From a matrix
mat <- matrix(1:9, nrow = 3)
df_from_matrix <- as.data.frame(mat)

Examining Data Frame Structure:

# Basic information
str(employeeData)        # Shows structure
dim(employeeData)        # Returns dimensions
nrow(employeeData)       # Number of rows
ncol(employeeData)       # Number of columns
names(employeeData)      # Column names
head(employeeData, n=2)  # First 2 rows
tail(employeeData, n=2)  # Last 2 rows

Accessing and Subsetting:

# Column access
employeeData$FirstName           # Using $
employeeData[["FirstName"]]      # Using [[]]
employeeData[, "FirstName"]      # Using [,]
employeeData[, c("FirstName", "Age")]  # Multiple columns

# Row access
employeeData[1, ]               # First row
employeeData[1:3, ]             # First three rows

# Both rows and columns
# TODO: Pay more attention here!!! Useful for filtering !!!
employeeData[1:3, c("FirstName", "Age")]  # First three rows, two specific columns
employeeData[c(1, 3), c(2, 4)]  # reference the first and third rows with respect to the second and fourth columns

# Conditional subsetting
employeeData[employeeData$Age > 30, ]     # Rows where Age > 30

# TODO: Pay more attention here!!! Useful for filtering !!!
subset(employeeData, Age > 30)            # Same thing using subset()

Modifying Data Frames:

# Adding new columns
employeeData$Department <- c("HR", "IT", "Sales", "IT", "HR")
employeeData[["Salary"]] <- c(50000, 60000, 75000, 65000, 80000)

# Modifying existing columns
employeeData$Age <- employeeData$Age + 1  # Increment all ages

# Adding new rows
new_row <- data.frame(
  EmployeeID = 106,
  FirstName = "Jamie",
  Age = 56,
  PayType = "Hourly",
  Department = "Sales",
  Salary = 70000
)
employeeData <- rbind(employeeData, new_row)

# Removing rows/columns
employeeData <- employeeData[-1, ]         # Remove first row
employeeData$Department <- NULL            # Remove Department column

Common Operations:

# Sorting (Very Useful)
employeeData[order(employeeData$Age), ]    # Sort by Age
employeeData[order(employeeData$Department, -employeeData$Salary), ]  # Multiple sort criteria

# Summarizing
summary(employeeData)                      # Statistical summary
table(employeeData$Department)             # Frequency table of Department

# Aggregating
aggregate(Salary ~ Department, data = employeeData, FUN = mean)  # Mean salary by department

# Handling missing values
complete.cases(employeeData)               # Identify complete rows
na.omit(employeeData)                      # Remove rows with any NA

Advanced Features:

# Merging data frames
df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C"))
df2 <- data.frame(ID = 2:4, Score = c(88, 94, 82))
merge(df1, df2, by = "ID")                # SQL-like join

# Reshaping data
# Wide to long format
library(tidyr)
long_data <- gather(employeeData, key = "Variable", value = "Value", -EmployeeID)

# Computing on columns
employeeData$BonusEligible <- employeeData$Salary > 70000

Matrices

Definition: 2D homogeneous data structure.

Creation: Use matrix().

matrix1 <- matrix(c(1, 0, -20, 0, 1, -15, 1, -1, 0), nrow = 3, ncol = 3, byrow = TRUE)

Accessing Elements:

matrix1[2, 3]  # Element at row 2, column 3
matrix1[1, ]   # Entire first row
matrix1[, 2]   # Entire second column

Matrix Operations:

mat1 + mat2  # Element-wise addition
mat1 * mat2  # Element-wise multiplication

Arrays

Definition: nD homogeneous data structure.

Creation: Use array().

array1 <- array(c(1:8), dim = c(2, 2, 2))

Named Arrays:

named_array <- array(c(1:8), dim = c(2, 2, 2), dimnames = list(c("r1", "r2"), c("c1", "c2"), c("m1", "m2")))

Lists

Definition: Heterogeneous data structure that can nest other objects.

Creation: Use list().

list1 <- list(Element1 = demoVec, Element2 = c("A", "B"), Element3 = 3, Element4 = demoDF)

Accessing Elements:

list1$Element1  # Returns the first element
list1[[1]]      # Same as above

Adding Elements:

list1$NewElement <- "New Value"  # Adds a new element

R Packages

Package Basics:
- Packages expand R's default functionality
- Thousands available via CRAN (Comprehensive R Archive Network)
- Browse packages at:
  - By name: cran.r-project.org/web/packages/available_packages_by_name.html
  - By task: cran.r-project.org/web/views/

Installation Methods:

# Method 1: Using install.packages() function
install.packages(c("foreign", "readr", "haven"))

# Method 2: For packages with potential installation issues
install.packages(c("dplyr", "car"), 
                dependencies = TRUE, 
                type = "binary", 
                ask = FALSE)

# Method 3: Interactive installation via RStudio
# Tools -> Install Packages...

Important: Package installation only needs to happen once per R major version

Loading Packages:

# Load one package at a time
library(foreign)
library(readr)
library(haven)

# Get package documentation
help(package = "haven")

Package Management:

# Update packages
update.packages()  # Via function
# Or: Tools -> Check for Package Updates... (in RStudio)

# Unload a package
detach("package:readr", unload = TRUE)

Handling Package Conflicts:

# When packages have functions with same name:
# Option 1: Use package-specific reference
dplyr::recode(data)  # Use dplyr's recode
car::recode(data)    # Use car's recode

# Option 2: Control through loading order
library(car)      # Load first package
library(dplyr)    # Most recently loaded takes precedence

Best Practices:
- Avoid reinstalling packages unnecessarily
- Be aware of package loading order
- Consider using specific package references (::) for functions with same names
- Watch for package conflict messages when loading libraries
- RStudio 1.2+ will offer to install missing packages automatically
- Restart R before updating packages that are currently loaded

File Paths and Working Directories

Setting Working Directory:
```
setwd("~/Documents/ResearchProject")
```
Getting Working Directory:

getwd()

Relative Paths:

list.files("Data Files")  # Lists files in the "Data Files" subdirectory
list.files("Data Files/More Data")
list.files("..")  # step down a folder

Reading Data Files

CSV Files:

CSV_base_example <- read.csv("Data Files/ExampleData.csv")

# using an external library
library(readr)
CSV_readr_example <- read_csv("Data Files/ExampleData.csv")

Excel Files:

# using an external library
library(readxl)
XSLX_readxl_example <- read_excel("Data Files/ExampleData.xlsx", sheet = 1)

SPSS Files:

# with `foreign` library
library(foreign)

SAV_foreign_example <- read.spss("Data Files/ExampleData.sav", to.data.frame = TRUE)

# with `haven` library
SAV_haven_example <- read_sav("Data Files/ExampleData.sav")

Writing Data Files

CSV Files:


	write.csv(CSV_readr_example, file = "Data Files/More Data/WriteExampleData.csv", row.names = FALSE)
	
	# TODO: As mentioned earlier, see the `fwrite` function is a faster alternative 
	data.table::fwrite(CSV_readr_example, file = "AltExampleData.csv")

Excel Files:

write_xlsx(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.xlsx")

SPSS Files:

write_sav(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.sav")

This summary covers all key concepts, examples, and edge cases from your lecture notes. You can now use this as a reference in your Obsidian notes!

...

statistics university issues r programming language

4 0 0

ISDS Preperation

Anonymous • Mar 31, 2025

I have recently started revising the materials from the Introduction to Statistics and Data Science module at WIUT.

Here is what i love about the module:

It covers a great range of things that are widely in every domain

Here is the list of R related lecture notes

As there will be some questions from R programming language, I decided to share my lecture notes with you:

...

statistics university issues wiut +1

24 1 0

Introduction to R Programming - Week 1 Lecture Notes

admin3 • Mar 23, 2025

Software Installation
RStudio Interface
Arithmetic Operations
Mathematical Functions
Relational Operators
Data Classes
Missing Data
R Objects and Assignment

Software Installation

Required Software

R: The core programming language
- Download from CRAN (Comprehensive R Archive Network)
- Platform-specific versions available for Mac, Windows, Linux
RStudio: Integrated Development Environment (IDE)
- Download from Posit website
- Provides unified interface across operating systems

Important URLs

R Download: https://cran.r-project.org/
RStudio Download: https://posit.co/download/rstudio-desktop/

RStudio Interface

Key Components

Source Pane (Top Left)

Where R scripts are written and edited
Save files with .R extension

# Example script content
# Calculate average temperature
temp_celsius <- 25
temp_fahrenheit <- (temp_celsius * 9/5) + 32

Console (Bottom Left)
- Displays executed commands and output
- Direct command entry possible
```
> 2 + 2
[1] 4
```

Environment Pane (Top Right)

Shows active variables and objects

# After running:
temp_celsius    # Value: 25
temp_fahrenheit # Value: 77

Arithmetic Operations

Basic Operators with Examples

# Addition
5 + 3        # Output: 8
# Subtraction
10 - 4       # Output: 6
# Multiplication
6 * 7        # Output: 42
# Division
15 / 3       # Output: 5
# Exponents
2 ^ 3        # Output: 8
# Modulo (remainder)
17 %% 5      # Output: 2
# Integer division
17 %/% 5     # Output: 3

Order of Operations Examples

# Different results based on parentheses
4 + 2 * 3        # Output: 10 (multiplication first)
(4 + 2) * 3      # Output: 18 (addition first)

# Complex calculation
((10 + 5) * 2) / 5   # Output: 6

Mathematical Functions

Common Functions with Examples

# Square root
sqrt(16)              # Output: 4
sqrt(c(9, 16, 25))   # Output: 3 4 5

# Absolute value
abs(-7.5)            # Output: 7.5
abs(c(-2, 0, 2))     # Output: 2 0 2

# Logarithms
log10(100)           # Output: 2
log(exp(1))          # Output: 1

# Exponential
exp(2)               # Output: 7.389056

Function Help Example

# Getting help for sqrt function
?sqrt
# Returns documentation showing:
# sqrt(x)   # where x is a numeric vector

Relational Operators

Examples with Different Data Types

# Numeric comparisons
5 < 10               # Output: TRUE
7 >= 7               # Output: TRUE
3 == 3               # Output: TRUE
4 != 5               # Output: TRUE

# String comparisons
"apple" == "apple"   # Output: TRUE
"a" < "b"            # Output: TRUE

# Mixed type comparisons
5 == "5"             # Output: FALSE

Data Classes

Type Examples and Conversions

# Numeric
x <- 10.5
typeof(x)            # Output: "double"

# Integer
y <- 10L
typeof(y)            # Output: "integer"

# Character
name <- "John"
typeof(name)         # Output: "character"

# Logical
is_valid <- TRUE
typeof(is_valid)     # Output: "logical"

# Type conversion examples
as.integer(10.7)     # Output: 10
as.character(123)    # Output: "123"
as.numeric("456")    # Output: 456
as.Date(43800, origin = "1899-12-30") # "2019-12-01"

Testing Types

# Using is.* functions
x <- 10.5
is.numeric(x)        # Output: TRUE
is.integer(x)        # Output: FALSE
is.character(x)      # Output: FALSE

# Multiple checks
y <- "123"
is.numeric(y)        # Output: FALSE
is.numeric(as.numeric(y))  # Output: TRUE

Missing Data

Working with NA and NaN

# Creating missing values
x <- c(1, NA, 3, NaN, 5)

# Testing for NA
is.na(x)             # Output: FALSE TRUE FALSE TRUE TRUE

# Calculations with NA
sum(c(1, NA, 3))     # Output: NA
sum(c(1, NA, 3), na.rm = TRUE)  # Output: 4

# NA vs NaN
0/0                  # Output: NaN
NA + 1               # Output: NA

Type Conversion and Comparison

Understanding Type Conversion

# Different numeric types
x_int <- 5L          # integer
x_num <- 5           # numeric/double
x_int == x_num       # Output: TRUE (values are equal)
typeof(x_int) == typeof(x_num)  # Output: FALSE (types are different)

# Detailed example
varA <- 3.3          # double/numeric
varB <- "hello there"  # character
varC <- FALSE        # logical
varD <- 5L           # integer
varE <- 5            # double
varF <- varD + varE  # double (integer + numeric = numeric)
varG <- 2 * varC     # numeric (numeric * logical = numeric)

# Checking types
typeof(varA)   # "double"
typeof(varB)   # "character"
typeof(varC)   # "logical"
typeof(varD)   # "integer"
typeof(varE)   # "double"
typeof(varF)   # "double"
typeof(varG)   # "double"

Key Points About Type Conversion

Implicit Conversion
- R automatically converts between integer and numeric types in calculations
- Logical values convert to 1 (TRUE) or 0 (FALSE) in numeric operations
- The "wider" type usually prevails (e.g., integer + numeric = numeric)

Value vs Type Comparison

5L == 5     # TRUE (comparing values)
typeof(5L) == typeof(5)  # FALSE (comparing types: "integer" vs "double")

Type Hierarchy

character > numeric > integer > logical
When mixing types, R usually converts to the higher type

1L + 2.5    # Result is numeric (2.5 wins)
TRUE + 1L   # Result is integer (1L wins)
TRUE + 1.0  # Result is numeric (1.0 wins)

R Objects and Assignment

Variable Assignment Examples

# Basic assignment
age <- 25
name <- "Alice"

# Multiple assignments
height <- weight <- 70

# Complex assignments
bmi <- weight / (height/100)^2

# Listing objects
ls()                 # Shows all objects in environment

# Removing objects
rm(age)              # Removes single object
rm(list = ls())      # Removes all objects

Naming Conventions Examples

# Valid names
valid_name <- 1
validName <- 2
VALID_NAME <- 3
.hidden_name <- 4

# Invalid names (will cause errors)
# 1name <- 5      # Can't start with number
# _name <- 6      # Can't start with underscore
# name-1 <- 7     # Can't use hyphen

Practice Exercises: 7. Create variables of different types and test their classes 8. Perform arithmetic operations with variables 9. Try working with missing values and understand their behavior 10. Practice naming conventions and object assignments

References

R data type and packages -> the continuation of the R chronicles (Week 2)

A 2 hours-long video tutorial:

...

statistics university issues r programming language

78 1 1

Stories

R functions, loops, conditional statements - Week 4 Lecture Notes

Functions in R Programming

Basic Function Structure

Example 1: Simple Function

Example 2: Function with Default Parameters

Conditional Statements (if, else if, else)

Basic Structure

Example 1: Simple Grade Calculator

Example 2: Number Check

Conditions and Logical Operators

Common Logical Operators

Example: Complex Conditions

Example: Multiple Conditions

Best Practices

Example: Good Practice Implementation

Data manipulation in R - Week 3 Lecture Notes

Package Management in R

1. Understanding R Package Ecosystem

2. Package Installation Strategies

1. Package Loading and Namespace Management

Basic Loading:

Namespace Conflicts and Resolution:

2. Package Version Management

3. Important Caveats and Best Practices

Package Loading Order:

Function Conflicts Resolution:

4. Project-specific Package Management

5. Common Pitfalls and Solutions

File Path Management

Reading Data Files

Data Manipulation and Subsetting

Basic subsetting in R:

What is which() function

1. NA Handling:

2. Multiple Conditions:

3. Finding Specific Positions:

Advanced subsetting techniques:

Important considerations when using which():

2. Performance Impact:

3. Maintaining Data Integrity:

4. Logical Vector Operations:

Best Practices:

5. Always consider NA values in your data:

6. Use explicit NA handling when needed:

7. Consider using modern alternatives:

Data Manipulation withdplyr package

Understanding dplyr's Core Philosophy

Key Concepts and Crucial Moments to Watch:

1. The Pipeline Operator (%>%):

2. Grouping Operations:

1. Summarizing with NA Values:

2. Multiple Operations and Order Sensitivity:

3. Common Pitfalls and Solutions:

Handling Grouped Operations:

Joining Tables:

4. Advanced Features and Best Practices:

Syntax explained

Key Takeaways:

Data Restructuring

References

R Data Structures and packages - Week 2 Lecture Notes

Introduction

Homogeneous vs. Heterogeneous Data Structures

Vectors

Factors

Data Frames

Matrices

Arrays

Lists

R Packages

File Paths and Working Directories

Reading Data Files

Writing Data Files

ISDS Preperation

Here is what i love about the module:

Here is the list of R related lecture notes

Introduction to R Programming - Week 1 Lecture Notes

Table of Contents

Software Installation

What is `which()` function

Important considerations when using `which()`:

Data Manipulation with`dplyr` package

1. The Pipeline Operator `(%>%)`: