Stories
Discover boundless stories from unique narrators (storytellers 🙃)
Silence before the storm
17501 • Apr 7, 2025
So, as I mentioned, I will have 2 final exams on the 15th and 23rd of April.
And to my unluck, I got sick (again). I am hoping to recover asap, as I need to lock in and prepare for those exams!
Wish me good luck, people 🙏🏻
R functions, loops, conditional statements - Week 4 Lecture Notes
17501 • Apr 5, 2025
Functions in R Programming
Basic Function Structure
function_name <- function(parameter1, parameter2) {
# Function body
result <- # operations
return(result)
}
Example 1: Simple Function
# Calculate area of rectangle
calculate_area <- function(length, width) {
area <- length * width
return(area)
}
# Usage
calculate_area(5, 3) # Returns 15
Example 2: Function with Default Parameters
greet_user <- function(name = "User") {
greeting <- paste("Hello,", name, "!")
return(greeting)
}
greet_user() # Returns "Hello, User !"
greet_user("Maria") # Returns "Hello, Maria !"
Conditional Statements (if, else if, else)
Basic Structure
if (condition) {
# code if condition is TRUE
} else if (another_condition) {
# code if another_condition is TRUE
} else {
# code if all conditions are FALSE
}
Example 1: Simple Grade Calculator
get_grade <- function(score) {
if (score >= 90) {
return("A")
} else if (score >= 80) {
return("B")
} else if (score >= 70) {
return("C")
} else {
return("F")
}
}
get_grade(85) # Returns "B"
Example 2: Number Check
check_number <- function(x) {
if (x > 0) {
return("Positive")
} else if (x < 0) {
return("Negative")
} else {
return("Zero")
}
}
Conditions and Logical Operators
Common Logical Operators
==
: Equal to!=
: Not equal to>
: Greater than<
: Less than>=
: Greater than or equal to<=
: Less than or equal to&
: AND|
: OR!
: NOT
Example: Complex Conditions
check_eligibility <- function(age, income) {
if (age >= 18 & income >= 30000) {
return("Eligible")
} else if (age >= 21 | income >= 50000) {
return("Conditionally Eligible")
} else {
return("Not Eligible")
}
}
check_eligibility(19, 35000) # Returns "Eligible"
Example: Multiple Conditions
categorize_day <- function(day, temperature) {
if (day %in% c("Saturday", "Sunday") & temperature > 20) {
return("Perfect weekend!")
} else if (day %in% c("Saturday", "Sunday")) {
return("Cold weekend")
} else {
return("Weekday")
}
}
categorize_day("Saturday", 25) # Returns "Perfect weekend!"
Best Practices
- Always use clear and descriptive function names
- Include documentation/comments for complex functions
- Handle edge cases and invalid inputs
- Keep functions focused on a single task
- Use consistent indentation for readability
Example: Good Practice Implementation
calculate_bmi <- function(weight, height) {
# Input validation
if (!is.numeric(weight) | !is.numeric(height)) {
return("Error: Inputs must be numeric")
}
if (weight <= 0 | height <= 0) {
return("Error: Values must be positive")
}
# Calculate BMI
bmi <- weight / (height^2)
# Categorize BMI
if (bmi < 18.5) {
return("Underweight")
} else if (bmi < 25) {
return("Normal")
} else if (bmi < 30) {
return("Overweight")
} else {
return("Obese")
}
}
# Usage
calculate_bmi(70, 1.75) # Returns BMI category
...
Data manipulation in R - Week 3 Lecture Notes
17501 • Apr 5, 2025
Package Management in R
1. Understanding R Package Ecosystem
The R package system is hierarchical:
# Base R: Comes with basic installation
mean(1:10) # Base function
# Recommended packages: Nearly standard but need loading
library(MASS) # A recommended package
# Third-party packages: Need installation and loading
install.packages("tidyverse") # Popular meta-package
2. Package Installation Strategies
Basic Installation:
# Single package
install.packages("dplyr")
# Multiple packages
install.packages(c("dplyr", "ggplot2", "tidyr"))
# With specific parameters for troubleshooting
install.packages("dplyr",
dependencies = TRUE, # Install all required packages
type = "binary", # Avoid source compilation
repos = "https://cran.rstudio.com/") # Specify repository
1. Package Loading and Namespace Management
Basic Loading:
# Standard loading
library(dplyr)
# Alternative loading with error handling
if (!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}
Namespace Conflicts and Resolution:
# Example of conflict
library(dplyr)
library(plyr) # Also has a select() function
# Three ways to handle conflicts:
# 1. Explicit namespace
dplyr::select(mtcars, mpg, cyl)
# 2. With package environment
with(dplyr, select(mtcars, mpg, cyl))
# 3. Import specific functions
import::from(dplyr, select, filter)
2. Package Version Management
Checking and Updating:
# Check installed packages
installed.packages()
# Check specific package version
packageVersion("dplyr")
# Update packages
update.packages()
# Install specific version (requires devtools)
devtools::install_version("dplyr", version = "1.0.0")
3. Important Caveats and Best Practices
Package Loading Order:
# BAD: Potential conflicts unclear
library(dplyr)
library(plyr)
library(tidyr)
# GOOD: Organized loading with comments
# Core data manipulation
library(dplyr) # Main data manipulation
library(tidyr) # Data reshaping
# Visualization
library(ggplot2) # Plotting
Function Conflicts Resolution:
# Check for conflicts
conflicts()
# Create alias for frequently used conflicting functions
filter_df <- dplyr::filter
select_df <- dplyr::select
# Use conflicted package for explicit conflict resolution
library(conflicted)
conflict_prefer("filter", "dplyr")
4. Project-specific Package Management
Using renv
for Project Isolation:
# Initialize project-specific package management
renv::init()
# Install project packages
renv::install("dplyr")
# Snapshot current project state
renv::snapshot()
# Restore project packages
renv::restore()
5. Common Pitfalls and Solutions
Package Loading Errors:
# Problem: Package not found
library(nonexistentpackage) # Error
# Solution: Check and install
if (!require("package")) {
install.packages("package")
library(package)
}
# Problem: Version conflicts
# Solution: Use packageVersion() to check versions
if (packageVersion("dplyr") < "1.0.0") {
install.packages("dplyr")
}
File Path Management
Code Examples:
# Mac/Linux path
"~/Documents/data.csv"
# Windows paths (both valid)
"C:\\Data\\data.csv" # Double backslash
"C:/Data/data.csv" # Forward slash
Important Caveats:
- Path specifications differ between Windows and Mac/Linux
- Working directory management is crucial:
# Check current working directory
getwd()
# Set working directory
setwd("~/Documents/Project")
# List files in current directory
list.files()
Reading Data Files
Code Examples:
# CSV files
# Base R approach
data_base <- read.csv("file.csv", header = TRUE)
# readr approach (faster)
data_readr <- read_csv("file.csv")
# Excel files
library(readxl)
data_excel <- read_excel("file.xlsx", sheet = 1)
# SPSS files
library(haven)
data_spss <- read_sav("file.sav")
Important Caveats:
- Always verify data structure after reading:
str(data)
head(data)
- Excel date handling requires special attention:
# Converting Excel dates
as.Date(43800, origin = "1899-12-30")
Data Manipulation and Subsetting
Basic subsetting in R:
# Create a simple dataset for demonstration
employee_data <- data.frame(
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
age = c(25, NA, 45, 32, NA),
salary = c(50000, 60000, NA, 75000, 80000)
)
# Basic subsetting by condition
young_employees <- employee_data[employee_data$age < 40, ]
When we subset like this, R evaluates each row against the condition and returns a logical vector (TRUE/FALSE). However, this introduces our first important consideration: how R handles missing values (NA) in comparisons.
Let's see what happens with NA values in comparisons:
# Demonstrate NA behavior
ages <- c(25, NA, 45, 32, NA)
ages < 40
# Returns: TRUE, NA, FALSE, TRUE, NA
# This means when we subset:
employee_data[employee_data$age < 40, ]
# We get rows where age < 40 is TRUE *and* rows where the comparison returns NA
What is which()
function
This is where the which()
function becomes valuable. Let's understand how it works, as it ==only the indices where the condition is TRUE==
# Using which() function
which(ages < 40)
# Returns: 1, 4 (only the indices where the condition is TRUE)
# Therefore:
employee_data[which(employee_data$age < 40), ]
# Only returns rows where age < 40 is definitively TRUE
The which()
function serves several important purposes:
1. NA Handling:
# Without which()
salary_filter <- employee_data[employee_data$salary > 60000, ]
# With which()
salary_filter_clean <- employee_data[which(employee_data$salary > 60000), ]
# The second approach excludes NA values automatically
2. Multiple Conditions:
# Complex conditions become clearer with which()
high_paid_young <- employee_data[which(
employee_data$salary > 60000 &
employee_data$age < 40
), ]
# This clearly shows which rows meet both conditions
3. Finding Specific Positions:
# Find indices of specific values
which(employee_data$name == "Alice") # Returns row number for Alice
# Can be used for multiple matches
# CRUTIAL: Pay attention to a complex and pretty interesting use case!
which(employee_data$salary > mean(employee_data$salary, na.rm = TRUE))
Advanced subsetting techniques:
# Using %in% operator for multiple values
selected_employees <- employee_data[which(
employee_data$name %in% c("Alice", "Bob", "Eve")
), ]
# Combining conditions with NA handling
qualified_employees <- employee_data[which(
employee_data$age >= 30 &
employee_data$salary >= 70000 &
!is.na(employee_data$age) & # Explicitly exclude NAs
!is.na(employee_data$salary)
), ]
Important considerations when using which()
:
2. Performance Impact:
# For very large datasets, which() might have performance implications
# In such cases, you might want to use data.table or dplyr alternatives:
library(dplyr)
qualified_employees <- employee_data %>%
filter(!is.na(age), !is.na(salary), age >= 30, salary >= 70000)
3. Maintaining Data Integrity:
# which() helps prevent unexpected results
# Bad approach:
mean(employee_data$salary[employee_data$salary > 60000]) # Includes NAs
# Better approach:
mean(employee_data$salary[which(employee_data$salary > 60000)]) # Excludes NAs
4. Logical Vector Operations:
# Understanding the difference
logical_vector <- employee_data$age < 40 # Contains TRUE, FALSE, NA
which_vector <- which(employee_data$age < 40) # Contains only matching indices
# This can be important for calculations
sum(logical_vector) # Might give NA
length(which_vector) # Gives actual count of TRUE values
Best Practices:
5. Always consider NA values in your data:
# Check for NAs before subsetting
sum(is.na(employee_data$age))
sum(is.na(employee_data$salary))
# Document your NA handling strategy
6. Use explicit NA handling when needed:
# Combining which() with explicit NA handling
clean_subset <- employee_data[which(
employee_data$age < 40 &
!is.na(employee_data$age)
), ]
7. Consider using modern alternatives:
# CRUTIAL: For real, take a close look at this library!
# dplyr approach for complex subsetting
library(dplyr)
clean_subset <- employee_data %>%
filter(!is.na(age), age < 40)
Data Manipulation withdplyr
package
Understanding dplyr's Core Philosophy
The dplyr package is built around a set of verb functions that each perform a specific data manipulation task. Let's start with a practical example:
library(dplyr)
# Create a sample dataset for demonstration
sales_data <- data.frame(
date = as.Date('2024-01-01') + 0:29,
region = rep(c("North", "South", "East", "West"), each = 8)[1:30],
sales = round(runif(30, 1000, 5000)),
profit = round(runif(30, 100, 1000))
)
# Basic dplyr operations pipeline
sales_analysis <- sales_data %>%
group_by(region) %>%
summarise(
total_sales = sum(sales),
avg_profit = mean(profit),
transactions = n() # the number of rows after grouping
) %>%
arrange(desc(total_sales))
[!info] ATTENTION: Here
n()
is ==a counting function== indplyr
that returns ==the number of rows in the current group==.PS: much like
count()
insql
Key Concepts and Crucial Moments to Watch:
1. The Pipeline Operator (%>%)
:
# Without pipeline (harder to read)
arrange(
summarise(
group_by(sales_data, region),
total_sales = sum(sales)
),
desc(total_sales)
)
# With pipeline (more readable)
sales_data %>%
group_by(region) %>%
summarise(total_sales = sum(sales)) %>%
arrange(desc(total_sales))
[!important] IMPORTANT: The pipeline operator ==passes the result as the first argument to the next function==. Be aware that some functions might need explicit argument naming if not using the first argument.
2. Grouping Operations:
# Simple grouping
sales_data %>%
group_by(region) %>%
summarise(mean_sales = mean(sales))
# Multiple grouping variables
sales_data %>%
group_by(region, month = format(date, "%m")) %>%
summarise(mean_sales = mean(sales))
# CRUCIAL: Remember to ungroup when needed
# It is important because the next mutate operation is performed on
# over the whole dataset, NOT over the previously grouped one
sales_data %>%
group_by(region) %>%
mutate(region_avg = mean(sales)) %>%
ungroup() %>% # Don't forget this!
mutate(overall_avg = mean(sales))
[!info] Comparison between
mutate
andsummarize
Whilesummarize
reduces multiple rows into single summary ==rows by the grouped values==,
mutate
creates new columns while ==preserving the original number of rows==
PS: mutate
just attaches the value (be it aggregate or any other computed one) as a new column of the row, without reducing it. summarize
, however, collapses data (by the grouped columns) to create summary statistics.
1. Summarizing with NA Values:
# Add some NA values for demonstration
sales_data$sales[c(5, 15)] <- NA
# Bad: NAs will propagate
sales_data %>%
group_by(region) %>%
summarise(mean_sales = mean(sales))
# Good: Handle NAs explicitly
sales_data %>%
group_by(region) %>%
summarise(mean_sales = mean(sales, na.rm = TRUE))
2. Multiple Operations and Order Sensitivity:
# Order matters! Be careful with these operations
sales_data %>%
group_by(region) %>%
filter(sales > mean(sales)) %>% # This uses group-wise mean
summarise(high_sales_count = n())
# Different result if we change order
# CRUTIAL: A very interesting use case
sales_data %>%
filter(sales > mean(sales)) %>% # This uses overall mean
group_by(region) %>%
summarise(high_sales_count = n())
3. Common Pitfalls and Solutions:
Handling Grouped Operations:
# Problem: Unexpected results due to grouping
sales_data %>%
group_by(region) %>%
mutate(pct_of_total = sales / sum(sales)) %>%
ungroup() # Always ungroup after grouped operations
# Solution: Be explicit about grouping scope
sales_data %>%
mutate(total_sales = sum(sales)) %>%
group_by(region) %>%
mutate(pct_of_region = sales / sum(sales),
pct_of_total = sales / first(total_sales)) %>%
ungroup()
starwars %>%
group_by(homeworld) %>% # grouping by homeworld field
filter(homeworld %in% c('Tatooine', 'Naboo') | eye_color == 'blue') %>% # multiple filtering
summarise(population = n()) %>% # counting the population of the groups
arrange(desc(population)) %>% # ordering it by population descendantly
slice(1:3) %>% # getting the top 3 hometowns with the most population
ungroup() # ungrouping for safity purposes
Joining Tables:
# Create a reference table
region_info <- data.frame(
region = c("North", "South", "East", "West"),
manager = c("Alice", "Bob", "Charlie", "David")
)
# Safe joining with explicit join type
sales_analysis <- sales_data %>%
left_join(region_info, by = "region") # Be explicit about join columns
# Check for unmatched rows
anti_join(sales_data, region_info, by = "region")
4. Advanced Features and Best Practices:
Using across() for Multiple Columns:
# Modern approach for operating on multiple columns
sales_data %>%
group_by(region) %>%
summarise(across(
c(sales, profit),
list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE)
),
.names = "{.col}_{.fn}"
))
Syntax explained
mean = ~mean(., na.rm = TRUE)
Breaking it down:
~
is aformula operator
in R.
represents the ==current column== being processedmean =
names the output- The whole structure is a lambda/anonymous function
Same formula written in traditional R:
# Traditional function
function(x) mean(x, na.rm = TRUE)
# dplyr shorthand
~mean(., na.rm = TRUE)
Example with multiple calculations:
# Verbose way
sales_data %>%
group_by(region) %>%
summarise(
sales_mean = mean(sales, na.rm = TRUE),
sales_sd = sd(sales, na.rm = TRUE),
profit_mean = mean(profit, na.rm = TRUE),
profit_sd = sd(profit, na.rm = TRUE)
)
# Compact way using across()
sales_data %>%
group_by(region) %>%
summarise(across(
c(sales, profit),
list(
mean = ~mean(., na.rm = TRUE),
sd = ~sd(., na.rm = TRUE)
)
))
Key Takeaways:
Always be mindful of grouping:
- Use group_by() intentionally
- Remember to ungroup() when finished
Handle missing values explicitly:
- Use
na.rm = TRUE
when appropriate - Consider filtering
NA
s beforehand if they're problematic
- Use
Pay attention to operation order:
- Operations are sequential
- Grouping affects subsequent calculations
- Filtering before or after grouping can give different results
Document your pipeline:
- Add comments explaining complex transformations
- Break long pipelines into meaningful chunks
- Consider intermediate assignments for clarity
Data Restructuring
Code Examples:
library(tidyr)
# Wide to Long format
long_data <- wide_data %>%
pivot_longer(
cols = starts_with("SurveyItem"),
names_to = "Question",
values_to = "Response"
)
# Long to Wide format
wide_data <- long_data %>%
pivot_wider(
names_from = Question,
values_from = Response
)
Key Areas to Watch For:
- Logical Operations:
- Parentheses matter in complex conditions:
# Different results:
data[(x < 5 | x > 10) & y == "A", ] # Correct
data[x < 5 | x > 10 & y == "A", ] # Incorrect
- Data Type Verification:
# Always check data types after import
str(data)
class(data$column)
# Convert if necessary
data$column <- as.factor(data$column)
- Missing Values:
# Check for missing values
sum(is.na(data))
# Handle missing values explicitly
data %>%
filter(!is.na(column)) %>%
summarise(mean = mean(value))
- Merging Data:
# Ensure key columns are properly identified
merged_data <- merge(
data1, data2,
by.x = "ID1", by.y = "ID2",
all = TRUE # Keep all rows
)
# Always verify merge results
dim(data1) # Original dimensions
dim(data2) # Original dimensions
dim(merged_data) # Should make sense given the merge type
References
- Introduction to
dplyr
package - Index page of
dplyr
package - which also includes the 1st reference - A great
tifyr
package for data restructuring
R Data Structures and packages - Week 2 Lecture Notes
17501 • Apr 5, 2025
Introduction
R provides several data structures to store data in different formats. These include:
- Vectors
- Factors
- Data Frames
- Matrices
- Lists
- Arrays (not covered in this module)
Homogeneous vs. Heterogeneous Data Structures
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1D | Atomic Vector | List |
2D | Matrix | Data Frame |
nD | Array |
- Homogeneous: All elements must be of the same type.
- Heterogeneous: Elements can be of different types.
Vectors
Definition: A basic data structure that stores multiple values of the same type.
Creation: Use the
c()
function.vector1 <- c(3, 6, 9)
Length: Use
length()
to find the number of elements.length(vector1) # Returns 3
Accessing Elements: Use bracket notation
[]
. Indexing starts at 1.vector1[1] # First element (3) vector1[2] # Second element (6) vector1[c(1,2)] # First and second elements (3, 6) vector1[-1] # All elements except the first vector1[-c(1,2)]# All elements except first and second
Edge Cases:
vector1[0]
returns an empty vectorvector1[4]
returnsNA
for a vector of length 3- Negative indices remove elements at those positions
- Cannot mix positive and negative indices in the same selection
Negative Indexing: Omits elements.
vector1[-1] # Returns elements except the first
Modifying Elements:
vector1[2] <- 100 # Changes the second element to 100
Adding Elements:
vector1[4] <- 200 # Adds 200 at position 4
Vector Operations:
- Arithmetic operations are element-wise.
vector1 + 1 # Adds 1 to each element vector1 * 2 # Multiplies each element by 2
Mixed Data Types: R implicitly converts mixed types to character.
vector10 <- c(10, "20", 30) # Converts to character
Using seq() Function: For generating more complex sequences:
# Basic usage with named arguments seq(from = 0, to = 10, by = 2) # Returns: 0 2 4 6 8 10 # Same call with positional arguments seq(0, 10, by = 2) # Returns: 0 2 4 6 8 10 # Decimal steps are allowed seq(0, 5, by = 0.5) # Returns: 0.0 0.5 1.0 1.5 ... 4.0 4.5 5.0 # Watch out for unexpected results with step size seq(1, 10, by = 5) # Returns: 1 6 only seq(1, 10, by = 3) # Returns: 1 4 7 10
Important Considerations:
- The sequence always starts at
from
and proceeds by steps of sizeby
- It will not exceed the
to
value, which may result in fewer elements than expected - Using named arguments (
from
,to
,by
) makes code more readable and less error-prone - The
by
argument determines how many elements you get, so choose it carefully
- The sequence always starts at
Factors
- Definition: Special vectors for storing ==categorical data==.
- Creation: Use
factor()
.
[!info] Categorical Variables in R: Factors are R's way of storing categorical data - like ==enums== in other languages They store values ==as integers internally== but display them as predefined categories
Creating Factors:
# Basic factor creation phoneType <- factor(c("iPhone", "Android", "Android", "iPhone", "Other")) # Creating with predefined levels (including levels that might appear later) phoneType <- factor(c("iPhone", "Android"), levels = c("iPhone", "Android", "Other", "Windows"))
Understanding Levels:
# Levels are the unique categories allowed in the factor levels(phoneType) # Shows all possible categories # Factors are stored as integers internally as.numeric(phoneType) # Shows the internal integer representation # For example, might show: 1 2 2 1 3 (where 1=iPhone, 2=Android, etc.)
Working with Levels:
# Check current levels str(phoneType) # Shows factor structure and levels # Modify level names levels(phoneType)[levels(phoneType) == "Other"] <- "Unknown" # Reorder levels (useful for plotting and modeling) phoneType <- relevel(phoneType, ref = "iPhone") # Make iPhone the reference level phoneType <- factor(phoneType, levels = c("iPhone", "Android", "Other")) # Complete reordering
Handling Missing Values:
# NA values are allowed and handled specially phoneType <- factor(c("iPhone", "Android", NA, "iPhone")) is.na(phoneType) # Identifies NA values
Common Operations:
# Count occurrences of each level table(phoneType) # Basic frequency table summary(phoneType) # Similar to table() but includes NA count # Convert to/from factors as.character(phoneType) # Convert factor to character vector as.factor(c("A", "B")) # Convert character vector to factor
Important Considerations:
- Factors are memory-efficient for repeated categorical values
- They maintain order of categories (unlike character vectors)
- They prevent data entry errors by allowing only predefined values
- Useful for statistical modeling where categorical variables need special handling
# This will create NA and warning phoneType[1] <- "Windows Phone" # Error if "Windows Phone" not in levels # To add new levels, you must redefine the factor phoneType <- factor(phoneType, levels = c(levels(phoneType), "Windows Phone"))
Practical Example:
# Real-world usage in data analysis satisfaction <- factor(c("High", "Medium", "Low", "High"), levels = c("Low", "Medium", "High"), ordered = TRUE) # Creates an ordered factor # Useful for plotting barplot(table(satisfaction)) # Creates bar plot with categories
Data Frames
[!info] Definition and Core Concepts: Data frames are 2-dimensional structures ==similar to database tables or Excel sheets==
- ==Each column== can have a different data type (unlike matrices)
- ==Column names== must be unique
- ==All columns== must have the same number of rows
Creating Data Frames:
# Basic creation employeeData <- data.frame( EmployeeID = 101:105, FirstName = c("Kim", "Ken", "Bob", "Bill", "Cindy"), Age = c(24, 23, 54, NA, 64), PayType = factor(c("Hourly", "Salaried", "Hourly", "Hourly", "Salaried")), stringsAsFactors = FALSE # Prevents automatic conversion of strings to factors ) # From existing vectors ids <- 1:3 names <- c("Alice", "Bob", "Charlie") scores <- c(85, 92, 78) df <- data.frame(ID = ids, Name = names, Score = scores) # From a matrix mat <- matrix(1:9, nrow = 3) df_from_matrix <- as.data.frame(mat)
Examining Data Frame Structure:
# Basic information str(employeeData) # Shows structure dim(employeeData) # Returns dimensions nrow(employeeData) # Number of rows ncol(employeeData) # Number of columns names(employeeData) # Column names head(employeeData, n=2) # First 2 rows tail(employeeData, n=2) # Last 2 rows
Accessing and Subsetting:
# Column access employeeData$FirstName # Using $ employeeData[["FirstName"]] # Using [[]] employeeData[, "FirstName"] # Using [,] employeeData[, c("FirstName", "Age")] # Multiple columns # Row access employeeData[1, ] # First row employeeData[1:3, ] # First three rows # Both rows and columns # TODO: Pay more attention here!!! Useful for filtering !!! employeeData[1:3, c("FirstName", "Age")] # First three rows, two specific columns employeeData[c(1, 3), c(2, 4)] # reference the first and third rows with respect to the second and fourth columns # Conditional subsetting employeeData[employeeData$Age > 30, ] # Rows where Age > 30 # TODO: Pay more attention here!!! Useful for filtering !!! subset(employeeData, Age > 30) # Same thing using subset()
Modifying Data Frames:
# Adding new columns employeeData$Department <- c("HR", "IT", "Sales", "IT", "HR") employeeData[["Salary"]] <- c(50000, 60000, 75000, 65000, 80000) # Modifying existing columns employeeData$Age <- employeeData$Age + 1 # Increment all ages # Adding new rows new_row <- data.frame( EmployeeID = 106, FirstName = "Jamie", Age = 56, PayType = "Hourly", Department = "Sales", Salary = 70000 ) employeeData <- rbind(employeeData, new_row) # Removing rows/columns employeeData <- employeeData[-1, ] # Remove first row employeeData$Department <- NULL # Remove Department column
Common Operations:
# Sorting (Very Useful) employeeData[order(employeeData$Age), ] # Sort by Age employeeData[order(employeeData$Department, -employeeData$Salary), ] # Multiple sort criteria # Summarizing summary(employeeData) # Statistical summary table(employeeData$Department) # Frequency table of Department # Aggregating aggregate(Salary ~ Department, data = employeeData, FUN = mean) # Mean salary by department # Handling missing values complete.cases(employeeData) # Identify complete rows na.omit(employeeData) # Remove rows with any NA
Advanced Features:
# Merging data frames df1 <- data.frame(ID = 1:3, Name = c("A", "B", "C")) df2 <- data.frame(ID = 2:4, Score = c(88, 94, 82)) merge(df1, df2, by = "ID") # SQL-like join # Reshaping data # Wide to long format library(tidyr) long_data <- gather(employeeData, key = "Variable", value = "Value", -EmployeeID) # Computing on columns employeeData$BonusEligible <- employeeData$Salary > 70000
Matrices
- Definition: 2D homogeneous data structure.
- Creation: Use
matrix()
.matrix1 <- matrix(c(1, 0, -20, 0, 1, -15, 1, -1, 0), nrow = 3, ncol = 3, byrow = TRUE)
- Accessing Elements:
matrix1[2, 3] # Element at row 2, column 3 matrix1[1, ] # Entire first row matrix1[, 2] # Entire second column
- Matrix Operations:
mat1 + mat2 # Element-wise addition mat1 * mat2 # Element-wise multiplication
Arrays
- Definition: nD homogeneous data structure.
- Creation: Use
array()
.array1 <- array(c(1:8), dim = c(2, 2, 2))
- Named Arrays:
named_array <- array(c(1:8), dim = c(2, 2, 2), dimnames = list(c("r1", "r2"), c("c1", "c2"), c("m1", "m2")))
Lists
- Definition: Heterogeneous data structure that can nest other objects.
- Creation: Use
list()
.list1 <- list(Element1 = demoVec, Element2 = c("A", "B"), Element3 = 3, Element4 = demoDF)
- Accessing Elements:
list1$Element1 # Returns the first element list1[[1]] # Same as above
- Adding Elements:
list1$NewElement <- "New Value" # Adds a new element
R Packages
Package Basics:
- Packages expand R's default functionality
- Thousands available via CRAN (Comprehensive R Archive Network)
- Browse packages at:
- By name: cran.r-project.org/web/packages/available_packages_by_name.html
- By task: cran.r-project.org/web/views/
Installation Methods:
# Method 1: Using install.packages() function install.packages(c("foreign", "readr", "haven")) # Method 2: For packages with potential installation issues install.packages(c("dplyr", "car"), dependencies = TRUE, type = "binary", ask = FALSE) # Method 3: Interactive installation via RStudio # Tools -> Install Packages...
Important: Package installation only needs to happen once per R major version
Loading Packages:
# Load one package at a time library(foreign) library(readr) library(haven) # Get package documentation help(package = "haven")
Package Management:
# Update packages update.packages() # Via function # Or: Tools -> Check for Package Updates... (in RStudio) # Unload a package detach("package:readr", unload = TRUE)
Handling Package Conflicts:
# When packages have functions with same name: # Option 1: Use package-specific reference dplyr::recode(data) # Use dplyr's recode car::recode(data) # Use car's recode # Option 2: Control through loading order library(car) # Load first package library(dplyr) # Most recently loaded takes precedence
Best Practices:
- Avoid reinstalling packages unnecessarily
- Be aware of package loading order
- Consider using specific package references (::) for functions with same names
- Watch for package conflict messages when loading libraries
- RStudio 1.2+ will offer to install missing packages automatically
- Restart R before updating packages that are currently loaded
File Paths and Working Directories
- Setting Working Directory:
setwd("~/Documents/ResearchProject")
- Getting Working Directory:
getwd()
- Relative Paths:
list.files("Data Files") # Lists files in the "Data Files" subdirectory list.files("Data Files/More Data") list.files("..") # step down a folder
Reading Data Files
- CSV Files:
CSV_base_example <- read.csv("Data Files/ExampleData.csv") # using an external library library(readr) CSV_readr_example <- read_csv("Data Files/ExampleData.csv")
- Excel Files:
# using an external library library(readxl) XSLX_readxl_example <- read_excel("Data Files/ExampleData.xlsx", sheet = 1)
- SPSS Files:
# with `foreign` library library(foreign) SAV_foreign_example <- read.spss("Data Files/ExampleData.sav", to.data.frame = TRUE) # with `haven` library SAV_haven_example <- read_sav("Data Files/ExampleData.sav")
Writing Data Files
- CSV Files:
write.csv(CSV_readr_example, file = "Data Files/More Data/WriteExampleData.csv", row.names = FALSE)
# TODO: As mentioned earlier, see the `fwrite` function is a faster alternative
data.table::fwrite(CSV_readr_example, file = "AltExampleData.csv")
- Excel Files:
write_xlsx(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.xlsx")
- SPSS Files:
write_sav(CSV_readr_example, path = "Data Files/More Data/WriteExampleData.sav")
This summary covers all key concepts, examples, and edge cases from your lecture notes. You can now use this as a reference in your Obsidian notes!
...Graphic Design Final Entry
17501 • Apr 4, 2025
Today, I have completed and submitted the last entry of my graphics design assignment.
I was supposed to create a figma UI for the company's website and support my knowledge of UI UX laws and principles in the written report.
Honestly, it was a bit harder than I expected. I took some time to get used to the new tool and it's features. But I think I did a good use of them and hope to receive a mark for this 😅.
Some of my friends helped me out when it comes to the requirements of the assessment. Gotta give her the credits.
Now, with ISDS and IMOB finals being left, I bravely march forward, carrying my sword 🗡️ to eliminate the rest of the exams ...
Wish me luck 🤞🏻 🍀
...ISDS Preperation
Anonymous • Mar 31, 2025
I have recently started revising the materials from the Introduction to Statistics and Data Science module at WIUT.
Here is what i love about the module:
- It covers a great range of things that are widely in every domain
Here is the list of R related lecture notes
As there will be some questions from R programming language, I decided to share my lecture notes with you:
...FundPro Viva Experience
17501 • Mar 27, 2025
Today, I have been called to viva for the first time ever.
Initially, I got 98 from the coursework, and that caused questions in the examiner. I mean, if I was using GenAI or just copying others' work.
I did it all by myself, and It took quite a time to complete you know, as I was aiming to get something near 100.
However, thankfully, I managed to pass it and it was easy. The hard part was to get to this day 😂. I have been just going nuts, thinking about worst case scenario.
Ahh, relief... But for a moment.
Now here comes my finals
Wish me luck 🤞🏻 and pray for me ❤️
Peace 🕊️✌️
...Introduction to R Programming - Week 1 Lecture Notes
admin3 • Mar 23, 2025
Table of Contents
- Software Installation
- RStudio Interface
- Arithmetic Operations
- Mathematical Functions
- Relational Operators
- Data Classes
- Missing Data
- R Objects and Assignment
Software Installation
Required Software
- R: The core programming language
- Download from CRAN (Comprehensive R Archive Network)
- Platform-specific versions available for Mac, Windows, Linux
- RStudio: Integrated Development Environment (IDE)
- Download from Posit website
- Provides unified interface across operating systems
Important URLs
- R Download: https://cran.r-project.org/
- RStudio Download: https://posit.co/download/rstudio-desktop/
RStudio Interface
Key Components
Source Pane (Top Left)
- Where R scripts are written and edited
- Save files with
.R
extension
# Example script content # Calculate average temperature temp_celsius <- 25 temp_fahrenheit <- (temp_celsius * 9/5) + 32
Console (Bottom Left)
- Displays executed commands and output
- Direct command entry possible
> 2 + 2 [1] 4
Environment Pane (Top Right)
- Shows active variables and objects
# After running: temp_celsius # Value: 25 temp_fahrenheit # Value: 77
Arithmetic Operations
Basic Operators with Examples
# Addition
5 + 3 # Output: 8
# Subtraction
10 - 4 # Output: 6
# Multiplication
6 * 7 # Output: 42
# Division
15 / 3 # Output: 5
# Exponents
2 ^ 3 # Output: 8
# Modulo (remainder)
17 %% 5 # Output: 2
# Integer division
17 %/% 5 # Output: 3
Order of Operations Examples
# Different results based on parentheses
4 + 2 * 3 # Output: 10 (multiplication first)
(4 + 2) * 3 # Output: 18 (addition first)
# Complex calculation
((10 + 5) * 2) / 5 # Output: 6
Mathematical Functions
Common Functions with Examples
# Square root
sqrt(16) # Output: 4
sqrt(c(9, 16, 25)) # Output: 3 4 5
# Absolute value
abs(-7.5) # Output: 7.5
abs(c(-2, 0, 2)) # Output: 2 0 2
# Logarithms
log10(100) # Output: 2
log(exp(1)) # Output: 1
# Exponential
exp(2) # Output: 7.389056
Function Help Example
# Getting help for sqrt function
?sqrt
# Returns documentation showing:
# sqrt(x) # where x is a numeric vector
Relational Operators
Examples with Different Data Types
# Numeric comparisons
5 < 10 # Output: TRUE
7 >= 7 # Output: TRUE
3 == 3 # Output: TRUE
4 != 5 # Output: TRUE
# String comparisons
"apple" == "apple" # Output: TRUE
"a" < "b" # Output: TRUE
# Mixed type comparisons
5 == "5" # Output: FALSE
Data Classes
Type Examples and Conversions
# Numeric
x <- 10.5
typeof(x) # Output: "double"
# Integer
y <- 10L
typeof(y) # Output: "integer"
# Character
name <- "John"
typeof(name) # Output: "character"
# Logical
is_valid <- TRUE
typeof(is_valid) # Output: "logical"
# Type conversion examples
as.integer(10.7) # Output: 10
as.character(123) # Output: "123"
as.numeric("456") # Output: 456
as.Date(43800, origin = "1899-12-30") # "2019-12-01"
Testing Types
# Using is.* functions
x <- 10.5
is.numeric(x) # Output: TRUE
is.integer(x) # Output: FALSE
is.character(x) # Output: FALSE
# Multiple checks
y <- "123"
is.numeric(y) # Output: FALSE
is.numeric(as.numeric(y)) # Output: TRUE
Missing Data
Working with NA and NaN
# Creating missing values
x <- c(1, NA, 3, NaN, 5)
# Testing for NA
is.na(x) # Output: FALSE TRUE FALSE TRUE TRUE
# Calculations with NA
sum(c(1, NA, 3)) # Output: NA
sum(c(1, NA, 3), na.rm = TRUE) # Output: 4
# NA vs NaN
0/0 # Output: NaN
NA + 1 # Output: NA
Type Conversion and Comparison
Understanding Type Conversion
# Different numeric types
x_int <- 5L # integer
x_num <- 5 # numeric/double
x_int == x_num # Output: TRUE (values are equal)
typeof(x_int) == typeof(x_num) # Output: FALSE (types are different)
# Detailed example
varA <- 3.3 # double/numeric
varB <- "hello there" # character
varC <- FALSE # logical
varD <- 5L # integer
varE <- 5 # double
varF <- varD + varE # double (integer + numeric = numeric)
varG <- 2 * varC # numeric (numeric * logical = numeric)
# Checking types
typeof(varA) # "double"
typeof(varB) # "character"
typeof(varC) # "logical"
typeof(varD) # "integer"
typeof(varE) # "double"
typeof(varF) # "double"
typeof(varG) # "double"
Key Points About Type Conversion
Implicit Conversion
- R automatically converts between integer and numeric types in calculations
- Logical values convert to 1 (TRUE) or 0 (FALSE) in numeric operations
- The "wider" type usually prevails (e.g., integer + numeric = numeric)
Value vs Type Comparison
5L == 5 # TRUE (comparing values) typeof(5L) == typeof(5) # FALSE (comparing types: "integer" vs "double")
Type Hierarchy
- character > numeric > integer > logical
- When mixing types, R usually converts to the higher type
1L + 2.5 # Result is numeric (2.5 wins) TRUE + 1L # Result is integer (1L wins) TRUE + 1.0 # Result is numeric (1.0 wins)
R Objects and Assignment
Variable Assignment Examples
# Basic assignment
age <- 25
name <- "Alice"
# Multiple assignments
height <- weight <- 70
# Complex assignments
bmi <- weight / (height/100)^2
# Listing objects
ls() # Shows all objects in environment
# Removing objects
rm(age) # Removes single object
rm(list = ls()) # Removes all objects
Naming Conventions Examples
# Valid names
valid_name <- 1
validName <- 2
VALID_NAME <- 3
.hidden_name <- 4
# Invalid names (will cause errors)
# 1name <- 5 # Can't start with number
# _name <- 6 # Can't start with underscore
# name-1 <- 7 # Can't use hyphen
Practice Exercises: 7. Create variables of different types and test their classes 8. Perform arithmetic operations with variables 9. Try working with missing values and understand their behavior 10. Practice naming conventions and object assignments
References
R data type and packages -> the continuation of the R chronicles (Week 2)
A 2 hours-long video tutorial:
...Impact of IT on retail industry
Anonymous • Mar 23, 2025
Core Study Context and Background:
- The study investigates the impact of information technology on retail industry, specifically focusing on employee and customer acceptance of smart retail technologies (SRT) in Jordan (549 page)
- Data was collected from 134 retail stores across Jordan's metropolitan cities, with 480 customer responses (558 page)
Key Technology Adoption Findings:
- Technology readiness was found to significantly impact retail performance (β = 0.620), indicating that a 1% increase in technology readiness results in a 62% improvement in retailer performance (561 page)
- Perceived usefulness showed substantial impact (β = 0.576) on retailer performance, suggesting a 57.6% improvement for each 1% increase (561 page)
Smart Retail Technology (SRT) Implementation:
- SRT provides retail customer services through smart device networks and integrated retail infrastructure (551 page)
- The study projects SRT assets to reach $36 billion by 2021 (551 page)
Customer Behavior and Technology Acceptance:
- Store reputation plays a crucial role - consumers see reputable stores as more trustworthy and show favorable attitudes toward SRT (552 page)
- Both employee and customer preparedness need to be considered when implementing new technologies (565 page)
Implementation Challenges:
- Retail stores must ensure smart and user-friendly innovations are introduced to reduce consumer discontent (565 page)
- The study found that intelligent, easy, and realistic technology can reduce consumer discontent and molestation (565 page)
Future Research Recommendations:
- Future studies should investigate customer adoption of specific technologies like smart displays, smart shopping carts, and NFC systems (566 page)
- Research could expand to developing countries like Malaysia, India, and China to increase generalization possibilities (566 page)
Regional Context:
- Jordan's retail sector shows significant potential for growth in modern, organized retail segments (566 page)
- The country's unique geographic location makes it a natural corridor for regional growth and international shopping (566 page)
Study Limitations:
- Respondents could opt out of providing certain details
- Self-administered questionnaires may have impacted efficacy
- Potential sampling errors in randomized sampling locations (566 page)
References
- Source:
- Ref:
- Theeb, K.A., Mansour, A.M., Khaled, A.S.D., Syed, A.A. and Saeed, A.M.M. (2023) ‘The impact of information technology on retail industry: an empirical study’, Int. J. Procurement Management, Vol. 16, No. 4, pp.549–568.
Feedback on groupmate's feedback #1
admin1 • Mar 23, 2025
I really liked detailed steps you enlisted and how the technology might require the change in the Karzinka’s internal processes and how to make sure users can adapt to it seamlessly.
However, I would strongly recommend you to prove your arguments with reliable sources from the similar cases that we identified earlier or ones that can be found online.
Here is the illustration from your draft:
“ A comprehensive strategy for implementing this is to, first, create a new app specifically for the self checkout system or to update the current Korzinka application… “
Here, you have mentioned the development or update of the existing app and how it will be utilized. The similar idea is discussed in this research paper
It would be so great if you could identify the arguments and advantages for this step from other sources, proving that Karzinka can benefit from it.
In short, the general structure is the following (of course, you can also add your own solutions, but overall recommendation is to support them with external academic sources) :
According to (source authors), bla-bla-bla has successfully worked out, showing (result of the implementation), and thats why Karzinka might as well benefit from integration of the bla-bla-bla by doing bla-bla-bla
You can use in-text citations and proper referencing using this book
Additionally, in my response to Question #2, I also mention a few study cases with some statistical results and where (company or place) the case has taken place , and you may also refer to those justifications.
On top of that, I have earlier send a detail document on the solution that we have chosen with the group (the self-checkout) - that's basically a summary of a related research paper. You may also want to go over it.
Thank you very much 🙏🏻🙏🏻🙏🏻
...