Empirical Application Vignette
Mauricio Olivares, Tomasz Olma, Daniel Wilhelm
2025-02-28
Source:vignettes/hdIndep.Rmd
hdIndep.Rmd
Introduction
This vignette illustrates how our stepdown procedure can be applied to real gene expression data to identify transcripts associated with cell cycle phases. We use publicly available HeLa cell cycle data from Hughes et al. (2009).
Setup
Install and load required packages:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
required_packages <- c("readxl", "compiler", "ggplot2")
invisible(lapply(required_packages, require, character.only = TRUE))
Load hdIndep
package
Data Acquisition
We analyze the GEO dataset GSE11923:
gse <- GEOquery::getGEO("GSE11923", GSEMatrix = TRUE)
expr_data <- exprs(gse[[1]])
You may download data from the HeLa Cell Cycle (from Hughes et al. 2009) See more information about the dataset here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11923
Each column in expr_data
corresponds to a sample (GSM
ID=GEO Sample). Each row corresponds to a probe set measuring expression
in that sample. The values in expr_data
are expression
intensity values for each probe set in each sample.
Running the Stepdown Procedure
Define parameters for the stepdown procedure:
n <- dim(expr_data)[2] # Sample size
p <- dim(expr_data)[1] # Number of variables
q <- block_size(n) # Block size
B <- 1000 # Number of bootstrap replications
alpha <- 0.05 # Significance level
m <- floor(n / (q + 1)) # Number of blocks
type <- "bmb1" # Test statistic used in the multiplier bootstrap
# Set seed for reproducibility
seed <- 5
set.seed(seed)
Reshape the data so that we can run the stepdown procedure automatically.
Run the stepdown procedure using the BMB1 studentized version of the test statistic
stud <- stepdown_RomanoWolf(dat, q, B, alpha, type, seed, steps = TRUE)
length(stud$rejected_total) # Number of rejected hypotheses
length(stud$rejected_total) / dim(expr_data)[1]
stud$counter # Number of steps in the stepdown procedure
Collect the variables that are screened out by the stepdown procedure. These are the probes that are not independent of the cycle phase.
independent_vars <- expr_data[stud_1$rejected_total, ]
Finally, save the independent variables
write.csv(independent_vars, file = "data/independent_vars.csv")
Comparing Results with Hughes et al. (2009)
Load the probes that were identified by Hughes et al. (2009).
Download the pre-processed data from the HeLa Cell Cycle from Hughes et al. (2009), Table S1, Supporting information: https://pmc.ncbi.nlm.nih.gov/articles/PMC2654964/#pgen.1000442.s012
This is an excel file, pgen.1000442.s012.xls
, that
contains the transcripts identified by Hughes et al. (2009) that cycle
with a period length of 24 hrs, i.e., that are independent of the cell
cycle phase.
df_hughes <- readxl::read_excel("data/pgen.1000442.s012.xls", sheet = 1)
hughes2009 <- df_hughes[[1]] # Extract identified probes as a vector
cat("Hughes et al. (2009) identified", length(hughes2009), "different
transcripts in mouse liver that cycle with a period length of 24 hours.\n")
Compare transcripts from Hughes et al. (2009) with ours
raw_oow2025 <- read.csv("data/independent_vars.csv", header = TRUE, stringsAsFactors = FALSE)
Extract only the first column
oow2025 <- raw_oow2025[[1]] # Extract identified probes as a vector
cat("Our stepdown method identified", length(oow2025),
"different transcripts.\n")
Compare the two sets of identified transcripts
Unique Transcripts
Take a random sample of unique transcripts identified by our procedure
cat("Here are 6 randomly selected transcripts identified by our procedure:\n")
set.seed(6)
id_sample <- sample(unique_oow, 6)
What are these transcripts? Extract the platform (GPL) ID
gpl_id <- annotation(gse[[1]])
Check the platform ID and download the annotation file for the platform
print(gpl_id)
gpl <- getGEO(gpl_id)
Extract annotation table and check available column names
gpl_table <- Table(gpl)
colnames(gpl_table)
Select relevant columns
probe_to_gene <- gpl_table[, c("ID", "Gene Symbol", "Gene Title")]
Match with id_sample (your probe set IDs) and add original Probe Set IDs
gene_info <- probe_to_gene[match(id_sample, probe_to_gene$ID), c("Gene Symbol", "Gene Title")]
gene_info$ProbeSetID <- id_sample
Print the subset
print(gene_info)
Replace id_sample with Gene Symbols for plots
id_sample_symbol <- gene_info$Gene.Symbol
Visualization of Selected Transcripts
Convert raw_oow2025 to a matrix
raw_oow2025 <- as.matrix(raw_oow2025)
Find id_sample in raw_oow2025 and return all the columns as a data frame
id_sample_df <- raw_oow2025[raw_oow2025[, 1] %in% id_sample, ]
plot_data <- apply(id_sample_df[, -1], 2, as.numeric)
Visuals
x_vals <- 1:ncol(plot_data)
for (i in 1:nrow(plot_data)) {
# Create a data frame for the current row
df <- data.frame(x = x_vals, y = plot_data[i, ])
# Generate plot
ggplot(df, aes(x = x, y = y)) +
geom_line(linetype = "dashed") + # Dashed line
geom_point(color = "skyblue") + # Blue circles
theme_minimal() + # Minimal theme
labs(x = "Hour", y = "Intensity Level") # Updated axis labels
}