Empirical Application Vignette

Introduction

This vignette illustrates how our stepdown procedure can be applied to real gene expression data to identify transcripts associated with cell cycle phases. We use publicly available HeLa cell cycle data from Hughes et al. (2009).

Setup

Install and load required packages:

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("GEOquery")

required_packages <- c("readxl", "compiler", "ggplot2")
invisible(lapply(required_packages, require, character.only = TRUE))

Load hdIndep package

library(hdIndep)

Data Acquisition

We analyze the GEO dataset GSE11923:

gse <- GEOquery::getGEO("GSE11923", GSEMatrix = TRUE)
expr_data <- exprs(gse[[1]])

You may download data from the HeLa Cell Cycle (from Hughes et al. 2009) See more information about the dataset here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11923

Each column in expr_data corresponds to a sample (GSM ID=GEO Sample). Each row corresponds to a probe set measuring expression in that sample. The values in expr_data are expression intensity values for each probe set in each sample.

Running the Stepdown Procedure

Define parameters for the stepdown procedure:

n <- dim(expr_data)[2] # Sample size
p <- dim(expr_data)[1] # Number of variables
q <- block_size(n) # Block size
B <- 1000 # Number of bootstrap replications
alpha <- 0.05 # Significance level
m <- floor(n / (q + 1)) # Number of blocks
type <- "bmb1" # Test statistic used in the multiplier bootstrap
# Set seed for reproducibility
seed <- 5
set.seed(seed)

Reshape the data so that we can run the stepdown procedure automatically.

D <- t(expr_data)
dat <- list(X = seq(1, n, by = 1), D = D)

Run the stepdown procedure using the BMB1 studentized version of the test statistic

stud <- stepdown_RomanoWolf(dat, q, B, alpha, type, seed, steps = TRUE)
length(stud$rejected_total) # Number of rejected hypotheses
length(stud$rejected_total) /  dim(expr_data)[1]
stud$counter # Number of steps in the stepdown procedure

Collect the variables that are screened out by the stepdown procedure. These are the probes that are not independent of the cycle phase.

independent_vars <- expr_data[stud_1$rejected_total, ]

Finally, save the independent variables

write.csv(independent_vars, file = "data/independent_vars.csv")

Comparing Results with Hughes et al. (2009)

Load the probes that were identified by Hughes et al. (2009).

Download the pre-processed data from the HeLa Cell Cycle from Hughes et al. (2009), Table S1, Supporting information: https://pmc.ncbi.nlm.nih.gov/articles/PMC2654964/#pgen.1000442.s012

This is an excel file, pgen.1000442.s012.xls, that contains the transcripts identified by Hughes et al. (2009) that cycle with a period length of 24 hrs, i.e., that are independent of the cell cycle phase.

df_hughes <- readxl::read_excel("data/pgen.1000442.s012.xls", sheet = 1)
hughes2009 <- df_hughes[[1]] # Extract identified probes as a vector
cat("Hughes et al. (2009) identified", length(hughes2009), "different
transcripts in mouse liver that cycle with a period length of 24 hours.\n")

Compare transcripts from Hughes et al. (2009) with ours

raw_oow2025 <- read.csv("data/independent_vars.csv", header = TRUE, stringsAsFactors = FALSE)

Extract only the first column

oow2025 <- raw_oow2025[[1]]  # Extract identified probes as a vector
cat("Our stepdown method identified", length(oow2025),
    "different transcripts.\n")

Compare the two sets of identified transcripts

common <- intersect(hughes2009, oow2025)
unique_oow <- setdiff(oow2025, hughes2009)
unique_hughes <- setdiff(hughes2009, oow2025)
cat("There are", length(common), "transcripts that are common to both sets.\n")
cat("There are", length(unique_oow), "transcripts unique to our method.\n")

Unique Transcripts

Take a random sample of unique transcripts identified by our procedure

cat("Here are 6 randomly selected transcripts identified by our procedure:\n")
set.seed(6)
id_sample <- sample(unique_oow, 6)

What are these transcripts? Extract the platform (GPL) ID

gpl_id <- annotation(gse[[1]])

Check the platform ID and download the annotation file for the platform

print(gpl_id)
gpl <- getGEO(gpl_id)

Extract annotation table and check available column names

gpl_table <- Table(gpl)
colnames(gpl_table)

Select relevant columns

probe_to_gene <- gpl_table[, c("ID", "Gene Symbol", "Gene Title")]

Match with id_sample (your probe set IDs) and add original Probe Set IDs

gene_info <- probe_to_gene[match(id_sample, probe_to_gene$ID), c("Gene Symbol", "Gene Title")]
gene_info$ProbeSetID <- id_sample

Print the subset

print(gene_info)

Replace id_sample with Gene Symbols for plots

id_sample_symbol <- gene_info$Gene.Symbol

Visualization of Selected Transcripts

Convert raw_oow2025 to a matrix

raw_oow2025 <- as.matrix(raw_oow2025)

Find id_sample in raw_oow2025 and return all the columns as a data frame

id_sample_df <- raw_oow2025[raw_oow2025[, 1] %in% id_sample, ]
plot_data <- apply(id_sample_df[, -1], 2, as.numeric)

Visuals

x_vals <- 1:ncol(plot_data)

for (i in 1:nrow(plot_data)) {
  # Create a data frame for the current row
  df <- data.frame(x = x_vals, y = plot_data[i, ])
  
  # Generate plot
  ggplot(df, aes(x = x, y = y)) +
    geom_line(linetype = "dashed") +  # Dashed line
    geom_point(color = "skyblue") +      # Blue circles
    theme_minimal() +                 # Minimal theme
    labs(x = "Hour", y = "Intensity Level")  # Updated axis labels
}

Mauricio Olivares, Tomasz Olma, Daniel Wilhelm

2025-02-28