Skip to contents

Converts between Gene IDs for a given expression dataset and mapping file.

Usage

gene_mapping(
  exprs_data,
  mapping_df,
  compress_fun = c("mean", "median", "sum", "pc1", "pc1_unscaled", "highest_mean",
    "highest_median"),
  compress_trans = c("none", "log_exp", "exp_log"),
  verbose = TRUE
)

Arguments

exprs_data

continuous value data.frame or matrix with Gene IDs as rownames

mapping_df

mapping file data.frame with the first column the Gene IDs in rownames of exprs_data and the second column the IDs to map to. Can have multiple to multiple linking.

compress_fun

the compression method to use in cases where multiple gene IDs link to a single new gene ID. "highest_mean" and "highest_median" pick the highest row based on mean or median, respectively, while "mean", "median", and "sum" aggregate the duplicate rows based on method chosen. "pc1" and "pc1_unscaled" are using the first principal component in stats::prcomp(), with scale. set to TRUE or FALSE, respectively.

compress_trans

the transformation used when compressing the duplicate rows. For example, "log_exp" would take the log of the data, apply the compress_fun method, then transpose back using exp (i.e. geometric mean).

verbose

should messages about the compression process be displayed

Value

a data.frame at the new gene ID level, with compression of duplicate rows as outlined in the compress_fun and compress_trans parameters

Details

Gene IDs in the exprs_data that do not link to the first column of mapping_df will be excluded from the final output.

When pc1 or pc1_unscaled compress_fun are specified the 1st principal component is flipped if there is negative correlation with the duplicate rows.

Examples

if (FALSE) { # \dontrun{
geo <- GEOquery::getGEO("GSE14333")
exprs_data <- geo$GSE14333_series_matrix.txt.gz@assayData$exprs

library(hgu133plus2.db)
mapping_df <- as.data.frame(hgu133plus2SYMBOL)

expression_compression(exprs_data, mapping_df, "highest_mean", "none")

### Using a more complicated case for GSE83834
temp_dir <- tempdir()
geo <- GEOquery::getGEO("GSE83834", destdir = temp_dir)
sup_mat <- GEOquery::getGEOSuppFiles("GSE83834", baseDir = temp_dir)
exprs_data_raw <- as.data.frame(readxl::read_excel(rownames(sup_mat)))
table(table(exprs_data_raw$ID))

exprs_data <- exprs_data_raw[,-1]
rownames(exprs_data) <- sub('\\.[0-9]*$', '', exprs_data_raw$ID)

library('biomaRt')
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <-  rownames(exprs_data)
mapping_df <- getBM(filters= "ensembl_gene_id",
                      attributes= c("ensembl_gene_id","hgnc_symbol"),
                      values = genes, mart= mart)

exprs_data_symbols <- expression_compression(exprs_data, mapping_df)

} # }