Expression Dataset Conversion Between Two Sets of Gene IDs
Source:R/gene_mapping.R
expression_compression.Rd
Converts between Gene IDs for a given expression dataset and mapping file.
Arguments
- exprs_data
continuous value data.frame or matrix with Gene IDs as rownames
- mapping_df
mapping file data.frame with the first column the Gene IDs in rownames of
exprs_data
and the second column the IDs to map to. Can have multiple to multiple linking.- compress_fun
the compression method to use in cases where multiple gene IDs link to a single new gene ID. "highest_mean" and "highest_median" pick the highest row based on mean or median, respectively, while "mean", "median", and "sum" aggregate the duplicate rows based on method chosen. "pc1" and "pc1_unscaled" are using the first principal component in
stats::prcomp()
, withscale.
set to TRUE or FALSE, respectively.- compress_trans
the transformation used when compressing the duplicate rows. For example, "log_exp" would take the log of the data, apply the
compress_fun
method, then transpose back using exp (i.e. geometric mean).- verbose
should messages about the compression process be displayed
Value
a data.frame at the new gene ID level, with compression of duplicate rows
as outlined in the compress_fun
and compress_trans
parameters
Details
Gene IDs in the exprs_data
that do not link to the first column of
mapping_df
will be excluded from the final output.
When pc1 or pc1_unscaled compress_fun
are specified the 1st principal
component is flipped if there is negative correlation with the duplicate
rows.
Examples
if (FALSE) { # \dontrun{
geo <- GEOquery::getGEO("GSE14333")
exprs_data <- geo$GSE14333_series_matrix.txt.gz@assayData$exprs
library(hgu133plus2.db)
mapping_df <- as.data.frame(hgu133plus2SYMBOL)
expression_compression(exprs_data, mapping_df, "highest_mean", "none")
### Using a more complicated case for GSE83834
temp_dir <- tempdir()
geo <- GEOquery::getGEO("GSE83834", destdir = temp_dir)
sup_mat <- GEOquery::getGEOSuppFiles("GSE83834", baseDir = temp_dir)
exprs_data_raw <- as.data.frame(readxl::read_excel(rownames(sup_mat)))
table(table(exprs_data_raw$ID))
exprs_data <- exprs_data_raw[,-1]
rownames(exprs_data) <- sub('\\.[0-9]*$', '', exprs_data_raw$ID)
library('biomaRt')
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- rownames(exprs_data)
mapping_df <- getBM(filters= "ensembl_gene_id",
attributes= c("ensembl_gene_id","hgnc_symbol"),
values = genes, mart= mart)
exprs_data_symbols <- expression_compression(exprs_data, mapping_df)
} # }