Last updated: 2022-02-01
Checks: 7 0
Knit directory: snRNA_eqtl/
This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20211124)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 7cdba32. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: data/magma_interaction_eqtl/
Ignored: data/magma_specific_genes/
Ignored: data_sensitive/
Untracked files:
Untracked: analysis/._index.Rmd
Untracked: analysis/additional/
Untracked: analysis/alternatives/
Untracked: analysis/clean_analysis.Rmd
Untracked: analysis/log_run_slurm_out
Untracked: analysis/log_run_slurm_stderr
Untracked: analysis/portfolio/
Untracked: analysis/revision.Rmd
Untracked: analysis/run_slurm.Rscript
Untracked: analysis/run_slurm.sbatch
Untracked: data/dice/
Untracked: data/epigenome_enrichment/
Untracked: data/gencode/
Untracked: data/gnomad_loeuf/
Untracked: data/gtex/
Untracked: data/gwas/
Untracked: data/gwas_epigenome_overlap/
Untracked: data/metabrain/
Untracked: data/umap/
Untracked: output/eqtl/
Untracked: output_almost_final/
Untracked: output_tmp_QTLtools/
Untracked: output_tmp_fastQTL_sample_ambient_removed/
Unstaged changes:
Modified: .gitignore
Modified: analysis/02_run_eQTL.Rmd
Modified: analysis/04_coloc.Rmd
Modified: analysis/_site.yml
Modified: analysis/plot_figures.Rmd
Deleted: output/README.md
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01_process_expression.Rmd
) and HTML (docs/01_process_expression.html
) files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | 7cdba32 | Julien Bryois | 2022-02-01 | Build site. |
Rmd | 07b8f71 | Julien Bryois | 2022-02-01 | process |
html | 7f06226 | Julien Bryois | 2022-01-27 | Build site. |
Rmd | 3a086fb | Julien Bryois | 2022-01-27 | Pseudo-bulk across cells expression added |
html | f0a5d45 | Julien Bryois | 2022-01-24 | Build site. |
Rmd | 5b8f23e | Julien Bryois | 2022-01-24 | Process Expression |
html | e65a9c5 | Julien Bryois | 2021-12-14 | Build site. |
Rmd | 4f7015d | Julien Bryois | 2021-12-14 | gene_soup_filter_sample_soup_filter_fastQTL_output2 |
html | 15cfc19 | Julien Bryois | 2021-12-14 | Build site. |
Rmd | 9a80a23 | Julien Bryois | 2021-12-14 | QTLtools format update - bug fix |
html | 7c447bd | Julien Bryois | 2021-12-14 | Build site. |
Rmd | dca9a01 | Julien Bryois | 2021-12-14 | QTLtools format update |
html | 16a97be | Julien Bryois | 2021-12-13 | Build site. |
Rmd | 329b2da | Julien Bryois | 2021-12-13 | process_expression_no_soup_genes_no_soup_ind |
html | f971404 | Julien Bryois | 2021-12-07 | Build site. |
Rmd | 62a3cc3 | Julien Bryois | 2021-12-07 | process expression without soupy genes |
html | 48c3088 | Julien Bryois | 2021-12-07 | Build site. |
Rmd | b05d4fc | Julien Bryois | 2021-12-07 | expression_filter_soupy_genes |
html | f41b1ed | Julien Bryois | 2021-12-07 | Build site. |
Rmd | a3b1cad | Julien Bryois | 2021-12-07 | expression_filter_soupy_genes |
Rmd | fcde74d | Julien Bryois | 2021-12-06 | expression_filter_soupy_genes |
html | 0d629e6 | Julien Bryois | 2021-12-01 | Build site. |
Rmd | db82bd8 | Julien Bryois | 2021-12-01 | Added code to extract expression of marker genes |
html | 7c6cfa8 | Julien Bryois | 2021-11-30 | Build site. |
Rmd | 98eaf47 | Julien Bryois | 2021-11-30 | TMM normalization for eQTL instead of CPM |
html | 67ca055 | Julien Bryois | 2021-11-29 | Build site. |
Rmd | 4b8b916 | Julien Bryois | 2021-11-29 | expression code added |
library(tidyverse)
library(SingleCellExperiment)
library(parallel)
Get individuals with both scRNA-seq and genotype information
individidual_genotype_scRNA <- read_tsv('data_sensitive/genotypes/individuals_with_genotype_scRNA.txt',
col_names = FALSE)
sce_ms <- readRDS('data_sensitive/sce/ms_sce_3.rds')
get annotation
annot <- read_csv('data_sensitive/sce/conos_labelled_2021-05-31.txt')
Only keep cells that were annotated
sce_ms <- sce_ms[,colnames(sce_ms)%in%annot$cell_id]
m <- match(colnames(sce_ms),annot$cell_id)
sce_ms$broad_annot <- annot$type_broad[m]
Some scRNA-seq were swapped and not corrected in the loaded sce object.
EU017 and EU018 –> swap WM189 and WM190 –> swap
Swap them here
sce_ms$library_id[sce_ms$library_id=='EU017'] <- 'tmp'
sce_ms$library_id[sce_ms$library_id=='EU018'] <- 'EU017'
sce_ms$library_id[sce_ms$library_id=='tmp'] <- 'EU018'
colnames(sce_ms) <- gsub('EU017','tmp',colnames(sce_ms))
colnames(sce_ms) <- gsub('EU018','EU017',colnames(sce_ms))
colnames(sce_ms) <- gsub('tmp','EU018',colnames(sce_ms))
sce_ms$library_id[sce_ms$library_id=='WM189'] <- 'tmp'
sce_ms$library_id[sce_ms$library_id=='WM190'] <- 'WM189'
sce_ms$library_id[sce_ms$library_id=='tmp'] <- 'WM190'
colnames(sce_ms) <- gsub('WM189','tmp',colnames(sce_ms))
colnames(sce_ms) <- gsub('WM190','WM189',colnames(sce_ms))
colnames(sce_ms) <- gsub('tmp','WM190',colnames(sce_ms))
Keep only individuals for which we have both scRNA-seq and expression
sce_ms <- sce_ms[,sce_ms$patient_id%in%individidual_genotype_scRNA$X1]
add individual_id column to the sce object
sce_ms$individual_id <- sce_ms$patient_id
Let’s remove the individual which is also in the AD dataset (less cells than the AD dataset)
sce_ms <- sce_ms[,sce_ms$individual_id!='S03-009']
Let’s remove samples that don’t match their genotype
sce_ms <- sce_ms[,!sce_ms$library_id%in%c('EU015','EU004','WM115','WM155','WM185')]
Let’s remove immune cells from the MS dataset (not observed in the AD dataset)
sce_ms <- sce_ms[,!sce_ms$broad_annot=='Immune']
Let’s set the gene name to ensembl (to merge with the AD sce dataset)
rownames(sce_ms) <- rowData(sce_ms)$ensembl
Finally, let’s save info about which sample corresponds to which individual in the filtered data
ms_sample_individual <- colData(sce_ms) %>%
as_tibble() %>%
dplyr::select(sample_id=library_id,individual_id,age,post_mortem_m,sex,tissue=matter) %>%
unique() %>%
write_tsv(.,'data_sensitive/expression/sample_individual_ms.txt',col_names = TRUE)
sce_ad <- readRDS('data_sensitive/sce/sce.annotated8.rds')
Keep only individuals for which we have both scRNA-seq and expression
sce_ad <- sce_ad[,sce_ad$individual_id%in%individidual_genotype_scRNA$X1]
Loading required package: HDF5Array
Loading required package: rhdf5
Some individuals are present in multiple AD studies
(individuals_multiple_studies <- colData(sce_ad) %>% as_tibble() %>%
dplyr::select(individual_id,study,diagnosis) %>%
unique() %>%
add_count(individual_id) %>%
filter(n>1) %>%
arrange(individual_id))
# A tibble: 10 x 4
individual_id study diagnosis n
<chr> <chr> <chr> <int>
1 SM-CJGLH Mathys Ctrl 2
2 SM-CJGLH Zhou Ctrl 2
3 SM-CJIZ1 Mathys Ctrl 2
4 SM-CJIZ1 Zhou Ctrl 2
5 SM-CJJ35 Mathys AD 2
6 SM-CJJ35 Zhou AD 2
7 SM-CTDQW Columbia AD 2
8 SM-CTDQW Mathys AD 2
9 SM-CTEM3 Mathys Ctrl 2
10 SM-CTEM3 Zhou Ctrl 2
count number of cells for these individuals
n_cells_per_id <- colData(sce_ad) %>% as_tibble() %>%
filter(individual_id%in%individuals_multiple_studies$individual_id) %>%
dplyr::count(individual_id,study) %>%
group_by(individual_id) %>%
mutate(keep=ifelse(n==max(n),TRUE,FALSE))
to_remove <- filter(n_cells_per_id,!keep) %>%
mutate(lab=paste0(individual_id,'_',study))
sce_ad$lab <- paste0(sce_ad$individual_id,'_',sce_ad$study)
Remove the sample with the least amount of cells (when two samples from the same individual are observed in two datasets)
sce_ad <- sce_ad[,!sce_ad$lab%in%to_remove$lab]
Get same cell type label as the MS dataset
sce_ad$broad_annot <- case_when(
sce_ad$broad_annot=='Glutamatergic' ~ 'Excitatory neurons',
sce_ad$broad_annot=='GABAergic' ~ 'Inhibitory neurons',
sce_ad$broad_annot=='OPCs' ~ 'OPCs / COPs',
sce_ad$broad_annot=='Endothelial' ~ 'Endothelial cells',
TRUE ~ sce_ad$broad_annot)
Let’s set the gene name to ensembl (to merge with the AD sce dataset)
rownames(sce_ad) <- rowData(sce_ad)$ENSEMBL
Finally, let’s save info about which sample corresponds to which individual in the filtered data
ad_sample_individual <- colData(sce_ad) %>%
as_tibble() %>%
mutate(post_mortem_m=pmi*60) %>%
dplyr::select(sample_id,individual_id,age,post_mortem_m,sex,tissue) %>%
unique() %>%
write_tsv(.,'data_sensitive/expression/sample_individual_ad.txt',col_names = TRUE)
common_genes <- intersect(rownames(sce_ms),rownames(sce_ad))
length(common_genes)
[1] 25938
sce_ms <- sce_ms[common_genes,]
sce_ad <- sce_ad[common_genes,]
Save ensembl - symbol names from ms dataset
ensembl2symbol <- rowData(sce_ms) %>%
as_tibble() %>%
dplyr::select(-type) %>%
write_tsv('data_sensitive/expression/ensembl2symbol_ms.txt')
For all samples, get:
ad_cov <- colData(sce_ad) %>% as_tibble() %>%
dplyr::select(individual_id,study,diagnosis) %>%
unique() %>%
mutate(study=paste0(study,'_AD'))
ms_cov <- colData(sce_ms) %>% as_tibble() %>%
dplyr::select(individual_id,diagnosis=disease_status) %>%
mutate(study='MS') %>%
unique() %>%
mutate(diagnosis=case_when(
diagnosis=='CTR' ~ 'Ctrl',
TRUE ~ 'MS'
)) %>% unique()
cov_all <- rbind(ad_cov,ms_cov)
cov_all_t <- cov_all %>% column_to_rownames('individual_id') %>% t() %>% as.data.frame() %>% rownames_to_column('id')
Load genotype PCs
cov_genotype_pcs <- read_tsv('data_sensitive/genotypes/pca_covariate_fastqtl.txt')
cov_all_t <- cov_all_t[,colnames(cov_genotype_pcs)]
Add covariates from metadata (study, case-control) to genotype PCs.
cov <- rbind(cov_genotype_pcs,cov_all_t) %>% as_tibble()
dir.create('data_sensitive/eqtl',showWarnings = FALSE)
write_tsv(cov,'data_sensitive/eqtl/covariate_pca_meta_fastqtl.txt')
count_per_cluster <- function(sce,i,var){
cnt_matrix <- sce[,sce$broad_annot==cell_type_id$V1[i] & sce[[var]]==cell_type_id$V2[i]]
tot_counts <- Matrix::rowSums(counts(cnt_matrix))
n_expressed <- Matrix::rowSums(counts(cnt_matrix)>0)
n_cells <- ncol(cnt_matrix)
cnt_df <- data.frame(counts=tot_counts,
n_expressed=n_expressed,
n_cells=n_cells) %>%
rownames_to_column('ensembl') %>%
mutate(perc_expressed=round(n_expressed*100/n_cells,digits=2),
cell_type=cell_type_id$V1[i],
individual_id=cell_type_id$V2[i]) %>%
inner_join(.,ensembl2symbol,by='ensembl') %>%
dplyr::select(cell_type,individual_id,ensembl,symbol,counts,n_expressed,n_cells,perc_expressed)
if(var=='sample_id'){
colnames(cnt_df)[2] <- 'sample_id'
}
return(cnt_df)
}
if(!file.exists('data_sensitive/expression/ms_sum_expression.individual_id.rds')){
cell_type_id <- cbind(as.character(sce_ms$broad_annot),sce_ms$individual_id) %>% unique() %>% as.data.frame()
sum_expression_ms <- mclapply(1:nrow(cell_type_id),
count_per_cluster,
sce=sce_ms,
var='individual_id',
mc.cores = 36,
mc.preschedule = FALSE) %>%
bind_rows()
saveRDS(sum_expression_ms,'data_sensitive/expression/ms_sum_expression.individual_id.rds')
rm(sum_expression_ms)
gc()
}
if(!file.exists('data_sensitive/expression/ms_sum_expression.sample_id.rds')){
sce_ms$sample_id <- sce_ms$library_id
cell_type_id <- cbind(as.character(sce_ms$broad_annot),sce_ms$sample_id) %>% unique() %>% as.data.frame()
sum_expression_ms <- mclapply(1:nrow(cell_type_id),
count_per_cluster,
sce=sce_ms,
var='sample_id',
mc.cores = 36,
mc.preschedule = FALSE) %>%
bind_rows()
saveRDS(sum_expression_ms,'data_sensitive/expression/ms_sum_expression.sample_id.rds')
rm(sum_expression_ms)
gc()
}
rm(sce_ms)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 9370436 500.5 17791542 950.2 17791542 950.2
Vcells 163244320 1245.5 8907262452 67957.1 12459900815 95061.5
if(!file.exists('data_sensitive/expression/ad_sum_expression.individual_id.rds')){
cell_type_id <- cbind(as.character(sce_ad$broad_annot),sce_ad$individual_id) %>% unique() %>% as.data.frame()
sum_expression_ad <- mclapply(1:nrow(cell_type_id),
count_per_cluster,
sce=sce_ad,
var='individual_id',
mc.cores = 36,
mc.preschedule = FALSE) %>%
bind_rows()
saveRDS(sum_expression_ad,'data_sensitive/expression/ad_sum_expression.individual_id.rds')
rm(sum_expression_ad)
gc()
}
if(!file.exists('data_sensitive/expression/ad_sum_expression.sample_id.rds')){
cell_type_id <- cbind(as.character(sce_ad$broad_annot),sce_ad$sample_id) %>% unique() %>% as.data.frame()
sum_expression_ad <- mclapply(1:nrow(cell_type_id),
count_per_cluster,
sce=sce_ad,
var='sample_id',
mc.cores = 36,
mc.preschedule = FALSE) %>%
bind_rows()
saveRDS(sum_expression_ad,'data_sensitive/expression/ad_sum_expression.sample_id.rds')
rm(sum_expression_ad)
gc()
}
rm(sce_ad)
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 7516242 401.5 17791542 950.2 17791542 950.2
Vcells 21596850 164.8 7125809962 54365.7 12459900815 95061.5
sum_expression_ms <- readRDS('data_sensitive/expression/ms_sum_expression.individual_id.rds') %>% mutate(dataset='ms')
sum_expression_ad <- readRDS('data_sensitive/expression/ad_sum_expression.individual_id.rds') %>% mutate(dataset='ad')
Let’s merge the MS and AD dataset.
sum_expression <- rbind(sum_expression_ms,sum_expression_ad)
Get counts per million (CPM)
sum_expression <- sum_expression %>%
group_by(cell_type,individual_id) %>%
mutate(counts_scaled_1M=counts*10^6/sum(counts)) %>%
ungroup() %>%
as_tibble()
Save expression of cell type marker genes
markers <- sum_expression %>% filter(symbol%in%c('SLC17A7','GAD2','AQP4','MOG','PDGFRA','C1QA','CLDN5','RGS5'))
saveRDS(markers,'data_sensitive/expression/cell_marker_expression.rds')
sumstats_gene <- sum_expression %>% group_by(cell_type,ensembl) %>%
summarise(counts_per_celltype=sum(counts),
n_expressed_per_celltype=sum(n_expressed),
n_cells_per_celltype=sum(n_cells),
perc_expressed_per_celltype=round(n_expressed_per_celltype*100/n_cells_per_celltype,digits=2),
n_samples_expressed=sum(counts>0),
perc_samples_expressed=round(n_samples_expressed*100/length(counts),digits=2),
mean_cpm=mean(counts_scaled_1M)) %>%
group_by(ensembl) %>%
mutate(prop_max_cpm=mean_cpm/max(mean_cpm)) %>%
ungroup()
We will first compute the proportion of reads that could be attributed to ambient RNA (conservative estimate).
compute_ambient_per_sample <- function(id){
message(id)
#Get only counts for sample 'id'
counts_id <- counts_per_cell_type %>%
filter(sample_id==id)
#Get the cell types for which we have counts in sample 'id'
cell_types_id <- counts_id %>% pull(cell_type) %>% unique()
#Get only ambient RNA for sample 'id'
ambient_id <- ambient %>% filter(sample_id==id) %>%
arrange(ensembl)
#Compute ambient proportion for each gene in each cell type from sample 'id'
ambient_prop <- lapply(1:length(cell_types_id), function(i){
message(cell_types_id[i])
counts_id_cell_type <- filter(counts_id,cell_type==cell_types_id[i]) %>%
arrange(ensembl)
stopifnot(all(counts_id_cell_type$ensembl==ambient_id$ensembl))
max_ambience <- DropletUtils::maximumAmbience(y = counts_id_cell_type$counts,
ambient = ambient_id$counts,
mode='proportion')
out <- tibble(individual_id=id,cell_type=cell_types_id[i],ensembl=counts_id_cell_type$ensembl,soup=max_ambience)
}) %>%
bind_rows()
return(ambient_prop)
}
if(file.exists('data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')){
ambient_prop_per_cell_type <- readRDS('data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')
} else{
#Load pseudo-bulk counts for each sample in the dataset.
sample_id_ms <- readRDS('data_sensitive/expression/ms_sum_expression.sample_id.rds')
sample_id_ad <- readRDS('data_sensitive/expression/ad_sum_expression.sample_id.rds')
counts_per_cell_type <- rbind(sample_id_ms,sample_id_ad) %>%
dplyr::select(cell_type,sample_id,ensembl,counts)
#We also summed counts for all barcodes with less than 100UMI
ambient_ad <- read_tsv('data_sensitive/ambient/ambient.100UMI.ad.txt')
#Swap two samples (genotype of scRNA-seq matches another sample and vice-versa), already done in ad sce object
colnames(ambient_ad)[colnames(ambient_ad)=='DWM-B3-24-Cog4-Path1-M'] <- 'tmp'
colnames(ambient_ad)[colnames(ambient_ad)=='DWM-B3-22-Cog4-Path0-M'] <- 'DWM-B3-24-Cog4-Path1-M'
colnames(ambient_ad)[colnames(ambient_ad)=='tmp'] <- 'DWM-B3-22-Cog4-Path0-M'
#Correct scRNA-seq sample names (scRNA-seq was not matching metadata)
colnames(ambient_ad)[colnames(ambient_ad)=='MFC-B2-10-Cog1-Path1-F'] <- 'MFC-B2-10-Cog1-Path0-M'
colnames(ambient_ad)[colnames(ambient_ad)=='MFC-B2-11-Cog1-Path1-M'] <- 'MFC-B2-11-Cog1-Path1-F'
#Make the ad ambient data long
ambient_ad <- ambient_ad %>%
gather(sample_id,counts,-gene) %>%
mutate(ensembl=gsub('\\..+','',gene)) %>%
dplyr::select(-gene)
#Now get ambient RNA for ms samples
ambient_ms <- read_tsv('data_sensitive/ambient/ambient.100UMI.ms.txt')
#Swap two samples (genotype of scRNA-seq matches another sample and vice-versa)
colnames(ambient_ms)[colnames(ambient_ms)=='EU017'] <- 'tmp'
colnames(ambient_ms)[colnames(ambient_ms)=='EU018'] <- 'EU017'
colnames(ambient_ms)[colnames(ambient_ms)=='tmp'] <- 'EU018'
colnames(ambient_ms)[colnames(ambient_ms)=='WM189'] <- 'tmp'
colnames(ambient_ms)[colnames(ambient_ms)=='WM190'] <- 'WM189'
colnames(ambient_ms)[colnames(ambient_ms)=='tmp'] <- 'WM190'
#Make the ms ambient data long
ambient_ms <- ambient_ms %>%
dplyr::select(-symbol) %>%
gather(sample_id,counts,-ensembl)
#Aggregate ambient counts for ad and MS, only keep genes that are in the pseudo-bulks counts
ambient <- rbind(ambient_ad,ambient_ms) %>%
filter(ensembl%in%unique(counts_per_cell_type$ensembl))
samples <- unique(counts_per_cell_type$sample_id)
samples_ambient <- unique(ambient$sample_id)
stopifnot(all(samples%in%samples_ambient))
#Estimate ambient RNA for each individual
ambient_prop_all <- mclapply(samples,compute_ambient_per_sample,mc.cores = 16) %>%
bind_rows()
saveRDS(ambient_prop_all,'data_sensitive/ambient/ambient.prop.estimate.droplet_utils.all.rds')
ambient_prop_per_cell_type <- ambient_prop_all%>%
group_by(cell_type,ensembl) %>%
summarise(mean_soup=mean(soup,na.rm=TRUE),
median_soup=median(soup,na.rm=TRUE),
sd_soup=sd(soup,na.rm=TRUE),
min_soup=min(soup,na.rm=TRUE),
max_soup=max(soup,na.rm=TRUE),
n_sample=n())
saveRDS(ambient_prop_per_cell_type,'data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')
}
Keep genes:
ambient_keep <- filter(ambient_prop_per_cell_type,mean_soup<0.1)
gene_to_keep_per_cell_type <- sumstats_gene %>%
filter(n_samples_expressed>10,mean_cpm>1) %>%
dplyr::select(cell_type,ensembl) %>%
unique() %>%
inner_join(.,ambient_keep,by=c('cell_type','ensembl'))
sum_expression <- sum_expression %>%
inner_join(.,gene_to_keep_per_cell_type,by=c('cell_type','ensembl'))
Number of genes tested per cell type
(dplyr::count(sum_expression,cell_type,individual_id) %>%
dplyr::select(-individual_id) %>%
unique() %>%
arrange(-n))
# A tibble: 8 x 2
cell_type n
<chr> <int>
1 Excitatory neurons 14595
2 Inhibitory neurons 12943
3 Astrocytes 11436
4 Oligodendrocytes 10816
5 OPCs / COPs 10801
6 Endothelial cells 10778
7 Microglia 8887
8 Pericytes 6940
Remove individuals with less than 10 cells for a given cell type.
(n_cells_ind_cell_type <- sum_expression %>%
dplyr::select(cell_type,individual_id,n_cells) %>%
unique() %>% arrange(n_cells))
# A tibble: 1,514 x 3
cell_type individual_id n_cells
<chr> <chr> <int>
1 Astrocytes SD030/18 1
2 Endothelial cells SD015/18 1
3 Pericytes SD015/18 1
4 Inhibitory neurons S13-096 1
5 Pericytes SM-CJGIM 1
6 Endothelial cells SM-CJGIM 1
7 Endothelial cells SM-CTEGO 1
8 Endothelial cells SM-CJFOL 1
9 Endothelial cells SM-CTECO 1
10 Inhibitory neurons SM-CJFOC 1
# … with 1,504 more rows
keep <- filter(n_cells_ind_cell_type,n_cells>=10)
Number of unique individuals per cell types for eQTL analysis
dplyr::count(keep,cell_type) %>% arrange(-n)
# A tibble: 8 x 2
cell_type n
<chr> <int>
1 Oligodendrocytes 192
2 Astrocytes 189
3 Excitatory neurons 187
4 Microglia 187
5 OPCs / COPs 187
6 Inhibitory neurons 178
7 Pericytes 150
8 Endothelial cells 144
sum_expression <- inner_join(sum_expression,keep,by=c('cell_type','individual_id','n_cells'))
Get TSS coordinate for all genes.
gtf <- rtracklayer::import('data/gencode/Homo_sapiens.GRCh38.96.filtered.gtf') %>%
as.data.frame() %>%
dplyr::filter(type=='gene') %>%
mutate(TSS_start=ifelse(strand=='+',start,end),
TSS_end=ifelse(strand=='+',start+1,end+1)) %>%
mutate(gene=paste0(gene_name,'_',gene_id)) %>%
dplyr::select(seqnames,TSS_start,TSS_end,gene) %>%
filter(seqnames%in%c(1:22))
gtf$seqnames <- droplevels(gtf$seqnames)
Set all samples to the same set of symbol and gene ID (like MS dataset).
gene_id_common <- sum_expression %>%
filter(dataset=='ms') %>%
dplyr::select(symbol,ensembl) %>%
unique() %>%
mutate(gene=paste0(symbol,'_',ensembl)) %>% dplyr::select(-symbol)
sum_expression <- sum_expression %>%
inner_join(.,gene_id_common,by='ensembl') %>%
dplyr::select(cell_type,individual_id,gene,counts,n_expressed,n_cells,perc_expressed,counts_scaled_1M)
saveRDS(sum_expression,'data_sensitive/expression/all_sum_expression.rds')
Perform TMM normalization and get fastQTL format
get_fastqtl_pheno <- function(cell){
#keep all expression for the cell type and patient that have genotype
bed <- sum_expression %>% dplyr::filter(cell_type==cell) %>%
dplyr::select(individual_id,gene,counts) %>%
spread(individual_id,counts) %>%
column_to_rownames('gene') %>%
edgeR::DGEList(counts = .) %>% #TMM normalize the data using edgeR
edgeR::calcNormFactors(.) %>%
edgeR::cpm(.) %>%
as.data.frame() %>%
rownames_to_column('gene')
bed <- bed %>% inner_join(.,gtf,by='gene') %>%
dplyr::select(seqnames,TSS_start,TSS_end,gene,everything()) %>%
arrange(seqnames,TSS_start)
colnames(bed)[c(1,2,3,4)] <- c('#Chr','start','end','ID')
write_tsv(bed,path = paste0('data_sensitive/eqtl/',make.names(cell),'.bed'))
}
cells <- unique(sum_expression$cell_type)
mclapply(cells,get_fastqtl_pheno,mc.cores=8)
cd data_sensitive/eqtl
ml tabix/0.2.6-goolf-1.7.20
for f in *.bed
do
bgzip $f && tabix -p bed $f.gz
done
geno_cov <- read_tsv('data_sensitive/eqtl/covariate_pca_meta_fastqtl.txt')
pheno_files <- list.files('data_sensitive/eqtl/',pattern='*.bed.gz$',full.names = T)
get_covariates <- function(i,folder_name){
out_name <- basename(pheno_files[i]) %>% gsub('\\.bed\\.gz','',.) %>% paste0('.cov.txt')
#load expression data
d <- data.table::fread(pheno_files[i],data.table=FALSE)
#perform pca
pca <- d[-c(1,2,3,4)] %>% t() %>% prcomp(.,scale.=T)
#Create directories with different number of expression pcs
#Add the n first pcs to the covariate file and write the covariate file in fastqtl format
number_of_pcs <- c(10,20,30,40,50,60,70,80,90,100,110,120)
lapply(number_of_pcs,function(i){
top_pcs <- pca$x[,1:i] %>% t()
rownames(top_pcs) <- paste0(rownames(top_pcs),'_exp')
top_pcs <- top_pcs %>% as.data.frame() %>% rownames_to_column('id')
shared_samples <- intersect(colnames(geno_cov),colnames(top_pcs))
cov <- rbind(geno_cov[,shared_samples],top_pcs[,shared_samples])
dir.create(paste0(folder_name,i),showWarnings = FALSE)
write_tsv(cov,paste0(folder_name,i,'/',out_name))
})
}
mclapply(1:length(pheno_files),get_covariates,folder_name='data_sensitive/eqtl/PC',mc.cores=8)
cd data_sensitive/eqtl
for f in PC*
do
cd $f
gzip *
ln -s ../*.bed.gz .
ln -s ../*.bed.gz.tbi .
ln -s ../../genotypes/processed/combined_final.vcf.gz .
ln -s ../../genotypes/processed/combined_final.vcf.gz.tbi .
cd ..
done
Let’s generate pseudo-bulk expression data across all cell types.
First, we load the expression data at pseudo-bulk level for each cell type.
sum_expression_ms <- readRDS('data_sensitive/expression/ms_sum_expression.individual_id.rds') %>% mutate(dataset='ms')
sum_expression_ad <- readRDS('data_sensitive/expression/ad_sum_expression.individual_id.rds') %>% mutate(dataset='ad')
sum_expression <- rbind(sum_expression_ms,sum_expression_ad)
Set all samples to the same set of symbol and gene ID (like MS dataset).
gene_id_common <- sum_expression %>%
filter(dataset=='ms') %>%
dplyr::select(symbol,ensembl) %>%
unique() %>%
mutate(gene=paste0(symbol,'_',ensembl)) %>% dplyr::select(-symbol)
sum_expression <- sum_expression %>%
inner_join(.,gene_id_common,by='ensembl') %>%
dplyr::select(cell_type,individual_id,gene,counts)
Get pseudo-bulk expression for each individual accross all cell types
pb <- sum_expression %>%
group_by(individual_id,gene) %>%
summarise(counts=sum(counts)) %>%
ungroup()
`summarise()` regrouping output by 'individual_id' (override with `.groups` argument)
Only keep genes with a mean CPM > 1
pb <- pb %>% group_by(individual_id) %>%
mutate(cpm=counts*10^6/sum(counts)) %>%
group_by(gene) %>%
mutate(mean_cpm=mean(cpm)) %>%
ungroup() %>%
filter(mean_cpm>1)
Perform TMM normalization and get fastQTL format
dir.create('data_sensitive/eqtl_pb/',showWarnings = FALSE)
bed <- pb %>%
dplyr::select(individual_id,gene,counts) %>%
spread(individual_id,counts) %>%
column_to_rownames('gene') %>%
edgeR::DGEList(counts = .) %>% #TMM normalize the data using edgeR
edgeR::calcNormFactors(.) %>%
edgeR::cpm(.) %>%
as.data.frame() %>%
rownames_to_column('gene')
bed <- bed %>% inner_join(.,gtf,by='gene') %>%
dplyr::select(seqnames,TSS_start,TSS_end,gene,everything()) %>%
arrange(seqnames,TSS_start)
colnames(bed)[c(1,2,3,4)] <- c('#Chr','start','end','ID')
write_tsv(bed,path = 'data_sensitive/eqtl_pb/pb.bed')
cd data_sensitive/eqtl_pb
ml tabix/0.2.6-goolf-1.7.20
for f in *.bed
do
bgzip $f && tabix -p bed $f.gz
done
pheno_files <- list.files('data_sensitive/eqtl_pb/',pattern='*.bed.gz$',full.names = T)
get_covariates <- function(i,folder_name){
out_name <- basename(pheno_files[i]) %>% gsub('\\.bed\\.gz','',.) %>% paste0('.cov.txt')
#load expression data
d <- data.table::fread(pheno_files[i],data.table=FALSE)
#perform pca
pca <- d[-c(1,2,3,4)] %>% t() %>% prcomp(.,scale.=T)
#Create directories with different number of expression pcs
#Add the n first pcs to the covariate file and write the covariate file in fastqtl format
number_of_pcs <- c(10,20,30,40,50,60,70,80,90,100,110,120)
lapply(number_of_pcs,function(i){
top_pcs <- pca$x[,1:i] %>% t()
rownames(top_pcs) <- paste0(rownames(top_pcs),'_exp')
top_pcs <- top_pcs %>% as.data.frame() %>% rownames_to_column('id')
shared_samples <- intersect(colnames(geno_cov),colnames(top_pcs))
cov <- rbind(geno_cov[,shared_samples],top_pcs[,shared_samples])
dir.create(paste0(folder_name,i),showWarnings = FALSE)
write_tsv(cov,paste0(folder_name,i,'/',out_name))
})
}
lapply(1:length(pheno_files),get_covariates,folder_name='data_sensitive/eqtl_pb/PC')
cd data_sensitive/eqtl_pb
for f in PC*
do
cd $f
gzip *
ln -s ../*.bed.gz .
ln -s ../*.bed.gz.tbi .
ln -s ../../genotypes/processed/combined_final.vcf.gz .
ln -s ../../genotypes/processed/combined_final.vcf.gz.tbi .
cd ..
done
sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /pstore/apps/OpenBLAS/0.3.1-GCC-7.3.0-2.30/lib/libopenblasp-r0.3.1.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] HDF5Array_1.16.1 rhdf5_2.32.4
[3] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.1
[5] DelayedArray_0.14.0 matrixStats_0.56.0
[7] Biobase_2.48.0 GenomicRanges_1.40.0
[9] GenomeInfoDb_1.24.0 IRanges_2.22.2
[11] S4Vectors_0.26.1 BiocGenerics_0.34.0
[13] forcats_0.5.0 stringr_1.4.0
[15] dplyr_1.0.0 purrr_0.3.4
[17] readr_1.3.1 tidyr_1.1.0
[19] tibble_3.0.1 ggplot2_3.3.3
[21] tidyverse_1.3.0 workflowr_1.6.2
loaded via a namespace (and not attached):
[1] ggbeeswarm_0.6.0 colorspace_1.4-1
[3] ellipsis_0.3.1 rprojroot_1.3-2
[5] XVector_0.28.0 BiocNeighbors_1.6.0
[7] fs_1.4.1 rstudioapi_0.11
[9] fansi_0.4.1 lubridate_1.7.9
[11] xml2_1.3.2 R.methodsS3_1.8.0
[13] knitr_1.28 scater_1.16.2
[15] jsonlite_1.6.1 Rsamtools_2.4.0
[17] broom_0.5.6 dbplyr_1.4.4
[19] R.oo_1.23.0 compiler_4.0.1
[21] httr_1.4.1 backports_1.1.7
[23] assertthat_0.2.1 Matrix_1.2-18
[25] limma_3.44.1 cli_2.0.2
[27] later_1.1.0.1 BiocSingular_1.4.0
[29] htmltools_0.5.1.1 tools_4.0.1
[31] rsvd_1.0.3 gtable_0.3.0
[33] glue_1.4.1 GenomeInfoDbData_1.2.3
[35] Rcpp_1.0.6 cellranger_1.1.0
[37] vctrs_0.3.1 Biostrings_2.56.0
[39] nlme_3.1-148 rtracklayer_1.48.0
[41] DelayedMatrixStats_1.10.1 xfun_0.14
[43] rvest_0.3.5 lifecycle_0.2.0
[45] irlba_2.3.3 XML_3.99-0.3
[47] edgeR_3.30.3 zlibbioc_1.34.0
[49] scales_1.1.1 hms_0.5.3
[51] promises_1.1.1 yaml_2.2.1
[53] gridExtra_2.3 stringi_1.4.6
[55] BiocParallel_1.22.0 rlang_0.4.10
[57] pkgconfig_2.0.3 bitops_1.0-6
[59] evaluate_0.14 lattice_0.20-41
[61] Rhdf5lib_1.10.0 GenomicAlignments_1.24.0
[63] tidyselect_1.1.0 magrittr_1.5
[65] R6_2.4.1 generics_0.0.2
[67] DBI_1.1.0 pillar_1.4.4
[69] haven_2.3.1 whisker_0.4
[71] withr_2.2.0 RCurl_1.98-1.2
[73] modelr_0.1.8 crayon_1.3.4
[75] utf8_1.1.4 rmarkdown_2.2
[77] viridis_0.5.1 locfit_1.5-9.4
[79] grid_4.0.1 readxl_1.3.1
[81] data.table_1.12.8 blob_1.2.1
[83] git2r_0.27.1 reprex_0.3.0
[85] digest_0.6.25 httpuv_1.5.4
[87] R.utils_2.9.2 munsell_0.5.0
[89] beeswarm_0.2.3 viridisLite_0.3.0
[91] vipor_0.4.5