Last updated: 2022-02-01

Checks: 7 0

Knit directory: snRNA_eqtl/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20211124)

The command set.seed(20211124) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7cdba32

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7cdba32. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/magma_interaction_eqtl/
    Ignored:    data/magma_specific_genes/
    Ignored:    data_sensitive/

Untracked files:
    Untracked:  analysis/._index.Rmd
    Untracked:  analysis/additional/
    Untracked:  analysis/alternatives/
    Untracked:  analysis/clean_analysis.Rmd
    Untracked:  analysis/log_run_slurm_out
    Untracked:  analysis/log_run_slurm_stderr
    Untracked:  analysis/portfolio/
    Untracked:  analysis/revision.Rmd
    Untracked:  analysis/run_slurm.Rscript
    Untracked:  analysis/run_slurm.sbatch
    Untracked:  data/dice/
    Untracked:  data/epigenome_enrichment/
    Untracked:  data/gencode/
    Untracked:  data/gnomad_loeuf/
    Untracked:  data/gtex/
    Untracked:  data/gwas/
    Untracked:  data/gwas_epigenome_overlap/
    Untracked:  data/metabrain/
    Untracked:  data/umap/
    Untracked:  output/eqtl/
    Untracked:  output_almost_final/
    Untracked:  output_tmp_QTLtools/
    Untracked:  output_tmp_fastQTL_sample_ambient_removed/

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/02_run_eQTL.Rmd
    Modified:   analysis/04_coloc.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/plot_figures.Rmd
    Deleted:    output/README.md

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01_process_expression.Rmd) and HTML (docs/01_process_expression.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
html	7cdba32	Julien Bryois	2022-02-01	Build site.
Rmd	07b8f71	Julien Bryois	2022-02-01	process
html	7f06226	Julien Bryois	2022-01-27	Build site.
Rmd	3a086fb	Julien Bryois	2022-01-27	Pseudo-bulk across cells expression added
html	f0a5d45	Julien Bryois	2022-01-24	Build site.
Rmd	5b8f23e	Julien Bryois	2022-01-24	Process Expression
html	e65a9c5	Julien Bryois	2021-12-14	Build site.
Rmd	4f7015d	Julien Bryois	2021-12-14	gene_soup_filter_sample_soup_filter_fastQTL_output2
html	15cfc19	Julien Bryois	2021-12-14	Build site.
Rmd	9a80a23	Julien Bryois	2021-12-14	QTLtools format update - bug fix
html	7c447bd	Julien Bryois	2021-12-14	Build site.
Rmd	dca9a01	Julien Bryois	2021-12-14	QTLtools format update
html	16a97be	Julien Bryois	2021-12-13	Build site.
Rmd	329b2da	Julien Bryois	2021-12-13	process_expression_no_soup_genes_no_soup_ind
html	f971404	Julien Bryois	2021-12-07	Build site.
Rmd	62a3cc3	Julien Bryois	2021-12-07	process expression without soupy genes
html	48c3088	Julien Bryois	2021-12-07	Build site.
Rmd	b05d4fc	Julien Bryois	2021-12-07	expression_filter_soupy_genes
html	f41b1ed	Julien Bryois	2021-12-07	Build site.
Rmd	a3b1cad	Julien Bryois	2021-12-07	expression_filter_soupy_genes
Rmd	fcde74d	Julien Bryois	2021-12-06	expression_filter_soupy_genes
html	0d629e6	Julien Bryois	2021-12-01	Build site.
Rmd	db82bd8	Julien Bryois	2021-12-01	Added code to extract expression of marker genes
html	7c6cfa8	Julien Bryois	2021-11-30	Build site.
Rmd	98eaf47	Julien Bryois	2021-11-30	TMM normalization for eQTL instead of CPM
html	67ca055	Julien Bryois	2021-11-29	Build site.
Rmd	4b8b916	Julien Bryois	2021-11-29	expression code added

library(tidyverse)
library(SingleCellExperiment)
library(parallel)

Files

Get individuals with both scRNA-seq and genotype information

individidual_genotype_scRNA <- read_tsv('data_sensitive/genotypes/individuals_with_genotype_scRNA.txt',
                                        col_names = FALSE)

MS expression

sce_ms <- readRDS('data_sensitive/sce/ms_sce_3.rds')

get annotation

annot <- read_csv('data_sensitive/sce/conos_labelled_2021-05-31.txt')

Only keep cells that were annotated

sce_ms <- sce_ms[,colnames(sce_ms)%in%annot$cell_id]

m <- match(colnames(sce_ms),annot$cell_id)
sce_ms$broad_annot <- annot$type_broad[m]

Some scRNA-seq were swapped and not corrected in the loaded sce object.

EU017 and EU018 –> swap WM189 and WM190 –> swap

Swap them here

sce_ms$library_id[sce_ms$library_id=='EU017'] <- 'tmp'
sce_ms$library_id[sce_ms$library_id=='EU018'] <- 'EU017'
sce_ms$library_id[sce_ms$library_id=='tmp'] <- 'EU018'

colnames(sce_ms) <- gsub('EU017','tmp',colnames(sce_ms))
colnames(sce_ms) <- gsub('EU018','EU017',colnames(sce_ms))
colnames(sce_ms) <- gsub('tmp','EU018',colnames(sce_ms))

sce_ms$library_id[sce_ms$library_id=='WM189'] <- 'tmp'
sce_ms$library_id[sce_ms$library_id=='WM190'] <- 'WM189'
sce_ms$library_id[sce_ms$library_id=='tmp'] <- 'WM190'

colnames(sce_ms) <- gsub('WM189','tmp',colnames(sce_ms))
colnames(sce_ms) <- gsub('WM190','WM189',colnames(sce_ms))
colnames(sce_ms) <- gsub('tmp','WM190',colnames(sce_ms))

Keep only individuals for which we have both scRNA-seq and expression

sce_ms <- sce_ms[,sce_ms$patient_id%in%individidual_genotype_scRNA$X1]

add individual_id column to the sce object

sce_ms$individual_id <- sce_ms$patient_id

Let’s remove the individual which is also in the AD dataset (less cells than the AD dataset)

sce_ms <- sce_ms[,sce_ms$individual_id!='S03-009']

Let’s remove samples that don’t match their genotype

sce_ms <- sce_ms[,!sce_ms$library_id%in%c('EU015','EU004','WM115','WM155','WM185')]

Let’s remove immune cells from the MS dataset (not observed in the AD dataset)

sce_ms <- sce_ms[,!sce_ms$broad_annot=='Immune']

Let’s set the gene name to ensembl (to merge with the AD sce dataset)

rownames(sce_ms) <- rowData(sce_ms)$ensembl

Finally, let’s save info about which sample corresponds to which individual in the filtered data

ms_sample_individual <- colData(sce_ms) %>% 
  as_tibble() %>% 
  dplyr::select(sample_id=library_id,individual_id,age,post_mortem_m,sex,tissue=matter) %>% 
  unique() %>% 
  write_tsv(.,'data_sensitive/expression/sample_individual_ms.txt',col_names = TRUE)

AD expression

sce_ad <- readRDS('data_sensitive/sce/sce.annotated8.rds')

Keep only individuals for which we have both scRNA-seq and expression

sce_ad <- sce_ad[,sce_ad$individual_id%in%individidual_genotype_scRNA$X1]

Loading required package: HDF5Array

Loading required package: rhdf5

Some individuals are present in multiple AD studies

(individuals_multiple_studies <- colData(sce_ad) %>% as_tibble() %>% 
  dplyr::select(individual_id,study,diagnosis) %>% 
  unique() %>% 
  add_count(individual_id) %>% 
  filter(n>1) %>% 
  arrange(individual_id))

# A tibble: 10 x 4
   individual_id study    diagnosis     n
   <chr>         <chr>    <chr>     <int>
 1 SM-CJGLH      Mathys   Ctrl          2
 2 SM-CJGLH      Zhou     Ctrl          2
 3 SM-CJIZ1      Mathys   Ctrl          2
 4 SM-CJIZ1      Zhou     Ctrl          2
 5 SM-CJJ35      Mathys   AD            2
 6 SM-CJJ35      Zhou     AD            2
 7 SM-CTDQW      Columbia AD            2
 8 SM-CTDQW      Mathys   AD            2
 9 SM-CTEM3      Mathys   Ctrl          2
10 SM-CTEM3      Zhou     Ctrl          2

count number of cells for these individuals

n_cells_per_id <- colData(sce_ad) %>% as_tibble() %>% 
  filter(individual_id%in%individuals_multiple_studies$individual_id) %>% 
  dplyr::count(individual_id,study) %>% 
    group_by(individual_id) %>% 
    mutate(keep=ifelse(n==max(n),TRUE,FALSE))

to_remove <- filter(n_cells_per_id,!keep) %>% 
  mutate(lab=paste0(individual_id,'_',study))

sce_ad$lab <- paste0(sce_ad$individual_id,'_',sce_ad$study)

Remove the sample with the least amount of cells (when two samples from the same individual are observed in two datasets)

sce_ad <- sce_ad[,!sce_ad$lab%in%to_remove$lab]

Get same cell type label as the MS dataset

sce_ad$broad_annot <- case_when(
  sce_ad$broad_annot=='Glutamatergic' ~ 'Excitatory neurons',
  sce_ad$broad_annot=='GABAergic' ~ 'Inhibitory neurons',
  sce_ad$broad_annot=='OPCs' ~ 'OPCs / COPs',
  sce_ad$broad_annot=='Endothelial' ~ 'Endothelial cells',
  TRUE ~ sce_ad$broad_annot)

Let’s set the gene name to ensembl (to merge with the AD sce dataset)

rownames(sce_ad) <- rowData(sce_ad)$ENSEMBL

Finally, let’s save info about which sample corresponds to which individual in the filtered data

ad_sample_individual <- colData(sce_ad) %>% 
  as_tibble() %>% 
  mutate(post_mortem_m=pmi*60) %>% 
  dplyr::select(sample_id,individual_id,age,post_mortem_m,sex,tissue) %>% 
  unique() %>% 
  write_tsv(.,'data_sensitive/expression/sample_individual_ad.txt',col_names = TRUE)

Get common genes

common_genes <-  intersect(rownames(sce_ms),rownames(sce_ad))

length(common_genes)

[1] 25938

sce_ms <- sce_ms[common_genes,]
sce_ad <- sce_ad[common_genes,]

Save ensembl - symbol names from ms dataset

ensembl2symbol <- rowData(sce_ms) %>% 
  as_tibble() %>% 
  dplyr::select(-type) %>% 
  write_tsv('data_sensitive/expression/ensembl2symbol_ms.txt')

Get covariates

For all samples, get:

Original dataset (Roche_AD,Columbia_AD,Mathys_AD,Zhou,MS)
Case-control status

ad_cov <- colData(sce_ad) %>% as_tibble() %>% 
  dplyr::select(individual_id,study,diagnosis) %>% 
  unique() %>% 
  mutate(study=paste0(study,'_AD'))

ms_cov <- colData(sce_ms) %>% as_tibble() %>% 
dplyr::select(individual_id,diagnosis=disease_status) %>% 
  mutate(study='MS') %>% 
  unique() %>% 
  mutate(diagnosis=case_when(
    diagnosis=='CTR' ~ 'Ctrl',
    TRUE ~ 'MS'
  )) %>% unique()

cov_all <- rbind(ad_cov,ms_cov)

cov_all_t <- cov_all %>% column_to_rownames('individual_id') %>% t() %>% as.data.frame() %>% rownames_to_column('id')

Load genotype PCs

cov_genotype_pcs <- read_tsv('data_sensitive/genotypes/pca_covariate_fastqtl.txt')

cov_all_t <- cov_all_t[,colnames(cov_genotype_pcs)]

Add covariates from metadata (study, case-control) to genotype PCs.

cov <- rbind(cov_genotype_pcs,cov_all_t) %>% as_tibble()

dir.create('data_sensitive/eqtl',showWarnings = FALSE)
write_tsv(cov,'data_sensitive/eqtl/covariate_pca_meta_fastqtl.txt')

Get Pseudo-bulk expression function

count_per_cluster <- function(sce,i,var){
  cnt_matrix <- sce[,sce$broad_annot==cell_type_id$V1[i] & sce[[var]]==cell_type_id$V2[i]]
  tot_counts <- Matrix::rowSums(counts(cnt_matrix))
  n_expressed <- Matrix::rowSums(counts(cnt_matrix)>0)
  n_cells <- ncol(cnt_matrix)
  cnt_df <- data.frame(counts=tot_counts,
                       n_expressed=n_expressed,
                       n_cells=n_cells) %>% 
    rownames_to_column('ensembl') %>% 
    mutate(perc_expressed=round(n_expressed*100/n_cells,digits=2),
           cell_type=cell_type_id$V1[i],
           individual_id=cell_type_id$V2[i]) %>% 
    inner_join(.,ensembl2symbol,by='ensembl') %>% 
  dplyr::select(cell_type,individual_id,ensembl,symbol,counts,n_expressed,n_cells,perc_expressed)
  
  if(var=='sample_id'){
    colnames(cnt_df)[2] <- 'sample_id'
  }
  return(cnt_df)
}

MS

if(!file.exists('data_sensitive/expression/ms_sum_expression.individual_id.rds')){
  
  cell_type_id <- cbind(as.character(sce_ms$broad_annot),sce_ms$individual_id) %>% unique() %>% as.data.frame()
  
  sum_expression_ms <- mclapply(1:nrow(cell_type_id),
                                count_per_cluster,
                                sce=sce_ms,
                                var='individual_id',
                                mc.cores = 36,
                                mc.preschedule = FALSE) %>% 
    bind_rows()
  
  saveRDS(sum_expression_ms,'data_sensitive/expression/ms_sum_expression.individual_id.rds')
  rm(sum_expression_ms)
  gc()
}

if(!file.exists('data_sensitive/expression/ms_sum_expression.sample_id.rds')){

  sce_ms$sample_id <- sce_ms$library_id
  cell_type_id <- cbind(as.character(sce_ms$broad_annot),sce_ms$sample_id) %>% unique() %>% as.data.frame()
  
  sum_expression_ms <- mclapply(1:nrow(cell_type_id),
                                count_per_cluster,
                                sce=sce_ms,
                                var='sample_id',
                                mc.cores = 36,
                                mc.preschedule = FALSE) %>% 
    bind_rows()
  
  saveRDS(sum_expression_ms,'data_sensitive/expression/ms_sum_expression.sample_id.rds')
  rm(sum_expression_ms)
  gc()
}

rm(sce_ms)
gc()

            used   (Mb) gc trigger    (Mb)    max used    (Mb)
Ncells   9370436  500.5   17791542   950.2    17791542   950.2
Vcells 163244320 1245.5 8907262452 67957.1 12459900815 95061.5

if(!file.exists('data_sensitive/expression/ad_sum_expression.individual_id.rds')){
  
  cell_type_id <- cbind(as.character(sce_ad$broad_annot),sce_ad$individual_id) %>% unique() %>% as.data.frame()
  
  sum_expression_ad <- mclapply(1:nrow(cell_type_id),
                                count_per_cluster,
                                sce=sce_ad,
                                var='individual_id',
                                mc.cores = 36,
                                mc.preschedule = FALSE) %>% 
    bind_rows()
  
  saveRDS(sum_expression_ad,'data_sensitive/expression/ad_sum_expression.individual_id.rds')
  rm(sum_expression_ad)
  gc()
}

if(!file.exists('data_sensitive/expression/ad_sum_expression.sample_id.rds')){
  
  cell_type_id <- cbind(as.character(sce_ad$broad_annot),sce_ad$sample_id) %>% unique() %>% as.data.frame()
  
  sum_expression_ad <- mclapply(1:nrow(cell_type_id),
                                count_per_cluster,
                                sce=sce_ad,
                                var='sample_id',
                                mc.cores = 36,
                                mc.preschedule = FALSE) %>% 
    bind_rows()
  
  saveRDS(sum_expression_ad,'data_sensitive/expression/ad_sum_expression.sample_id.rds')
  rm(sum_expression_ad)
  gc()
}

rm(sce_ad)
gc()

           used  (Mb) gc trigger    (Mb)    max used    (Mb)
Ncells  7516242 401.5   17791542   950.2    17791542   950.2
Vcells 21596850 164.8 7125809962 54365.7 12459900815 95061.5

Load Pseudo-bulk

sum_expression_ms <- readRDS('data_sensitive/expression/ms_sum_expression.individual_id.rds') %>% mutate(dataset='ms')
sum_expression_ad <- readRDS('data_sensitive/expression/ad_sum_expression.individual_id.rds') %>% mutate(dataset='ad')

Aggregate datasets

Let’s merge the MS and AD dataset.

sum_expression <- rbind(sum_expression_ms,sum_expression_ad)

Get counts per million (CPM)

sum_expression <- sum_expression %>% 
  group_by(cell_type,individual_id) %>% 
  mutate(counts_scaled_1M=counts*10^6/sum(counts)) %>% 
  ungroup() %>% 
  as_tibble()

Save expression of cell type marker genes

markers <- sum_expression %>% filter(symbol%in%c('SLC17A7','GAD2','AQP4','MOG','PDGFRA','C1QA','CLDN5','RGS5'))
saveRDS(markers,'data_sensitive/expression/cell_marker_expression.rds')

Get number of samples expressing genes and average expression

sumstats_gene <- sum_expression %>% group_by(cell_type,ensembl) %>% 
  summarise(counts_per_celltype=sum(counts),
         n_expressed_per_celltype=sum(n_expressed),
         n_cells_per_celltype=sum(n_cells),
         perc_expressed_per_celltype=round(n_expressed_per_celltype*100/n_cells_per_celltype,digits=2),
         n_samples_expressed=sum(counts>0),
         perc_samples_expressed=round(n_samples_expressed*100/length(counts),digits=2),
         mean_cpm=mean(counts_scaled_1M)) %>% 
  group_by(ensembl) %>% 
  mutate(prop_max_cpm=mean_cpm/max(mean_cpm)) %>% 
  ungroup()

Filter genes

Get Soupy genes

We will first compute the proportion of reads that could be attributed to ambient RNA (conservative estimate).

compute_ambient_per_sample <- function(id){
  message(id)
  #Get only counts for sample 'id'
  counts_id <- counts_per_cell_type %>% 
    filter(sample_id==id)
  
  #Get the cell types for which we have counts in sample 'id'
  cell_types_id <- counts_id %>% pull(cell_type) %>% unique()
  
  #Get only ambient RNA for sample 'id'
  ambient_id <- ambient %>% filter(sample_id==id) %>% 
    arrange(ensembl)
  
  #Compute ambient proportion for each gene in each cell type from sample 'id'
  ambient_prop <- lapply(1:length(cell_types_id), function(i){
    message(cell_types_id[i])
    counts_id_cell_type <- filter(counts_id,cell_type==cell_types_id[i]) %>% 
    arrange(ensembl)
    stopifnot(all(counts_id_cell_type$ensembl==ambient_id$ensembl))

    max_ambience <- DropletUtils::maximumAmbience(y = counts_id_cell_type$counts,
                                                  ambient = ambient_id$counts,
                                                  mode='proportion')
    out <- tibble(individual_id=id,cell_type=cell_types_id[i],ensembl=counts_id_cell_type$ensembl,soup=max_ambience)
  }) %>% 
    bind_rows()
  
  return(ambient_prop)
}

if(file.exists('data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')){
  ambient_prop_per_cell_type <- readRDS('data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')
} else{

  #Load pseudo-bulk counts for each sample in the dataset.
  sample_id_ms <- readRDS('data_sensitive/expression/ms_sum_expression.sample_id.rds')
  sample_id_ad <- readRDS('data_sensitive/expression/ad_sum_expression.sample_id.rds')
  
  counts_per_cell_type <- rbind(sample_id_ms,sample_id_ad) %>% 
    dplyr::select(cell_type,sample_id,ensembl,counts)
  
  #We also summed counts for all barcodes with less than 100UMI
  ambient_ad <- read_tsv('data_sensitive/ambient/ambient.100UMI.ad.txt')

  #Swap two samples (genotype of scRNA-seq matches another sample and vice-versa), already done in ad sce object
  colnames(ambient_ad)[colnames(ambient_ad)=='DWM-B3-24-Cog4-Path1-M'] <- 'tmp'
  colnames(ambient_ad)[colnames(ambient_ad)=='DWM-B3-22-Cog4-Path0-M'] <- 'DWM-B3-24-Cog4-Path1-M'
  colnames(ambient_ad)[colnames(ambient_ad)=='tmp'] <- 'DWM-B3-22-Cog4-Path0-M'
  
  #Correct scRNA-seq sample names (scRNA-seq was not matching metadata)
  colnames(ambient_ad)[colnames(ambient_ad)=='MFC-B2-10-Cog1-Path1-F'] <- 'MFC-B2-10-Cog1-Path0-M'
  colnames(ambient_ad)[colnames(ambient_ad)=='MFC-B2-11-Cog1-Path1-M'] <- 'MFC-B2-11-Cog1-Path1-F'

  #Make the ad ambient data long
  ambient_ad <- ambient_ad %>% 
    gather(sample_id,counts,-gene) %>% 
    mutate(ensembl=gsub('\\..+','',gene)) %>% 
    dplyr::select(-gene)
  
  #Now get ambient RNA for ms samples
  ambient_ms <- read_tsv('data_sensitive/ambient/ambient.100UMI.ms.txt')

  #Swap two samples (genotype of scRNA-seq matches another sample and vice-versa)
  colnames(ambient_ms)[colnames(ambient_ms)=='EU017'] <- 'tmp'
  colnames(ambient_ms)[colnames(ambient_ms)=='EU018'] <- 'EU017'
  colnames(ambient_ms)[colnames(ambient_ms)=='tmp'] <- 'EU018'
  
  colnames(ambient_ms)[colnames(ambient_ms)=='WM189'] <- 'tmp'
  colnames(ambient_ms)[colnames(ambient_ms)=='WM190'] <- 'WM189'
  colnames(ambient_ms)[colnames(ambient_ms)=='tmp'] <- 'WM190'
  
  #Make the ms ambient data long
  ambient_ms <- ambient_ms %>% 
    dplyr::select(-symbol) %>% 
    gather(sample_id,counts,-ensembl)
  
  #Aggregate ambient counts for ad and MS, only keep genes that are in the pseudo-bulks counts
  ambient <- rbind(ambient_ad,ambient_ms) %>% 
    filter(ensembl%in%unique(counts_per_cell_type$ensembl))
  
  samples <- unique(counts_per_cell_type$sample_id)
  samples_ambient <- unique(ambient$sample_id)
  stopifnot(all(samples%in%samples_ambient))
  
  #Estimate ambient RNA for each individual
  ambient_prop_all <- mclapply(samples,compute_ambient_per_sample,mc.cores = 16) %>% 
  bind_rows()
  saveRDS(ambient_prop_all,'data_sensitive/ambient/ambient.prop.estimate.droplet_utils.all.rds')

  
  ambient_prop_per_cell_type <- ambient_prop_all%>% 
  group_by(cell_type,ensembl) %>% 
  summarise(mean_soup=mean(soup,na.rm=TRUE),
            median_soup=median(soup,na.rm=TRUE),
            sd_soup=sd(soup,na.rm=TRUE),
            min_soup=min(soup,na.rm=TRUE),
            max_soup=max(soup,na.rm=TRUE),
            n_sample=n())
  
  saveRDS(ambient_prop_per_cell_type,'data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')
}

Filtering

Keep genes:

For which ambient RNA does not explain more than 10% of the read counts.
Expressed in at least 10 individuals
Expressed with a mean CPM of at least 1

ambient_keep <- filter(ambient_prop_per_cell_type,mean_soup<0.1)

gene_to_keep_per_cell_type <- sumstats_gene %>% 
  filter(n_samples_expressed>10,mean_cpm>1) %>% 
  dplyr::select(cell_type,ensembl) %>% 
  unique() %>% 
  inner_join(.,ambient_keep,by=c('cell_type','ensembl'))

sum_expression <- sum_expression %>% 
  inner_join(.,gene_to_keep_per_cell_type,by=c('cell_type','ensembl'))

Number of genes tested per cell type

(dplyr::count(sum_expression,cell_type,individual_id) %>% 
  dplyr::select(-individual_id) %>% 
  unique() %>% 
  arrange(-n))

# A tibble: 8 x 2
  cell_type              n
  <chr>              <int>
1 Excitatory neurons 14595
2 Inhibitory neurons 12943
3 Astrocytes         11436
4 Oligodendrocytes   10816
5 OPCs / COPs        10801
6 Endothelial cells  10778
7 Microglia           8887
8 Pericytes           6940

Filter individuals

Remove individuals with less than 10 cells for a given cell type.

(n_cells_ind_cell_type <- sum_expression %>% 
  dplyr::select(cell_type,individual_id,n_cells) %>% 
  unique() %>% arrange(n_cells))

# A tibble: 1,514 x 3
   cell_type          individual_id n_cells
   <chr>              <chr>           <int>
 1 Astrocytes         SD030/18            1
 2 Endothelial cells  SD015/18            1
 3 Pericytes          SD015/18            1
 4 Inhibitory neurons S13-096             1
 5 Pericytes          SM-CJGIM            1
 6 Endothelial cells  SM-CJGIM            1
 7 Endothelial cells  SM-CTEGO            1
 8 Endothelial cells  SM-CJFOL            1
 9 Endothelial cells  SM-CTECO            1
10 Inhibitory neurons SM-CJFOC            1
# … with 1,504 more rows

keep <- filter(n_cells_ind_cell_type,n_cells>=10)

Number of unique individuals per cell types for eQTL analysis

dplyr::count(keep,cell_type) %>% arrange(-n)

# A tibble: 8 x 2
  cell_type              n
  <chr>              <int>
1 Oligodendrocytes     192
2 Astrocytes           189
3 Excitatory neurons   187
4 Microglia            187
5 OPCs / COPs          187
6 Inhibitory neurons   178
7 Pericytes            150
8 Endothelial cells    144

sum_expression <- inner_join(sum_expression,keep,by=c('cell_type','individual_id','n_cells'))

Get gene coordinates

Get TSS coordinate for all genes.

gtf <- rtracklayer::import('data/gencode/Homo_sapiens.GRCh38.96.filtered.gtf') %>% 
  as.data.frame() %>% 
  dplyr::filter(type=='gene') %>% 
  mutate(TSS_start=ifelse(strand=='+',start,end),
         TSS_end=ifelse(strand=='+',start+1,end+1)) %>% 
  mutate(gene=paste0(gene_name,'_',gene_id)) %>% 
  dplyr::select(seqnames,TSS_start,TSS_end,gene) %>% 
  filter(seqnames%in%c(1:22))
gtf$seqnames <- droplevels(gtf$seqnames)

Set all samples to the same set of symbol and gene ID (like MS dataset).

gene_id_common <- sum_expression %>% 
  filter(dataset=='ms') %>% 
  dplyr::select(symbol,ensembl) %>% 
  unique() %>% 
  mutate(gene=paste0(symbol,'_',ensembl)) %>% dplyr::select(-symbol)

sum_expression <- sum_expression %>% 
  inner_join(.,gene_id_common,by='ensembl') %>% 
  dplyr::select(cell_type,individual_id,gene,counts,n_expressed,n_cells,perc_expressed,counts_scaled_1M)

saveRDS(sum_expression,'data_sensitive/expression/all_sum_expression.rds')

Get fastQTL format

Perform TMM normalization and get fastQTL format

get_fastqtl_pheno <- function(cell){
  #keep all expression for the cell type and patient that have genotype
  bed <- sum_expression %>% dplyr::filter(cell_type==cell) %>% 
    dplyr::select(individual_id,gene,counts) %>% 
    spread(individual_id,counts) %>% 
    column_to_rownames('gene') %>% 
    edgeR::DGEList(counts = .) %>% #TMM normalize the data using edgeR
    edgeR::calcNormFactors(.) %>% 
    edgeR::cpm(.) %>%
    as.data.frame() %>% 
    rownames_to_column('gene')
  
    bed <- bed %>% inner_join(.,gtf,by='gene') %>% 
      dplyr::select(seqnames,TSS_start,TSS_end,gene,everything()) %>% 
      arrange(seqnames,TSS_start)
    
    colnames(bed)[c(1,2,3,4)] <- c('#Chr','start','end','ID')
    
    write_tsv(bed,path = paste0('data_sensitive/eqtl/',make.names(cell),'.bed'))
}

cells <- unique(sum_expression$cell_type)

mclapply(cells,get_fastqtl_pheno,mc.cores=8)

cd data_sensitive/eqtl
ml tabix/0.2.6-goolf-1.7.20

for f in *.bed
do
bgzip $f && tabix -p bed $f.gz
done

Generate expression pca

geno_cov <- read_tsv('data_sensitive/eqtl/covariate_pca_meta_fastqtl.txt')

pheno_files <- list.files('data_sensitive/eqtl/',pattern='*.bed.gz$',full.names = T)

get_covariates <- function(i,folder_name){
  
  out_name <- basename(pheno_files[i]) %>% gsub('\\.bed\\.gz','',.) %>% paste0('.cov.txt')
  
  #load expression data
  d <- data.table::fread(pheno_files[i],data.table=FALSE)
  #perform pca
  pca <- d[-c(1,2,3,4)] %>% t() %>% prcomp(.,scale.=T)
  
  #Create directories with different number of expression pcs
  #Add the n first pcs to the covariate file and write the covariate file in fastqtl format
  number_of_pcs <- c(10,20,30,40,50,60,70,80,90,100,110,120)
  
  lapply(number_of_pcs,function(i){
    
      top_pcs <- pca$x[,1:i] %>% t()
      rownames(top_pcs) <- paste0(rownames(top_pcs),'_exp')
      top_pcs <- top_pcs %>% as.data.frame() %>% rownames_to_column('id')
  
      shared_samples <- intersect(colnames(geno_cov),colnames(top_pcs))
  
      cov <- rbind(geno_cov[,shared_samples],top_pcs[,shared_samples])
      dir.create(paste0(folder_name,i),showWarnings = FALSE)
      write_tsv(cov,paste0(folder_name,i,'/',out_name))
  })
}

mclapply(1:length(pheno_files),get_covariates,folder_name='data_sensitive/eqtl/PC',mc.cores=8)

cd data_sensitive/eqtl
for f in PC*
do
cd $f
gzip *
ln -s ../*.bed.gz .
ln -s ../*.bed.gz.tbi .
ln -s ../../genotypes/processed/combined_final.vcf.gz .
ln -s ../../genotypes/processed/combined_final.vcf.gz.tbi .
cd ..
done

Pseudo-bulk eQTL

Let’s generate pseudo-bulk expression data across all cell types.

First, we load the expression data at pseudo-bulk level for each cell type.

sum_expression_ms <- readRDS('data_sensitive/expression/ms_sum_expression.individual_id.rds') %>% mutate(dataset='ms')
sum_expression_ad <- readRDS('data_sensitive/expression/ad_sum_expression.individual_id.rds') %>% mutate(dataset='ad')

sum_expression <- rbind(sum_expression_ms,sum_expression_ad)

Set all samples to the same set of symbol and gene ID (like MS dataset).

gene_id_common <- sum_expression %>% 
  filter(dataset=='ms') %>% 
  dplyr::select(symbol,ensembl) %>% 
  unique() %>% 
  mutate(gene=paste0(symbol,'_',ensembl)) %>% dplyr::select(-symbol)

sum_expression <- sum_expression %>% 
  inner_join(.,gene_id_common,by='ensembl') %>% 
  dplyr::select(cell_type,individual_id,gene,counts)

Get pseudo-bulk expression for each individual accross all cell types

pb <- sum_expression %>% 
  group_by(individual_id,gene) %>% 
  summarise(counts=sum(counts)) %>% 
  ungroup()

`summarise()` regrouping output by 'individual_id' (override with `.groups` argument)

Only keep genes with a mean CPM > 1

pb <- pb %>% group_by(individual_id) %>% 
  mutate(cpm=counts*10^6/sum(counts)) %>% 
  group_by(gene) %>% 
  mutate(mean_cpm=mean(cpm)) %>% 
  ungroup() %>% 
  filter(mean_cpm>1)

Perform TMM normalization and get fastQTL format

dir.create('data_sensitive/eqtl_pb/',showWarnings = FALSE)

bed <- pb %>%
    dplyr::select(individual_id,gene,counts) %>% 
    spread(individual_id,counts) %>% 
    column_to_rownames('gene') %>% 
    edgeR::DGEList(counts = .) %>% #TMM normalize the data using edgeR
    edgeR::calcNormFactors(.) %>% 
    edgeR::cpm(.) %>%
    as.data.frame() %>% 
    rownames_to_column('gene')

bed <- bed %>% inner_join(.,gtf,by='gene') %>% 
      dplyr::select(seqnames,TSS_start,TSS_end,gene,everything()) %>% 
      arrange(seqnames,TSS_start)
    
colnames(bed)[c(1,2,3,4)] <- c('#Chr','start','end','ID')
    
write_tsv(bed,path = 'data_sensitive/eqtl_pb/pb.bed')

cd data_sensitive/eqtl_pb
ml tabix/0.2.6-goolf-1.7.20

for f in *.bed
do
bgzip $f && tabix -p bed $f.gz
done

Generate expression pca

pheno_files <- list.files('data_sensitive/eqtl_pb/',pattern='*.bed.gz$',full.names = T)

get_covariates <- function(i,folder_name){
  
  out_name <- basename(pheno_files[i]) %>% gsub('\\.bed\\.gz','',.) %>% paste0('.cov.txt')
  
  #load expression data
  d <- data.table::fread(pheno_files[i],data.table=FALSE)
  #perform pca
  pca <- d[-c(1,2,3,4)] %>% t() %>% prcomp(.,scale.=T)
  
  #Create directories with different number of expression pcs
  #Add the n first pcs to the covariate file and write the covariate file in fastqtl format
  number_of_pcs <- c(10,20,30,40,50,60,70,80,90,100,110,120)
  
  lapply(number_of_pcs,function(i){
    
      top_pcs <- pca$x[,1:i] %>% t()
      rownames(top_pcs) <- paste0(rownames(top_pcs),'_exp')
      top_pcs <- top_pcs %>% as.data.frame() %>% rownames_to_column('id')
  
      shared_samples <- intersect(colnames(geno_cov),colnames(top_pcs))
  
      cov <- rbind(geno_cov[,shared_samples],top_pcs[,shared_samples])
      dir.create(paste0(folder_name,i),showWarnings = FALSE)
      write_tsv(cov,paste0(folder_name,i,'/',out_name))
  })
}

lapply(1:length(pheno_files),get_covariates,folder_name='data_sensitive/eqtl_pb/PC')

cd data_sensitive/eqtl_pb
for f in PC*
do
cd $f
gzip *
ln -s ../*.bed.gz .
ln -s ../*.bed.gz.tbi .
ln -s ../../genotypes/processed/combined_final.vcf.gz .
ln -s ../../genotypes/processed/combined_final.vcf.gz.tbi .
cd ..
done

sessionInfo()

R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /pstore/apps/OpenBLAS/0.3.1-GCC-7.3.0-2.30/lib/libopenblasp-r0.3.1.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] HDF5Array_1.16.1            rhdf5_2.32.4               
 [3] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.1
 [5] DelayedArray_0.14.0         matrixStats_0.56.0         
 [7] Biobase_2.48.0              GenomicRanges_1.40.0       
 [9] GenomeInfoDb_1.24.0         IRanges_2.22.2             
[11] S4Vectors_0.26.1            BiocGenerics_0.34.0        
[13] forcats_0.5.0               stringr_1.4.0              
[15] dplyr_1.0.0                 purrr_0.3.4                
[17] readr_1.3.1                 tidyr_1.1.0                
[19] tibble_3.0.1                ggplot2_3.3.3              
[21] tidyverse_1.3.0             workflowr_1.6.2            

loaded via a namespace (and not attached):
 [1] ggbeeswarm_0.6.0          colorspace_1.4-1         
 [3] ellipsis_0.3.1            rprojroot_1.3-2          
 [5] XVector_0.28.0            BiocNeighbors_1.6.0      
 [7] fs_1.4.1                  rstudioapi_0.11          
 [9] fansi_0.4.1               lubridate_1.7.9          
[11] xml2_1.3.2                R.methodsS3_1.8.0        
[13] knitr_1.28                scater_1.16.2            
[15] jsonlite_1.6.1            Rsamtools_2.4.0          
[17] broom_0.5.6               dbplyr_1.4.4             
[19] R.oo_1.23.0               compiler_4.0.1           
[21] httr_1.4.1                backports_1.1.7          
[23] assertthat_0.2.1          Matrix_1.2-18            
[25] limma_3.44.1              cli_2.0.2                
[27] later_1.1.0.1             BiocSingular_1.4.0       
[29] htmltools_0.5.1.1         tools_4.0.1              
[31] rsvd_1.0.3                gtable_0.3.0             
[33] glue_1.4.1                GenomeInfoDbData_1.2.3   
[35] Rcpp_1.0.6                cellranger_1.1.0         
[37] vctrs_0.3.1               Biostrings_2.56.0        
[39] nlme_3.1-148              rtracklayer_1.48.0       
[41] DelayedMatrixStats_1.10.1 xfun_0.14                
[43] rvest_0.3.5               lifecycle_0.2.0          
[45] irlba_2.3.3               XML_3.99-0.3             
[47] edgeR_3.30.3              zlibbioc_1.34.0          
[49] scales_1.1.1              hms_0.5.3                
[51] promises_1.1.1            yaml_2.2.1               
[53] gridExtra_2.3             stringi_1.4.6            
[55] BiocParallel_1.22.0       rlang_0.4.10             
[57] pkgconfig_2.0.3           bitops_1.0-6             
[59] evaluate_0.14             lattice_0.20-41          
[61] Rhdf5lib_1.10.0           GenomicAlignments_1.24.0 
[63] tidyselect_1.1.0          magrittr_1.5             
[65] R6_2.4.1                  generics_0.0.2           
[67] DBI_1.1.0                 pillar_1.4.4             
[69] haven_2.3.1               whisker_0.4              
[71] withr_2.2.0               RCurl_1.98-1.2           
[73] modelr_0.1.8              crayon_1.3.4             
[75] utf8_1.1.4                rmarkdown_2.2            
[77] viridis_0.5.1             locfit_1.5-9.4           
[79] grid_4.0.1                readxl_1.3.1             
[81] data.table_1.12.8         blob_1.2.1               
[83] git2r_0.27.1              reprex_0.3.0             
[85] digest_0.6.25             httpuv_1.5.4             
[87] R.utils_2.9.2             munsell_0.5.0            
[89] beeswarm_0.2.3            viridisLite_0.3.0        
[91] vipor_0.4.5

Process expression

Julien Bryois

February 01, 2022

Files

MS expression

AD expression

Get common genes

Get covariates

Get Pseudo-bulk expression function

MS

AD

Load Pseudo-bulk

Aggregate datasets

Get number of samples expressing genes and average expression

Filter genes

Get Soupy genes

Filtering

Filter individuals

Get gene coordinates

Get fastQTL format

Generate expression pca

Pseudo-bulk eQTL

Generate expression pca