Last updated: 2022-02-01

Checks: 7 0

Knit directory: snRNA_eqtl/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20211124) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7cdba32. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/magma_interaction_eqtl/
    Ignored:    data/magma_specific_genes/
    Ignored:    data_sensitive/

Untracked files:
    Untracked:  analysis/._index.Rmd
    Untracked:  analysis/additional/
    Untracked:  analysis/alternatives/
    Untracked:  analysis/clean_analysis.Rmd
    Untracked:  analysis/log_run_slurm_out
    Untracked:  analysis/log_run_slurm_stderr
    Untracked:  analysis/portfolio/
    Untracked:  analysis/revision.Rmd
    Untracked:  analysis/run_slurm.Rscript
    Untracked:  analysis/run_slurm.sbatch
    Untracked:  data/dice/
    Untracked:  data/epigenome_enrichment/
    Untracked:  data/gencode/
    Untracked:  data/gnomad_loeuf/
    Untracked:  data/gtex/
    Untracked:  data/gwas/
    Untracked:  data/gwas_epigenome_overlap/
    Untracked:  data/metabrain/
    Untracked:  data/umap/
    Untracked:  output/eqtl/
    Untracked:  output_almost_final/
    Untracked:  output_tmp_QTLtools/
    Untracked:  output_tmp_fastQTL_sample_ambient_removed/

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/02_run_eQTL.Rmd
    Modified:   analysis/04_coloc.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/plot_figures.Rmd
    Deleted:    output/README.md

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/01_process_expression.Rmd) and HTML (docs/01_process_expression.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
html 7cdba32 Julien Bryois 2022-02-01 Build site.
Rmd 07b8f71 Julien Bryois 2022-02-01 process
html 7f06226 Julien Bryois 2022-01-27 Build site.
Rmd 3a086fb Julien Bryois 2022-01-27 Pseudo-bulk across cells expression added
html f0a5d45 Julien Bryois 2022-01-24 Build site.
Rmd 5b8f23e Julien Bryois 2022-01-24 Process Expression
html e65a9c5 Julien Bryois 2021-12-14 Build site.
Rmd 4f7015d Julien Bryois 2021-12-14 gene_soup_filter_sample_soup_filter_fastQTL_output2
html 15cfc19 Julien Bryois 2021-12-14 Build site.
Rmd 9a80a23 Julien Bryois 2021-12-14 QTLtools format update - bug fix
html 7c447bd Julien Bryois 2021-12-14 Build site.
Rmd dca9a01 Julien Bryois 2021-12-14 QTLtools format update
html 16a97be Julien Bryois 2021-12-13 Build site.
Rmd 329b2da Julien Bryois 2021-12-13 process_expression_no_soup_genes_no_soup_ind
html f971404 Julien Bryois 2021-12-07 Build site.
Rmd 62a3cc3 Julien Bryois 2021-12-07 process expression without soupy genes
html 48c3088 Julien Bryois 2021-12-07 Build site.
Rmd b05d4fc Julien Bryois 2021-12-07 expression_filter_soupy_genes
html f41b1ed Julien Bryois 2021-12-07 Build site.
Rmd a3b1cad Julien Bryois 2021-12-07 expression_filter_soupy_genes
Rmd fcde74d Julien Bryois 2021-12-06 expression_filter_soupy_genes
html 0d629e6 Julien Bryois 2021-12-01 Build site.
Rmd db82bd8 Julien Bryois 2021-12-01 Added code to extract expression of marker genes
html 7c6cfa8 Julien Bryois 2021-11-30 Build site.
Rmd 98eaf47 Julien Bryois 2021-11-30 TMM normalization for eQTL instead of CPM
html 67ca055 Julien Bryois 2021-11-29 Build site.
Rmd 4b8b916 Julien Bryois 2021-11-29 expression code added

library(tidyverse)
library(SingleCellExperiment)
library(parallel)

Files

Get individuals with both scRNA-seq and genotype information

individidual_genotype_scRNA <- read_tsv('data_sensitive/genotypes/individuals_with_genotype_scRNA.txt',
                                        col_names = FALSE)

MS expression

sce_ms <- readRDS('data_sensitive/sce/ms_sce_3.rds')

get annotation

annot <- read_csv('data_sensitive/sce/conos_labelled_2021-05-31.txt')

Only keep cells that were annotated

sce_ms <- sce_ms[,colnames(sce_ms)%in%annot$cell_id]
m <- match(colnames(sce_ms),annot$cell_id)
sce_ms$broad_annot <- annot$type_broad[m]

Some scRNA-seq were swapped and not corrected in the loaded sce object.

EU017 and EU018 –> swap WM189 and WM190 –> swap

Swap them here

sce_ms$library_id[sce_ms$library_id=='EU017'] <- 'tmp'
sce_ms$library_id[sce_ms$library_id=='EU018'] <- 'EU017'
sce_ms$library_id[sce_ms$library_id=='tmp'] <- 'EU018'

colnames(sce_ms) <- gsub('EU017','tmp',colnames(sce_ms))
colnames(sce_ms) <- gsub('EU018','EU017',colnames(sce_ms))
colnames(sce_ms) <- gsub('tmp','EU018',colnames(sce_ms))
sce_ms$library_id[sce_ms$library_id=='WM189'] <- 'tmp'
sce_ms$library_id[sce_ms$library_id=='WM190'] <- 'WM189'
sce_ms$library_id[sce_ms$library_id=='tmp'] <- 'WM190'

colnames(sce_ms) <- gsub('WM189','tmp',colnames(sce_ms))
colnames(sce_ms) <- gsub('WM190','WM189',colnames(sce_ms))
colnames(sce_ms) <- gsub('tmp','WM190',colnames(sce_ms))

Keep only individuals for which we have both scRNA-seq and expression

sce_ms <- sce_ms[,sce_ms$patient_id%in%individidual_genotype_scRNA$X1]

add individual_id column to the sce object

sce_ms$individual_id <- sce_ms$patient_id

Let’s remove the individual which is also in the AD dataset (less cells than the AD dataset)

sce_ms <- sce_ms[,sce_ms$individual_id!='S03-009']

Let’s remove samples that don’t match their genotype

sce_ms <- sce_ms[,!sce_ms$library_id%in%c('EU015','EU004','WM115','WM155','WM185')]

Let’s remove immune cells from the MS dataset (not observed in the AD dataset)

sce_ms <- sce_ms[,!sce_ms$broad_annot=='Immune']

Let’s set the gene name to ensembl (to merge with the AD sce dataset)

rownames(sce_ms) <- rowData(sce_ms)$ensembl

Finally, let’s save info about which sample corresponds to which individual in the filtered data

ms_sample_individual <- colData(sce_ms) %>% 
  as_tibble() %>% 
  dplyr::select(sample_id=library_id,individual_id,age,post_mortem_m,sex,tissue=matter) %>% 
  unique() %>% 
  write_tsv(.,'data_sensitive/expression/sample_individual_ms.txt',col_names = TRUE)

AD expression

sce_ad <- readRDS('data_sensitive/sce/sce.annotated8.rds')

Keep only individuals for which we have both scRNA-seq and expression

sce_ad <- sce_ad[,sce_ad$individual_id%in%individidual_genotype_scRNA$X1]
Loading required package: HDF5Array
Loading required package: rhdf5

Some individuals are present in multiple AD studies

(individuals_multiple_studies <- colData(sce_ad) %>% as_tibble() %>% 
  dplyr::select(individual_id,study,diagnosis) %>% 
  unique() %>% 
  add_count(individual_id) %>% 
  filter(n>1) %>% 
  arrange(individual_id))
# A tibble: 10 x 4
   individual_id study    diagnosis     n
   <chr>         <chr>    <chr>     <int>
 1 SM-CJGLH      Mathys   Ctrl          2
 2 SM-CJGLH      Zhou     Ctrl          2
 3 SM-CJIZ1      Mathys   Ctrl          2
 4 SM-CJIZ1      Zhou     Ctrl          2
 5 SM-CJJ35      Mathys   AD            2
 6 SM-CJJ35      Zhou     AD            2
 7 SM-CTDQW      Columbia AD            2
 8 SM-CTDQW      Mathys   AD            2
 9 SM-CTEM3      Mathys   Ctrl          2
10 SM-CTEM3      Zhou     Ctrl          2

count number of cells for these individuals

n_cells_per_id <- colData(sce_ad) %>% as_tibble() %>% 
  filter(individual_id%in%individuals_multiple_studies$individual_id) %>% 
  dplyr::count(individual_id,study) %>% 
    group_by(individual_id) %>% 
    mutate(keep=ifelse(n==max(n),TRUE,FALSE))
to_remove <- filter(n_cells_per_id,!keep) %>% 
  mutate(lab=paste0(individual_id,'_',study))
sce_ad$lab <- paste0(sce_ad$individual_id,'_',sce_ad$study)

Remove the sample with the least amount of cells (when two samples from the same individual are observed in two datasets)

sce_ad <- sce_ad[,!sce_ad$lab%in%to_remove$lab]

Get same cell type label as the MS dataset

sce_ad$broad_annot <- case_when(
  sce_ad$broad_annot=='Glutamatergic' ~ 'Excitatory neurons',
  sce_ad$broad_annot=='GABAergic' ~ 'Inhibitory neurons',
  sce_ad$broad_annot=='OPCs' ~ 'OPCs / COPs',
  sce_ad$broad_annot=='Endothelial' ~ 'Endothelial cells',
  TRUE ~ sce_ad$broad_annot)

Let’s set the gene name to ensembl (to merge with the AD sce dataset)

rownames(sce_ad) <- rowData(sce_ad)$ENSEMBL

Finally, let’s save info about which sample corresponds to which individual in the filtered data

ad_sample_individual <- colData(sce_ad) %>% 
  as_tibble() %>% 
  mutate(post_mortem_m=pmi*60) %>% 
  dplyr::select(sample_id,individual_id,age,post_mortem_m,sex,tissue) %>% 
  unique() %>% 
  write_tsv(.,'data_sensitive/expression/sample_individual_ad.txt',col_names = TRUE)

Get common genes

common_genes <-  intersect(rownames(sce_ms),rownames(sce_ad))
length(common_genes)
[1] 25938
sce_ms <- sce_ms[common_genes,]
sce_ad <- sce_ad[common_genes,]

Save ensembl - symbol names from ms dataset

ensembl2symbol <- rowData(sce_ms) %>% 
  as_tibble() %>% 
  dplyr::select(-type) %>% 
  write_tsv('data_sensitive/expression/ensembl2symbol_ms.txt')

Get covariates

For all samples, get:

  1. Original dataset (Roche_AD,Columbia_AD,Mathys_AD,Zhou,MS)
  2. Case-control status
ad_cov <- colData(sce_ad) %>% as_tibble() %>% 
  dplyr::select(individual_id,study,diagnosis) %>% 
  unique() %>% 
  mutate(study=paste0(study,'_AD'))
ms_cov <- colData(sce_ms) %>% as_tibble() %>% 
dplyr::select(individual_id,diagnosis=disease_status) %>% 
  mutate(study='MS') %>% 
  unique() %>% 
  mutate(diagnosis=case_when(
    diagnosis=='CTR' ~ 'Ctrl',
    TRUE ~ 'MS'
  )) %>% unique()
cov_all <- rbind(ad_cov,ms_cov)
cov_all_t <- cov_all %>% column_to_rownames('individual_id') %>% t() %>% as.data.frame() %>% rownames_to_column('id')

Load genotype PCs

cov_genotype_pcs <- read_tsv('data_sensitive/genotypes/pca_covariate_fastqtl.txt')
cov_all_t <- cov_all_t[,colnames(cov_genotype_pcs)]

Add covariates from metadata (study, case-control) to genotype PCs.

cov <- rbind(cov_genotype_pcs,cov_all_t) %>% as_tibble()
dir.create('data_sensitive/eqtl',showWarnings = FALSE)
write_tsv(cov,'data_sensitive/eqtl/covariate_pca_meta_fastqtl.txt')

Get Pseudo-bulk expression function

count_per_cluster <- function(sce,i,var){
  cnt_matrix <- sce[,sce$broad_annot==cell_type_id$V1[i] & sce[[var]]==cell_type_id$V2[i]]
  tot_counts <- Matrix::rowSums(counts(cnt_matrix))
  n_expressed <- Matrix::rowSums(counts(cnt_matrix)>0)
  n_cells <- ncol(cnt_matrix)
  cnt_df <- data.frame(counts=tot_counts,
                       n_expressed=n_expressed,
                       n_cells=n_cells) %>% 
    rownames_to_column('ensembl') %>% 
    mutate(perc_expressed=round(n_expressed*100/n_cells,digits=2),
           cell_type=cell_type_id$V1[i],
           individual_id=cell_type_id$V2[i]) %>% 
    inner_join(.,ensembl2symbol,by='ensembl') %>% 
  dplyr::select(cell_type,individual_id,ensembl,symbol,counts,n_expressed,n_cells,perc_expressed)
  
  if(var=='sample_id'){
    colnames(cnt_df)[2] <- 'sample_id'
  }
  return(cnt_df)
}

MS

if(!file.exists('data_sensitive/expression/ms_sum_expression.individual_id.rds')){
  
  cell_type_id <- cbind(as.character(sce_ms$broad_annot),sce_ms$individual_id) %>% unique() %>% as.data.frame()
  
  sum_expression_ms <- mclapply(1:nrow(cell_type_id),
                                count_per_cluster,
                                sce=sce_ms,
                                var='individual_id',
                                mc.cores = 36,
                                mc.preschedule = FALSE) %>% 
    bind_rows()
  
  saveRDS(sum_expression_ms,'data_sensitive/expression/ms_sum_expression.individual_id.rds')
  rm(sum_expression_ms)
  gc()
}
if(!file.exists('data_sensitive/expression/ms_sum_expression.sample_id.rds')){

  sce_ms$sample_id <- sce_ms$library_id
  cell_type_id <- cbind(as.character(sce_ms$broad_annot),sce_ms$sample_id) %>% unique() %>% as.data.frame()
  
  sum_expression_ms <- mclapply(1:nrow(cell_type_id),
                                count_per_cluster,
                                sce=sce_ms,
                                var='sample_id',
                                mc.cores = 36,
                                mc.preschedule = FALSE) %>% 
    bind_rows()
  
  saveRDS(sum_expression_ms,'data_sensitive/expression/ms_sum_expression.sample_id.rds')
  rm(sum_expression_ms)
  gc()
}
rm(sce_ms)
gc()
            used   (Mb) gc trigger    (Mb)    max used    (Mb)
Ncells   9370436  500.5   17791542   950.2    17791542   950.2
Vcells 163244320 1245.5 8907262452 67957.1 12459900815 95061.5

Load Pseudo-bulk

sum_expression_ms <- readRDS('data_sensitive/expression/ms_sum_expression.individual_id.rds') %>% mutate(dataset='ms')
sum_expression_ad <- readRDS('data_sensitive/expression/ad_sum_expression.individual_id.rds') %>% mutate(dataset='ad')

Aggregate datasets

Let’s merge the MS and AD dataset.

sum_expression <- rbind(sum_expression_ms,sum_expression_ad)

Get counts per million (CPM)

sum_expression <- sum_expression %>% 
  group_by(cell_type,individual_id) %>% 
  mutate(counts_scaled_1M=counts*10^6/sum(counts)) %>% 
  ungroup() %>% 
  as_tibble()

Save expression of cell type marker genes

markers <- sum_expression %>% filter(symbol%in%c('SLC17A7','GAD2','AQP4','MOG','PDGFRA','C1QA','CLDN5','RGS5'))
saveRDS(markers,'data_sensitive/expression/cell_marker_expression.rds')

Get number of samples expressing genes and average expression

sumstats_gene <- sum_expression %>% group_by(cell_type,ensembl) %>% 
  summarise(counts_per_celltype=sum(counts),
         n_expressed_per_celltype=sum(n_expressed),
         n_cells_per_celltype=sum(n_cells),
         perc_expressed_per_celltype=round(n_expressed_per_celltype*100/n_cells_per_celltype,digits=2),
         n_samples_expressed=sum(counts>0),
         perc_samples_expressed=round(n_samples_expressed*100/length(counts),digits=2),
         mean_cpm=mean(counts_scaled_1M)) %>% 
  group_by(ensembl) %>% 
  mutate(prop_max_cpm=mean_cpm/max(mean_cpm)) %>% 
  ungroup()

Filter genes

Get Soupy genes

We will first compute the proportion of reads that could be attributed to ambient RNA (conservative estimate).

compute_ambient_per_sample <- function(id){
  message(id)
  #Get only counts for sample 'id'
  counts_id <- counts_per_cell_type %>% 
    filter(sample_id==id)
  
  #Get the cell types for which we have counts in sample 'id'
  cell_types_id <- counts_id %>% pull(cell_type) %>% unique()
  
  #Get only ambient RNA for sample 'id'
  ambient_id <- ambient %>% filter(sample_id==id) %>% 
    arrange(ensembl)
  
  #Compute ambient proportion for each gene in each cell type from sample 'id'
  ambient_prop <- lapply(1:length(cell_types_id), function(i){
    message(cell_types_id[i])
    counts_id_cell_type <- filter(counts_id,cell_type==cell_types_id[i]) %>% 
    arrange(ensembl)
    stopifnot(all(counts_id_cell_type$ensembl==ambient_id$ensembl))

    max_ambience <- DropletUtils::maximumAmbience(y = counts_id_cell_type$counts,
                                                  ambient = ambient_id$counts,
                                                  mode='proportion')
    out <- tibble(individual_id=id,cell_type=cell_types_id[i],ensembl=counts_id_cell_type$ensembl,soup=max_ambience)
  }) %>% 
    bind_rows()
  
  return(ambient_prop)
}
if(file.exists('data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')){
  ambient_prop_per_cell_type <- readRDS('data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')
} else{

  #Load pseudo-bulk counts for each sample in the dataset.
  sample_id_ms <- readRDS('data_sensitive/expression/ms_sum_expression.sample_id.rds')
  sample_id_ad <- readRDS('data_sensitive/expression/ad_sum_expression.sample_id.rds')
  
  counts_per_cell_type <- rbind(sample_id_ms,sample_id_ad) %>% 
    dplyr::select(cell_type,sample_id,ensembl,counts)
  
  #We also summed counts for all barcodes with less than 100UMI
  ambient_ad <- read_tsv('data_sensitive/ambient/ambient.100UMI.ad.txt')

  #Swap two samples (genotype of scRNA-seq matches another sample and vice-versa), already done in ad sce object
  colnames(ambient_ad)[colnames(ambient_ad)=='DWM-B3-24-Cog4-Path1-M'] <- 'tmp'
  colnames(ambient_ad)[colnames(ambient_ad)=='DWM-B3-22-Cog4-Path0-M'] <- 'DWM-B3-24-Cog4-Path1-M'
  colnames(ambient_ad)[colnames(ambient_ad)=='tmp'] <- 'DWM-B3-22-Cog4-Path0-M'
  
  #Correct scRNA-seq sample names (scRNA-seq was not matching metadata)
  colnames(ambient_ad)[colnames(ambient_ad)=='MFC-B2-10-Cog1-Path1-F'] <- 'MFC-B2-10-Cog1-Path0-M'
  colnames(ambient_ad)[colnames(ambient_ad)=='MFC-B2-11-Cog1-Path1-M'] <- 'MFC-B2-11-Cog1-Path1-F'

  #Make the ad ambient data long
  ambient_ad <- ambient_ad %>% 
    gather(sample_id,counts,-gene) %>% 
    mutate(ensembl=gsub('\\..+','',gene)) %>% 
    dplyr::select(-gene)
  
  #Now get ambient RNA for ms samples
  ambient_ms <- read_tsv('data_sensitive/ambient/ambient.100UMI.ms.txt')

  #Swap two samples (genotype of scRNA-seq matches another sample and vice-versa)
  colnames(ambient_ms)[colnames(ambient_ms)=='EU017'] <- 'tmp'
  colnames(ambient_ms)[colnames(ambient_ms)=='EU018'] <- 'EU017'
  colnames(ambient_ms)[colnames(ambient_ms)=='tmp'] <- 'EU018'
  
  colnames(ambient_ms)[colnames(ambient_ms)=='WM189'] <- 'tmp'
  colnames(ambient_ms)[colnames(ambient_ms)=='WM190'] <- 'WM189'
  colnames(ambient_ms)[colnames(ambient_ms)=='tmp'] <- 'WM190'
  
  #Make the ms ambient data long
  ambient_ms <- ambient_ms %>% 
    dplyr::select(-symbol) %>% 
    gather(sample_id,counts,-ensembl)
  
  #Aggregate ambient counts for ad and MS, only keep genes that are in the pseudo-bulks counts
  ambient <- rbind(ambient_ad,ambient_ms) %>% 
    filter(ensembl%in%unique(counts_per_cell_type$ensembl))
  
  samples <- unique(counts_per_cell_type$sample_id)
  samples_ambient <- unique(ambient$sample_id)
  stopifnot(all(samples%in%samples_ambient))
  
  #Estimate ambient RNA for each individual
  ambient_prop_all <- mclapply(samples,compute_ambient_per_sample,mc.cores = 16) %>% 
  bind_rows()
  saveRDS(ambient_prop_all,'data_sensitive/ambient/ambient.prop.estimate.droplet_utils.all.rds')

  
  ambient_prop_per_cell_type <- ambient_prop_all%>% 
  group_by(cell_type,ensembl) %>% 
  summarise(mean_soup=mean(soup,na.rm=TRUE),
            median_soup=median(soup,na.rm=TRUE),
            sd_soup=sd(soup,na.rm=TRUE),
            min_soup=min(soup,na.rm=TRUE),
            max_soup=max(soup,na.rm=TRUE),
            n_sample=n())
  
  saveRDS(ambient_prop_per_cell_type,'data_sensitive/ambient/ambient.prop.estimate.droplet_utils.rds')
}

Filtering

Keep genes:

  1. For which ambient RNA does not explain more than 10% of the read counts.
  2. Expressed in at least 10 individuals
  3. Expressed with a mean CPM of at least 1
ambient_keep <- filter(ambient_prop_per_cell_type,mean_soup<0.1)
gene_to_keep_per_cell_type <- sumstats_gene %>% 
  filter(n_samples_expressed>10,mean_cpm>1) %>% 
  dplyr::select(cell_type,ensembl) %>% 
  unique() %>% 
  inner_join(.,ambient_keep,by=c('cell_type','ensembl'))
sum_expression <- sum_expression %>% 
  inner_join(.,gene_to_keep_per_cell_type,by=c('cell_type','ensembl'))

Number of genes tested per cell type

(dplyr::count(sum_expression,cell_type,individual_id) %>% 
  dplyr::select(-individual_id) %>% 
  unique() %>% 
  arrange(-n))
# A tibble: 8 x 2
  cell_type              n
  <chr>              <int>
1 Excitatory neurons 14595
2 Inhibitory neurons 12943
3 Astrocytes         11436
4 Oligodendrocytes   10816
5 OPCs / COPs        10801
6 Endothelial cells  10778
7 Microglia           8887
8 Pericytes           6940

Filter individuals

Remove individuals with less than 10 cells for a given cell type.

(n_cells_ind_cell_type <- sum_expression %>% 
  dplyr::select(cell_type,individual_id,n_cells) %>% 
  unique() %>% arrange(n_cells))
# A tibble: 1,514 x 3
   cell_type          individual_id n_cells
   <chr>              <chr>           <int>
 1 Astrocytes         SD030/18            1
 2 Endothelial cells  SD015/18            1
 3 Pericytes          SD015/18            1
 4 Inhibitory neurons S13-096             1
 5 Pericytes          SM-CJGIM            1
 6 Endothelial cells  SM-CJGIM            1
 7 Endothelial cells  SM-CTEGO            1
 8 Endothelial cells  SM-CJFOL            1
 9 Endothelial cells  SM-CTECO            1
10 Inhibitory neurons SM-CJFOC            1
# … with 1,504 more rows
keep <- filter(n_cells_ind_cell_type,n_cells>=10)

Number of unique individuals per cell types for eQTL analysis

dplyr::count(keep,cell_type) %>% arrange(-n)
# A tibble: 8 x 2
  cell_type              n
  <chr>              <int>
1 Oligodendrocytes     192
2 Astrocytes           189
3 Excitatory neurons   187
4 Microglia            187
5 OPCs / COPs          187
6 Inhibitory neurons   178
7 Pericytes            150
8 Endothelial cells    144
sum_expression <- inner_join(sum_expression,keep,by=c('cell_type','individual_id','n_cells'))

Get gene coordinates

Get TSS coordinate for all genes.

gtf <- rtracklayer::import('data/gencode/Homo_sapiens.GRCh38.96.filtered.gtf') %>% 
  as.data.frame() %>% 
  dplyr::filter(type=='gene') %>% 
  mutate(TSS_start=ifelse(strand=='+',start,end),
         TSS_end=ifelse(strand=='+',start+1,end+1)) %>% 
  mutate(gene=paste0(gene_name,'_',gene_id)) %>% 
  dplyr::select(seqnames,TSS_start,TSS_end,gene) %>% 
  filter(seqnames%in%c(1:22))
gtf$seqnames <- droplevels(gtf$seqnames)

Set all samples to the same set of symbol and gene ID (like MS dataset).

gene_id_common <- sum_expression %>% 
  filter(dataset=='ms') %>% 
  dplyr::select(symbol,ensembl) %>% 
  unique() %>% 
  mutate(gene=paste0(symbol,'_',ensembl)) %>% dplyr::select(-symbol)
sum_expression <- sum_expression %>% 
  inner_join(.,gene_id_common,by='ensembl') %>% 
  dplyr::select(cell_type,individual_id,gene,counts,n_expressed,n_cells,perc_expressed,counts_scaled_1M)
saveRDS(sum_expression,'data_sensitive/expression/all_sum_expression.rds')

Get fastQTL format

Perform TMM normalization and get fastQTL format

get_fastqtl_pheno <- function(cell){
  #keep all expression for the cell type and patient that have genotype
  bed <- sum_expression %>% dplyr::filter(cell_type==cell) %>% 
    dplyr::select(individual_id,gene,counts) %>% 
    spread(individual_id,counts) %>% 
    column_to_rownames('gene') %>% 
    edgeR::DGEList(counts = .) %>% #TMM normalize the data using edgeR
    edgeR::calcNormFactors(.) %>% 
    edgeR::cpm(.) %>%
    as.data.frame() %>% 
    rownames_to_column('gene')
  
    bed <- bed %>% inner_join(.,gtf,by='gene') %>% 
      dplyr::select(seqnames,TSS_start,TSS_end,gene,everything()) %>% 
      arrange(seqnames,TSS_start)
    
    colnames(bed)[c(1,2,3,4)] <- c('#Chr','start','end','ID')
    
    write_tsv(bed,path = paste0('data_sensitive/eqtl/',make.names(cell),'.bed'))
}    
cells <- unique(sum_expression$cell_type)
mclapply(cells,get_fastqtl_pheno,mc.cores=8)
cd data_sensitive/eqtl
ml tabix/0.2.6-goolf-1.7.20

for f in *.bed
do
bgzip $f && tabix -p bed $f.gz
done

Generate expression pca

geno_cov <- read_tsv('data_sensitive/eqtl/covariate_pca_meta_fastqtl.txt')
pheno_files <- list.files('data_sensitive/eqtl/',pattern='*.bed.gz$',full.names = T)
get_covariates <- function(i,folder_name){
  
  out_name <- basename(pheno_files[i]) %>% gsub('\\.bed\\.gz','',.) %>% paste0('.cov.txt')
  
  #load expression data
  d <- data.table::fread(pheno_files[i],data.table=FALSE)
  #perform pca
  pca <- d[-c(1,2,3,4)] %>% t() %>% prcomp(.,scale.=T)
  
  #Create directories with different number of expression pcs
  #Add the n first pcs to the covariate file and write the covariate file in fastqtl format
  number_of_pcs <- c(10,20,30,40,50,60,70,80,90,100,110,120)
  
  lapply(number_of_pcs,function(i){
    
      top_pcs <- pca$x[,1:i] %>% t()
      rownames(top_pcs) <- paste0(rownames(top_pcs),'_exp')
      top_pcs <- top_pcs %>% as.data.frame() %>% rownames_to_column('id')
  
      shared_samples <- intersect(colnames(geno_cov),colnames(top_pcs))
  
      cov <- rbind(geno_cov[,shared_samples],top_pcs[,shared_samples])
      dir.create(paste0(folder_name,i),showWarnings = FALSE)
      write_tsv(cov,paste0(folder_name,i,'/',out_name))
  })
}
mclapply(1:length(pheno_files),get_covariates,folder_name='data_sensitive/eqtl/PC',mc.cores=8)
cd data_sensitive/eqtl
for f in PC*
do
cd $f
gzip *
ln -s ../*.bed.gz .
ln -s ../*.bed.gz.tbi .
ln -s ../../genotypes/processed/combined_final.vcf.gz .
ln -s ../../genotypes/processed/combined_final.vcf.gz.tbi .
cd ..
done

Pseudo-bulk eQTL

Let’s generate pseudo-bulk expression data across all cell types.

First, we load the expression data at pseudo-bulk level for each cell type.

sum_expression_ms <- readRDS('data_sensitive/expression/ms_sum_expression.individual_id.rds') %>% mutate(dataset='ms')
sum_expression_ad <- readRDS('data_sensitive/expression/ad_sum_expression.individual_id.rds') %>% mutate(dataset='ad')
sum_expression <- rbind(sum_expression_ms,sum_expression_ad)

Set all samples to the same set of symbol and gene ID (like MS dataset).

gene_id_common <- sum_expression %>% 
  filter(dataset=='ms') %>% 
  dplyr::select(symbol,ensembl) %>% 
  unique() %>% 
  mutate(gene=paste0(symbol,'_',ensembl)) %>% dplyr::select(-symbol)
sum_expression <- sum_expression %>% 
  inner_join(.,gene_id_common,by='ensembl') %>% 
  dplyr::select(cell_type,individual_id,gene,counts)

Get pseudo-bulk expression for each individual accross all cell types

pb <- sum_expression %>% 
  group_by(individual_id,gene) %>% 
  summarise(counts=sum(counts)) %>% 
  ungroup()
`summarise()` regrouping output by 'individual_id' (override with `.groups` argument)

Only keep genes with a mean CPM > 1

pb <- pb %>% group_by(individual_id) %>% 
  mutate(cpm=counts*10^6/sum(counts)) %>% 
  group_by(gene) %>% 
  mutate(mean_cpm=mean(cpm)) %>% 
  ungroup() %>% 
  filter(mean_cpm>1)

Perform TMM normalization and get fastQTL format

dir.create('data_sensitive/eqtl_pb/',showWarnings = FALSE)

bed <- pb %>%
    dplyr::select(individual_id,gene,counts) %>% 
    spread(individual_id,counts) %>% 
    column_to_rownames('gene') %>% 
    edgeR::DGEList(counts = .) %>% #TMM normalize the data using edgeR
    edgeR::calcNormFactors(.) %>% 
    edgeR::cpm(.) %>%
    as.data.frame() %>% 
    rownames_to_column('gene')

bed <- bed %>% inner_join(.,gtf,by='gene') %>% 
      dplyr::select(seqnames,TSS_start,TSS_end,gene,everything()) %>% 
      arrange(seqnames,TSS_start)
    
colnames(bed)[c(1,2,3,4)] <- c('#Chr','start','end','ID')
    
write_tsv(bed,path = 'data_sensitive/eqtl_pb/pb.bed')
cd data_sensitive/eqtl_pb
ml tabix/0.2.6-goolf-1.7.20

for f in *.bed
do
bgzip $f && tabix -p bed $f.gz
done

Generate expression pca

pheno_files <- list.files('data_sensitive/eqtl_pb/',pattern='*.bed.gz$',full.names = T)
get_covariates <- function(i,folder_name){
  
  out_name <- basename(pheno_files[i]) %>% gsub('\\.bed\\.gz','',.) %>% paste0('.cov.txt')
  
  #load expression data
  d <- data.table::fread(pheno_files[i],data.table=FALSE)
  #perform pca
  pca <- d[-c(1,2,3,4)] %>% t() %>% prcomp(.,scale.=T)
  
  #Create directories with different number of expression pcs
  #Add the n first pcs to the covariate file and write the covariate file in fastqtl format
  number_of_pcs <- c(10,20,30,40,50,60,70,80,90,100,110,120)
  
  lapply(number_of_pcs,function(i){
    
      top_pcs <- pca$x[,1:i] %>% t()
      rownames(top_pcs) <- paste0(rownames(top_pcs),'_exp')
      top_pcs <- top_pcs %>% as.data.frame() %>% rownames_to_column('id')
  
      shared_samples <- intersect(colnames(geno_cov),colnames(top_pcs))
  
      cov <- rbind(geno_cov[,shared_samples],top_pcs[,shared_samples])
      dir.create(paste0(folder_name,i),showWarnings = FALSE)
      write_tsv(cov,paste0(folder_name,i,'/',out_name))
  })
}
lapply(1:length(pheno_files),get_covariates,folder_name='data_sensitive/eqtl_pb/PC')
cd data_sensitive/eqtl_pb
for f in PC*
do
cd $f
gzip *
ln -s ../*.bed.gz .
ln -s ../*.bed.gz.tbi .
ln -s ../../genotypes/processed/combined_final.vcf.gz .
ln -s ../../genotypes/processed/combined_final.vcf.gz.tbi .
cd ..
done

sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /pstore/apps/OpenBLAS/0.3.1-GCC-7.3.0-2.30/lib/libopenblasp-r0.3.1.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] HDF5Array_1.16.1            rhdf5_2.32.4               
 [3] SingleCellExperiment_1.10.1 SummarizedExperiment_1.18.1
 [5] DelayedArray_0.14.0         matrixStats_0.56.0         
 [7] Biobase_2.48.0              GenomicRanges_1.40.0       
 [9] GenomeInfoDb_1.24.0         IRanges_2.22.2             
[11] S4Vectors_0.26.1            BiocGenerics_0.34.0        
[13] forcats_0.5.0               stringr_1.4.0              
[15] dplyr_1.0.0                 purrr_0.3.4                
[17] readr_1.3.1                 tidyr_1.1.0                
[19] tibble_3.0.1                ggplot2_3.3.3              
[21] tidyverse_1.3.0             workflowr_1.6.2            

loaded via a namespace (and not attached):
 [1] ggbeeswarm_0.6.0          colorspace_1.4-1         
 [3] ellipsis_0.3.1            rprojroot_1.3-2          
 [5] XVector_0.28.0            BiocNeighbors_1.6.0      
 [7] fs_1.4.1                  rstudioapi_0.11          
 [9] fansi_0.4.1               lubridate_1.7.9          
[11] xml2_1.3.2                R.methodsS3_1.8.0        
[13] knitr_1.28                scater_1.16.2            
[15] jsonlite_1.6.1            Rsamtools_2.4.0          
[17] broom_0.5.6               dbplyr_1.4.4             
[19] R.oo_1.23.0               compiler_4.0.1           
[21] httr_1.4.1                backports_1.1.7          
[23] assertthat_0.2.1          Matrix_1.2-18            
[25] limma_3.44.1              cli_2.0.2                
[27] later_1.1.0.1             BiocSingular_1.4.0       
[29] htmltools_0.5.1.1         tools_4.0.1              
[31] rsvd_1.0.3                gtable_0.3.0             
[33] glue_1.4.1                GenomeInfoDbData_1.2.3   
[35] Rcpp_1.0.6                cellranger_1.1.0         
[37] vctrs_0.3.1               Biostrings_2.56.0        
[39] nlme_3.1-148              rtracklayer_1.48.0       
[41] DelayedMatrixStats_1.10.1 xfun_0.14                
[43] rvest_0.3.5               lifecycle_0.2.0          
[45] irlba_2.3.3               XML_3.99-0.3             
[47] edgeR_3.30.3              zlibbioc_1.34.0          
[49] scales_1.1.1              hms_0.5.3                
[51] promises_1.1.1            yaml_2.2.1               
[53] gridExtra_2.3             stringi_1.4.6            
[55] BiocParallel_1.22.0       rlang_0.4.10             
[57] pkgconfig_2.0.3           bitops_1.0-6             
[59] evaluate_0.14             lattice_0.20-41          
[61] Rhdf5lib_1.10.0           GenomicAlignments_1.24.0 
[63] tidyselect_1.1.0          magrittr_1.5             
[65] R6_2.4.1                  generics_0.0.2           
[67] DBI_1.1.0                 pillar_1.4.4             
[69] haven_2.3.1               whisker_0.4              
[71] withr_2.2.0               RCurl_1.98-1.2           
[73] modelr_0.1.8              crayon_1.3.4             
[75] utf8_1.1.4                rmarkdown_2.2            
[77] viridis_0.5.1             locfit_1.5-9.4           
[79] grid_4.0.1                readxl_1.3.1             
[81] data.table_1.12.8         blob_1.2.1               
[83] git2r_0.27.1              reprex_0.3.0             
[85] digest_0.6.25             httpuv_1.5.4             
[87] R.utils_2.9.2             munsell_0.5.0            
[89] beeswarm_0.2.3            viridisLite_0.3.0        
[91] vipor_0.4.5