Last updated: 2022-02-02

Checks: 7 0

Knit directory: snRNA_eqtl/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20211124) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version d7f2b88. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/magma_interaction_eqtl/
    Ignored:    data/magma_specific_genes/
    Ignored:    data_sensitive/
    Ignored:    output/figures/

Untracked files:
    Untracked:  analysis/._index.Rmd
    Untracked:  analysis/additional/
    Untracked:  analysis/alternatives/
    Untracked:  analysis/clean_analysis.Rmd
    Untracked:  analysis/log_run_slurm_out
    Untracked:  analysis/log_run_slurm_stderr
    Untracked:  analysis/portfolio/
    Untracked:  analysis/revision.Rmd
    Untracked:  analysis/run_slurm.Rscript
    Untracked:  analysis/run_slurm.sbatch
    Untracked:  data/dice/
    Untracked:  data/epigenome_enrichment/
    Untracked:  data/gencode/
    Untracked:  data/gnomad_loeuf/
    Untracked:  data/gtex/
    Untracked:  data/gwas/
    Untracked:  data/gwas_epigenome_overlap/
    Untracked:  data/metabrain/
    Untracked:  data/umap/
    Untracked:  output/eqtl/
    Untracked:  output_almost_final/
    Untracked:  output_tmp_QTLtools/
    Untracked:  output_tmp_fastQTL_sample_ambient_removed/

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/04_coloc.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/plot_figures.Rmd
    Deleted:    output/README.md

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/02_run_eQTL.Rmd) and HTML (docs/02_run_eQTL.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd d7f2b88 Julien Bryois 2022-02-02 eQTL pipeline
html fa2d582 Julien Bryois 2022-01-28 Build site.
Rmd fe925c3 Julien Bryois 2022-01-28 run eQTL with pseudo-bulk
html c0af206 Julien Bryois 2022-01-25 Build site.
Rmd 2f8e71b Julien Bryois 2022-01-25 run eQTL
html d2896af Julien Bryois 2021-12-01 Build site.
Rmd b1d314e Julien Bryois 2021-12-01 eqtl_code_added, figures added

Use fastQTL for permutation pass and nominal pass. QTL tools for independent pass

library(tidyverse)
library(qvalue)
library(parallel)

Set path for eqtl data

Sys.setenv(eqtl_path='data_sensitive/eqtl')

Permutation pass

cd $eqtl_path
source ~/.bashrc

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b

for f in PC*
do
cd $f
for bed in *.bed.gz
do
cell=`echo $bed | sed 's/.bed.gz//'`
echo $cell
fastQTL --vcf combined_final.vcf.gz --bed $bed --permute 1000 --out $cell.quantile.txt.gz --normal --cov $cell.cov.txt.gz --commands 22 $cell.commands.22.tmp
sed 's/^ /sbatch --wrap="/; s/$/"/' $cell.commands.22.tmp > $cell.commands.22.txt
rm $cell.commands.22.tmp
done
cd ..
done
cd $eqtl_path
source ~/.bashrc

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b

for d in PC*
do
cd $d
rm *quantile.txt.gz
for f in *.commands.22.txt
do
sh $f
done
cd ..
done

Check that all eQTL jobs ran successfully

cd $eqtl_path

for d in PC{10,20,30,40,50,60,70,80,90,100,110,120}
do
cd $d
echo $d
grep 'Running time' slurm-* | wc -l
cd ..
done
PC10
176
PC20
176
PC30
176
PC40
176
PC50
176
PC60
176
PC70
176
PC80
176
PC90
176
PC100
176
PC110
176
PC120
176

All good (176)

fastqtl_out_files <- list.files('data_sensitive/eqtl/',pattern='.gz.[0-9]',full.names = T,recursive = TRUE)
d <- tibble(files=fastqtl_out_files,
            cell_type=basename(files) %>% 
              gsub('.quantile.txt.gz..+','',.) %>% 
              gsub('\\.',' ',.) %>% 
              gsub('OPCs   COPs','OPCs / COPs',.)) %>% 
  mutate(PCs=dirname(files) %>% basename(.)) %>% 
  mutate(file_content=map(files,read.table,header=F)) %>% 
  unnest(file_content) %>% 
  filter(!is.na(V11)) %>% 
  select(-files) %>% 
  setNames(c("cell_type","PCs","pid", "nvar", "shape1", "shape2", "dummy", "sid", "dist", "npval", "slope","ppval", "bpval")) %>% 
  group_by(cell_type,PCs) %>% 
  mutate(adj_p=qvalue(bpval)$qvalues) %>% 
  ungroup() %>% 
  arrange(cell_type,bpval)

Write

Write eQTL results with different number of expression principal components

dir.create('output/eqtl/',showWarnings = FALSE,recursive = TRUE)
write_tsv(d,'output/eqtl/eqtl.allPCs.txt')

Get number of PCs maximising the number of eQTL discoveries

selected_pcs <- d %>% 
  filter(adj_p<0.05) %>% 
  dplyr::count(PCs) %>% filter(n==max(n))

Write eQTL results with selected number of PCs

d %>% 
  filter(PCs==selected_pcs$PCs) %>% 
  dplyr::select(-PCs) %>% 
  write_tsv(.,paste0('output/eqtl/eqtl.',selected_pcs$PCs,'.txt'))

Clean

Removing tmp files

cd $eqtl_path

for d in PC*
do
cd $d
rm *quantile*
cd ..
done

Nominal pass

We now run a nominal pass to get pvalues for all SNPs (necessary for colocalization analysis).

We will use 70 PCs as this maximised the number of eQTL discoveries using permutations.

source ~/.bashrc
cd $eqtl_path

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b


mkdir -p 'PC70_nominal'
cd PC70_nominal
ln -s ../PC70/*.bed.gz .
ln -s ../PC70/*.bed.gz.tbi .
ln -s ../PC70/*.cov.txt.gz .
ln -s ../PC70/combined_final.vcf.gz .
ln -s ../PC70/combined_final.vcf.gz.tbi .

for bed in *.bed.gz
do
cell=`echo $bed | sed 's/.bed.gz//'`
fastQTL --vcf combined_final.vcf.gz --bed $bed --out $cell.quantile.txt.gz --normal --cov $cell.cov.txt.gz --commands 22 $cell.commands.22.tmp
sed 's/^ /sbatch --wrap="/; s/$/"/' $cell.commands.22.tmp > $cell.commands.22.txt
sh $cell.commands.22.txt
rm $cell.commands.22.tmp
done
cd $eqtl_path'/PC70_nominal'
grep 'Running time' slurm-* | wc -l
176

176 files successfully completed. All good!

Now let’s gzip the association files and give them a nice name.

cd $eqtl_path'/PC70_nominal'
rm *quantile.txt.gz
rm *.bed.gz
rm *.bed.gz.tbi  
rm *cov.txt.gz
rm *.commands.22.txt
rm *.vcf.gz
rm *.vcf.gz.tbi

for f in *.quantile.txt.gz.*
do
gzip $f
x=`echo $f | sed 's/:.\+$/.gz/' | sed 's/.quantile.txt.gz//'`
mv $f.gz $x
done

Independent pass

We will use QTL tools to find independent cis-eQTLs.

  1. Get QTL tools files format from fastQTL files.
bed_files_fastqtl <- list.files('data_sensitive/eqtl',pattern='.bed.gz$',full.names = T)
read_fastQTL_write_QTLtools <- function(i){
  output_file_name <- gsub('bed.gz','qtltools.bed',bed_files_fastqtl[i]) %>% 
    gsub('data_sensitive/eqtl/','data_sensitive/eqtl/PC70_indep/',.)
  
  fasqtl <- read_tsv(bed_files_fastqtl[i]) %>% 
    mutate(pid='.',strand='+') %>% 
    dplyr::select(`#Chr`:ID,pid,strand,everything())
 
  dir.create('data_sensitive/eqtl/PC70_indep',recursive = TRUE,showWarnings = FALSE)
  write_tsv(fasqtl,output_file_name) 
}
mclapply(1:length(bed_files_fastqtl),read_fastQTL_write_QTLtools,mc.cores = 8)
mkdir -p $eqtl_path'/PC70_indep'
cd $eqtl_path'/PC70_indep'
ml tabix/0.2.6-goolf-1.7.20

for f in *.qtltools.bed
do
bgzip $f && tabix -p bed $f.gz
done
  1. Perform permutation pass on all phenotypes
cd $eqtl_path'/PC70_indep'
source ~/.bashrc

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b
ml HTSlib/1.9-GCCcore-7.3.0
ml zlib/1.2.11-GCCcore-7.3.0
ml lzma/4.32.7-GCCcore-7.3.0
ml cURL/7.60.0-GCCcore-7.3.0

for bed in *qtltools.bed.gz
do
cell=`echo $bed | sed 's/.qtltools.bed.gz//'`
for i in {0..22}
do
echo 'sbatch --mem=4000MB --wrap="QTLtools cis --vcf combined_final.vcf.gz --bed '$bed' --cov '$cell'.cov.txt.gz --permute 1000 --normal --chunk '$i' 22 --out '$cell'.conditional_permute_'$i'_22"' >> $cell.commands.cond.permutation.22.txt
done
done
cd $eqtl_path'/PC70_indep'
source ~/.bashrc

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b
ml HTSlib/1.9-GCCcore-7.3.0
ml zlib/1.2.11-GCCcore-7.3.0
ml lzma/4.32.7-GCCcore-7.3.0
ml cURL/7.60.0-GCCcore-7.3.0

ln -s ../PC70/combined_final.vcf.gz .
ln -s ../PC70/combined_final.vcf.gz.tbi .
ln -s ../PC70/*.cov.txt.gz  .

for f in *.commands.cond.permutation.22.txt
do
sh $f
done
cd $eqtl_path'/PC70_indep'
grep 'Running time' slurm-* | wc -l

All good!

Gather all output

cd $eqtl_path'/PC70_indep'

for bed in *.qtltools.bed.gz
do
cell=`echo $bed | sed 's/.qtltools.bed.gz//'`
cat $cell.conditional_permute_* >> $cell.conditional.permute.txt
done
rm *.conditional_permute_*
  1. Extract significant phenotypes
cd $eqtl_path'/PC70_indep'
ml  R/4.0.1-foss-2018b
wget https://raw.githubusercontent.com/qtltools/qtltools/master/scripts/qtltools_runFDR_cis.R .
for f in *.conditional.permute.txt
do
cell=`echo $f | sed 's/.conditional.permute.txt//'`
Rscript qtltools_runFDR_cis.R $f 0.05 $cell.permutations_all
done

Read bed files and only retain significant genes.

bed_files <- list.files('data_sensitive/eqtl/PC70_indep',pattern='.qtltools.bed.gz$',full.names = T)
read_fastQTL_write_QTLtools_conditional <- function(i){
  output_file_name <- gsub('qtltools.bed.gz','qtltools.conditional.bed',bed_files[i])
  
  significant_hits <- read_delim(gsub('qtltools.bed.gz','permutations_all.significant.txt',bed_files[i]),delim = ' ',col_names=FALSE)
  
  fasqtl <- read_tsv(bed_files[i]) %>% 
    filter(ID%in%significant_hits$X1)
  
  threshold <- read_delim(gsub('qtltools.bed.gz','permutations_all.thresholds.txt',bed_files[i]),
                                 delim = ' ',col_names=FALSE) %>% 
    filter(X1%in%significant_hits$X1)

  write_tsv(fasqtl,output_file_name) 
  write_delim(threshold,gsub('qtltools.bed.gz','permutations_all.thresholds.txt',bed_files[i]),col_names=FALSE,delim = ' ') 
}
mclapply(1:length(bed_files),read_fastQTL_write_QTLtools_conditional,mc.cores=8)
cd $eqtl_path'/PC70_indep'
ml tabix/0.2.6-goolf-1.7.20

for f in *.qtltools.conditional.bed
do
bgzip $f && tabix -p bed $f.gz
done
  1. Run conditional analysis
cd $eqtl_path'/PC70_indep'
source ~/.bashrc

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b
ml HTSlib/1.9-GCCcore-7.3.0
ml zlib/1.2.11-GCCcore-7.3.0
ml lzma/4.32.7-GCCcore-7.3.0
ml cURL/7.60.0-GCCcore-7.3.0

rm slurm*

for bed in *.qtltools.conditional.bed.gz
do
cell=`echo $bed | sed 's/.qtltools.conditional.bed.gz//'`
echo 'sbatch --mem=8000MB --wrap="QTLtools cis --vcf combined_final.vcf.gz --bed '$bed' --cov '$cell'.cov.txt.gz --mapping '$cell'.permutations_all.thresholds.txt --normal --out '$cell'.conditional.txt"' > $cell.commands.cond.txt
done

for f in *.commands.cond.txt
do
sh $f
done
cd $eqtl_path'/PC70_indep'
grep 'Running time' slurm-* | wc -l
8

8 cell types successfully ran. All good!

Write

Get conditional results

conditional_files <- list.files('data_sensitive/eqtl/PC70_indep',pattern='*conditional.txt',full.names = T)
header <- c('phe_id','phe_chr','phe_from','phe_to','phe_strd','n_var_in_cis','dist_phe_var','var_id', 'var_chr','var_from','var_to','rank','fwd_pval','fwd_r_squared','fwd_slope','fwd_best_hit','fwd_sig', 'bwd_pval','bwd_r_squared','bwd_slope','bwd_best_hit','bwd_sig')
d <- tibble(files=conditional_files,
            cell_type=gsub('.conditional.txt','',basename(conditional_files)) %>% 
              gsub('\\.',' ',.) %>% 
              gsub('OPCs   COPs','OPCs / COPs',.),
            file_content=map(files,read_delim,delim=' ',col_names=FALSE) 
            ) %>% 
  dplyr::select(-files) %>%
  unnest(file_content) %>% 
  setNames(c('cell_type',header)) %>% 
  filter(bwd_best_hit==1)
write_tsv(d,'output/eqtl/eqtl.70PCs.indep.txt')

Fine-mapping

We will use CaVEMan for fine-mapping

  1. Get bed files for significant cis-eQTL
d <- read_tsv('output/eqtl/eqtl.PC70.txt') %>% 
  mutate(cell_type=make.names(cell_type)) %>% 
  filter(adj_p<0.05)
bed_files_fastqtl <- list.files('data_sensitive/eqtl',pattern='.bed.gz$',full.names = T)
read_fastQTL_write_caveman <- function(i){
  output_file_name <- gsub('bed.gz','caveman.bed',bed_files_fastqtl[i]) %>% gsub('eqtl','eqtl/PC70_caveman',.)
  cell_type_file <- basename(bed_files_fastqtl[i]) %>% gsub('.bed.gz','',.)

  genes_to_keep <- filter(d,cell_type==cell_type_file) %>% pull(pid)
  
  fasqtl <- read_tsv(bed_files_fastqtl[i]) %>% 
    filter(ID%in%genes_to_keep)
 
  dir.create('data_sensitive/eqtl/PC70_caveman',showWarnings = FALSE)
  write_tsv(fasqtl,output_file_name) 
}
mclapply(1:length(bed_files_fastqtl),read_fastQTL_write_caveman,mc.cores=8)
  1. Get bed files adjusted for covariates
write_tsv(d[,'sid',drop=FALSE],'data_sensitive/eqtl/PC70_caveman/sig_eQTL_snps.txt',col_names = FALSE)
cd $eqtl_path'/PC70_caveman'
ln -s ../PC70/combined_final.vcf.gz .
ln -s ../PC70/combined_final.vcf.gz.tbi .

zgrep -Fw -f sig_eQTL_snps.txt combined_final.vcf.gz | grep -v '#' | cut -f 1,2,3,4,5 > sig_eQTL_snps_alleles.txt
genotypes_selected <- read_tsv('data_sensitive/eqtl/PC70_caveman/sig_eQTL_snps_alleles.txt',col_names = FALSE) %>% 
  setNames(c('chr','pos','sid','A1','A2'))
cell_types <- unique(d$cell_type)
get_eqtl_cov <- function(i){
  eqtl_cell_type <- filter(d,cell_type==cell_types[i]) %>% 
    dplyr::select(pid,sid) %>% 
    left_join(.,genotypes_selected,by='sid') %>% 
    dplyr::select(pid,chr,pos,A1,A2)
  
  write_tsv(eqtl_cell_type,paste0('data_sensitive/eqtl/PC70_caveman/',make.names(cell_types[i]),'.eqtl.list'),col_names = FALSE)
}
lapply(1:length(cell_types),get_eqtl_cov)

Now get covariates

cov_files <- list.files('data_sensitive/eqtl/PC70',pattern = 'cov.txt.gz',full.names = TRUE) 
read_cov_write <- function(i){
  cov <- read_tsv(cov_files[i])
  out <- gsub('.txt.gz','.caveman.txt',cov_files[i]) %>% gsub('PC70','PC70_caveman',.)
  
  study <- cov[4,][,-1] %>% t() %>% as.data.frame()
  study_model_matrix <-  model.matrix(~0+V1,data=study) %>% as.data.frame() %>% t() %>%
    as.data.frame() %>% rownames_to_column('id') %>% mutate(id=gsub('V1','',id))

  disease <- cov[5,][,-1] %>% t() %>% as.data.frame()
  disease_model_matrix <-  model.matrix(~0+V1,data=disease) %>% as.data.frame() %>% t() %>% 
    as.data.frame() %>% rownames_to_column('id') %>% mutate(id=gsub('V1','',id))

  cov_out <- rbind(cov[c(1:3),],study_model_matrix,disease_model_matrix,cov[c(6:nrow(cov)),])
  cov_out <- cov_out[,-1]
  write_tsv(cov_out,out)
}
lapply(1:length(cov_files),read_cov_write)
cd $eqtl_path'/PC70_caveman'
source ~/.bashrc

ml tabix/0.2.6-goolf-1.7.20

for f in *.caveman.bed
do
cell=`echo $f | sed 's/.caveman.bed//'`
CaVEMaN --single-signal --eqtl "$cell".eqtl.list --bed "$f" --vcf combined_final.vcf.gz \
     --out "$cell".caveman.corrected.expression.bed --cov "$cell".cov.caveman.txt
done
cd $eqtl_path'/PC70_caveman'
source ~/.bashrc
mkdir -p results

ml tabix/0.2.6-goolf-1.7.20

for f in *.caveman.corrected.expression.bed
do
cell=`echo $f | sed 's/.caveman.corrected.expression.bed//'`
n_eqtl=`wc -l $f |cut -d ' ' -f 1`
echo $n_eqtl
n_jobs=$(($n_eqtl/50))
n_jobs=$(($n_jobs+1))
for i in $( eval echo {1..$n_jobs} )
do
sbatch --wrap="CaVEMaN --bed '$f' --vcf combined_final.vcf.gz --genes 50 \
     --job-number '$i' --normal --out results/'$cell'_'$i' --verbose"
done 
done
cd $eqtl_path'/PC70_caveman'/results

awk 'FNR>1||NR==1' Astrocytes_* > Astrocytes.all
awk 'FNR>1||NR==1' Endothelial.cells_* > Endothelial.cells.all
awk 'FNR>1||NR==1' Excitatory.neurons_* > Excitatory.neurons.all
awk 'FNR>1||NR==1' Inhibitory.neurons_* > Inhibitory.neurons.all
awk 'FNR>1||NR==1' Microglia_* > Microglia.all
awk 'FNR>1||NR==1' Oligodendrocytes_* > Oligodendrocytes.all
awk 'FNR>1||NR==1' OPCs...COPs_* > OPCs...COPs.all
awk 'FNR>1||NR==1' Pericytes_* > Pericytes.all
cd $eqtl_path'/PC70_caveman'/results
source ~/.bashrc

ml tabix/0.2.6-goolf-1.7.20

for f in *.all
do
CaVEMaN --best $f --out $f.best --verbose
done
rm *_*
files <- list.files('data_sensitive/eqtl/PC70_caveman/results',pattern='.best',full.names = TRUE)
d <- tibble(files) %>% mutate(file_content=map(files,read_tsv)) %>% 
  unnest(file_content) %>% 
  mutate(cell_type=basename(files) %>% gsub('\\.all.best','',.)) %>% 
  dplyr::select(cell_type,everything()) %>% 
  dplyr::select(-files) %>% 
  mutate(cell_type=gsub('\\.',' ',cell_type)) %>% 
  mutate(cell_type=gsub('OPCs   COPs','OPCs / COPs',cell_type)) %>% 
  arrange(-Probability)
write_tsv(d,'output/eqtl/eqtl.70PCs.caveman.txt')

Permutation pass - Pseudo-bulk All

Sys.setenv(eqtl_path='data_sensitive/eqtl_pb')
cd $eqtl_path
source ~/.bashrc

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b

for f in PC*
do
cd $f
for bed in *.bed.gz
do
cell=`echo $bed | sed 's/.bed.gz//'`
echo $cell
fastQTL --vcf combined_final.vcf.gz --bed $bed --permute 1000 --out $cell.quantile.txt.gz --normal --cov $cell.cov.txt.gz --commands 22 $cell.commands.22.tmp
sed 's/^ /sbatch --wrap="/; s/$/"/' $cell.commands.22.tmp > $cell.commands.22.txt
rm $cell.commands.22.tmp
done
cd ..
done
cd $eqtl_path
source ~/.bashrc

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b

for d in PC*
do
cd $d
rm *quantile.txt.gz
for f in *.commands.22.txt
do
sh $f
done
cd ..
done

Check that all eQTL jobs ran successfully

cd $eqtl_path

for d in PC{10,20,30,40,50,60,70,80,90,100,110,120}
do
cd $d
echo $d
grep 'Running time' slurm-* | wc -l
cd ..
done
PC10
22
PC20
22
PC30
22
PC40
22
PC50
22
PC60
22
PC70
22
PC80
22
PC90
22
PC100
22
PC110
22
PC120
22

All done (22)

fastqtl_out_files <- list.files('data_sensitive/eqtl_pb/',pattern='.gz.[0-9]',full.names = T,recursive = TRUE)
d <- tibble(files=fastqtl_out_files,
            cell_type=basename(files) %>% 
              gsub('.quantile.txt.gz..+','',.)) %>% 
  mutate(PCs=dirname(files) %>% basename(.)) %>% 
  mutate(file_content=map(files,read.table,header=F)) %>% 
  unnest(file_content) %>% 
  filter(!is.na(V11)) %>% 
  dplyr::select(-files) %>% 
  setNames(c("cell_type","PCs","pid", "nvar", "shape1", "shape2", "dummy", "sid", "dist", "npval", "slope","ppval", "bpval")) %>% 
  group_by(cell_type,PCs) %>% 
  mutate(adj_p=qvalue(bpval)$qvalues) %>% 
  ungroup() %>% 
  arrange(cell_type,bpval)

Write

Write eQTL results with different number of expression principal components

dir.create('output/eqtl/',showWarnings = FALSE)
write_tsv(d,'output/eqtl/eqtl.pb.allPCs.txt')

Get number of PCs maximising the number of eQTL discoveries

selected_pcs <- d %>% 
  filter(adj_p<0.05) %>% 
  dplyr::count(PCs) %>% filter(n==max(n))

Write eQTL results with selected number of PCs

d %>% 
  filter(PCs==selected_pcs$PCs) %>% 
  dplyr::select(-PCs) %>% 
  write_tsv(.,paste0('output/eqtl/eqtl.pb.',selected_pcs$PCs,'.txt'))

Clean

Removing tmp files

cd $eqtl_path

for d in PC*
do
cd $d
rm *quantile*
cd ..
done

Nominal pass - Pseudo-bulk All

We now run a nominal pass to get pvalues for all SNPs.

We will use 70 PCs as this maximised the number of eQTL discoveries using permutations.

source ~/.bashrc
cd $eqtl_path

ml foss/2018b
ml Rmath/3.6.1-GCCcore-7.3.0
ml GSL/2.5-GCCcore-7.3.0
ml Eigen/3.3.7
ml Boost/1.67.0-foss-2018b


mkdir -p 'PC70_nominal'
cd PC70_nominal
ln -s ../PC70/*.bed.gz .
ln -s ../PC70/*.bed.gz.tbi .
ln -s ../PC70/*.cov.txt.gz .
ln -s ../PC70/combined_final.vcf.gz .
ln -s ../PC70/combined_final.vcf.gz.tbi .

for bed in *.bed.gz
do
cell=`echo $bed | sed 's/.bed.gz//'`
fastQTL --vcf combined_final.vcf.gz --bed $bed --out $cell.quantile.txt.gz --normal --cov $cell.cov.txt.gz --commands 22 $cell.commands.22.tmp
sed 's/^ /sbatch --wrap="/; s/$/"/' $cell.commands.22.tmp > $cell.commands.22.txt
sh $cell.commands.22.txt
rm $cell.commands.22.tmp
done
cd $eqtl_path'/PC70_nominal'
grep 'Running time' slurm-* | wc -l
22

22 files successfully completed. All good!

Now let’s gzip the association files and give them a nice name.

cd $eqtl_path'/PC70_nominal'
rm *quantile.txt.gz
rm *.bed.gz
rm *.bed.gz.tbi  
rm *cov.txt.gz
rm *.commands.22.txt
rm *.vcf.gz
rm *.vcf.gz.tbi

for f in *.quantile.txt.gz.*
do
gzip $f
x=`echo $f | sed 's/:.\+$/.gz/' | sed 's/.quantile.txt.gz//'`
mv $f.gz $x
done

sessionInfo()
R version 4.0.1 (2020-06-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /pstore/apps/OpenBLAS/0.3.1-GCC-7.3.0-2.30/lib/libopenblasp-r0.3.1.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] qvalue_2.20.0   forcats_0.5.0   stringr_1.4.0   dplyr_1.0.0    
 [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.0     tibble_3.0.1   
 [9] ggplot2_3.3.3   tidyverse_1.3.0 workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.14         reshape2_1.4.4    splines_4.0.1    
 [5] haven_2.3.1       lattice_0.20-41   colorspace_1.4-1  vctrs_0.3.1      
 [9] generics_0.0.2    htmltools_0.5.1.1 yaml_2.2.1        blob_1.2.1       
[13] rlang_0.4.10      later_1.1.0.1     pillar_1.4.4      withr_2.2.0      
[17] glue_1.4.1        DBI_1.1.0         dbplyr_1.4.4      modelr_0.1.8     
[21] readxl_1.3.1      plyr_1.8.6        lifecycle_0.2.0   munsell_0.5.0    
[25] gtable_0.3.0      cellranger_1.1.0  rvest_0.3.5       evaluate_0.14    
[29] knitr_1.28        httpuv_1.5.4      fansi_0.4.1       broom_0.5.6      
[33] Rcpp_1.0.6        promises_1.1.1    backports_1.1.7   scales_1.1.1     
[37] jsonlite_1.6.1    fs_1.4.1          hms_0.5.3         digest_0.6.25    
[41] stringi_1.4.6     grid_4.0.1        rprojroot_1.3-2   cli_2.0.2        
[45] tools_4.0.1       magrittr_1.5      crayon_1.3.4      whisker_0.4      
[49] pkgconfig_2.0.3   ellipsis_0.3.1    xml2_1.3.2        reprex_0.3.0     
[53] lubridate_1.7.9   rstudioapi_0.11   assertthat_0.2.1  rmarkdown_2.2    
[57] httr_1.4.1        R6_2.4.1          nlme_3.1-148      git2r_0.27.1     
[61] compiler_4.0.1