Introduction
- Global variables and helper functions
Validation of ES and NS cell through transcriptomic profiling
The analysis of chromatin conformation capture data
- TCC data - import and normalisation
- In-situ HiC data - import and normalisation
Identification of interactions
Identification of loops
Comparative analysis of Loops - TCC
- ES vs. NS cells
Loops and pluripotency
- ES cells (2i/LIF) versus epi-stem cells (EpiSC) - comparison of loop signal
Loops - in situ Hi-C
- Comparative analysis of loops in ES and NS cells - in situ Hi-C data
  - Controls
The relationship between CTCF binding and differential loops
Loop dynamics and chromatin contact insulation
- Estimation of the insulation effect
  - The composite profile of contacts at induced loop anchors - ES (2i/LIF) versus NS (TCC data)
Chromatin interaction domains
- Insulation of domain boundaries - relationship with loops
- Loop domains and ordinary domains
Loop induction and chromatin features
- GO-term analysis
- Loops - annotation of anchors
Loop- and compartment-domains
- Loop induction and intra-domain promoter-enhancer interactions
Loops and chromatin compartments
- Loop induction compartments and H3K4me1
Mouse gene atlas

Introduction

With this vignette, we walk the reader through the analyses performed in Gain of CTCF-anchored chromatin loops marks the exit from naive pluripotency. We store the intermediate files in .RData format. All the files are avaliable for download from our webpage. We also provie data normalized using Juicer (Durand et al., 2016) in the data directory.

Global variables and helper functions

We load a set of helper functions for the initial steps of the analysis along with the visualization function.

myEScolor = rgb(229, 67, 0, maxColorValue=255)
myNScolor = rgb(0, 130, 229, maxColorValue = 255)

We set the paths.

data_directory = '/g/huber/projects/HiC/Pekowska_et_al/data/'
scripts_directory = '/g/huber/projects/HiC/Pekowska_et_al/scripts/'
dumped_directory = '/g/huber/projects/HiC/Pekowska_et_al/data/dumped/'

source( paste0(scripts_directory,'functions.R' ) )
set.seed(22)

Validation of ES and NS cell through transcriptomic profiling

Below, we analyze the expression of key pluripotency and neuronal markers in our cells. We performed DESeq2 analysis of the transcriptome in ES cells (FBS/LIF) as well as in NS cells.

library("DESeq")

##     Welcome to 'DESeq'. For improved performance, usability and
##     functionality, please consider migrating to 'DESeq2'.

## 
## Attaching package: 'DESeq'

## The following objects are masked from 'package:DESeq2':
## 
##     estimateSizeFactorsForMatrix, getVarianceStabilizedData,
##     varianceStabilizingTransformation

es1=read.delim( paste0(data_directory,"ES1_counts_reverse.txt"), as.is=TRUE, header=FALSE, sep="\t")
es2=read.delim( paste0(data_directory,"ES2_counts_reverse.txt"), as.is=TRUE, header=FALSE, sep="\t")
ns1=read.delim( paste0(data_directory,"NS1_counts_reverse.txt"), as.is=TRUE, header=FALSE, sep="\t")
ns2=read.delim( paste0(data_directory,"NS2_counts_reverse.txt"), as.is=TRUE, header=FALSE, sep="\t")

SCT = cbind(esc1=es1[,2], esc2=es2[,2], nsc1=ns1[,2], nsc2=ns2[,2])
rownames(SCT) = es1[,1]
SCT = SCT[1:(nrow(SCT)-5),]

cond = factor( c( "esc", "esc", "nsc", "nsc" ) )
cds = newCountDataSet( as.matrix(SCT), cond )
cds = estimateSizeFactors( cds )
cds = estimateDispersions( cds )
resO = nbinomTest( cds, "esc", "nsc" )
res = resO[! is.na(resO$pval),]
res = res[res$padj < 0.1,] 

save(res, cds, SCT, file = paste0(data_directory, "RNASeq_ESC_NSC.RData"))



###
load( paste0( data_directory, "AllTSS_NCBIM37_67_all_TSS_TTS.RData") )
gtf = import( paste0(data_directory, 'Mus_musculus.NCBIM37.67.gtf' ) )
seqlevelsStyle(gtf) = 'ucsc'


genes = data.frame(ensembl = elementMetadata(gtf)$gene_id,
                   geneN = elementMetadata(gtf)$gene_name, stringsAsFactors = F)

res$Significant = res$padj < 0.01 

res$Gene = genes[match(res$id,genes$ensembl),'geneN']
pluripotency = c('Klf4', 'Prdm14', 'Zfp42', 'Nanog', 
                 'Tbx3', 'Esrrb', 'Pou5f1','Nodal','Tet1', 'Fgf4', 
                 'Lin28a', 'Phc1', 'Dppa2','Lif' )
neuronal = c('Olig1','Olig2', 'Pax6', 'Bmi1', 'Rxra', 'Wnt5a', 'Nes',
             'Ascl1', 'Hes5', 'Hes6', 'Dll1','Notch1', 'Fzd1', 'Fzd2')

Ep = res[match(pluripotency,res$Gene),]
En = res[match(neuronal,res$Gene),]

rownames(Ep) = Ep$Gene
rownames(En) = En$Gene

Epv = Ep$log2FoldChange
names(Epv)=Ep$Gene

Env = En$log2FoldChange
names(Env)=En$Gene

Epv[!is.finite(Epv)]=-11
Epv = -1*Epv[order(Epv,decreasing=F)]
Env = -1*Env[order(Env,decreasing=F)]

# pdf( '/g/huber/projects/HiC/Pekowska_et_al/expr_check.pdf', width=15, height=9, pointsize=20 )
par( mar=c(7,7,7,1), cex.lab=1.5, cex.axis=1.5 )
barplot( c( Epv, Env), col=c( rep(myEScolor, length(Epv)), 
                              rep( myNScolor, length(Epv))), 
         ylab=expression('ES/NS [log'[2]*']'),
         names=c(names(Epv), names(Env)), ylim=c(-11,11),las=2 )

# dev.off()

The analysis of chromatin conformation capture data

The pre-processing of the TCC and in-situ Hi-C reads was performed using juicer. We obtain the ligation frequency matrices (LFMs) from the .hic files, using function dump in juicer.

We also need to obtain the annotation of bins for the LFMs at the desired resolutions. To this end, we retrieve the lengths of chromosomes in the mouse genome. We retrive this information from the BSgenome.Mmusculus.UCSC.mm9 package.

library(BSgenome.Mmusculus.UCSC.mm9)
genome = BSgenome.Mmusculus.UCSC.mm9
si = seqinfo(genome)
save(si, file=paste0(data_directory,'si.RData'))

Obtain genomic annotation of bins at different resolution.

load(paste0(data_directory,'si.RData'))

ga = binAnno( si, as.list(paste0('chr', c(1:19,'X','Y'))), 5000)
gagr = GRanges(seqnames = Rle(ga$chr),
                 ranges = IRanges(as.numeric(ga$start),
                                  end = as.numeric(ga$end),
                                  names = seq(1, nrow(ga))),
                 strand = Rle(rep("*", nrow(ga))))

ga10 = binAnno( si, as.list(paste0('chr', c(1:19,'X','Y'))), 10000)
gagr10 = GRanges(seqnames = Rle(ga10$chr),
                 ranges = IRanges(as.numeric(ga10$start),
                                  end = as.numeric(ga10$end),
                                  names = seq(1, nrow(ga10))),
                 strand = Rle(rep("*", nrow(ga10))))

ga50 = binAnno( si, as.list(paste0('chr', c(1:19,'X','Y'))), 50000)
gagr50 = GRanges(seqnames = Rle(ga50$chr),
                  ranges = IRanges(as.numeric(ga50$start),
                                   end = as.numeric(ga50$end),
                                   names = seq(1, nrow(ga50))),
                  strand = Rle(rep("*", nrow(ga50))))

TCC data - import and normalisation

Now read the files and assemble the corresponding genome-wide sparse matrices of interactions.

chroms_combs = read.delim(paste0(annotation_dir,'allChroms_Musm.txt'), header=F, as.is=T)

###################################################################### 
es1_lfm_10kb = read.hic_files( paste0(dumped_directory,'ES1_10kb/'), "ES1_30_",".txt", ga10, chroms_combs[,1] )
es1_lfm_10kb = getGWmatrix( list.of.sm = es1_lfm_10kb, ga = ga10)

es2_lfm_10kb = read.hic_files( paste0(dumped_directory,'ES2_10kb/'), "ES2_30_",".txt", ga10, chroms_combs[,1] )
es2_lfm_10kb = getGWmatrix( list.of.sm = es2_lfm_10kb, ga = ga10)

ns1_lfm_10kb = read.hic_files( paste0(dumped_directory,'NS1_10kb/'), "NS1_30_",".txt", ga10, chroms_combs[,1] )
ns1_lfm_10kb = getGWmatrix( list.of.sm = ns1_lfm_10kb, ga = ga10)

ns2_lfm_10kb = read.hic_files( paste0(dumped_directory,'NS2_10kb/'), "NS2_30_",".txt", ga10, chroms_combs[,1] )
ns2_lfm_10kb = getGWmatrix( list.of.sm = ns2_lfm_10kb,ga = ga10)

save(es1_lfm_10kb, es2_lfm_10kb, ns1_lfm_10kb, ns2_lfm_10kb, 
     file=paste0(data_directory, 'TCC_LFMs_10_kb.RData'))

###################################################################### 
ti1_lfm_10kb = read.hic_files( paste0(dumped_directory,'TI1_10kb/'), "TI1_30_",".txt", ga10, chroms_combs[,1] )
ti1_lfm_10kb = getGWmatrix( list.of.sm = ti1_lfm_10kb, ga = ga10)

ti2_lfm_10kb = read.hic_files( paste0(dumped_directory,'TI2_10kb/'), "TI2_30_",".txt", ga10, chroms_combs[,1] )
ti2_lfm_10kb = getGWmatrix( list.of.sm = ti2_lfm_10kb, ga = ga10)

save(ti1_lfm_10kb, ti2_lfm_10kb, 
     file=paste0(data_directory, 'TCC_LFMs_10_kb_2i.RData'))


ep1_lfm_10kb = read.hic_files( paste0(tcc_dumped_dir,'EP1_10kb/'), "EP1_30_",".txt", ga10, chroms_combs[,1] )
ep1_lfm_10kb = getGWmatrix( list.of.sm = ep1_lfm_10kb, ga = ga10)

ep2_lfm_10kb = read.hic_files( paste0(tcc_dumped_dir,'EP2_10kb/'), "EP2_30_",".txt", ga10, chroms_combs[,1] )
ep2_lfm_10kb = getGWmatrix( list.of.sm = ep2_lfm_10kb, ga = ga10)

save(ep1_lfm_10kb, ep2_lfm_10kb, 
     file=paste0(data_directory, 'TCC_LFMs_10_kb_Epi.RData'))

We implemented Iterative Proportional Fitting algorithm (IPF) to normalize the LFMs.

chroms=paste0('chr',c(1:19,'X','Y'))
itn=20

###################################################################### 
load(paste0(data_directory, 'TCC_LFMs_10_kb.RData'))
es1 = IPF(es1_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
es2 = IPF(es2_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
ns1 = IPF(ns1_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
ns2 = IPF(ns2_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
save(es1, es2, ns1, ns2,
    file=paste0(data_directory, 'TCC_IPF_10kb_ES_NS_repl_sep.RData') )

es_lfm_10kb = es1_lfm_10kb + es2_lfm_10kb
ns_lfm_10kb = ns1_lfm_10kb + ns2_lfm_10kb
es = IPF(es_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
ns = IPF(ns_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
save(es, ns,
    file=paste0(data_directory, 'TCC_IPF_10kb_ES_NS_pooled.RData') )


###################################################################### 
ti1 = IPF(ti1_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
ti2 = IPF(ti2_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
save(ti1,ti2,
     file=paste0(data_directory, 'TCC_IPF_10kb_repl_sep_ES_2i.RData') )
ti_lfm_10kb = ti1_lfm_10kb + ti2_lfm_10kb
ti = IPF(ti_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
save(ti, file=paste0(data_directory, 'TCC_IPF_10kb_pooled_ES_2i.RData') )

###################################################################### 
ep1 = IPF(ep1_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
ep2 = IPF(ep2_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
save(ep1,ep2,
     file=paste0(data_directory, 'TCC_IPF_10kbMatrices_repl_sep_EpiSC.RData') )
ep_lfm_10kb = ep1_lfm_10kb + ep2_lfm_10kb
ep = IPF(ep_lfm_10kb, ann=ga10, numberOfIterations=itn, chroms=chroms)
save(ep, file=paste0(data_directory, 'TCC_IPF_10kbMatrices_pooled_EpiSC.RData') )

We export the data to create .hic file to visualize with Juicebox

getMatrixForJucerPre=function( mat, binAnnotation, chrToConsider,disCutoff ) {
  # mat = es[[2]]; binAnnotation=ga10; chrToConsider=paste0('chr',c(1:19,'X','Y')); disCutoff=5000000
  D = disCutoff %/% mean(binAnnotation$end - binAnnotation$start)
  start = min( which(binAnnotation$chr %in% chrToConsider) )
  end = max( which(binAnnotation$chr %in% chrToConsider) )
  # mat = mat[start:end, start:end]
  tp = as.data.frame( summary(mat), stringsAsFactors=FALSE )
  tp = tp[tp$i <= tp$j & tp$x!=0, ]
  tp = tp[tp$j - tp$i < D,]
  tp$i = tp$i + start -1
  tp$j = tp$j + start -1
  
  chroms = paste0('chr',c(1:19,'X','Y'))
  
  ## two issues with juicer require these two sorting steps
  dfmatch = data.frame( V1=chroms, V2=1:21 )
  binAnnotation$chrID = dfmatch[,2][ match( binAnnotation$chr, dfmatch[,1]) ]
  tp = tp[ binAnnotation$chrID[tp$i] <= binAnnotation$chrID[tp$j], ]
  
  tp = do.call("rbind", lapply( split(tp, binAnnotation$chr[tp$i] ), function(chrom){
    do.call("rbind", split(chrom, binAnnotation$chr[chrom$j]))  }) )
  
  print('generating the data frame')
  return( data.frame( strand1 = 0,
                      chr1 = binAnnotation$chr[tp$i],
                      pos1 = (binAnnotation$start[tp$i])-1,
                      fres1 = 0,
                      strand2 = 0,
                      chr2 = binAnnotation$chr[tp$j],
                      pos2 = (binAnnotation$start[tp$j])-1,
                      fres2 = 1,
                      score = tp$x,
                      stringsAsFactors=FALSE ) )

}



esc.tcc.juicer = getMatrixForJucerPre( es[[2]], 
                                       ga10, 
                                       paste0('chr',c(1:19,'X','Y')), 
                                       25000000 )
write.table(esc.tcc.juicer, file=paste0(data_directory,'esc.tcc.juicer.txt'),
            quote=FALSE, row.names=FALSE, col.names=FALSE,
            sep='\t')


nsc.tcc.juicer = getMatrixForJucerPre( ns[[2]], 
                                       ga10, 
                                       paste0('chr',c(1:19,'X','Y')), 
                                       25000000 )
write.table(nsc.tcc.juicer, file=paste0(data_directory,'nsc.tcc.juicer.txt'),
            quote=FALSE, row.names=FALSE, col.names=FALSE,
            sep='\t')

In-situ HiC data - import and normalisation

Read the LFMs.

es_lfm_5kb = read.hic_files( dumped_dir, "ES_IS_5kb/ES_inter30_",".txt", ga, chroms )
ns_lfm_5kb = read.hic_files( dumped_dir, "NS_IS_5kb/NS_inter30_",".txt", ga, chroms )

es1_lfm_5kb = read.hic_files( dumped_dir, "ES_IS_5kb/ES1_inter30_",".txt", ga, chroms )
es2_lfm_5kb = read.hic_files( dumped_dir, "ES_IS_5kb/ES2_inter30_",".txt", ga, chroms )

ns1_lfm_5kb = read.hic_files( dumped_dir, "NS_IS_5kb/NS1_inter30_",".txt", ga, chroms )
ns2_lfm_5kb = read.hic_files( dumped_dir, "NS_IS_5kb/NS2_inter30_",".txt", ga, chroms )

save( es_lfm_5kb, ns_lfm_5kb, es1_lfm_5kb, ns1_lfm_5kb, ns2_lfm_5kb, es2_lfm_5kb, 
      file=paste0( data_directory, 'LFMs_in_situ_5kb.RData' ) )



es_lfm_10kb = read.hic_files_intra( dumped_dir, "ES_IS_10kb/ES_inter30_",".txt", ga10, chroms )
ns_lfm_10kb = read.hic_files_intra( dumped_dir, "NS_IS_10kb/NS_inter30_",".txt", ga10, chroms )

es1_lfm_10kb = read.hic_files_intra( dumped_dir, "ES_IS_10kb/ES1_inter30_",".txt", ga10, chroms )
es2_lfm_10kb = read.hic_files_intra( dumped_dir, "ES_IS_10kb/ES2_inter30_",".txt", ga10, chroms )

ns1_lfm_10kb = read.hic_files_intra( dumped_dir, "NS_IS_10kb/NS1_inter30_",".txt", ga10, chroms )
ns2_lfm_10kb = read.hic_files_intra( dumped_dir, "NS_IS_10kb/NS2_inter30_",".txt", ga10, chroms )

save( es_lfm_10kb, ns_lfm_10kb, es1_lfm_10kb, ns1_lfm_10kb, ns2_lfm_10kb, es2_lfm_10kb, 
      file=paste0( data_directory, 'LFMs_in_situ_10kb.RData' ) )

Normalize the LFMs.

load(paste0( data_directory, 'LFMs_in_situ_5kb.RData' ))

chroms=paste0('chr',c(1:19,'X','Y'))
itn=20


es_ipf_5kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga[ga$chr==chr,]
  m = es_lfm_5kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(es_ipf_5kb_intra) = chroms[1:20]

save( es_ipf_5kb_intra, 
      file= paste0(data_directory,'intrachromosomal_5kb_IPF_ES.RData') )

esins = list( c(LFM = getGWmatrix_dedicated( es_ipf_5kb_intra, ga, 'LFM' ),
                balanced = getGWmatrix_dedicated( es_ipf_5kb_intra, ga, 'balanced' )) )
esins = esins[[1]]

save( esins,
      file=paste0(data_directory,'InSitu_ES_pooled_5kb_my_object.RData' ))



es1_ipf_5kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga[ga$chr==chr,]
  m = es1_lfm_5kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(es1_ipf_5kb_intra) = chroms[1:20]

es2_ipf_5kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga[ga$chr==chr,]
  m = es2_lfm_5kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(es2_ipf_5kb_intra) = chroms[1:20]

save( es1_ipf_5kb_intra, es2_ipf_5kb_intra, 
      file= paste0(data_directory,'intrachromosomal_5kb_IPF_ES_repl_sep.RData') )


esins1 = list( c(LFM = getGWmatrix_dedicated( es1_ipf_5kb_intra, ga, 'LFM' ),
                 balanced = getGWmatrix_dedicated( es1_ipf_5kb_intra, ga, 'balanced' )) )
esins1 = esins1[[1]]
esins2 = list( c(LFM = getGWmatrix_dedicated( es2_ipf_5kb_intra, ga, 'LFM' ),
                 balanced = getGWmatrix_dedicated( es2_ipf_5kb_intra, ga, 'balanced' )) )
esins2 = esins2[[1]]

save( esins1, esins2,
      file=paste0(data_directory,'InSitu_ES_pooled_5kb_my_object_repl_sep.RData' ))


ns_ipf_5kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga[ga$chr==chr,]
  m = ns_lfm_5kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(ns_ipf_5kb_intra) = chroms[1:20]
save( ns_ipf_5kb_intra,
      file= paste0(data_directory,'intrachromosomal_5kb_IPF_NS.RData') )


ns1_ipf_5kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga[ga$chr==chr,]
  m = ns1_lfm_5kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(ns1_ipf_5kb_intra) = chroms[1:20]

ns2_ipf_5kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga[ga$chr==chr,]
  m = ns2_lfm_5kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(ns2_ipf_5kb_intra) = chroms[1:20]

save( ns1_ipf_5kb_intra, ns2_ipf_5kb_intra, 
      file= paste0(data_directory,'intrachromosomal_5kb_IPF_NS_repl_sep.RData') )


nsins1 = list( c(LFM = getGWmatrix_dedicated( ns1_ipf_5kb_intra, ga, 'LFM' ),
                 balanced = getGWmatrix_dedicated( ns1_ipf_5kb_intra, ga, 'balanced' )) )
nsins1 = nsins1[[1]]
nsins2 = list( c(LFM = getGWmatrix_dedicated( ns2_ipf_5kb_intra, ga, 'LFM' ),
                 balanced = getGWmatrix_dedicated( ns2_ipf_5kb_intra, ga, 'balanced' )) )
nsins2 = nsins2[[1]]

save( nsins1, nsins2,
      file=paste0(data_directory,'InSitu_NS_pooled_5kb_my_object_repl_sep.RData' ))


load(paste0(data_directory,'intrachromosomal_5kb_IPF_NS.RData'))
ns_ipf_5kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga[ga$chr==chr,]
  m = ns_lfm_5kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(ns_ipf_5kb_intra) = chroms[1:20]
save( ns_ipf_5kb_intra, 
      file= paste0(data_directory,'InSitu_NS_pooled_5kb_my_object.RData') )


nsins = list( c(LFM = getGWmatrix_dedicated( ns_ipf_5kb_intra, ga, 'LFM' ),
                balanced = getGWmatrix_dedicated( ns_ipf_5kb_intra, ga, 'balanced' )) )
nsins = nsins[[1]]

save( nsins,
      file=paste0(data_directory,'InSitu_NS_pooled_5kb_my_object.RData' ))

es_ipf_10kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga10[ga10$chr==chr,]
  m = es_lfm_10kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(es_ipf_10kb_intra) = chroms[1:20]

es1_ipf_10kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga10[ga10$chr==chr,]
  m = es1_lfm_10kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(es1_ipf_10kb_intra) = chroms[1:20]

es2_ipf_10kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga10[ga10$chr==chr,]
  m = es2_lfm_10kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(es2_ipf_10kb_intra) = chroms[1:20]

save( es1_ipf_10kb_intra, es2_ipf_10kb_intra, 
      file = paste0(data_directory,'intrachromosomal_10kb_IPF_ES_repl_sep.RData') )
save( es_ipf_10kb_intra, file = paste0(data_directory,'intrachromosomal_10kb_IPF_ES_pooled.RData') )



ns_ipf_10kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga10[ga10$chr==chr,]
  m = ns_lfm_10kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(ns_ipf_10kb_intra) = chroms[1:20]

ns1_ipf_10kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga10[ga10$chr==chr,]
  m = ns1_lfm_10kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(ns1_ipf_10kb_intra) = chroms[1:20]

ns2_ipf_10kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga10[ga10$chr==chr,]
  m = ns2_lfm_10kb[[chr]]
  print(paste0('normalizing ', chr))
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names(ns2_ipf_10kb_intra) = chroms[1:20]


save( ns1_ipf_10kb_intra, ns2_ipf_10kb_intra, 
      file= paste0(data_directory,'intrachromosomal_10kb_IPF_NS_repl_sep.RData') )
save( ns_ipf_10kb_intra, file = paste0(data_directory,'intrachromosomal_10kb_IPF_NS_pooled.RData') )

esins = list( c(LFM = getGWmatrix_dedicated( es_ipf_10kb_intra, ga10, 'LFM' ),
                balanced = getGWmatrix_dedicated( es_ipf_10kb_intra, ga10, 'balanced' )) )
esins = esins[[1]]
save( esins, 
      file=paste0(data_directory,'InSitu_ES_pooled_10kb_my_object.RData' ))

nsins = list( c(LFM = getGWmatrix_dedicated( ns_ipf_10kb_intra, ga10, 'LFM' ),
                balanced = getGWmatrix_dedicated( ns_ipf_10kb_intra, ga10, 'balanced' )) )
nsins = nsins[[1]]
save( nsins, 
      file=paste0(data_directory,'InSitu_NS_pooled_10kb_my_object.RData' ))

Export matrices to generate .hic files for visualization with juicebox. We also provide the code how to get the .hic file using juicer. We also provide an option to export only a subset of interactions (conrolled by the parameter disCutoff). This might reveal to be useful, if we wish to simply reduce the size of .hic file.

nsc.juicer = getMatrixForJucerPre( nsins[[2]], 
ga, paste0('chr',c(1:19,'X','Y')), 250000000 )
write.table(nsc.juicer, file=paste0(data_directory,'nsc.juicer.txt'),
                 quote=FALSE, row.names=FALSE, col.names=FALSE, 
                 sep='\t')

esc.juicer = getMatrixForJucerPre( esins[[2]],
                                   ga, paste0('chr',c(1:19,'X','Y')), 250000000 )
write.table(esc.juicer, file=paste0(data_directory,'esc.juicer.txt'),
            quote=FALSE, row.names=FALSE, col.names=FALSE,
            sep='\t')

# cd /g/huber/users/pekowska/Tools/juicer_helix/internet/
# java -Xmx5g -jar juicer_tools_0.7.5.jar pre -n esc.tcc.juicer.txt esc_IPF_tcc_10kb.hic mm9
# java -Xmx5g -jar juicer_tools_0.7.5.jar pre -n nsc.tcc.juicer.txt nsc_IPF_tcc_10kb.hic mm9
# java -Xmx10g -jar juicer_tools_0.7.5.jar pre -n esc.juicer.txt esc_IPF_pooled_5kb.hic mm9
# java -Xmx10g -jar juicer_tools_0.7.5.jar pre -n nsc.juicer.txt nsc_IPF_pooled_5kb.hic mm9

Identification of interactions

We are done with the normalization. We will now define a function to call interactions. Details about the underlying statistics can be found in the STAR methods of our paper. In essence, we will use Wald test to check if an interaction between two bins (resolution of 10kb) is stronger than the expected value. The expected value is computed as the smoothed mean of the interaction strength at the corresponding genomic distance. We first define the functions necessary to perform these analyses.

interaction_calling_nb = function(counts,norm_facs){

    y = DGEList(counts, lib.size = rep(1, ncol(counts)))
    y$offset = log(norm_facs)
    y = estimateDisp(y, min.row.sum = 2, trend.method = "movingave")
    dispersion <- getDispersion(y)
    dd = DESeqDataSetFromMatrix(countData = counts, design = ~ 1,
                                 colData = DataFrame(sample = colnames(counts)))

    normalizationFactors(dd) = norm_facs
    dispersions(dd) = dispersion
    dd = nbinomWaldTest(dd, betaPrior = FALSE)
    return( results(dd, independentFiltering = FALSE, cooksCutoff = FALSE,
                   altHypothesis = "greater") )

}


smoothed_mean = function(vals, distan){
  
  Max =  max(as.numeric(distan))
  myV = split( vals, as.numeric(distan) )
  
  stopifnot( all( diff( as.numeric(names(myV))) == 1 ) )
  
  
  means = lapply( as.list( seq( 2, Max-1 ) ), 
                  function(d){
                    mean( c( mean(  myV[[ d-1 ]], na.rm=T ), 
                             mean(  myV[[ d  ]],  na.rm=T ), 
                             mean(  myV[[ d+1 ]], na.rm=T ) ),na.rm=T ) } )
  ffit = c( mean( c(mean(myV[[ 1 ]]), mean(myV[[2]])) ), 
            unlist(means), 
            mean( c(mean(myV[[ length(myV)-1 ]]), 
                    mean(myV[[ length(myV)]])) ) )
  return( ffit )
  
}


getOE.smatrix=function(normalizedObject,ga,DISTANCE){
  
  res=Matrix(0,nrow=nrow(ga),ncol=nrow(ga),sparse=T)
  NR=0
  pt=data.frame()
  chroms=paste0('chr',c(1:19,'X'))
  
  for(i in 1:20){
    ## get the estimate of the expected value
    thisChromID = which(ga$chr==chroms[i])
    FROM = min(thisChromID)  
    TO   = max(thisChromID)
    tp = as.data.frame(summary(normalizedObject[[2]][FROM:TO,FROM:TO]), stringsAsFactors=F)
    tp = tp[tp$i<tp$j,]
    tp = tp[tp$j-tp$i < DISTANCE,]
    sm = smoothed_mean( tp$x, tp$j-tp$i )
    tp$expected = sm[ as.numeric( tp$j-tp$i ) ]
    LFM = normalizedObject[[1]][FROM:TO,FROM:TO]
    LFM = LFM[cbind(tp$i, tp$j)]
    tp$SF = tp$expected * (LFM/tp$x)
    tp$i=tp$i+NR
    tp$j=tp$j+NR
    pt = rbind(pt,tp)
    NR = NR+sum(ga$chr==chroms[i])
    }
  res[ cbind(pt$i, pt$j) ] = pt$SF
  
  return(res)
}


GetMs = function( coord, distance ){
  
  reg.width = 1+2*distance
  return( cbind( rows = unlist(lapply(as.list(coord[,1]), function(x){ rep(seq(x-distance, x+distance), reg.width)})),
                 cols = unlist(lapply(as.list(coord[,2]), function(x){ rep(seq(x-distance, x+distance), each=reg.width)})),
                 id = rep( seq(1,nrow(coord)), each=reg.width^2) ) )
  
}


isolateAreaAroundPixelsOfChoice = function( coordinates, D, ga ){
  
  PI = coordinates
  nr = nrow(ga)
  
  PI = PI[PI[,2]-PI[,1] > 2*D + 1 & PI[,1]+2*D + 1<nr & PI[,2]+2*D + 1<nr & PI[,1]>2*D+1 & PI[,2]>2*D+1, ]
  PI = PI[ PI[,1]<nr-(1+2*D) &  PI[,2]<nr-(1+2*D), ]
  print( paste('Isolated', nrow(PI), 'puative interactios'))
  Mcoord = GetMs( PI, distance = D ) 
  
  return( data.frame( row=as.numeric(Mcoord[,1]),
                      col=as.numeric(Mcoord[,2]),
                      id=as.numeric(Mcoord[,3]),
                      stringsAsFactors=FALSE ) ) }


getScorePerAreaAroundInteraction = function( mat ){
  
  hmPixelsInPeak = sum( mat[5:7 , 5:7 ]>0 )
  hmPixelsInZone = sum( mat>0 )

  peak = mat[5:7 , 5:7 ]
  mat[5:7 , 5:7 ] = 0
  donut = sum( mat[ 4:8, 4:8 ] )
  mat[ 4:8, 4:8 ] = 0
  return( cbind( peak = sum(peak, na.rm=TRUE),
                 F = sum( mat[ 1:5,7:11 ], na.rm=TRUE),
                 S = sum( mat[ 5:7, 9:11 ], na.rm = TRUE),
                 T = sum( mat[ 7:11, 7:11 ], na.rm = TRUE),
                 Fo = sum( mat[ 9:11, 5:7  ], na.rm = TRUE),
                 Fi = sum( mat[ 7:11, 1:5  ], na.rm = TRUE),
                 Si = sum( mat[ 5:7, 1:3  ], na.rm = TRUE),
                 Se = sum( mat[ 1:5, 1:5  ], na.rm = TRUE),
                 E = sum( mat[ 1:3, 5:7 ], na.rm = TRUE),
                 Donut = donut,
                 hmPixelsInPeak = hmPixelsInPeak,
                 hmPixelsInZone = hmPixelsInZone))
  
}

Now we define the main function that wraps all the functions defined above.

findInteractions = function( normalizedObject1, normalizedObject2,
                             ga,Lfm_threshold, 
                             DISTANCE,
                             area.around){
  #  normalizedObject1=es1; normalizedObject2=es2; Lfm_threshold=1; DISTANCE=200; area.around=1
  lfm.filt = (normalizedObject1[[1]]>Lfm_threshold) * 
    (normalizedObject2[[1]]>Lfm_threshold)
  id = as.data.frame(summary(lfm.filt),stringsAsFactors=F)
  id = id[ id$x>0 & ga$chr[id$i]==ga$chr[id$j] & id$j-id$i<DISTANCE, ]
  print('applied initial filters')
  
  expected1 = getOE.smatrix(normalizedObject1,ga,DISTANCE=DISTANCE)
  expected2 = getOE.smatrix(normalizedObject2,ga,DISTANCE=DISTANCE)
  print( 'computed expected values')
  
  ## pixels to include
  pixelsTI = isolateAreaAroundPixelsOfChoice( id,
                                              D=area.around,
                                              ga )
  # determine the coordinate of the central pixel
  coordinates = pixelsTI[ seq(round(((1+(2*area.around))^2)/2)+1,
                              nrow(pixelsTI), 
                              by=(1+(2*area.around))^2 ), c('row','col','id')]

  ## take the observed and expected values
  ob1 = normalizedObject1[[1]][cbind(pixelsTI$row,pixelsTI$col)]
  ob2 = normalizedObject2[[1]][cbind(pixelsTI$row,pixelsTI$col)]
  ex1 = expected1[cbind(pixelsTI$row,pixelsTI$col)]
  ex2 = expected2[cbind(pixelsTI$row,pixelsTI$col)]
  
  # split by the zone ID
  ob1 = split( ob1, pixelsTI$id )
  ob2 = split( ob2, pixelsTI$id )
  ex1 = split( ex1, pixelsTI$id )
  ex2 = split( ex2, pixelsTI$id )
  stopifnot(  all(diff(as.numeric(names(ob1)))==1) & all(diff(as.numeric(names(ob2)))==1) & all(diff(as.numeric(names(ex1)))==1) & all(diff(as.numeric(names(ex2)))==1) )
  
  # get the count table along with the size factors
  expected = cbind( ex1=unlist(lapply(ex1, sum)),
                    ex2=unlist(lapply(ex2, sum)) )
  observed = cbind( ob1=unlist(lapply(ob1, sum)),
                    ob2=unlist(lapply(ob2, sum)) )
  
  # consider only places where we had SizeFactors for both replicates
  myTests = rowSums(expected>0)==2
  expected = expected[myTests,]
  observed = observed[myTests,]
  coordinates = coordinates[myTests,]
  test = interaction_calling_nb( observed, expected )

  print('testing for significance')
  return( list(o=observed,
               e=expected,
               c=coordinates,
               test=test ) )
}

Next, we identify interactions in ES cells (FBS/LIF and 2i/LIF) and in NS cells.

load(paste0(data_directory, 'TCC_IPF_10kb_ES_NS_repl_sep.RData'))
esi = findInteractions( es1, es2, 
                        ga10, 1,
                        200, 1 )

nsi = findInteractions( ns1, ns2, 
                        ga10, 1,
                        200, 1 )
save( esi, nsi, 
      file=paste0(data_directory,'TCC_ES_NS_interactions.RData'))
@


Identify interactions in ES cells grown in the presence of 2i and LIF.


<<findInteractions_ES_2iLIF, eval=FALSE>>=
load(paste0(data_directory, 'TCC_IPF_10kb_repl_sep_ES_2i.RData'))

tii = findInteractions( ti1, ti2, 
                        ga10,1,
                        200, 1 )
save( tii, 
      file=paste0(data_directory,'TCC_2i_cells_interactions.RData'))

This analysis is relatively time consuming. Therefore, we save these calls.

Identification of loops

We load the interaction files and define the function to identify loops.

load( paste0(data_directory,'TCC_ES_NS_interactions.RData') )  
load( paste0(data_directory,'TCC_2i_cells_interactions.RData') )

Function to call loops. As described in the STAR methods, we define eight neighborhoods that surround the interrogated pixel; we consider the area surrounding the pixel as the interaction area. Loop calling is based on counting the interactions in these neighbourhoods.

callLoops=function(int.result,
                   ga, 
                   D,
                   Pthr,
                   FCthr,
                   peakThr,
                   donoutThr){
  
  ## interactions above the thresholds
  Interactions = int.result[[3]][int.result[[4]]$padj<Pthr
                                 & int.result[[4]]$log2FoldChange>log2(FCthr),]
  
  ## get a binary marix with the interactions
  sm = Matrix(0L, nrow=nrow(ga),ncol=nrow(ga))
  sm[ cbind(Interactions[,1], Interactions[,2]) ] = 1
  
  # identify pixels around the filtered interactions
  Int = isolateAreaAroundPixelsOfChoice( Interactions[,1:2], 
                                         D=D, 
                                         ga )
  coords = Int[ seq(round(((1+(2*D))^2)/2)+1,nrow(Int),by=(1+(2*D))^2 ), c('row','col','id')]
  isInt = sm[cbind(Int[,'row'],Int[,'col'])]
  onInt =  split( isInt, Int$id )                                         
  int.Int = do.call('rbind', lapply( onInt, function(x){
    getScorePerAreaAroundInteraction(matrix(x,nrow=1+(2*D),ncol=1+(2*D)))} ) )

  # identify loop pixels
  ids = int.Int[,1]>peakThr  & rowSums(int.Int[,2:9]<donoutThr)==8
  
  # cluster immediately neighbouring loop pixels
  # into loops
  sm = Matrix(0L, nrow=nrow(ga),ncol=nrow(ga))
  sm[ cbind(coords[ids,1],coords[ids,2]) ] = 1
  
  forClustering = isolateAreaAroundPixelsOfChoice( coords[ids,1:2],D=1,ga)
  forClustering = forClustering[!duplicated(forClustering[,1:2]),]
  
  clustered = annotateInteractionsToZones( forClustering, 1, ga, 300 ) 
  clustered = clustered[ga$chr[clustered$row]==ga$chr[clustered$col],]

  return(list(clustered = clustered ) )
}

Now we identify the loops.

loops.ES = callLoops( int.result=esi,
                      ga=ga10,
                      D=5, 
                      Pthr=0.2,
                      FCthr=1.5,
                      peakThr=1,
                      donoutThr=3 )

loops.NS = callLoops( int.result=nsi,
                      ga=ga10,
                      D=5, 
                      Pthr=0.2,
                      FCthr=1.5,
                      peakThr=1,
                      donoutThr=3 )

save( loops.ES, loops.NS, 
      file=paste0(data_directory,'TCC_ES_NS_loops.RData'))

loops.TI = callLoops( int.result=tii,
                      ga=ga10,
                      D=5, 
                      Pthr=0.2,
                      FCthr=1.5,
                      peakThr=1,
                      donoutThr=3 )

save( loops.TI,
      file=paste0(data_directory,'TCC_2i_loops.RData'))

Check how many identified loops have CTCF at anchors.

load(paste0(data_directory,'TCC_ES_NS_loops.RData'))

esct = import.bed(paste0(data_directory,'ES_FBS_CTCF_peaks.bed'))
nsct = import.bed(paste0(data_directory,'NS_CTCF_peaks.bed'))
seqlevelsStyle(esct) = 'ucsc'
seqlevelsStyle(nsct) = 'ucsc'

getGR=function(coords){
  return(GRanges(seqnames = Rle(coords[,1]),
                 ranges = IRanges(start = as.numeric(coords[,2]),
                                  end = as.numeric(coords[,3])),
                 strand = Rle(rep("*", nrow(coords)))) ) }

esloops = do.call('rbind', lapply(split(loops.ES[[1]],loops.ES[[1]]$x), function(l){
  data.frame(chr1=ga10$chr[l[1,1]],x1=min(ga10$start[l[,1]]),x2=max(ga10$start[l[,1]]),
             chr2=ga10$chr[l[1,1]],x2=min(ga10$start[l[,2]]),x2=max(ga10$start[l[,2]]), stringsAsFactors=FALSE) }))

nsloops = do.call('rbind', lapply(split(loops.NS[[1]],loops.NS[[1]]$x), function(l){
  data.frame(chr1=ga10$chr[l[1,1]],x1=min(ga10$start[l[,1]]),x2=max(ga10$start[l[,1]]),
             chr2=ga10$chr[l[1,1]],x2=min(ga10$start[l[,2]]),x2=max(ga10$start[l[,2]]), stringsAsFactors=FALSE) }))

nrow(esloops)

## [1] 2247

nrow(nsloops)

## [1] 3183

## majority of loop anchors overlaps CTCF peaks
m=rbind( ES=table(rowSums(cbind( countOverlaps(getGR(esloops[,1:3]),esct)>0,
                     countOverlaps(getGR(esloops[,4:6]),esct)>0) ) ),
         NS=table(rowSums(cbind( countOverlaps(getGR(nsloops[,1:3]),nsct)>0,
                     countOverlaps(getGR(nsloops[,4:6]),nsct)>0) ) ) )


par( cex.lab=1.5, cex.axis=1.5, bty='n' )
barplot( m, beside=T, col=c(myEScolor,myNScolor), 
         names=0:2, main='', xlab='N [anchors overlapping CTCF peak(s)]' )
legend(x=1,y=1500,c('ES','NS'), pch=15, 
       col=c(myEScolor,myNScolor), bty='n', cex=1.5 )

We save the TCC loop coordinates.

les = loops.ES$clustered
lns = loops.NS$clustered

loop.coordinates.es = do.call('rbind',lapply(split(les,les$x),function(l){
  data.frame(chr1=ga10$chr[min(l$row)], start1=ga10$start[min(l$row)], end1=ga10$end[max(l$row)],
             chr2=ga10$chr[min(l$col)], start2=ga10$start[min(l$col)], end2=ga10$end[max(l$col)],
             stringsAsFactors = F)
}))

loop.coordinates.ns = do.call('rbind',lapply(split(lns,lns$x),function(l){
  data.frame(chr1=ga10$chr[min(l$row)], start1=ga10$start[min(l$row)], end1=ga10$end[max(l$row)],
             chr2=ga10$chr[min(l$col)], start2=ga10$start[min(l$col)], end2=ga10$end[max(l$col)],
             stringsAsFactors = F)
}))

save( les, lns, 
      loop.coordinates.es,loop.coordinates.ns, 
      file=paste0(data_directory,'Loops_coordinates.RData') )

Comparative analysis of Loops - TCC

In this section we show how we performed the calling of differential loops in TCC data.

ES vs. NS cells

We pool loops in ES and NS cells and cluster the pixels to remove redundancy.

load(paste0(data_directory, 'TCC_IPF_10kb_ES_NS_repl_sep.RData'))
load(paste0(data_directory,'Loops_coordinates.RData'))
loops.input = rbind(les,lns)
loops = loops.input[!duplicated(loops.input[,1:2]),]
loops2d = annotateInteractionsToZones( loops[,1:2], 1, ga10, 300 )
loops2d = loops2d[ga10$chr[loops2d$row] == ga10$chr[loops2d$col], ]
lids = as.matrix( loops2d[,1:2] )

Now retrieve the normalized and un-normalized values for the clustered loops. First we take ligation frequencies in these clustered areas.

tam = data.frame( es1 = es1[[1]][ lids ],
                  es2 = es2[[1]][ lids ],
                  ns1 = ns1[[1]][ lids ],
                  ns2 = ns2[[1]][ lids ],
                  row = loops2d$row,
                  col = loops2d$col,
                  id = loops2d$x,
                  stringsAsFactors=FALSE)

tam = do.call('rbind', lapply(split(tam, tam$id), function(z){
  data.frame( es1=colSums(z)[1],
              es2=colSums(z)[2],
              ns1=colSums(z)[3],
              ns2=colSums(z)[4],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE )
}) )

Then we take the normalized ligation frequencies in these clustered areas.

mat = data.frame( es1 = es1[[2]][ lids ],
                  es2 = es2[[2]][ lids ],
                  ns1 = ns1[[2]][ lids ],
                  ns2 = ns2[[2]][ lids ],
                  row = loops2d$row,
                  col = loops2d$col,
                  id = loops2d$x,
                  stringsAsFactors=FALSE)

mat = do.call('rbind', lapply(split(mat, mat$id), function(z){
  data.frame( es1=colSums(z)[1],
              es2=colSums(z)[2],
              ns1=colSums(z)[3],
              ns2=colSums(z)[4],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE )
}) )

save( mat, tam, loops2d,
      file=paste0(data_directory, 'TCC_Loops_merged_ES_NS.RData') )

We perform differential loop analysis.

load(paste0(data_directory, 'TCC_Loops_merged_ES_NS.RData'))
norm_facs = as.matrix(tam[, 1:4] / mat[, 1:4])
norm_facs[is.nan(norm_facs) ] = 1

head(tam[, 1:4]  / norm_facs)

##          es1        es2        ns1        ns2
## 1 0.13944336 0.13892008 0.10131579 0.09569634
## 2 0.02264798 0.05168537 0.07443603 0.09135705
## 3 0.05280570 0.07672645 0.01279394 0.03834995
## 4 0.10674115 0.11276552 0.23052304 0.21343575
## 5 0.07468507 0.02876307 0.05608981 0.03984282
## 6 0.14528805 0.09022108 0.20563792 0.20118898

counts = tam[, 1:4]
cond =  c("E", "E", "N", "N")
y = DGEList(counts, lib.size = rep(1, ncol(counts)), group = cond)
y$offset = log(norm_facs)
y = estimateDisp(y, min.row.sum = 2, trend.method = "movingave")

## Design matrix not provided. Switch to the classic mode.

dispersion = getDispersion(y)
dd = DESeqDataSetFromMatrix(countData = counts, design = ~ cond,
                            colData = DataFrame(sample = colnames(counts),
                            cond = cond))

## converting counts to integer mode

normalizationFactors(dd) = norm_facs
dispersions(dd) = dispersion
dd = nbinomWaldTest(dd, betaPrior = FALSE)


di = as.data.frame( results(dd, 
                            independentFiltering = FALSE, 
                            cooksCutoff = FALSE,
                            lfcThreshold = 0 ) )
di$pvalueC = di$pvalue
di$pvalueC[di$pvalue<10e-10] = 10^-10
di=data.frame(di, mat)

di$sig=di$padj<0.1 & abs(di$log2FoldChange)>0.59

We define ES and NS specific loops.

loops.es = di[di$padj<0.1 & di$log2FoldChange<log2(1/1.5), ]
loops.ns = di[di$padj<0.1 & di$log2FoldChange>log2(1.5), ]
nrow(loops.es)

## [1] 287

nrow(loops.ns)

## [1] 1490

nrow(di)

## [1] 4328

save( loops.es, loops.ns, di,
      file=paste0(data_directory,'TCC_ES_NS_DESeq_loop_comparision'))

Preparation for APA.

load( paste0(data_directory, 'TCC_IPF_10kb_ES_NS_pooled.RData')  )

prepareForAPA = function( sm, D, coordinates, ga ){
  coord = isolateAreaAroundPixelsOfChoice( coordinates, D, ga )
  stopifnot(max(coord)<nrow(ga))
  coord$x = sm[ cbind(coord$row,coord$col)]
  return( split(coord$x,coord$id) )
  
}

es.loops.sig.es = prepareForAPA( es[[2]],D=5,coordinates=loops.es[,c('row','col')],ga10)
es.loops.sig.ns = prepareForAPA( ns[[2]],D=5,coordinates=loops.es[,c('row','col')],ga10)
ns.loops.sig.es = prepareForAPA( es[[2]],D=5,coordinates=loops.ns[,c('row','col')],ga10)
ns.loops.sig.ns = prepareForAPA( ns[[2]],D=5,coordinates=loops.ns[,c('row','col')],ga10)
                                                
save( es.loops.sig.es, 
      es.loops.sig.ns, 
      ns.loops.sig.es, 
      ns.loops.sig.ns,
      file=paste0(data_directory, 'loops_controls_ES_NS.RData'))

APA

load(paste0(data_directory, 'loops_controls_ES_NS.RData'))

APA = function(loops,colors,
               scaleLimit,matrixSize,
               size,plotName){
  
  apa=Reduce('+',lapply(loops,
                         function(l){matrix(l,nrow=matrixSize,ncol=matrixSize)}))
  apa=apa/length(loops)
  image( x=seq(-size, size, length.out=matrixSize ), 
         y=seq(-size, size, length.out=matrixSize ),
         matrix(cut( apa, c(seq(0,scaleLimit, length.out=255),1), labels=FALSE),11,11),
         col=colors,
         breaks = 1:(length(colors)+1),
         xlab='Distance (kb)', ylab='Distance (kb)', main=plotName ) }



par(mfrow=c(2,2),mar=rep(5,4),
    cex.lab=1.5,cex.main=1.7,cex.axis=1.5)

APA( loops=es.loops.sig.es,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.01,matrixSize=11,size=50,plotName='ES (ES)'  )
APA( loops=es.loops.sig.ns,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.01,matrixSize=11,size=50,plotName='ES (NS)'  )
APA( loops=ns.loops.sig.es,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.01,matrixSize=11,size=50,plotName='NS (ES)'  )
APA( loops=ns.loops.sig.ns,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.01,matrixSize=11,size=50,plotName='NS (NS)'  )

Loops and pluripotency

We identify differential loops between ES cells (2i/LIF) and NS cells.

load( paste0(data_directory, 'TCC_IPF_10kb_ES_NS_repl_sep.RData'))
load( paste0(data_directory, 'TCC_IPF_10kb_repl_sep_ES_2i.RData') )
load( paste0(data_directory,'TCC_IPF_10kbMatrices_repl_sep_EpiSC.RData') )

load( paste0(data_directory,'TCC_2i_loops.RData') )
load( paste0(data_directory,'TCC_ES_NS_loops.RData') )


lts = loops.TI$clustered
save( lts, file=paste0(data_directory,'Loops_2iLif_coordinates.RData'))


lns = loops.NS$clustered
loops.input = rbind(lts,lns)
loops = loops.input[!duplicated(loops.input[,1:2]),]
loops2d = annotateInteractionsToZones( loops[,1:2], 1, ga10, 300 )
loops2d = loops2d[ga$chr[loops2d$row] == ga$chr[loops2d$col], ]


tam = data.frame( ti1 = ti1[[1]][ as.matrix( loops2d[,1:2] ) ],
                  ti2 = ti2[[1]][ as.matrix( loops2d[,1:2] ) ],
                  ep1 = ep1[[1]][ as.matrix( loops2d[,1:2] ) ],
                  ep2 = ep2[[1]][ as.matrix( loops2d[,1:2] ) ],
                  ns1 = ns1[[1]][ as.matrix( loops2d[,1:2] ) ],
                  ns2 = ns2[[1]][ as.matrix( loops2d[,1:2] ) ],
                  row = loops2d$row,
                  col = loops2d$col,
                  id = loops2d$x,
                  stringsAsFactors=FALSE)

tam = do.call('rbind', lapply(split(tam, tam$id), function(z){
  data.frame( ti1=colSums(z)[1],
              ti2=colSums(z)[2],
              ep1=colSums(z)[3],
              ep2=colSums(z)[4],
              ns1=colSums(z)[5],
              ns2=colSums(z)[6],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE )
}) )



mat = data.frame( ti1 = ti1[[2]][ as.matrix( loops2d[,1:2] ) ],
                  ti2 = ti2[[2]][ as.matrix( loops2d[,1:2] ) ],
                  ep1 = ep1[[2]][ as.matrix( loops2d[,1:2] ) ],
                  ep2 = ep2[[2]][ as.matrix( loops2d[,1:2] ) ],
                  ns1 = ns1[[2]][ as.matrix( loops2d[,1:2] ) ],
                  ns2 = ns2[[2]][ as.matrix( loops2d[,1:2] ) ],
                  row = loops2d$row,
                  col = loops2d$col,
                  id = loops2d$x,
                  stringsAsFactors=FALSE)

mat = do.call('rbind', lapply(split(mat, mat$id), function(z){
  data.frame( ti1=colSums(z)[1],
              ti2=colSums(z)[2],
              ep1=colSums(z)[3],
              ep2=colSums(z)[4],
              ns1=colSums(z)[5],
              ns2=colSums(z)[6],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE )
}) )


save( mat, tam,
      file=paste0(data_directory, 'Loops_merged_2i_NS.RData') )

Differential loop calling.

load(paste0(data_directory, 'Loops_merged_2i_NS.RData'))
norm_facs = as.matrix(tam[, c(1,2,5,6)] / mat[, c(1,2,5,6)])
norm_facs[is.nan(norm_facs) ] <- 1

head(tam[, c(1,2,5,6)]  / norm_facs)

##          ti1        ti2        ns1        ns2
## 1 0.09129244 0.07871238 0.04360697 0.05174893
## 2 0.04093303 0.03953112 0.07443603 0.09135705
## 3 0.15598653 0.16963597 0.04783226 0.08206230
## 4 0.46054854 0.07451381 0.23052304 0.21343575
## 5 0.02245209 0.01998159 0.05608981 0.03984282
## 6 0.08222796 0.07300396 0.20563792 0.20118898

counts = tam[, c(1,2,5,6)]
cond =  c("E", "E", "N", "N")


y = DGEList(counts, lib.size = rep(1, ncol(counts)), group = cond)
y$offset = log(norm_facs)
y = estimateDisp(y, min.row.sum = 2, trend.method = "movingave")

## Design matrix not provided. Switch to the classic mode.

dispersion = getDispersion(y)
dd = DESeqDataSetFromMatrix(countData = counts, design = ~ cond,
                             colData = DataFrame(sample = colnames(counts),
                             cond = cond))

## converting counts to integer mode

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

normalizationFactors(dd) = norm_facs
dispersions(dd) = dispersion
dd = nbinomWaldTest(dd, betaPrior = FALSE)


di = as.data.frame( results(dd, independentFiltering = FALSE, cooksCutoff = FALSE,
              lfcThreshold = 0 ) )
di$pvalueC = di$pvalue
di$pvalueC[di$pvalue<10e-10] = 10^-10
di=data.frame(di, mat)

di$sig=di$padj<0.1 & abs(di$log2FoldChange)>log2(1.5)

loops.ti = di[ di$padj<0.1 & di$log2FoldChange<log2(1/1.5), ]
loops.ns.ti = di[ di$padj<0.1 & di$log2FoldChange>log2(1.5), ]

save(loops.ti, loops.ns.ti, di, 
     file=paste0(data_directory,'loops_comparison_ti_ns.RData'))

Violin plots of the fold change of loop signal for

load(paste0(data_directory, 'Loops_merged_2i_NS.RData'))
load(paste0(data_directory,'loops_comparison_ti_ns.RData'))

mat$ti = rowMeans(mat[,1:2])
mat$ep = rowMeans(mat[,3:4])
mat$ns = rowMeans(mat[,5:6])

sum(sum(di$padj<0.1 & di$log2FoldChange<(log2(1/1.5))))

## [1] 387

sum(di$padj<0.1 & di$log2FoldChange>(log2(1.5)))

## [1] 2286

changing = mat[ di$padj<0.1 & di$log2FoldChange>(log2(1.5)), ]
table( changing$ep>changing$ti )/nrow(changing)

## 
##     FALSE      TRUE 
## 0.2309711 0.7690289

changing[,c('ti','ep','ns')] = log2(changing[,c('ti','ep','ns')])
changing$class = cut( changing$col-changing$row, c(0,40,80,Inf),labels=F)


MAT = data.frame( Signal = c( changing$ti-changing$ep,
                              changing$ns-changing$ep) ,
                  Cell=c(rep('Ti/EpiSC', nrow(changing)),
                         rep('NS/EpiSC', nrow(changing))),
                  Class = c( changing$class, changing$class) ,
                  stringsAsFactors=FALSE)
MAT$Cell = factor( MAT$Cell, levels=c('Ti/EpiSC','NS/EpiSC'))
table( changing$class )

## 
##    1    2    3 
## 1448  563  275

table(MAT$Cell)

## 
## Ti/EpiSC NS/EpiSC 
##     2286     2286

my_facette_names = list(
  '1'=">120 & <400",
  '2'=">=400 & <800",
  '3'=">=800"
)

myLabeller = function(variable,value){
  return(my_facette_names[value])
}

  

p1 = ggplot(MAT, aes(y=Signal,x=Cell,color=Cell ) ) +
  geom_violin(trim=TRUE,size=2) + geom_boxplot( width=0.5, size=1.2 ) +
  scale_colour_manual(values = c("red","black")) +  
  theme_classic() +  facet_grid(. ~ Class, labeller=myLabeller) +
  theme(legend.position="none",
        title = element_text(size=20), 
        line=element_line(size=1),
        axis.title.x = element_text(size=30),
        axis.title.y = element_text(size=30),
        axis.text.x  = element_text(size=30,angle=45,hjust=1),
        axis.text.y  = element_text(size=30),
        strip.text.x = element_text(size=30),
        strip.background = element_rect(colour="white", fill="white")) +
  xlab("") +
  ylab( expression( 'Log'[2]*'(ratio)' )  ) +
  theme(axis.line.y = element_line(color="black", size = 1)) +
  geom_hline(yintercept = 0) 


p1

Prepare the APA plot.

load(paste0(data_directory, 'TCC_IPF_10kb_pooled_ES_2i.RData'))
load(paste0(data_directory, 'TCC_IPF_10kbMatrices_pooled_EpiSC.RData'))
load( paste0(data_directory, 'TCC_IPF_10kb_ES_NS_pooled.RData') )

prepareForAPA = function( sm, D, coordinates, ga ){
  coord = isolateAreaAroundPixelsOfChoice( coordinates, D, ga )
  stopifnot(max(coord)<nrow(ga))
  coord$x = sm[ cbind(coord$row,coord$col)]
  return( split(coord$x,coord$id) )
  
}

ti.loops.sig.ti = prepareForAPA( ti[[2]],D=5,coordinates=loops.ti[,c('row','col')],ga10)
ti.loops.sig.ns = prepareForAPA( ns[[2]],D=5,coordinates=loops.ti[,c('row','col')],ga10)
ns.ti.loops.sig.ti = prepareForAPA( ti[[2]],D=5,coordinates=loops.ns.ti[,c('row','col')],ga10)
ns.ti.loops.sig.ns = prepareForAPA( ns[[2]],D=5,coordinates=loops.ns.ti[,c('row','col')],ga10)
                                                
save( ti.loops.sig.ti, 
      ti.loops.sig.ns, 
      ns.ti.loops.sig.ti, 
      ns.ti.loops.sig.ns,
      file=paste0(data_directory, 'loops_controls_ES_2i_NS.RData'))

Plot the APA for ES cells (2i/LIF) and NS cells.

load(paste0(data_directory, 'loops_controls_ES_2i_NS.RData'))


par(mfrow=c(2,2),mar=rep(5,4),
    cex.lab=1.5,cex.main=1.7,cex.axis=1.5)

APA( loops=ti.loops.sig.ti,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.02,matrixSize=11,size=50,plotName='2i/LIF (2i/LIF)'  )
APA( loops=ti.loops.sig.ns,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.02,matrixSize=11,size=50,plotName='2i/LIF (NS)'  )
APA( loops=ns.ti.loops.sig.ti,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.02,matrixSize=11,size=50,plotName='NS (2i/LIF)'  )
APA( loops=ns.ti.loops.sig.ns,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.02,matrixSize=11,size=50,plotName='NS (NS)'  )

ES cells (2i/LIF) versus epi-stem cells (EpiSC) - comparison of loop signal

Identify loops specific for ES cells (2i/LIF) and EpiSCs. Replicate 1 of the EpiSC was sequenced to a lower extend, we will not consider it for the quantitative comparison.

load(paste0(data_directory, 'Loops_merged_2i_NS.RData'))
norm_facs = as.matrix(tam[, c(1,2,4)] / mat[, c(1,2,4)])
norm_facs[is.nan(norm_facs) ] <- 1

head(tam[, c(1,2,4)]  / norm_facs)

##          ti1        ti2        ep2
## 1 0.09129244 0.07871238 0.07181253
## 2 0.04093303 0.03953112 0.04717113
## 3 0.15598653 0.16963597 0.16136960
## 4 0.46054854 0.07451381 0.11970415
## 5 0.02245209 0.01998159 0.05173047
## 6 0.08222796 0.07300396 0.11177858

counts = tam[, c(1,2,4)]
cond =  c("T", "T", "P")
y = DGEList(counts, lib.size = rep(1, ncol(counts)), group = cond)
y$offset = log(norm_facs)
y = estimateDisp(y, min.row.sum = 2, trend.method = "movingave")

## Design matrix not provided. Switch to the classic mode.

dispersion = getDispersion(y)
dd = DESeqDataSetFromMatrix(countData = counts, design = ~ cond,
                            colData = DataFrame(sample = colnames(counts),
                            cond = cond))

## converting counts to integer mode

## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors

normalizationFactors(dd) = norm_facs
dispersions(dd) = dispersion
dd = nbinomWaldTest(dd, betaPrior = FALSE)

di = as.data.frame( results(dd, independentFiltering = FALSE, cooksCutoff = FALSE,
              lfcThreshold = 0 ) )
di$pvalueC = di$pvalue
di$pvalueC[di$pvalue<10e-10] = 10^-10
di=data.frame(di, mat)

loops.ti.ep = di[ di$padj<0.1 & di$log2FoldChange>log2(1.5), ]
loops.ep.ti = di[ di$padj<0.1 & di$log2FoldChange<log2(1/1.5), ]

nrow(loops.ti.ep) ## 306

## [1] 306

nrow(loops.ep.ti) ## 840

## [1] 840

save(loops.ti.ep, loops.ep.ti, 
     di, 
     file=paste0(data_directory,'loops_comparison_ti_EpiSC.RData'))

Prepare the controls and display the distribution of loop sizes.

load( paste0(data_directory,'loops_comparison_ti_EpiSC.RData') )
d1=table( cut((loops.ti.ep$right_end-1)-(loops.ti.ep$left_start+1), c(0,40,80,Inf)) )
d2=table( cut((loops.ep.ti$right_end-1)-(loops.ep.ti$left_start+1), c(0,40,80,Inf)) )

d1

## 
##   (0,40]  (40,80] (80,Inf] 
##      226       71        9

d2

## 
##   (0,40]  (40,80] (80,Inf] 
##      523      206      111

d2/d1

## 
##    (0,40]   (40,80]  (80,Inf] 
##  2.314159  2.901408 12.333333

DD=rbind(d1,d2)
colnames(DD) = c('>120 & <=400', '>400 & <=800', '>800')


par(mfrow=c(1,1),mar=rep(5,4),
    cex.lab=1.5,cex.axis=1.5)

barplot(DD, beside=T, col=c('orange','gray50'),
        ylab='Number', main='' )

legend( y=500, x=7, pch=15, cex=2, 
        c('2i','EpiSC'),
        col=c('orange','gray50'),
        bty='n' )

ti.loops.sig.ti = prepareForAPA( ti[[2]],D=5,coordinates=loops.ti.ep[,c('row','col')],ga)
ti.loops.sig.ep = prepareForAPA( ep[[2]],D=5,coordinates=loops.ti.ep[,c('row','col')],ga)
ep.loops.sig.ti = prepareForAPA( ti[[2]],D=5,coordinates=loops.ep.ti[,c('row','col')],ga)
ep.loops.sig.ep = prepareForAPA( ep[[2]],D=5,coordinates=loops.ep.ti[,c('row','col')],ga)
                                                
save( ti.loops.sig.ti, 
      ti.loops.sig.ep, 
      ep.loops.sig.ti, 
      ep.loops.sig.ep,
      file=paste0(data_directory, 'Ti_versus_EpiSC_APA.RData'))

Plot the APA

load(paste0(data_directory, 'Ti_versus_EpiSC_APA.RData'))

par(mfrow=c(2,2),mar=rep(5,4),
    cex.lab=1.5,cex.main=1.7,cex.axis=1.5)

APA( loops=ti.loops.sig.ti,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.012,matrixSize=11,size=25,plotName='2i (2i)'  )
APA( loops=ti.loops.sig.ep,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.012,matrixSize=11,size=25,plotName='2i (EpiSC)'  )
APA( loops=ep.loops.sig.ti,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.012,matrixSize=11,size=25,plotName='EpiSC (2i)'  )
APA( loops=ep.loops.sig.ep,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.012,matrixSize=11,size=25,plotName='EpiSC (EpiSC)'  )

Loops - in situ Hi-C

In this section we will consider loops from Juicer and perform the analysis of loop signal in ES and NS cells. We noticed that in both ES and NS cells, there were several loops that were extremely long range (exceed 5Mb). These calls were not associated with contact domains. They rather reflect chromatin compartments. We do not consider these interactions for further analysis.

es_loops_is = read.delim( paste0(data_directory,'ES_loops.txt'),as.is=T )
ns_loops_is = read.delim( paste0(data_directory,'NS_loops.txt'),as.is=T )
es_loops_is = es_loops_is[ es_loops_is[,6]-es_loops_is[,2] < 5005000,] 
ns_loops_is = ns_loops_is[ ns_loops_is[,6]-ns_loops_is[,2] < 5005000,] 

es_loops_is$x1 = es_loops_is$x1 + 1
es_loops_is$y1 = es_loops_is$y1 + 1
ns_loops_is$x1 = ns_loops_is$x1 + 1
ns_loops_is$y1 = ns_loops_is$y1 + 1

Here we analyse the overlap between loop anchors and CTCF binding sites.

esct = import.bed(paste0(data_directory,'ES_FBS_CTCF_peaks.bed'))
nsct = import.bed(paste0(data_directory,'NS_CTCF_peaks.bed'))
seqlevelsStyle(esct) = 'ucsc'
seqlevelsStyle(nsct) = 'ucsc'

getGR=function(coords,offset){
  return(GRanges(seqnames = Rle(coords[,1]),
                 ranges = IRanges(start = as.numeric(coords[,2]-offset),
                                  end = as.numeric(coords[,3]+offset)),
                 strand = Rle(rep("*", nrow(coords)))) ) }

m=rbind( ES=table(rowSums(cbind( countOverlaps(getGR(es_loops_is[,1:3],offset=10000),esct)>0,
                                 countOverlaps(getGR(es_loops_is[,4:6],offset=10000),esct)>0) ) ),
         NS=table(rowSums(cbind( countOverlaps(getGR(ns_loops_is[,1:3],offset=10000),nsct)>0,
                                 countOverlaps(getGR(ns_loops_is[,4:6],offset=10000),nsct)>0) ) ) )


par( cex.lab=1.5, cex.axis=1.5, bty='n' )
barplot( m/1000, beside=T, col=c(myEScolor,myNScolor), 
         ylab='Number (x1000)',
         names=0:2, main='', xlab='N [anchors overlapping CTCF peak(s)]' )
legend(x=1,y=6,c('ES','NS'), pch=15, 
       col=c(myEScolor,myNScolor), bty='n', cex=1.5 )

We recorded several cases where adjacent pixels were called as separate loops. We here cluster these instances. First we define the function to cluster the pixels.

## define functions that allow to 
## represent loops as coordinates of pixels 
## in a matrix of a chosen resolution
## here we will work with the resolution of 5kb

getGR=function(coords,offset){
  return(GRanges(seqnames = Rle(coords[,1]),
                 ranges = IRanges(start = as.numeric(coords[,2]-offset),
                                  end = as.numeric(coords[,3]+offset)),
                 strand = Rle(rep("*", nrow(coords)))) ) }


loops2matrixcoords = function( loops, binAnno, offset ){
  ## loops - loops from juicebox
  ## binAnno - GRanges of bin anno
  ## offset - how much we expand loops
  ## loops = es_loops_is; manual_ann = es_filt; binAnno=gagr; offset=0

  left_gr = getGR( loops[ ,1:3 ], offset )
  right_gr = getGR( loops[ ,4:6 ], offset )
  leftBins = data.frame(findOverlaps(binAnno,left_gr))
  rightBins = data.frame(findOverlaps(binAnno,right_gr))  

  ## juicebox exports column first and then row
  ## filp it here
  
  res = do.call('rbind',lapply(as.list(seq(1,nrow(loops))), function(l){
    left_anno = leftBins[ leftBins[,2]==l, 1]
    right_anno = rightBins[ rightBins[,2]==l, 1]
    return( cbind( row = rep(seq(min(left_anno),max(left_anno)), length(unique(right_anno))),
                   col = rep(seq(min(right_anno),max(right_anno)), each=length(unique(left_anno))),
                   loop = rep(l, length(unique(right_anno)))) )}))
  res = res[!duplicated(res[,1:2]),]
  return(res)
}

We perform the clustering.

## 2D clusters 
es_loops_clust = as.data.frame( loops2matrixcoords( loops=es_loops_is, binAnno=gagr, offset = 0 ) )
ns_loops_clust = as.data.frame( loops2matrixcoords( loops=ns_loops_is, binAnno=gagr, offset = 0 ) )


d = 1

es_loops_centered = do.call('rbind',
  lapply(split(es_loops_clust,es_loops_clust[,'loop']), 
  function(l){
    mids = apply(l,2,function(x){round(mean(x))})
    return( cbind( row = rep(seq(mids[1]-d,mids[1]+d), length(seq(mids[1]-d,mids[1]+d))),
                   col = rep(seq(mids[2]-d,mids[2]+d), each=length(seq(mids[1]-d,mids[1]+d))),
                   loop = rep(mids[3], length( seq(mids[1]-d,mids[1]+d))) )) } ) ) 

ns_loops_centered = do.call('rbind',
  lapply(split(ns_loops_clust,ns_loops_clust[,'loop']), 
  function(l){
    mids = apply(l,2,function(x){round(mean(x))})
    return( cbind( row = rep(seq(mids[1]-d,mids[1]+d), length(seq(mids[1]-d,mids[1]+d))),
                   col = rep(seq(mids[2]-d,mids[2]+d), each=length(seq(mids[1]-d,mids[1]+d))),
                   loop = rep(mids[3], length( seq(mids[1]-d,mids[1]+d))) )) } ) ) 

save( es_loops_centered, ns_loops_centered, 
      file=paste0(data_directory,'In_situ_loops_coordinates.RData'))

loops.input = rbind(es_loops_centered,ns_loops_centered)
loops = loops.input[!duplicated(loops.input[,1:2]),]
loops2d = annotateInteractionsToZones( loops[,1:2], 1, ga, 1000 )
loops2d = loops2d[ga$chr[loops2d$row] == ga$chr[loops2d$col], ]
lids = as.matrix( loops2d[,1:2] )

save( loops2d, lids, 
      file=paste0(data_directory,'loops_ES_NS_clustered_for_comparison.RData') )

Comparative analysis of loops in ES and NS cells - in situ Hi-C data

We perform the quantitative analysis of Hi-C signal. Load the Hi-C data. Note, we will work with the matrices of interactions at the resolution of 5kb.

load(paste0(data_directory,'InSitu_ES_pooled_5kb_my_object_repl_sep.RData'))
load(paste0(data_directory,'InSitu_NS_pooled_5kb_my_object_repl_sep.RData'))
load(paste0(data_directory,'loops_ES_NS_clustered_for_comparison.RData'))

As in the sections above, we firs retrieve the normalized and un-normalized values for the clustered pixels corresponding to unique loops.

tam = data.frame( es1 = esins1[[1]][ lids ],
                  es2 = esins2[[1]][ lids ],
                  ns1 = nsins1[[1]][ lids ],
                  ns2 = nsins2[[1]][ lids ],
                  row = loops2d$row,
                  col = loops2d$col,
                  id = loops2d$x,
                  stringsAsFactors=FALSE)

tam = do.call('rbind', lapply(split(tam, tam$id), function(z){
  data.frame( es1=colSums(z)[1],
              es2=colSums(z)[2],
              ns1=colSums(z)[3],
              ns2=colSums(z)[4],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE )
}) )

mat = data.frame( es1 = esins1[[2]][ lids ],
                  es2 = esins2[[2]][ lids ],
                  ns1 = nsins1[[2]][ lids ],
                  ns2 = nsins2[[2]][ lids ],
                  row = loops2d$row,
                  col = loops2d$col,
                  id = loops2d$x,
                  stringsAsFactors=FALSE)

mat = do.call('rbind', lapply(split(mat, mat$id), function(z){
  data.frame( es1=colSums(z)[1],
              es2=colSums(z)[2],
              ns1=colSums(z)[3],
              ns2=colSums(z)[4],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE )
}) )

ids = which(rowSums(tam[,1:4])>0)
tam = tam[ids,]
mat = mat[ids,]
loops2d = loops2d[loops2d$x %in% tam$id, ]

save( mat, tam, loops2d,
      file=paste0(data_directory, 'In_situ_loops_merged_ES_NS_signal.RData') )

We perform differential loop analysis.

load(paste0(data_directory, 'In_situ_loops_merged_ES_NS_signal.RData'))

norm_facs = as.matrix(tam[, 1:4] / mat[, 1:4])
norm_facs[is.nan(norm_facs) ] = 1

head(tam[, 1:4]  / norm_facs)

##          es1        es2        ns1         ns2
## 1 0.04347117 0.04672366 0.01933699 0.023749043
## 2 0.01673290 0.01646888 0.01137967 0.004025863
## 3 0.01951009 0.02844583 0.02869123 0.034941532
## 4 0.03471747 0.03189220 0.03697452 0.030717779
## 5 0.05659973 0.04402490 0.01546021 0.012770461
## 6 0.02801395 0.03360472 0.04939171 0.047647461

counts = tam[, 1:4]
cond =  c("E", "E", "N", "N")
y = DGEList(counts, lib.size = rep(1, ncol(counts)), group = cond)
y$offset = log(norm_facs)
y = estimateDisp(y, min.row.sum = 2, trend.method = "movingave")

## Design matrix not provided. Switch to the classic mode.

dispersion = getDispersion(y)
dd = DESeqDataSetFromMatrix(countData = counts, design = ~ cond,
                            colData = DataFrame(sample = colnames(counts),
                            cond = cond))

## converting counts to integer mode

normalizationFactors(dd) = norm_facs
dispersions(dd) = dispersion
dd = nbinomWaldTest(dd, betaPrior = FALSE)


di = as.data.frame( results(dd, 
                            independentFiltering = FALSE, 
                            cooksCutoff = FALSE,
                            lfcThreshold = 0 ) )
di$pvalueC = di$pvalue
di$pvalueC[di$pvalue<10e-10] = 10^-10
di=data.frame(di, mat)
di$sig=di$padj<0.05 & abs(di$log2FoldChange)>0.59

Save the calls.

loops.es = di[di$padj<0.05 & di$log2FoldChange<log2(1/1.5), ]
loops.ns = di[di$padj<0.05 & di$log2FoldChange>log2(1.5), ]
loops.cm = di[abs(di$log2FoldChange)<log2(1.25), ]

nrow(loops.es)

## [1] 811

nrow(loops.ns)

## [1] 2454

nrow(loops.cm)

## [1] 3917

save( loops.es, loops.ns, loops.cm, di,
      file=paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision'))

Controls

Plot the distributions of differential loops with respect to their genomic span

load(paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision'))

d1=( cut((loops.es$right_end-1)-(loops.es$left_start+1), c(0,40,80,160,Inf)) )
d2=( cut((loops.ns$right_end-1)-(loops.ns$left_start+1), c(0,40,80,160,Inf)) )
d3=( cut((loops.cm$right_end-1)-(loops.cm$left_start+1), c(0,40,80,160,Inf)) )


m=rbind(ES=table(d1), Common=table(d3), NS=table(d2))


par( mar=c(8,8,1,1), cex.lab=1.2, cex.axis=1.2, bty='n') 
barplot(m/1000, beside=T, col=c(myEScolor,'gray',myNScolor),
        names=c("<0.4",">0.4 & <0.8",">0.8 & <1.6",">1.6"),
        ylab='Number (x1000)', ylim=c(0,2.5), xlab='Loop size (Mb)')
legend(x=12, y=2, c('ES','Common','NS'), 
       col=c(myEScolor,'gray',myNScolor), 
       pch=15, bty='n', cex=1.5)

APA - preparations.

load(paste0(data_directory,'InSitu_ES_pooled_5kb_my_object.RData' ))
load(paste0(data_directory,'InSitu_NS_pooled_5kb_my_object.RData' ))

Prepare the average profiles.

prepareForAPA = function( sm, D, coordinates, ga ){
  coord = isolateAreaAroundPixelsOfChoice( coordinates, D, ga )
  stopifnot(max(coord)<nrow(ga))
  coord$x = sm[ cbind(coord$row,coord$col)]
  return( split(coord$x,coord$id) )
  
}

es.loops.sig.es = prepareForAPA( esins[[2]],D=5,coordinates=loops.es[,c('row','col')],ga)
es.loops.sig.ns = prepareForAPA( nsins[[2]],D=5,coordinates=loops.es[,c('row','col')],ga)
ns.loops.sig.es = prepareForAPA( esins[[2]],D=5,coordinates=loops.ns[,c('row','col')],ga)
ns.loops.sig.ns = prepareForAPA( nsins[[2]],D=5,coordinates=loops.ns[,c('row','col')],ga)
                                                
save( es.loops.sig.es, 
      es.loops.sig.ns, 
      ns.loops.sig.es, 
      ns.loops.sig.ns,
      file=paste0(data_directory, 'in_situ_loops_controls_ES_NS.RData'))

The APA plot

load(paste0(data_directory, 'in_situ_loops_controls_ES_NS.RData'))

APA = function(loops,colors,
               scaleLimit,matrixSize,
               size,plotName){
  
  apa=Reduce('+',lapply(loops,
                         function(l){matrix(l,nrow=matrixSize,ncol=matrixSize)}))
  apa=apa/length(loops)
  image( x=seq(-size, size, length.out=matrixSize ), 
         y=seq(-size, size, length.out=matrixSize ),
         matrix(cut( apa, c(seq(0,scaleLimit, length.out=255),1), labels=FALSE),11,11),
         col=colors,
         breaks = 1:(length(colors)+1),
         xlab='Distance (kb)', ylab='Distance (kb)', main=plotName )
}



par(mfrow=c(2,2),mar=rep(5,4),
    cex.lab=1.5,cex.main=1.7,cex.axis=1.5)

APA( loops=es.loops.sig.es,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.0075,matrixSize=11,size=25,plotName='ES (ES)'  )
APA( loops=es.loops.sig.ns,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.0075,matrixSize=11,size=25,plotName='ES (NS)'  )
APA( loops=ns.loops.sig.es,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.0075,matrixSize=11,size=25,plotName='NS (ES)'  )
APA( loops=ns.loops.sig.ns,
     colors=colorRampPalette(c("white", "red"))(256),
     scaleLimit=0.0075,matrixSize=11,size=25,plotName='NS (NS)'  )

We also display the distribution of p values.

par(mfrow=c(1,1),mar=c(6,12,2,1),cex.lab=1.5, 
    cex.axis=1.5, cex.main=1.5, lwd=1, bty='n')
hist( di$pvalue, 
      col=c('steelblue'),
      xlab='P-value',n=200,
      las=2,ylab='',
      main='')

And display the volcano plot.

par(cex.lab=1.5, cex.axis=1.5,mar=c(6,6,1,1))
plot( x=di$log2FoldChange, y=-1*log2(di$pvalue), 
      col=ifelse(di$padj<0.05,'red4', 'gray20'),
      pch=19, cex=0.1, main='', xlim=c(-4,4),ylim=c(0,60),
      xlab=expression('NS/ES [log'[2]*']'),
      ylab=expression('P-val [-log'[2]*']') )

Plot the strength of ES and NS specific loops in ES and NS cells and compare it to the signal of common loops. First we consider the ES specific loops.

nrow(tam)==nrow(di)

## [1] TRUE

my_facette_names = list(
  '1'="<0.2",
  '2'=">0.2 & <0.4",
  '3'=">0.4 & <0.8",
  '4'=">0.8")

myLabeller = function(variable,value){
  return(my_facette_names[value])
}


tp = data.frame( es = log2(rowSums(tam[,1:2])), 
                 ns = log2(rowSums(tam[,3:4])), 
                 distance = cut((di$right_end-1)-(di$left_start+1), c(0,40,80,160,Inf),labels = F),
                 sig = rep(0,nrow(di)) )

tp$sig[ di$padj<0.05 & di$log2FoldChange>log2(1.5) ] = 'Induced'
tp$sig[ di$padj<0.05 & di$log2FoldChange<(log2(1/1.5)) ] = 'Reduced'
tp$sig[ di$padj>0.05 & abs(di$log2FoldChange)<log2(1.25) ] = 'Common'
tp = tp[tp$sig!=0,]

tp$distance = factor(tp$distance, levels=1:4)
tp$sig = factor(tp$sig, levels=c('Reduced','Induced','Common'))

The plots:

ggplot(tp, aes(y=es,x=sig,color=sig ) ) +
  geom_violin(trim=TRUE,size=2) + geom_boxplot( width=0.2, size=1.2 ) +
  scale_colour_manual(values = c("orange2","black","blue2")) +  
  theme_classic() +  facet_grid(. ~ distance, labeller=myLabeller) +
  theme(legend.position="none",
        title = element_text(size=20), 
        line=element_line(size=1),
        axis.title.x = element_text(size=30),
        axis.title.y = element_text(size=30),
        axis.text.x  = element_text(size=30,angle=45,hjust=1),
        axis.text.y  = element_text(size=30),
        strip.text.x = element_text(size=30),
        strip.background = element_rect(colour="white", fill="white")) +
  xlab("") +
  ylab( expression( 'Hi-C signal [log'[2]*']' )  ) +
  theme(axis.line.y = element_line(color="black", size = 1))

Then the NS specific loops.

ggplot(tp, aes(y=ns,x=sig,color=sig ) ) +
  geom_violin(trim=TRUE,size=2) + geom_boxplot( width=0.5, size=1.2 ) +
  scale_colour_manual(values = c("orange2","black","blue2")) +  
  theme_classic() +  facet_grid(. ~ distance, labeller=myLabeller) +
  theme(legend.position="none",
        title = element_text(size=20), 
        line=element_line(size=1),
        axis.title.x = element_text(size=30),
        axis.title.y = element_text(size=30),
        axis.text.x  = element_text(size=30,angle=45,hjust=1),
        axis.text.y  = element_text(size=30),
        strip.text.x = element_text(size=30),
        strip.background = element_rect(colour="white", fill="white")) +
  xlab("") +
  ylab( expression( 'Hi-C signal [log'[2]*']' )  ) +
  theme(axis.line.y = element_line(color="black", size = 1))

How many induced loops are already detectable in ES cells? In other words, can we speak about loop induction?

getGR=function(coords){
  return(GRanges(seqnames = Rle(coords[,1]),
                 ranges = IRanges(start = as.numeric(coords[,2]),
                                  end = as.numeric(coords[,3])),
                 strand = Rle(rep("*", nrow(coords)))) ) }

es_juicer_anchors_left = getGR( es_loops_is[,1:3] ) 
es_juicer_anchors_right = getGR( es_loops_is[,4:6] ) 

getGR=function(coords,ga,offset){
return(GRanges(seqnames = Rle(ga[coords[,1],'chr']),
                 ranges = IRanges(start = as.numeric(ga[coords[,1],'start'])-offset,
                                  end = as.numeric(ga[coords[,2],'end'])+offset),
                 strand = Rle(rep("*", nrow(coords)))) )}

left_induced_anchors = getGR( loops.ns[,c('left_start','left_end')], ga, 5000 )
right_induced_anchors = getGR( loops.ns[,c('right_start','right_end')], ga, 5000 )

d=table( do.call('c', lapply(as.list(seq(1,nrow(loops.ns))), function(l){
  lefts = queryHits(findOverlaps(es_juicer_anchors_left,left_induced_anchors[l]))
  rights = queryHits(findOverlaps(es_juicer_anchors_right,right_induced_anchors[l]))
  any(lefts %in% rights )
})) )


par(mfrow=c(1,1),mar=c(12,6,2,1),cex.lab=1.5, 
    cex.axis=1.5, cex.main=1.5, lwd=6, bty='n')
barplot( d/1000, col=c('blue','red'),
         names=c('Not a loop in ES','Loop in ES'),
         las=2,main='',ylab='Loops (number x 1000)')

General LFC of loops. Here we compare loop fold change and random pixel fold change. Again, we analyze all at the resolution of 5kb. We pick pixels randomly, we perform the normalization of the data and display the fold change of Hi-C signal in ES and NS cells. We sample interactions in such a way that the distribution of these random contacts follows the distribution of loops identified in ES and NS cells.

load( paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision') )

randomPixels = sample( seq(1,nrow(ga)), 10000, replace=T )
randomPixels = data.frame( row = randomPixels, 
                           col = randomPixels+sample( di$col-di$row, 10000, replace=T ),
                           id=seq(1,10000), stringsAsFactors=F )

randomPixels = randomPixels[randomPixels[,1]<nrow(ga) & randomPixels[,2]<nrow(ga),]

randomPixels$chr1 = ga$chr[(randomPixels$row-3)]
randomPixels$chr2 = ga$chr[(randomPixels$col+3)]
randomPixels = randomPixels[randomPixels$chr1==randomPixels$chr2,]

randomPixels = do.call('rbind',lapply(as.list(seq(1,nrow(randomPixels))), function(x){
  cbind( row=rep( seq( randomPixels[x,1]-1,randomPixels[x,1]+1 ), 3),
         col=rep( seq( randomPixels[x,2]-1,randomPixels[x,2]+1 ), each=3),
         id=rep( randomPixels[x,3], 3*3)) }))

summary( randomPixels[seq(5,nrow(randomPixels),by=9),'col'] - randomPixels[seq(5,nrow(randomPixels),by=9),'row'])

summary( di$col-di$row )

tam = data.frame( es1 = esins1[[1]][ cbind(randomPixels[,1],randomPixels[,2]) ],
                  es2 = esins2[[1]][ cbind(randomPixels[,1],randomPixels[,2])  ],
                  ns1 = nsins1[[1]][ cbind(randomPixels[,1],randomPixels[,2])  ],
                  ns2 = nsins2[[1]][ cbind(randomPixels[,1],randomPixels[,2])  ],
                  row = randomPixels[,1],
                  col = randomPixels[,2],
                  id = randomPixels[,3],
                  stringsAsFactors=FALSE)

tam = do.call('rbind', lapply(split(tam, tam$id), function(z){
  data.frame( es1=colSums(z)[1],
              es2=colSums(z)[2],
              ns1=colSums(z)[3],
              ns2=colSums(z)[4],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE )}) )

mat = data.frame( es1 = esins1[[2]][ cbind(randomPixels[,1],randomPixels[,2]) ],
                  es2 = esins2[[2]][ cbind(randomPixels[,1],randomPixels[,2]) ],
                  ns1 = nsins1[[2]][ cbind(randomPixels[,1],randomPixels[,2]) ],
                  ns2 = nsins2[[2]][ cbind(randomPixels[,1],randomPixels[,2]) ],
                  row = randomPixels[,1],
                  col = randomPixels[,2],
                  id = randomPixels[,3],
                  stringsAsFactors=FALSE)

mat = do.call('rbind', lapply(split(mat, mat$id), function(z){
  data.frame( es1=colSums(z)[1],
              es2=colSums(z)[2],
              ns1=colSums(z)[3],
              ns2=colSums(z)[4],
              id=unique(z[,'id']),
              left_start=min(z[,'row']),
              left_end=max(z[,'row']),
              right_start=min(z[,'col']),
              right_end=max(z[,'col']),
              row=round(mean(z[,'row'])),
              col=round(mean(z[,'col'])),
              stringsAsFactors=FALSE ) }) )

norm_facs = as.matrix(tam[, 1:4] / mat[, 1:4])
norm_facs[is.nan(norm_facs) ] = 1
counts = tam[, 1:4]
cond =  c("E", "E", "N", "N")
y = DGEList(counts, lib.size = rep(1, ncol(counts)), group = cond)
y$offset = log(norm_facs)
y = estimateDisp(y, min.row.sum = 2, trend.method = "movingave")
dispersion = getDispersion(y)
dd = DESeqDataSetFromMatrix(countData = counts, design = ~ cond,
                            colData = DataFrame(sample = colnames(counts),
                            cond = cond))
normalizationFactors(dd) = norm_facs
dispersions(dd) = dispersion
dd = nbinomWaldTest(dd, betaPrior = FALSE)

ri = as.data.frame( results(dd, 
                            independentFiltering = FALSE, 
                            cooksCutoff = FALSE,
                            lfcThreshold = 0 ) )
ri$pvalueC = ri$pvalue
ri$pvalueC[ri$pvalue<10e-10] = 10^-10
ri=data.frame(ri, mat)
save( ri, file=paste0(data_directory,'random_loops.RData'))

Plot density of LFC.

load( paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision') )
load( paste0(data_directory,'random_loops.RData') )

par(mfrow=c(1,1),mar=c(6,6,2,1),cex.lab=1.5, 
    cex.axis=1.5, cex.main=1.5, lwd=6, bty='n')
plot(density(di$log2FoldChange,na.rm=T), col='black',ty='l', main='',
     xlab=expression('NS/ES [log'[2]*']'), xlim=c(-3,3), ylim=c(0,0.9))
lines(density(ri$log2FoldChange,na.rm=T), col='red')
legend('topleft', bty='n', lwd=3, col=c('black','red'), c('Loops','Random') )
abline(v=0,lty=2,col='gray',lwd=0.5)

The relationship between CTCF binding and differential loops

We will first illustrate how CTCF binding varies between conditions and next will focus on the relationship between CTCF, Rad21 and domains and loops

General preparations for the analyses of CTCF and Rad21 ChIP-seq data

We will display a Venn diagram of CTCF peak annotations in ES and NS cells.

nsct = import.bed(paste0(data_directory,'NS_CTCF_peaks.bed'))
esct = import.bed(paste0(data_directory,'ES_FBS_CTCF_peaks.bed'))

nsra = import.bed(paste0(data_directory,'NS_Rad21_peaks.bed'))
esra = import.bed(paste0(data_directory,'ES_FBS_Rad21_peaks.bed'))

Venn diagram of ES and NS CTCF peak overlaps

library(VennDiagram)

## Loading required package: futile.logger

## 
## Attaching package: 'VennDiagram'

## The following object is masked from 'package:EBImage':
## 
##     rotate

overlaps=findOverlaps(esct,nsct)

draw.pairwise.venn(area1=queryLength(overlaps),
                   area2=subjectLength(overlaps),
                   cross.area=length(overlaps),
                   category=c('ES','NS'), col=c('coral3','blue3'),
                   fill=rep('white',2),cat.cex=1.2,lwd=8)

## (polygon[GRID.polygon.737], polygon[GRID.polygon.738], polygon[GRID.polygon.739], polygon[GRID.polygon.740], text[GRID.text.741], text[GRID.text.742], text[GRID.text.743], text[GRID.text.744], text[GRID.text.745])

How does CTCF overlap with Rad21 in ES and NS cells?

library(VennDiagram)

overlaps=findOverlaps(esct,esra)


draw.pairwise.venn(area1=queryLength(overlaps),
                   area2=subjectLength(overlaps),
                   cross.area=length(overlaps),
                   category=c('CTCF','Rad21'), col=c('coral3','blue3'),
                   fill=rep('white',2),cat.cex=1.2,lwd=8)

## (polygon[GRID.polygon.746], polygon[GRID.polygon.747], polygon[GRID.polygon.748], polygon[GRID.polygon.749], text[GRID.text.750], text[GRID.text.751], lines[GRID.lines.752], text[GRID.text.753], text[GRID.text.754], text[GRID.text.755])

library(VennDiagram)

overlaps=findOverlaps(nsct,nsra)

draw.pairwise.venn(area1=queryLength(overlaps),
                   area2=subjectLength(overlaps),
                   cross.area=length(overlaps),
                   category=c('CTCF','Rad21'), col=c('coral3','blue3'),
                   fill=rep('white',2),cat.cex=1.2,lwd=8)

## (polygon[GRID.polygon.756], polygon[GRID.polygon.757], polygon[GRID.polygon.758], polygon[GRID.polygon.759], text[GRID.text.760], text[GRID.text.761], text[GRID.text.762], text[GRID.text.763], text[GRID.text.764])

Quantitative comparison of CTCF and Rad21 binding strength in ES and NS cells

Now, we will focus on the relationship between CTCF and Rad21 binding in differentiation. We observed more CTCF and Rad21 peaks in the ES cell condition. We construct a universe of CTCF binding sites that consists of all the peaks of CTCF in ES and NS cells. First, we will count the reads within CTCF peaks. We define the universe CTCF peaks (all_peaks) that are bound by CTCF in at least one cell type.

ns_only_ctcf_peaks = nsct[ -queryHits(findOverlaps(nsct,esct)) ]
all_peaks = c( esct, ns_only_ctcf_peaks )
seqlevelsStyle(all_peaks) = 'ucsc'

Counting ChIP-seq tags in peak intervals. We previously estimated the length of ChIP fragments, extended the reads obtained from sequencing. We will load these extended reads and count how many of these map to the intervals of all_peaks.

load( paste0(data_directory,'CTCF_reads_extended_xx.RData') )
load( paste0(data_directory,'Rad21_reads_extended_xx.RData') )


all_counts = data.frame( es_ctcf_rep1 = countOverlaps(all_peaks,es_FBS_rep1_ctcf),
                         es_ctcf_rep2 = countOverlaps(all_peaks,es_FBS_rep2_ctcf),
                         ns_ctcf_rep1 = countOverlaps(all_peaks,ns_rep1_ctcf),
                         ns_ctcf_rep2 = countOverlaps(all_peaks,ns_rep2_ctcf), 
                         stringsAsFactors=FALSE )

lla_counts = data.frame( es_Rad21_rep1 = countOverlaps(all_peaks,es_FBS_rep1_Rad21),
                         es_Rad21_rep2 = countOverlaps(all_peaks,es_FBS_rep2_Rad21),
                         ns_Rad21_rep2 = countOverlaps(all_peaks,ns_rep2_Rad21), 
                         stringsAsFactors=FALSE )


save( all_counts, lla_counts, all_peaks, 
      file=paste0(data_directory,'FBS_NS_all_peaks_CTCF_Rad21_counts.RData'))

We next perform DESeq2 analysis of ChIP-seq signal at these peaks.

load(paste0(data_directory,'FBS_NS_all_peaks_CTCF_Rad21_counts.RData'))

cond =  data.frame(condition=c("E", "E", "N", "N"))
rownames(cond) = colnames(all_counts)

counts = all_counts[, 1:4]
dds = DESeqDataSetFromMatrix(countData = counts,
                             colData = cond,
                             design = ~ condition)
dds = DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.

## final dispersion estimates

## fitting model and testing

res = results(dds)
summary(res)

## 
## out of 68195 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 8810, 13% 
## LFC < 0 (down)   : 7125, 10% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 6611, 9.7% 
## (mean count < 21)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

x=as.data.frame(res)

par( cex.lab=1.5, cex.axis=1.5,bty='n', mar=c(5,5,1,1) )
plot( x=log2(x$baseMean),
      y=x$log2FoldChange, xlim=c(4,10), ylim=c(-4,4),
      xaxs = 'i', pch=19,
      cex=0.3, col=ifelse(x$padj<0.1,'red', 'steelblue' ),
      xlab=expression('Mean counts log['[2]*']'), ylab=expression('NS/ES log['[2]*']'),
      main='' )

We perform the same analysis for Rad21. Here, we noticed that one replicate of ChIP in NS cells was of substantially lower quality. We removed it from consideration.

COUNTS = lla_counts
cond =  data.frame(condition=c("E", "E", "N"))
rownames(cond) = colnames(COUNTS)


DDS = DESeqDataSetFromMatrix(countData = COUNTS,
                             colData = cond,
                             design = ~ condition)
DDS = DESeq(DDS)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

RES = results(DDS)
summary(RES)

## 
## out of 68194 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 12578, 18% 
## LFC < 0 (down)   : 4774, 7% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 14544, 21% 
## (mean count < 20)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Also, prepare the average profile of ChIP-seq signal at CTCF peaks.

esct_summits = import.bed( paste0(data_directory, 'ES_FBS_CTCF_summits.bed' ) )
nsct_summits = import.bed( paste0(data_directory, 'NS_CTCF_summits.bed' ) )

find_intervals = function(peaks,D1,D2,BinSize){
  # peaks=esct_summits; D1 = 2500; D2 = 2500; BinSize = 100
  theStarts = start(peaks)
  theStarts = do.call('c', lapply(as.list(theStarts), function(s){
    seq(s-D1,s+D2,by=BinSize)
  }))
  regSize = length(seq(1-D1,1+D2,by=BinSize))
  ends = theStarts+BinSize-1

  res = GRanges(seqnames = Rle(rep(chrom(peaks),each=regSize)),
                ranges = IRanges(as.numeric( theStarts ),
                                 end = as.numeric( ends )),
                strand = Rle(rep("*", length(ends) )),
                ID = rep(seq(1:length(peaks)), each=regSize )) 
 seqlevelsStyle(res) = 'ucsc'
 return(res)
}

esct_summit_intervals = find_intervals( esct_summits, D1=2500, D2=2500, BinSize=10 )
nsct_summit_intervals = find_intervals( nsct_summits, D1=2500, D2=2500, BinSize=10 )

es_ctcf_AP_rep1 = countOverlaps(esct_summit_intervals,es_FBS_rep1_ctcf)
es_ctcf_AP_rep2 = countOverlaps(esct_summit_intervals,es_FBS_rep2_ctcf)
ns_ctcf_AP_rep1 = countOverlaps(nsct_summit_intervals,ns_rep1_ctcf)
ns_ctcf_AP_rep2 = countOverlaps(nsct_summit_intervals,ns_rep2_ctcf)

save( es_ctcf_AP_rep1, es_ctcf_AP_rep2, ns_ctcf_AP_rep1, ns_ctcf_AP_rep2, 
      esct_summit_intervals, nsct_summit_intervals, 
      file=paste0(data_directory,'CTCF_AP.RData'))


es_rad21_AP_rep1 = countOverlaps(esct_summit_intervals,es_FBS_rep1_Rad21)
es_rad21_AP_rep2 = countOverlaps(esct_summit_intervals,es_FBS_rep2_Rad21)
ns_rad21_AP_rep1 = countOverlaps(nsct_summit_intervals,ns_rep1_Rad21)
ns_rad21_AP_rep2 = countOverlaps(nsct_summit_intervals,ns_rep2_Rad21)

save( es_rad21_AP_rep1, es_rad21_AP_rep2, ns_rad21_AP_rep1, ns_rad21_AP_rep2, 
      file=paste0(data_directory,'Rad21_AP.RData'))


es_ctcf_AP = es_ctcf_AP_rep1+es_ctcf_AP_rep2
es_ctcf_AP_mat = matrix( es_ctcf_AP, 
                         nrow=length(esct_summit_intervals), ncol=501, byrow=T )

es_Rad21_AP = es_rad21_AP_rep1+es_rad21_AP_rep2
es_Rad21_AP_mat = matrix( es_Rad21_AP, 
                         nrow=length(esct_summit_intervals), ncol=501, byrow=T )

ns_ctcf_AP = ns_ctcf_AP_rep1+ns_ctcf_AP_rep2
ns_ctcf_AP_mat = matrix( ns_ctcf_AP, 
                         nrow=length(nsct_summit_intervals), ncol=501, byrow=T )

ns_Rad21_AP = ns_rad21_AP_rep1+ns_rad21_AP_rep2
ns_Rad21_AP_mat = matrix( ns_Rad21_AP, 
                          nrow=length(nsct_summit_intervals), ncol=501, byrow=T )

save( es_ctcf_AP_mat, es_Rad21_AP_mat,
      ns_ctcf_AP_mat, ns_Rad21_AP_mat,
      file=paste0( data_directory,'AP_profiles_CTCF_Rad21.RData' ))

Plot the distribution of signal.

load(paste0( data_directory,'AP_profiles_CTCF_Rad21.RData' ))

par(mfrow=c(2,2), cex.lab=1.5, cex.axis=1.5, cex.main=1.5, bty='n' )
plot( x=seq(-2500,2500,length.out=501),
      y=colMeans(es_ctcf_AP_mat), 
      ty='l', lwd=3, col='black',main='ES cells CTCF',
      ylab='Average tag count', xlab='Position (bp)',
      ylim=c(0,120))
plot( x=seq(-2500,2500,length.out=501),
      y=colMeans(es_Rad21_AP_mat), 
      ty='l', lwd=3, col='orange3',main='ES cells Rad21',
      ylab='Average tag count', xlab='Position (bp)',
      ylim=c(0,80))

plot( x=seq(-2500,2500,length.out=501),
      y=colMeans(ns_ctcf_AP_mat), 
      ty='l', lwd=3, col='black',main='NS cells CTCF',
      ylab='Average tag count', xlab='Position (bp)',
      ylim=c(0,160))
plot( x=seq(-2500,2500,length.out=501),
      y=colMeans(ns_Rad21_AP_mat), 
      ty='l', lwd=3, col='orange3',main='NS cells Rad21',
      ylab='Average tag count', xlab='Position (bp)',
      ylim=c(0,100))

Scatterplot of Rad21 and CTCF signal in both conditions.

ctcf_rad_es = c( esct[-queryHits(findOverlaps(esct,esra))], 
                 esct[queryHits(findOverlaps(esct,esra))],
                 esra[queryHits(findOverlaps(esra,esct))] )
seqlevelsStyle(ctcf_rad_es)='ucsc'

ctcf_rad_ns = c( nsct[-queryHits(findOverlaps(nsct,nsra))], 
                 nsct[queryHits(findOverlaps(nsct,nsra))],
                 nsra[queryHits(findOverlaps(nsra,nsct))] )
seqlevelsStyle(ctcf_rad_ns)='ucsc'


univ_es_ctcf_rad = data.frame( es_ctcf_rep1 = countOverlaps(ctcf_rad_es,es_FBS_rep1_ctcf),
                               es_ctcf_rep2 = countOverlaps(ctcf_rad_es,es_FBS_rep2_ctcf),
                               es_rad_rep1 = countOverlaps(ctcf_rad_es,es_FBS_rep1_Rad21),
                               es_rad_rep2 = countOverlaps(ctcf_rad_es,es_FBS_rep2_Rad21),
                               stringsAsFactors=FALSE )


univ_ns_ctcf_rad = data.frame( ns_ctcf_rep1 = countOverlaps(ctcf_rad_ns,ns_rep1_ctcf),
                               ns_ctcf_rep2 = countOverlaps(ctcf_rad_ns,ns_rep2_ctcf),
                               ns_rad_rep1 = countOverlaps(ctcf_rad_ns,ns_rep1_Rad21),
                               ns_rad_rep2 = countOverlaps(ctcf_rad_ns,ns_rep2_Rad21),
                               stringsAsFactors=FALSE )

save( ctcf_rad_es, ctcf_rad_ns, univ_es_ctcf_rad, univ_ns_ctcf_rad, 
      file=paste0(data_directory,'CTCF_Rad21_levels_per_cell_type.RData'))

Produce the scatter plots to illustrate the correlation between the change of CTCF and Rad21 binding strenght at CTCF sites upon differentiation.

load(paste0(data_directory,'CTCF_Rad21_levels_per_cell_type.RData'))

par(mfrow=c(1,2), cex.lab=1.5, cex.axis=1.5, cex.main=1.5, pty='s' )
heatscatter( x=log2((0.5+rowSums(univ_es_ctcf_rad[,1:2]))/sum(colSums(univ_es_ctcf_rad[,1:2]))),
             y=log2((0.5+rowSums(univ_es_ctcf_rad[,3:4]))/sum(colSums(univ_es_ctcf_rad[,3:4]))),
             cex=0.1, colpal='spectral', 
             xlab=expression('CTCF counts [log'[2]*']'),
             ylab=expression('Rad21 counts [log'[2]*']'), 
             main='ES cells')
heatscatter( x=log2((0.5+rowSums(univ_ns_ctcf_rad[,1:2]))/sum(colSums(univ_ns_ctcf_rad[,1:2]))),
             y=log2((0.5+rowSums(univ_ns_ctcf_rad[,3:4]))/sum(colSums(univ_ns_ctcf_rad[,3:4]))),
             cex=0.1, colpal='spectral', 
             xlab=expression('CTCF counts [log'[2]*']'),
             ylab=expression('Rad21 counts [log'[2]*']'), 
             main='NS cells',ylim=c(-25,-10),xlim=c(-25,-10))

Take the anchors and ask how many differences in CTCF binding at these anchors we observe?

nrow(loops.ns)

## [1] 2454

getGR = function(df,offset){

  return(GRanges(seqnames = Rle(df[,1]),
                 ranges = IRanges(start = as.numeric(df[,2])-offset,
                                  end = as.numeric(df[,3])+offset),
                 strand = Rle(rep("*", nrow(df))))) 


}

leftI = getGR( data.frame(ga$chr[loops.ns[,'left_start']],
                          ga$start[loops.ns[,'row']],
                          ga$end[loops.ns[,'row']]),5000 )

righI = getGR( data.frame(ga$chr[loops.ns[,'right_start']],
                          ga$start[loops.ns[,'col']],
                          ga$end[loops.ns[,'col']]),5000 )


############################################################
######### identiy loops for which both anchors #############
######### are defined by one CTCF peak #####################
############################################################

leftI_s = leftI[ which(countOverlaps(leftI,all_peaks)==1 & countOverlaps(righI, all_peaks)==1)]
righI_s = righI[ which(countOverlaps(leftI,all_peaks)==1 & countOverlaps(righI, all_peaks)==1)]

length(leftI_s)

## [1] 479

Analyze the changes in CTCF and Rad21 binding at the anchors of these loops.

## anchors with a single CTCF binding site at each of the two anchors
leftI_s = leftI[ which(countOverlaps(leftI,all_peaks)==1 & countOverlaps(righI, all_peaks)==1)]
righI_s = righI[ which(countOverlaps(leftI,all_peaks)==1 & countOverlaps(righI, all_peaks)==1)]

# which peaks are signifincantly induced and which significnatly reduced
ns2es = rep(0,length(all_peaks))
ns2es[res$log2FoldChange>0 & res$padj<0.1] = 1
ns2es[res$log2FoldChange<0 & res$padj<0.1] = -1

NS2ES = rep(0,length(all_peaks))
NS2ES[RES$log2FoldChange>0 & RES$padj<0.1] = 1
NS2ES[RES$log2FoldChange<0 & RES$padj<0.1] = -1


lfc = cbind( left=ns2es[ subjectHits(findOverlaps(leftI_s,all_peaks)) ],
             righ=ns2es[ subjectHits(findOverlaps(righI_s,all_peaks)) ] )
lf2 = cbind( left=NS2ES[ subjectHits(findOverlaps(leftI_s,all_peaks)) ],
             righ=NS2ES[ subjectHits(findOverlaps(righI_s,all_peaks)) ] )

ctcfm = 100* data.frame(  noDiff = sum( apply(lfc,1,function(x){all(x==0)})),
                          decrease = sum( rowSums(lfc<0)==1 & rowSums(lfc>0)==0),
                          up_one = sum( rowSums(lfc>0)==1 ),
                          up_two = sum( rowSums(lfc>0)==2 ) )/nrow(lfc)


Radfm = 100* data.frame(  noDiff = sum( apply(lf2,1,function(x){all(x==0)})),
                          decrease = sum( rowSums(lf2<0)==1 & rowSums(lf2>0)==0),
                          up_one = sum( rowSums(lf2>0)==1 ),
                          up_two = sum( rowSums(lf2>0)==2 ) )/nrow(lf2)

m = rbind( Radfm, ctcfm )

par(mfrow=c(1,1),mar=c(6,6,2,1),
    cex.lab=1.2, cex.axis=1.2,lwd=2)
barplot(t(m), horiz=T, las=2,
        col=c('gray','wheat','red4','steelblue'), 
        names=c('Rad21', 'CTCF'),
        xlab = '% of loops')

Finally check how many induced loop anchors have at least one gained peak at either or both anchors.

# define gained and lost CTCF and Rad21 peaks
gainedCTCFpeaks = all_peaks[ which(res$log2FoldChange>0 & res$padj<0.1) ]
gainedRad21peaks = all_peaks[ which(RES$log2FoldChange>0 & RES$padj<0.1) ]

lostCTCFpeaks = all_peaks[ which(res$log2FoldChange<0 & res$padj<0.1) ]
lostRad21peaks = all_peaks[ which(RES$log2FoldChange<0 & RES$padj<0.1) ]


## use loops that overlap CTCF peak(s) at both anchors
ids2consider_ind = which( rowSums(cbind(countOverlaps(leftI,all_peaks),countOverlaps(righI,all_peaks) )>0)==2 )

## annotate loop anchors
gainedLoopsGainedCTCF = cbind(left=countOverlaps(leftI,gainedCTCFpeaks),
                              right=countOverlaps(righI,gainedCTCFpeaks))

## take the CTCF positive loops
gainedLoopsGainedCTCF = gainedLoopsGainedCTCF[ids2consider_ind,]

table(rowSums(gainedLoopsGainedCTCF>0))

## 
##   0   1   2 
## 447 824 588

## how many gained loops have both anchors overlapping reduced peaks
gainedLoopsLostCTCF = cbind(left=countOverlaps(leftI,lostCTCFpeaks),
                            right=countOverlaps(righI,lostCTCFpeaks))
gainedLoopsLostCTCF = gainedLoopsLostCTCF[ids2consider_ind,]


ctcfm = 100* data.frame(  noDiff = sum( rowSums(gainedLoopsGainedCTCF)==0 )-sum( gainedLoopsLostCTCF[,1]>0 & gainedLoopsLostCTCF[,2]>0 ),
                          decrease = sum( gainedLoopsLostCTCF[,1]>0 & gainedLoopsLostCTCF[,2]>0 ),
                          up_one = sum( rowSums(gainedLoopsGainedCTCF>0)==1 ),
                          up_two = sum( rowSums(gainedLoopsGainedCTCF>0)==2 ) )/nrow(gainedLoopsGainedCTCF)

## Rad21
gainedLoopsGainedRad21 = cbind(left=countOverlaps(leftI,gainedRad21peaks),
                              right=countOverlaps(righI,gainedRad21peaks))

## take the CTCF positive loops
gainedLoopsGainedRad21 = gainedLoopsGainedRad21[ids2consider_ind,]

## how many gained loops have both anchors overlapping reduced peaks
gainedLoopsLostRad21 = cbind(left=countOverlaps(leftI,lostRad21peaks),
                            right=countOverlaps(righI,lostRad21peaks))
gainedLoopsLostRad21 = gainedLoopsLostRad21[ids2consider_ind,]

Radfm = 100* data.frame(  noDiff = sum( rowSums(gainedLoopsGainedRad21)==0 )-sum( gainedLoopsLostRad21[,1]>0 & gainedLoopsLostRad21[,2]>0 ),
                          decrease = sum( gainedLoopsLostRad21[,1]>0 & gainedLoopsLostRad21[,2]>0 ),
                          up_one = sum( rowSums(gainedLoopsGainedRad21>0)==1 ),
                          up_two = sum( rowSums(gainedLoopsGainedRad21>0)==2 ) )/nrow(gainedLoopsGainedRad21)

m = rbind( Radfm, ctcfm )
m

##     noDiff  decrease   up_one   up_two
## 1 11.08123 0.2151694 36.79398 51.90963
## 2 23.13072 0.9144701 44.32491 31.62991

par(mfrow=c(1,1),mar=c(6,6,2,1),
    cex.lab=1.2, cex.axis=1.2,lwd=2)
barplot(t(m), horiz=T, las=2,
        col=c('gray','wheat','red4','steelblue'), 
        names=c('Rad21', 'CTCF'),
        xlab = '% of loops')

Domain boundaries - the dynamics of CTCF and Rad21 binding

Compare at domain boundaries. Identify genomic intervals spanning boundaries and regions between them.

getGR=function(coords,offset){

  return(GRanges(seqnames = Rle(coords[,1]),
                 ranges = IRanges(start = as.numeric(coords[,2])-offset,
                                  end = as.numeric(coords[,2])+offset),
                 strand = Rle(rep("*", nrow(coords)))) )


}

es_domains = read.delim(paste0(data_directory,'ES_25000_blocks.txt'),head=T, as.is=T)
ns_domains = read.delim(paste0(data_directory,'NS_25000_blocks.txt'),head=T, as.is=T)
es_domains$x1=es_domains$x1+1
ns_domains$x1=ns_domains$x1+1
es_domains$y1=es_domains$y1+1
ns_domains$y1=ns_domains$y1+1

es_boundaries_GR = reduce( c( getGR( es_domains[,1:2], 15000), getGR( es_domains[,c(1,3)], 15000) ) )
ns_boundaries_GR = reduce( c( getGR( ns_domains[,1:2], 15000), getGR( ns_domains[,c(1,3)], 15000) ) )

## find all domain boundaries in ES and NS cells
all_domains = rbind( es_domains, ns_domains )
all_boundaries = c( getGR( all_domains[,1:2], 5000), getGR( all_domains[,c(1,3)], 5000) )
all_boundaries = reduce(all_boundaries)
all_boundaries_extended = reduce(c( getGR( all_domains[,1:2], 50000), getGR( all_domains[,c(1,3)], 50000) ))



overlaps=findOverlaps(es_boundaries_GR,ns_boundaries_GR)

draw.pairwise.venn(area1=queryLength(overlaps),
                   area2=subjectLength(overlaps),
                   cross.area=length(overlaps),
                   category=c('ES','NS'), col=c('coral3','blue3'),
                   fill=rep('white',2),cat.cex=1.2,lwd=8)

## (polygon[GRID.polygon.765], polygon[GRID.polygon.766], polygon[GRID.polygon.767], polygon[GRID.polygon.768], text[GRID.text.769], text[GRID.text.770], text[GRID.text.771], text[GRID.text.772], text[GRID.text.773])

Compare how CTCF and Rad21 signal varies at domain boundaries and inside domains.

## CTCF and Rad21 at boundaries
ctcf_boundaries = res$log2FoldChange[ unique(queryHits(findOverlaps(all_peaks,all_boundaries))) ]
rad21_boundaries = RES$log2FoldChange[ unique(queryHits(findOverlaps(all_peaks,all_boundaries))) ]

## find CTCF and Rad21 inside domains
## remove larger areas around boundaries
ctcf_inside = res$log2FoldChange[ -unique(queryHits(findOverlaps(all_peaks,all_boundaries_extended))) ]
rad21_inside = RES$log2FoldChange[ -unique(queryHits(findOverlaps(all_peaks,all_boundaries_extended))) ]


## we see a redirection of CTCF and Rad21 towards domain boundaries
par(mfrow=c(1,2),mar=c(6,6,2,1),cex.lab=1.5, 
    cex.axis=1.5, cex.main=1.5, lwd=6, bty='n')

plot(density(ctcf_inside,na.rm=T), col='black',ty='l', main='CTCF',
     xlab=expression('NS/ES [log'[2]*']'), xlim=c(-3,3), ylim=c(0,0.7))
lines(density(ctcf_boundaries,na.rm=T), col='red')
abline(v=0, lty=2, col='gray')

plot(density(rad21_inside,na.rm=T), col='black',ty='l', main='Rad21',
     xlab=expression('NS/ES [log'[2]*']'), xlim=c(-3,3), ylim=c(0,0.6))
lines(density(rad21_boundaries,na.rm=T), col='red')
abline(v=0, lty=2, col='gray')

t.test(ctcf_boundaries,ctcf_inside)

## 
##  Welch Two Sample t-test
## 
## data:  ctcf_boundaries and ctcf_inside
## t = 17.111, df = 8661.9, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1455094 0.1831625
## sample estimates:
##   mean of x   mean of y 
##  0.04903899 -0.11529696

t.test(rad21_boundaries,rad21_inside)

## 
##  Welch Two Sample t-test
## 
## data:  rad21_boundaries and rad21_inside
## t = 12.262, df = 9154.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1210593 0.1671306
## sample estimates:
##   mean of x   mean of y 
##  0.10800154 -0.03609341

Domain boundaries and the changes in the level of CTCF binding between ES and NS cells

Here we show that common domain boundaries are bound by CTCF more frequently than the cell type specific boundaries.

es_boundaries_GR = reduce( c( getGR( es_domains[,1:2], 15000), getGR( es_domains[,c(1,3)], 15000) ) )
ns_boundaries_GR = reduce( c( getGR( ns_domains[,1:2], 15000), getGR( ns_domains[,c(1,3)], 15000) ) )

common_boundaries_GR = ns_boundaries_GR[unique(queryHits(findOverlaps(ns_boundaries_GR,es_boundaries_GR))) ]
es_spe_boundaries_GR = es_boundaries_GR[-queryHits(findOverlaps(es_boundaries_GR,ns_boundaries_GR)) ]
ns_spe_boundaries_GR = ns_boundaries_GR[-queryHits(findOverlaps(ns_boundaries_GR,es_boundaries_GR)) ]

seqlevelsStyle(esct) = 'ucsc'
seqlevelsStyle(nsct) = 'ucsc'

par(mfrow=c(1,1),mar=c(6,6,2,1),cex.lab=1.5, 
    cex.axis=1.5, cex.main=1.5, lwd=6, bty='n')
barplot( c(100*sum(countOverlaps(es_spe_boundaries_GR,esct)==0)/length(es_spe_boundaries_GR),
           100*sum(countOverlaps(ns_spe_boundaries_GR,nsct)==0)/length(ns_spe_boundaries_GR),
           100*sum(countOverlaps(common_boundaries_GR,all_peaks)==0)/length(common_boundaries_GR)),
         col=c(myEScolor,myNScolor,'gray'), xlab='Cell type specificity', ylim=c(0,55),
         names=c('ES','NS','Common'), main='Overlap with CTCF', ylab= c('CTCF- boundaries [%]'))

Gained CTCF - orientation vis-a-vis loop domains

Load the genome wide CTCF motif annotation. We obtained it using FIMO.

gainedCTCFpeaks = all_peaks[ which(res$log2FoldChange>0 & res$padj<0.1) ]
seqlevelsStyle(gainedCTCFpeaks) = 'ucsc'

ctcf.motif.orientation = import.bed( paste0(data_directory,'mm9_CTCF_core_motif.bed') )
pval = elementMetadata(ctcf.motif.orientation)[,1]
pval = unlist(strsplit(pval,';'))[seq(3,5*length(pval),by=5)]
pval = as.numeric(unlist(strsplit(pval,'='))[seq(2,2*length(pval),by=2)])

peaks2ori = findOverlaps(gainedCTCFpeaks,ctcf.motif.orientation)
peaks2ori = data.frame(peak = queryHits(peaks2ori),
                       pval = pval[as.numeric(subjectHits(peaks2ori))],
                       ori = strand(ctcf.motif.orientation)[as.numeric(subjectHits(peaks2ori))],
                       stringsAsFactors=F )
ctcf.motif.ori = unlist(lapply(split(peaks2ori,peaks2ori$peak), function(m){
  as.character( m[which.min(m$pval),'ori'] )
}))

pos_peaks = gainedCTCFpeaks[ as.numeric(names(ctcf.motif.ori[ctcf.motif.ori=='+'])) ]
neg_peaks = gainedCTCFpeaks[ as.numeric(names(ctcf.motif.ori[ctcf.motif.ori=='-'])) ]

For each loop domain, we assess the distribution of CTCF binding sites in associated with forward and reverse oriented CTCF motifs.

getGR = function(coords,ga){

  return(GRanges(seqnames = Rle(ga[coords[,1],'chr']),
                 ranges = IRanges(start = as.numeric(ga[coords[,1],'start']),
                                  end = as.numeric(ga[coords[,2],'end'])),
                 strand = Rle(rep("*", nrow(coords)))) )


}

nrow(loops.ns)

induced_loops_intervals = getGR( loops.ns[,c('row','col')], ga )

DOM_neg_up = data.frame()
DOM_pos_up = data.frame()

length(induced_loops_intervals)


for(i in 1:length(induced_loops_intervals))
  {
  
  intervals = round( c( seq( start(induced_loops_intervals)[i],
                             end(induced_loops_intervals)[i]+1,length.out=250 ) ) )
  BS = min(diff(intervals))
  left_extra = seq( start(induced_loops_intervals)[i]-10*BS, start(induced_loops_intervals)[i]-BS, length.out=10 )
  right_extra = seq( end(induced_loops_intervals)[i]+BS, end(induced_loops_intervals)[i]+11*BS, length.out=11 )
  
  starts = c(left_extra,intervals,right_extra)
  ends = c( starts[2:length(starts)]-1 )
  starts = starts[1:(length(starts)-1)]
  
  thisDOM = GRanges(seqnames = chrom(induced_loops_intervals[i]),
                    ranges = IRanges(starts,
                                     ends),
                    strand = Rle(rep("*", length(starts))))
  ## profile bins
  thisVector_pos_up = countOverlaps( thisDOM, pos_peaks )
  thisVector_neg_up = countOverlaps( thisDOM, neg_peaks )

  ## report
  DOM_pos_up = rbind(DOM_pos_up,thisVector_pos_up)
  DOM_neg_up = rbind(DOM_neg_up,thisVector_neg_up)

}

save( DOM_neg_up, DOM_pos_up,  
      file=paste0( data_directory,'induced_loops_ES_NS_CTCF_motif_ori.RData' ) )

We display the distribution of these CTCF peaks.

load( paste0( data_directory,'induced_loops_ES_NS_CTCF_motif_ori.RData' ) )

par(mfrow=c(1,1),mar=c(4,4,1,1),bty='n',lwd=3,
    cex.lab=1.2,cex.axis=1.2)
plot(x=1:ncol(DOM_neg_up),
     y=100*colMeans(DOM_neg_up>0),ty='l',pch=15,cex=0.5,col='red',
     ylim=c(0,15),xaxt = 'n',las=2, ylab='% of loops', xlab='',las=2)
lines(x=1:ncol(DOM_pos_up),
     y=100*colMeans(DOM_pos_up>0),ty='l',cex=0.5,col='blue', lty=1)

axis(1,c(1:11,259:ncol(DOM_pos_up)),c(rep("",10),'start','end',rep("",11)),las=2 )

Loop dynamics and chromatin contact insulation

We identify loops that are stronger in NS cells as compared to ES cells (2i/LIF). Then, we check how the insulation of the anchors of these loops was changing. We will work with the normalized matrices at the resolution of 10 kb.

load( paste0(data_directory,'InSitu_ES_pooled_10kb_my_object.RData' ) )
load( paste0(data_directory,'InSitu_NS_pooled_10kb_my_object.RData' ) )

Define the functions and ranges.

load(paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision'))

fromAnchorsToGR = function( anchors,ga ){
  GRanges(seqnames = Rle(ga$chr[anchors[,1]]),
          ranges = IRanges(as.numeric(ga$start[anchors[,1]]),
                           end = as.numeric(ga$end[anchors[,2]])),
          strand = Rle(rep("*", nrow(anchors))))
}

getMidGR=function(gr){

  return(GRanges(seqnames = Rle( chrom(gr) ),
                 ranges = IRanges(start = rowMeans( cbind(start(gr),end(gr))),
                                  width=1 ),
                 strand = Rle(rep("*", length(gr)))) )
}


## get Ranges
repressed = append( fromAnchorsToGR( loops.es[,c('left_start','left_end')], ga ),
                    fromAnchorsToGR( loops.es[,c('right_start','right_end')], ga ) )

induced = append( fromAnchorsToGR( loops.ns[,c('left_start','left_end')], ga ),
                  fromAnchorsToGR( loops.ns[,c('right_start','right_end')], ga ) )

COM = di[ abs(di$log2FoldChange)<log2(1.25), ]
COM = append( fromAnchorsToGR( COM[,c('left_start','left_end')], ga ),
              fromAnchorsToGR( COM[,c('right_start','right_end')], ga ) )


## find bin at loop anchor - work with the resolution of 10kb
## common
COM.filt = COM[ -queryHits(findOverlaps(COM, c(repressed,induced)))]
COM.filt = reduce(COM.filt)
COM.filt = queryHits( findOverlaps(gagr10, getMidGR( COM.filt ) ) )

## reduced
reduced.filt = repressed[ -queryHits(findOverlaps(repressed,c(induced,COM)))]
reduced.filt = reduce( reduced.filt ) 
reduced.filt = queryHits( findOverlaps(gagr10, getMidGR( reduced.filt )) )

# induced
induced.filt = induced[ - queryHits(findOverlaps(induced,c(repressed,COM))) ]
induced.filt = reduce(induced.filt)
induced.filt = queryHits( findOverlaps(gagr10, getMidGR( induced.filt ) ) )

Estimation of the insulation effect

We define a function - getAreaAround to collect the signal for a bin in a square (Area) where the central pixel is a distance (distance) from the diagonal.

getAreaAround = function( bin, mat, distance, Area){

  ## bin=c(10,11); mat=esins[[2]][1:20,1:20]; distance=5; Area=3
  rows = unlist(lapply(as.list(bin), function(b){
    rep( ( (b-distance) + seq(-Area,Area)) , (1+2*Area))}))
  cols = unlist(lapply(as.list(bin), function(b){
    rep( ( (b+distance) + seq(-Area,Area) ), each=(1+2*Area))}))
  ids = cbind(rows, cols)
  print('subsetting')
  m = mat[ids]
  
  IDS = rep( bin, each = (1+2*Area)^2 )
  matrices = split( m, IDS )
  
  return( unlist(lapply(matrices,function(m){sum(m, na.rm=T)})) )

}

The relationship between loop dynamics and insulation changes.

D=5

reduced_es = data.frame( left = getAreaAround( reduced.filt-10, esins[[2]], D, 2 ),
                         right = getAreaAround( reduced.filt+10, esins[[2]], D, 2 ),
                         between = getAreaAround( reduced.filt, esins[[2]], D, 2 ), 
                         stringsAsFactors=FALSE )

common_es = data.frame( left = getAreaAround( COM.filt-10, esins[[2]], D, 2 ),
                        right = getAreaAround( COM.filt+10, esins[[2]], D, 2 ),
                        between = getAreaAround( COM.filt, esins[[2]], D, 2 ), 
                        stringsAsFactors=FALSE )

induced_es = data.frame( left = getAreaAround( induced.filt-10, esins[[2]], D, 2 ),
                         right = getAreaAround( induced.filt+10, esins[[2]], D, 2 ),
                         between = getAreaAround( induced.filt, esins[[2]], D, 2 ), 
                         stringsAsFactors=FALSE )

reduced_es = log2( rowMeans(reduced_es[,1:2])/reduced_es[,3] )
common_es = log2( rowMeans(common_es[,1:2])/common_es[,3] )
induced_es = log2( rowMeans(induced_es[,1:2])/induced_es[,3] )

save( reduced_es, common_es, induced_es,
      file=paste0(data_directory,'IS_dynamics_ES.RData'))


reduced_ns = data.frame( left = getAreaAround( reduced.filt-10, nsins[[2]], D, 2 ),
                         right = getAreaAround( reduced.filt+10, nsins[[2]], D, 2 ),
                         between = getAreaAround( reduced.filt, nsins[[2]], D, 2 ), 
                         stringsAsFactors=FALSE )

common_ns = data.frame( left = getAreaAround( COM.filt-10, nsins[[2]], D, 2 ),
                        right = getAreaAround( COM.filt+10, nsins[[2]], D, 2 ),
                        between = getAreaAround( COM.filt, nsins[[2]], D, 2 ), 
                        stringsAsFactors=FALSE )

induced_ns = data.frame( left = getAreaAround( induced.filt-10, nsins[[2]], D, 2 ),
                         right = getAreaAround( induced.filt+10, nsins[[2]], D, 2 ),
                         between = getAreaAround( induced.filt, nsins[[2]], D, 2 ), 
                         stringsAsFactors=FALSE )

reduced_ns = log2( rowMeans(reduced_ns[,1:2])/reduced_ns[,3] )
common_ns = log2( rowMeans(common_ns[,1:2])/common_ns[,3] )
induced_ns = log2( rowMeans(induced_ns[,1:2])/induced_ns[,3] )

save( reduced_ns, common_ns, induced_ns,
      file=paste0(data_directory,'IS_dynamics_NS.RData'))

Plot the result.

load( paste0(data_directory,'IS_dynamics_ES.RData') )
load( paste0(data_directory,'IS_dynamics_NS.RData') )

redLine=reduced_ns-reduced_es
blackLine=common_ns-common_es
blueLine=induced_ns-induced_es

redLine = redLine[is.finite(redLine) & ! is.na(redLine)]
blackLine=blackLine[is.finite(blackLine) & !is.na(blackLine)]
blueLine=blueLine[is.finite(blueLine) & ! is.na(blueLine)]

t.test( redLine, blackLine )

## 
##  Welch Two Sample t-test
## 
## data:  redLine and blackLine
## t = -10.997, df = 1177.5, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1836879 -0.1280685
## sample estimates:
##   mean of x   mean of y 
## -0.07196655  0.08391165

t.test( blueLine, blackLine )

## 
##  Welch Two Sample t-test
## 
## data:  blueLine and blackLine
## t = 17.547, df = 4873.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1304976 0.1633252
## sample estimates:
##  mean of x  mean of y 
## 0.23082304 0.08391165

t.test( blueLine, redLine )

## 
##  Welch Two Sample t-test
## 
## data:  blueLine and redLine
## t = 19.981, df = 1501.9, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2730639 0.3325152
## sample estimates:
##   mean of x   mean of y 
##  0.23082304 -0.07196655

par(mfrow=c(1,1),mar=c(5,5,1,1),cex.lab=1.5,cex.axis=1.5,bty='n')
multiecdf(list(blueLine,blackLine,redLine),
          xlim=c(-1,2),col=c(myNScolor,'black',myEScolor), lwd=2,legend=F,
          xlab=expression('Gain in insulation'), 
          main='',ylab='Cumulative fraction' )
legend(x=0.1,y=0.25,c('Induced', 'Common', 'Reduced'),
       col=c(myNScolor,'black',myEScolor), lwd=3, bty='n',cex=1.5 )
abline(v=0,lty=2,col='gray',lwd=2)

The composite profile of contacts at induced loop anchors - ES (2i/LIF) versus NS (TCC data)

Define the function to gather the signal around the anchors of induced loops.

load( paste0(data_directory, 'loops_comparison_ti_ns.RData'))
load( paste0(data_directory, 'TCC_IPF_10kb_pooled_ES_2i.RData') )
load( paste0(data_directory, 'TCC_IPF_10kbMatrices_pooled_EpiSC.RData') )
load( paste0(data_directory, 'TCC_IPF_10kb_ES_NS_pooled.RData') )

GetMs = function( coord, distance ){
  
  reg.width = 1+2*distance
  return( cbind( rows = unlist(lapply(as.list(coord[,1]), function(x){ rep(seq(x-distance, x+distance), reg.width)})),
                 cols = unlist(lapply(as.list(coord[,2]), function(x){ rep(seq(x-distance, x+distance), each=reg.width)})),
                 id = rep( seq(1,nrow(coord)), each=reg.width^2) ) )
  
}

induced_bins = unique(c(loops.ns.ti$row, loops.ns.ti$col))
reduced_bins = unique(c(loops.ti$row, loops.ti$col))
common_bins = unique(c(di$row[abs(di$log2FoldChange)<1.25], 
                       di$col[abs(di$log2FoldChange)<1.25]))
induced_bins = induced_bins[ ! induced_bins %in% unique(c(reduced_bins, common_bins)) ]

theCoordinates = GetMs( cbind( induced_bins,induced_bins), 40 )

ti.ap    = lapply(split(ti[[2]][theCoordinates[,1:2]],theCoordinates[,3]), function(x){Matrix(x,81,81)})
ep.ti.ap = lapply(split(ep[[2]][theCoordinates[,1:2]],theCoordinates[,3]), function(x){Matrix(x,81,81)})
ns.ti.ap = lapply(split(ns[[2]][theCoordinates[,1:2]],theCoordinates[,3]), function(x){Matrix(x,81,81)})

We will show a heat-map to illustrate how loop induction correlates with contact insulation across loop anchors.

ti.AP = Reduce( '+', ti.ap )
ep.ti.AP = Reduce( '+', ep.ti.ap )
ns.ti.AP = Reduce( '+', ns.ti.ap )

Plotting the figures separately.

m2 = log2(ti.AP/ns.ti.AP)

image(m2,col.regions=colorRampPalette(c("black",'white','blue'))(100),
      lwd=0, xlab='', ylab='', at=c(-Inf,seq(-0.5,0.5,length.out=98),Inf),
      cuts=100)

And ES (2i/LIF) versus EpiSCs.

m3 = log2(ti.AP/ep.ti.AP)

image(m3,col.regions=colorRampPalette(c("black",'white','blue'))(100),
      lwd=0, xlab='', ylab='', at=c(-Inf,seq(-0.5,0.5,length.out=98),Inf),
      cuts=100)

Chromatin interaction domains

We will analyse the boundaries of domains and the intra-domain interactions. First we focus on the boundary strength.

Insulation of domain boundaries - relationship with loops

First we stratify domain boundaries into two groups depending on whether they overlap the anchors of loops.

getGR = function(df,offset){

  return(GRanges(seqnames = Rle(df[,1]),
                 ranges = IRanges(start = as.numeric(df[,2])-offset,
                                  end = as.numeric(df[,3])+offset),
                 strand = Rle(rep("*", nrow(df))))) 


}


GR4bound = function( domains, loops, offset ){
  
  # domains1 = ns_domains; domains2 = es_domains; offset=20000
  left_dom1 = getGR( data.frame( chr=domains[,1], start=domains[,2], end=domains[,2], stringsAsFactors=F ),offset)
  right_dom1 = getGR( data.frame( chr=domains[,1], start=domains[,3], end=domains[,3], stringsAsFactors=F ),offset)
  left_loop = getGR( data.frame( chr=loops[,1], start=loops[,2], end=loops[,2], stringsAsFactors=F ),offset)
  right_loop = getGR( data.frame( chr=loops[,1], start=loops[,3], end=loops[,3], stringsAsFactors=F ),offset)
  
  left = left_dom1[ unique( queryHits(findOverlaps(left_dom1,left_loop)))]
  right = right_dom1[ unique( queryHits(findOverlaps(right_dom1,right_loop)))] 
  res = c(left,right)

  res2 = c(left_dom1[ -unique( queryHits(findOverlaps(left_dom1,left_loop)))],
           right = right_dom1[ - unique( queryHits(findOverlaps(right_dom1,right_loop)))] )
  return( list(looping=res,
               not_looping=res2) )
}
seqlevelsStyle(nsct) = 'UCSC'

looping.ctcf.ns = GR4bound( ns_domains, ns_loops_is, 20000 )[[1]]
not.looping.ctcf.ns = GR4bound( ns_domains, ns_loops_is, 20000 )[[2]]

looping.ctcf.ns = looping.ctcf.ns[ unique(queryHits(findOverlaps(looping.ctcf.ns,nsct))) ]
not.looping.ctcf.ns = not.looping.ctcf.ns[ unique( queryHits(findOverlaps(not.looping.ctcf.ns,nsct))) ]

looping.ctcf.ns = reduce(looping.ctcf.ns)
looping.ctcf.ns = queryHits( findOverlaps(gagr10, getMidGR( looping.ctcf.ns ) ) )

not.looping.ctcf.ns = reduce(not.looping.ctcf.ns)
not.looping.ctcf.ns = queryHits( findOverlaps(gagr10, getMidGR( not.looping.ctcf.ns ) ) )


## double check if the object is the correct one
nrow(nsins[[2]])

looping = data.frame( left = getAreaAround( looping.ctcf.ns-10, nsins[[2]], 5, 2 ),
                      right = getAreaAround( looping.ctcf.ns+10, nsins[[2]], 5, 2 ),
                      between = getAreaAround( looping.ctcf.ns, nsins[[2]], 5, 2 ), 
                      stringsAsFactors=FALSE )

not.looping = data.frame( left = getAreaAround( not.looping.ctcf.ns-10, nsins[[2]], 5, 2 ),
                          right = getAreaAround( not.looping.ctcf.ns+10, nsins[[2]], 5, 2 ),
                          between = getAreaAround( not.looping.ctcf.ns, nsins[[2]], 5, 2 ), 
                          stringsAsFactors=FALSE )

looping = log2( rowMeans(looping[,1:2])/looping[,3] )
not.looping = log2( rowMeans(not.looping[,1:2])/not.looping[,3] )

looping = looping[ is.finite(looping) ]
not.looping = not.looping[ is.finite(not.looping) ]

save( looping, not.looping,
      file=paste0(data_directory,'IS_loop_no_loop.RData'))

The plot.

load(paste0(data_directory,'IS_loop_no_loop.RData'))

t.test( looping, not.looping )

## 
##  Welch Two Sample t-test
## 
## data:  looping and not.looping
## t = 23.077, df = 3451.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3799323 0.4504864
## sample estimates:
## mean of x mean of y 
## 0.9432264 0.5280170

par(mfrow=c(1,1),mar=c(5,6,1,6),cex.lab=1.5,cex.axis=1.5,bty='n')
multiecdf(list(looping,
               not.looping),
          xlim=c(-1,3),col=c('red','gray'), lwd=2,legend=F,
          xlab=expression('Insulation score (log'[2]*')'), 
          main='', ylab='Cumulative fraction' )
legend(x=1,y=0.4,c('Loop', 'No loop'),
       col=c('red','gray'), lwd=3, bty='n',cex=1.5 )
abline(v=0,lty=2,col='gray',lwd=2)

Loop domains and ordinary domains

Compare domains between ES and NS cells. Despite differences in the positioning of identified domains, we will show that boundaries are rather preserved.

## domain overlap
compare_domains = function( domains1, domains2, offset, ga ){
  
  getGR = function( b, offset ){
  GRanges(seqnames = Rle(b[,1]),
          ranges = IRanges(as.numeric(b[,2])-offset,
                           end = as.numeric(b[,3])+offset,
                           names = seq(1, nrow(b))),
          strand = Rle(rep("*", nrow(b))))}
  
  # domains1 = ns_domains; domains2 = es_domains; offset=20000
  left_dom1 = getGR( data.frame( chr=domains1[,1], start=domains1[,2], end=domains1[,2], stringsAsFactors=F ),offset)
  right_dom1 = getGR( data.frame( chr=domains1[,1], start=domains1[,3], end=domains1[,3], stringsAsFactors=F ),offset)
  left_dom2 = getGR( data.frame( chr=domains2[,1], start=domains2[,2], end=domains2[,2], stringsAsFactors=F ),offset)
  right_dom2 = getGR( data.frame( chr=domains2[,1], start=domains2[,3], end=domains2[,3], stringsAsFactors=F ),offset)

  left = as.data.frame( findOverlaps( left_dom1, left_dom2 ) )
  right = as.data.frame( findOverlaps( right_dom1, right_dom2 ) )
  
  ## common domains based on the annotation in domains1
  lr = merge.data.frame( left, right, by.x='queryHits',by.y='queryHits')
  common_domains = domains1[ lr[ lr$subjectHits.x==lr$subjectHits.y, 'queryHits'], ]
  
  ## cond1 specific domains
  cond1_specific_domains = domains1[ -lr[ lr$subjectHits.x==lr$subjectHits.y, 'queryHits'], ]

  ## common domains based on the annotation in domains1
  left = as.data.frame( findOverlaps( left_dom2, left_dom1 ) )
  right = as.data.frame( findOverlaps( right_dom2, right_dom1 ) )
  lr = merge.data.frame( left, right, by.x='queryHits',by.y='queryHits')
  ## cond2 specific domains
  cond2_specific_domains = domains2[ -lr[ lr$subjectHits.x==lr$subjectHits.y, 'queryHits'], ]
  
  return( list(common = common_domains,
               cond1s = cond1_specific_domains,
               cond2s = cond2_specific_domains ))
}
es_vs_ns_domains = compare_domains( es_domains, ns_domains, 20000 )
unlist(lapply(es_vs_ns_domains, nrow))

## common cond1s cond2s 
##   1847   2629   2410

## boundary overlap
compare_boundaries = function( domains1, domains2, offset ){
  getGR = function( b, offset ){
  GRanges(seqnames = Rle(b[,1]),
          ranges = IRanges(as.numeric(b[,2])-offset,
                           end = as.numeric(b[,3])+offset,
                           names = seq(1, nrow(b))),
          strand = Rle(rep("*", nrow(b))))}
  # domains1 = ns_domains; domains2 = es_domains; offset=20000
  left_dom1 = getGR( data.frame( chr=domains1[,1], start=domains1[,2], end=domains1[,2], stringsAsFactors=F ),offset)
  right_dom1 = getGR( data.frame( chr=domains1[,1], start=domains1[,3], end=domains1[,3], stringsAsFactors=F ),offset)
  left_dom2 = getGR( data.frame( chr=domains2[,1], start=domains2[,2], end=domains2[,2], stringsAsFactors=F ),offset)
  right_dom2 = getGR( data.frame( chr=domains2[,1], start=domains2[,3], end=domains2[,3], stringsAsFactors=F ),offset)
  
  b1 = c(reduce(left_dom1),reduce(right_dom1))
  b2 = c(reduce(left_dom2),reduce(right_dom2))
  
  return( list(common = b1[ unique( queryHits(findOverlaps(b1,b2))) ],
               cond1s = b1[ -unique( queryHits(findOverlaps(b1,b2))) ],
               cond2s = b2[ -unique( queryHits(findOverlaps(b2,b1))) ]  ) )
}

es_vs_ns_boundaries = compare_boundaries( es_domains, ns_domains, 20000 )
unlist(lapply(es_vs_ns_boundaries, length))

## common cond1s cond2s 
##   5077   2275   1919

Compute DI at domain boundaries.

load( paste0(data_directory,'InSitu_ES_pooled_10kb_my_object.RData') )
load( paste0(data_directory,'InSitu_NS_pooled_10kb_my_object.RData') )
load(paste0(data_directory, 'mm9_mappability_10kb_bins.RData'))

chroms = paste0('chr',as.list(c(1:19,'X','Y')))
names(chroms)=chroms

es_di = computeDI(esins, ga10, offset=200,bins1, mappabilityThr=0.6, chroms )
ns_di = computeDI(nsins, ga10, offset=200,bins1, mappabilityThr=0.6, chroms )

save( es_di, ns_di, file=paste0(data_directory,'DI_10kb_ES_NS_in_situ.RData') )

Plot the DI around cell type specific and common boundaries.

load( paste0(data_directory,'DI_10kb_ES_NS_in_situ.RData') )

getDIprofile = function( DI, gagr, boundaries, windowHalfSize ){
  
  getMidGR=function(gr){

  return(GRanges(seqnames = Rle( chrom(gr) ),
                 ranges = IRanges(start = rowMeans( cbind(start(gr),end(gr)))+1,
                                  width=1 ),
                 strand = Rle(rep("*", length(gr)))) )
  }
  
  # DI=unlist(es_di);gagr=gagr10;boundaries=es_vs_ns_boundaries[[1]];windowHalfSize=10
  theBins = subjectHits(findOverlaps(getMidGR( boundaries ), gagr ))
  stopifnot(all( as.character(chrom(gagr)[theBins])==as.character(chrom(boundaries))))
  
  theBins = unlist(lapply(as.list(seq(1,length(boundaries))),function(b){
    return( seq(theBins[b]-windowHalfSize, theBins[b]+windowHalfSize) ) }))
  
  return( matrix( DI[theBins], 
                  nrow=length(boundaries), 
                  ncol=length(seq(1-windowHalfSize,1+windowHalfSize)),
                  byrow=T ) )
}


di_common_boundaries_es = getDIprofile( unlist(es_di), gagr10, es_vs_ns_boundaries[[1]], 10 )
di_common_boundaries_ns = getDIprofile( unlist(ns_di), gagr10, es_vs_ns_boundaries[[1]], 10 )
di_common_boundaries_ES = di_common_boundaries_es[order(rowSums(abs(di_common_boundaries_ns)),decreasing = F),]
di_common_boundaries_NS = di_common_boundaries_ns[order(rowSums(abs(di_common_boundaries_ns)),decreasing = F),]

di_es_sp_boundaries_es = getDIprofile(  unlist(es_di), gagr10, es_vs_ns_boundaries[[2]], 10 )
di_es_sp_boundaries_ns = getDIprofile(  unlist(ns_di), gagr10, es_vs_ns_boundaries[[2]], 10 )
di_es_sp_boundaries_ES = di_es_sp_boundaries_es[order(rowSums(abs(di_es_sp_boundaries_es)),decreasing = F),]
di_es_sp_boundaries_NS = di_es_sp_boundaries_ns[order(rowSums(abs(di_es_sp_boundaries_es)),decreasing = F),]

di_ns_sp_boundaries_es = getDIprofile(  unlist(es_di), gagr10, es_vs_ns_boundaries[[3]], 10 )
di_ns_sp_boundaries_ns = getDIprofile(  unlist(ns_di), gagr10, es_vs_ns_boundaries[[3]], 10 )
di_ns_sp_boundaries_ES = di_ns_sp_boundaries_es[order(rowSums(abs(di_ns_sp_boundaries_ns)),decreasing = F),]
di_ns_sp_boundaries_NS = di_ns_sp_boundaries_ns[order(rowSums(abs(di_ns_sp_boundaries_ns)),decreasing = F),]


BigM_ES = rbind( di_common_boundaries_ES, di_es_sp_boundaries_ES )
BigM_ES = rbind( BigM_ES, di_ns_sp_boundaries_ES )

BigM_NS = rbind( di_common_boundaries_NS, di_es_sp_boundaries_NS )
BigM_NS = rbind( BigM_NS, di_ns_sp_boundaries_NS )

BigM_ES[ BigM_ES >0.2] = 0.2
BigM_ES[ BigM_ES <(-0.2)] = -0.2

BigM_NS[ BigM_NS >0.2] = 0.2
BigM_NS[ BigM_NS <(-0.2)] = -0.2

Display also the medians of the DI at each genomic bin.

colMedians = function(M){
  apply(M,2,function(x){mean(x,trim=0.01, na.rm=T)})
}


par( mfrow=c(1,3), 
     cex.main=1.5, mar=c(5,5,1,1), lwd=4, bty='n' )

plot( x=seq(-100,+100,length.out=21), 
      y=colMedians(di_common_boundaries_ES), ylim=c(-0.05, 0.05),
      ty='l', col=myEScolor,main='Common',ylab='DI', xlab='Pos (kb)' )
lines(x=seq(-100,+100,length.out=21), 
      y=colMedians(di_common_boundaries_NS),
      col=myNScolor )

plot( x=seq(-100,+100,length.out=21), 
      y=colMedians(di_es_sp_boundaries_ES),ylim=c(-0.05, 0.05),
      ty='l', col=myEScolor,main='ES specific',ylab='DI', xlab='Pos (kb)' )
lines(x=seq(-100,+100,length.out=21), 
      y=colMedians(di_es_sp_boundaries_NS),
      col=myNScolor )

plot( x=seq(-100,+100,length.out=21), 
      y=colMedians(di_ns_sp_boundaries_ES),ylim=c(-0.05, 0.05),
      ty='l', col=myEScolor,main='NS specific',ylab='DI', xlab='Pos (kb)' )
lines(x=seq(-100,+100,length.out=21), 
      y=colMedians(di_ns_sp_boundaries_NS),
      col=myNScolor )

Insulation score for boundaries

di_common_bound4IS = subjectHits(findOverlaps(getMidGR( reduce(es_vs_ns_boundaries[[1]]) ), gagr10 ))
di_es_sp_bound4IS = subjectHits(findOverlaps(getMidGR( reduce(es_vs_ns_boundaries[[2]]) ), gagr10 ))
di_ns_sp_bound4IS = subjectHits(findOverlaps(getMidGR( reduce(es_vs_ns_boundaries[[3]]) ), gagr10 ))

## IS for common boundaries
commonB_ES = data.frame( left = getAreaAround( di_common_bound4IS-10, esins[[2]], 5, 2 ),
                         right = getAreaAround( di_common_bound4IS+10, esins[[2]], 5, 2 ),
                         between = getAreaAround( di_common_bound4IS, esins[[2]], 5, 2 ), 
                         stringsAsFactors=FALSE )
commonB_ES = log2( rowMeans(commonB_ES[,1:2])/commonB_ES[,3] )

commonB_NS = data.frame( left = getAreaAround( di_common_bound4IS-10, nsins[[2]], 5, 2 ),
                         right = getAreaAround( di_common_bound4IS+10, nsins[[2]], 5, 2 ),
                         between = getAreaAround( di_common_bound4IS, nsins[[2]], 5, 2 ), 
                         stringsAsFactors=FALSE )
commonB_NS = log2( rowMeans(commonB_NS[,1:2])/commonB_NS[,3] )


## IS for ES specific boundaries
es_speB_ES = data.frame( left = getAreaAround( di_es_sp_bound4IS-10, esins[[2]], 5, 2 ),
                         right = getAreaAround( di_es_sp_bound4IS+10, esins[[2]], 5, 2 ),
                         between = getAreaAround( di_es_sp_bound4IS, esins[[2]], 5, 2 ), 
                         stringsAsFactors=FALSE )
es_speB_ES = log2( rowMeans(es_speB_ES[,1:2])/es_speB_ES[,3] )

es_speB_NS = data.frame( left = getAreaAround( di_es_sp_bound4IS-10, nsins[[2]], 5, 2 ),
                         right = getAreaAround( di_es_sp_bound4IS+10, nsins[[2]], 5, 2 ),
                         between = getAreaAround( di_es_sp_bound4IS, nsins[[2]], 5, 2 ), 
                         stringsAsFactors=FALSE )
es_speB_NS = log2( rowMeans(es_speB_NS[,1:2])/es_speB_NS[,3] )


## IS for ES specific boundaries
ns_speB_ES = data.frame( left = getAreaAround( di_ns_sp_bound4IS-10, esins[[2]], 5, 2 ),
                         right = getAreaAround( di_ns_sp_bound4IS+10, esins[[2]], 5, 2 ),
                         between = getAreaAround( di_ns_sp_bound4IS, esins[[2]], 5, 2 ), 
                         stringsAsFactors=FALSE )
ns_speB_ES = log2( rowMeans(ns_speB_ES[,1:2])/ns_speB_ES[,3] )

ns_speB_NS = data.frame( left = getAreaAround( di_ns_sp_bound4IS-10, nsins[[2]], 5, 2 ),
                         right = getAreaAround( di_ns_sp_bound4IS+10, nsins[[2]], 5, 2 ),
                         between = getAreaAround( di_ns_sp_bound4IS, nsins[[2]], 5, 2 ), 
                         stringsAsFactors=FALSE )
ns_speB_NS = log2( rowMeans(ns_speB_NS[,1:2])/ns_speB_NS[,3] )

save( commonB_ES, commonB_NS, es_speB_ES, es_speB_NS, ns_speB_ES, ns_speB_NS,
      file=paste0(data_directory,'domain_boundaries_IS.RData') )

Plot the Insulation score.

load( paste0(data_directory,'domain_boundaries_IS.RData') )


par(mfrow=c(1,1),mar=c(5,6,1,6),cex.lab=1.5,cex.axis=1.5,bty='n')

plot(density(commonB_NS-commonB_ES,na.rm=T), col='gray20',bty='n',lwd=5,
     xlab=expression('NS/ES [log'[2]*']'), ylim=c(0,2),xlim=c(-2,2),
     main='')
lines( density(es_speB_NS-es_speB_ES,na.rm=T), col=myEScolor,lwd=5 )
lines( density(ns_speB_NS-ns_speB_ES,na.rm=T), col=myNScolor,lwd=5 )
abline(v=0,lty=2,lwd=3)

Loop induction and chromatin features

The analysis of induced loop intervals. We assess how many induced active enhancers are within the induced loops. Load RNA-seq data and the annotation of genes.

load( paste0( data_directory, "RNASeq_ESC_NSC.RData") )

## gene anno
gtf = import( paste0(data_directory, 'Mus_musculus.NCBIM37.67.gtf' ) )
seqlevelsStyle(gtf) = 'ucsc'

findTSS = function(gtf){
  
  df = data.frame( transcriptID = elementMetadata(gtf)[,'transcript_id'],
                   geneID = elementMetadata(gtf)[,'gene_id'],
                   chr=chrom(gtf),
                   start=start(gtf),
                   end=end(gtf),
                   strand=as.character(strand(gtf)), 
                   exon = as.numeric(elementMetadata(gtf)[,'exon_number']),
                   stringsAsFactors=FALSE)
  dfs = split( df, df$transcriptID )
  
  res = do.call('rbind', lapply(dfs,function(x){
    if( unique(x$strand)=='-' ) {tss = max(x$end)}
    if( unique(x$strand)=='+' ) {tss = min(x$start)}
    fd = data.frame( chr=unique(x$chr),
                     tss = tss,
                     geneN = (unique(x$geneID)),
                     stringsAsFactors=FALSE  ) 
    return(fd) } ) )
  
  res=GRanges(seqnames = Rle( res$chr ),
              ranges = IRanges(start = as.numeric(res$tss) - 100,
                               end = as.numeric(res$tss) + 100,
                               names = res$geneN ),
              strand = Rle(rep("*", nrow(res))) )
  res$geneID = res$geneID
  # res = reduce(res)
  return(res)
}

tsgr = findTSS(gtf)

save( tsgr, file=paste0(data_directory,'AllTSS_NCBIM37_67_all_TSS_TTS.RData') )

Define enhancers and active promoters.

load(paste0(data_directory,'AllTSS_NCBIM37_67_all_TSS_TTS.RData'))
seqlevelsStyle(tsgr) = 'ncbi'

me1_peaks_es = import.bed( paste0(data_directory,'mESC_H3K4me1_peaks.bed') )
me1_peaks_ns = import.bed( paste0(data_directory,'mNSC_H3K4me1_peaks.bed') )

k27_peaks_es = import.bed( paste0(data_directory,'mESC_H3K27ac_peaks.bed') )
k27_peaks_ns = import.bed( paste0(data_directory,'mNSC_H3K27ac_peaks.bed') )

me3_peaks_es = import.bed( paste0(data_directory,'mESC_H3K4me3_peaks.bed') )
me3_peaks_ns = import.bed( paste0(data_directory,'mNSC_H3K4me3_peaks.bed') )


es_me1p.filt = me1_peaks_es[ -unique(queryHits(findOverlaps(me1_peaks_es, me3_peaks_es))) ]
escEa = k27_peaks_es[ unique(queryHits(findOverlaps(k27_peaks_es, es_me1p.filt))) ]
escEm = es_me1p.filt[ -unique(queryHits(findOverlaps(es_me1p.filt, k27_peaks_es))) ]


ns_me1p.filt = me1_peaks_ns[ -unique(queryHits(findOverlaps(me1_peaks_ns, me3_peaks_ns))) ]
nscEa = k27_peaks_ns[ unique(queryHits( findOverlaps(k27_peaks_ns, ns_me1p.filt)))  ]
nscEm = ns_me1p.filt[ -unique(queryHits(findOverlaps(ns_me1p.filt, k27_peaks_ns))) ]


escEm = escEm[ - as.numeric(unique(queryHits(findOverlaps(escEm, tsgr)))) ]
escEa = escEa[ - as.numeric(unique(queryHits(findOverlaps(escEa, tsgr)))) ]
nscEm = nscEm[ - as.numeric(unique(queryHits(findOverlaps(nscEm, tsgr)))) ]
nscEa = nscEa[ - as.numeric(unique(queryHits(findOverlaps(nscEa, tsgr)))) ]
# seqlevelsStyle(tsgr) = 'ucsc'


# induced active enhancers
induced_aE = nscEa[ -unique(queryHits(findOverlaps(nscEa,c(escEa))))] 
reduced_aE = escEa[ -unique(queryHits(findOverlaps(escEa,c(nscEa))))] 
common_aE = nscEa[ unique(queryHits(findOverlaps(nscEa,c(escEa))))] 

# induced genes
load( paste0( data_directory, "RNASeq_ESC_NSC.RData") )
ig = res$id[res$padj<0.01 & res$log2FoldChange>1.5]
rg = res$id[res$padj<0.01 & res$log2FoldChange<(-1.5)]


me3_peaks_es = import.bed( paste0(data_directory,'mESC_H3K4me3_peaks.bed') )
me3_peaks_ns = import.bed( paste0(data_directory,'mNSC_H3K4me3_peaks.bed') )

active_promoters_es = tsgr[ findOverlaps(tsgr,me3_peaks_es)@queryHits ]
active_promoters_ns = tsgr[ findOverlaps(tsgr,me3_peaks_ns)@queryHits ]

inactive_promoters_es = tsgr[ - findOverlaps(tsgr,me3_peaks_es)@queryHits ]
inactive_promoters_ns = tsgr[ - findOverlaps(tsgr,me3_peaks_ns)@queryHits ]

Enhancers inside domains

load( paste0( data_directory, "RNASeq_ESC_NSC.RData") )
load(paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision'))
di$not.sig = abs( di$log2FoldChange ) < log2(1.25)

load( paste0(data_directory,'AllTSS_NCBIM37_67_all_TSS_TTS.RData')  )
seqlevelsStyle(reduced_aE) = 'ucsc'
seqlevelsStyle(induced_aE) = 'ucsc'

reduced_loci = tsgr[ which( names(tsgr) %in% res$id[res$log2FoldChange<log2(1/1.5) ] ) ]
induced_loci = tsgr[ which( names(tsgr) %in% res$id[res$log2FoldChange>log2(1.5) ] ) ]


induced_loops_intervals = GRanges(seqnames = Rle(ga[loops.ns$row,'chr']),
                 ranges = IRanges(start = as.numeric(ga[(loops.ns$left_end)+2,'start']),
                                  end = as.numeric(ga[(loops.ns$right_start)-2,'end'])),
                 strand = Rle(rep("*", nrow(loops.ns))))

reduced_loops_intervals = GRanges(seqnames = Rle(ga[loops.es$row,'chr']),
                 ranges = IRanges(start = as.numeric(ga[(loops.es$left_end)+2,'start']),
                                  end = as.numeric(ga[(loops.es$right_start)-2,'end'])),
                 strand = Rle(rep("*", nrow(loops.es))))


common_loops_intervals = GRanges(seqnames = Rle(ga[di$row[di$not.sig],'chr']),
                 ranges = IRanges(start = as.numeric(ga[ (di$left_end[di$not.sig])+2,'start']),
                                  end = as.numeric(ga[ (di$right_start[di$not.sig])-2,'end'])),
                 strand = Rle(rep("*", sum(di$not.sig))))


par( mfrow=c(1,2), mar=c(5,5,5,1), cex.lab=1.2, cex.axis=1.2 )

barplot( c( sum( countOverlaps( induced_loops_intervals, reduced_aE ) > 0 )/length(induced_loops_intervals),
            sum( countOverlaps( induced_loops_intervals, induced_aE ) > 0 )/length(induced_loops_intervals)),
         col=c(myEScolor,myNScolor), ylim=c(0,1), main='Active enhancers',
         names=c('Off', 'On'), xlab='Enhancer', ylab='Fraction of induced loops') 


barplot( c( sum( countOverlaps(induced_loops_intervals,reduced_loci)>0 )/length(induced_loops_intervals),
            sum( countOverlaps(induced_loops_intervals,induced_loci)>0 )/length(induced_loops_intervals)),
         col=c(myEScolor,myNScolor), ylim=c(0,1), main='Gene expression',
         names=c('down', 'up'), xlab='Expression', ylab='Fraction of induced loops')

sum( countOverlaps(induced_loops_intervals,reduced_loci)>0 )

## [1] 852

sum( countOverlaps(induced_loops_intervals,induced_loci)>0 )

## [1] 1250

fisher.test( matrix(c(sum( countOverlaps( induced_loops_intervals, reduced_aE ) > 0 ),
                      sum( countOverlaps( induced_loops_intervals, induced_aE ) > 0 ),
                      length(reduced_aE),
                      length(induced_aE)),2,2))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  
## p-value = 0.000001463
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7558967 0.8898079
## sample estimates:
## odds ratio 
##  0.8202303

fisher.test( matrix(c(sum( countOverlaps( reduced_loops_intervals, reduced_aE ) > 0 ),
                      sum( countOverlaps( reduced_loops_intervals, induced_aE ) > 0 ),
                      length(reduced_aE),
                      length(induced_aE)),2,2))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  
## p-value = 0.000000000004181
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.487629 2.054682
## sample estimates:
## odds ratio 
##   1.747546

fisher.test( matrix(c(sum( countOverlaps( common_loops_intervals, reduced_aE ) > 0 ),
                      sum( countOverlaps( common_loops_intervals, induced_aE ) > 0 ),
                      length(reduced_aE),
                      length(induced_aE)),2,2))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  
## p-value = 0.1217
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9843754 1.1425727
## sample estimates:
## odds ratio 
##   1.060566

## genes with an induced and reduced loop
## annotation of loop anchors
leftGR = gagr[ loops.ns$row ]
start(leftGR) = start(leftGR)-5000
end(leftGR) = end(leftGR)+5000
seqlevelsStyle(leftGR) = 'ucsc'
rightGR = gagr[ loops.ns$col ]
start(rightGR) = start(rightGR)-5000
end(rightGR) = end(rightGR)+5000
seqlevelsStyle(rightGR) = 'ucsc'

induced.genes.at.loop.anchor = unique( names(tsgr[queryHits(findOverlaps(tsgr, c(leftGR, rightGR)))]) )
write.table( induced.genes.at.loop.anchor, file=paste0(data_directory,'induced.genes.anchors.txt'),
             quote=F,row.names=F,col.names=F)


## 
induced.genes.inside = unique( names(tsgr[queryHits(findOverlaps(tsgr, induced_loops_intervals))]) )
repressed.genes.inside = unique( names(tsgr[queryHits(findOverlaps(tsgr, reduced_loops_intervals))]) )

write.table( induced.genes.inside, file=paste0(data_directory,'induced.genes.inside2.txt'),
             quote=F,row.names=F,col.names=F)

write.table( repressed.genes.inside, file=paste0(data_directory,'repressed.genes.inside2.txt'),
             quote=F,row.names=F,col.names=F)

GO-term analysis

We import and filter the results obtained using DAVID online tool (https://david.ncifcrf.gov/).

dk = read.delim(paste0(data_directory,'induced_inside_loops_DAVID.txt'), header=T, as.is=T)
dk=dk[dk$Fold.Enrichment>1.5 & dk$Benjamini<0.01,]
dk$t = unlist(strsplit(dk$Term,'~'))[seq(2,2*nrow(dk),by=2)] 
dk = dk[order(dk$Benjamini,decreasing=F),]
dkp = dk[c(30,33,45,57,62,76,98,101,144,146,147,155,219),]
dkp = dkp[order(dkp$Benjamini, decreasing=F),]

# pdf( '/g/huber/projects/HiC/Pekowska_et_al/data/DAVID1.pdf', 10, 10 )
par(mfrow=c(1,1),mar=c(5,30,1,1),cex.lab=1.5,cex.axis=1.5)
barplot( -log10(dkp$Benjamini), 
         horiz=T, names=dkp$t,las=2,
         xlab=expression('-Log'[10]*'(P-val)'), col='blue4')

# dev.off()

GO term of genes with promoters overlapping the anchors of induced loops.

dk = read.delim(paste0(data_directory,'induced_at_anchors_DAVID.txt'), header=T, as.is=T)
dk=dk[dk$Fold.Enrichment>1.5 & dk$Benjamini<0.01,]
dk$t = unlist(strsplit(dk$Term,'~'))[seq(2,2*nrow(dk),by=2)] 
dk = dk[order(dk$Benjamini,decreasing=F),]
dkp = dk[c(1,2,3,4,5,11,22,27,36,41,48,64,91),]
dkp = dkp[order(dkp$Benjamini, decreasing=F),]

# pdf( '/g/huber/projects/HiC/Pekowska_et_al/data/DAVID2.pdf', 10, 10 )

par(mfrow=c(1,1),mar=c(5,30,1,1),cex.lab=1.5,cex.axis=1.5)
barplot( -log10(dkp$Benjamini), 
         horiz=T, names=dkp$t,las=2,
         xlab=expression('-Log'[10]*'(P-val)'), col='red')

# dev.off()

Loops - annotation of anchors

First we check if genes promoters of which are in the anchors of loops are expressed higher than genes that do not form loops.

getGenesPerAnchor = function( loops,tsgr,myObject, ga ){
  ## loops = es_loops_is
  if(!myObject){
  leftGR = GRanges(seqnames = Rle(loops[,1]),
                   ranges = IRanges(start = as.numeric(loops[,2]),
                                    end = as.numeric(loops[,3])),
                   strand = Rle(rep("*", nrow(loops))) ) 
  righGR = GRanges(seqnames = Rle(loops[,4]),
                   ranges = IRanges(start = as.numeric(loops[,5]),
                                    end = as.numeric(loops[,6])),
                   strand = Rle(rep("*", nrow(loops))) )
  seqlevelsStyle(tsgr) = seqlevelsStyle(leftGR)}
  
  if(myObject){
  leftGR = GRanges(seqnames = Rle(ga$chr[loops[,'left_start']]),
                   ranges = IRanges(start = as.numeric(ga$start[loops[,'left_start']]),
                                    end = as.numeric(ga$end[loops[,'left_end']])),
                   strand = Rle(rep("*", nrow(loops))) ) 
  righGR = GRanges(seqnames = Rle(ga$chr[loops[,'right_start']]),
                   ranges = IRanges(start = as.numeric(ga$start[loops[,'right_start']]),
                                    end = as.numeric(ga$end[loops[,'right_end']])),
                   strand = Rle(rep("*", nrow(loops))) ) 
  }
  genesForLeftAnchors=names( tsgr[ queryHits(findOverlaps(tsgr,leftGR )) ] )
  genesForRightAnchors=names( tsgr[ queryHits(findOverlaps(tsgr,righGR)) ] )
  return(unique(c(genesForLeftAnchors,genesForRightAnchors) ))
}

loopingGenes.es = getGenesPerAnchor( es_loops_is,tsgr,FALSE,ga ) 
loopingGenes.ns = getGenesPerAnchor( ns_loops_is,tsgr,FALSE,ga )

Functional annotation of genomic features overlapping the anchors of the induced loops

nrow( loops.ns )

## [1] 2454

## annotation of loop anchors
leftGR = gagr[ loops.ns$row ]
start(leftGR) = start(leftGR)-5000
end(leftGR) = end(leftGR)+5000
rightGR = gagr[ loops.ns$col ]
start(rightGR) = start(rightGR)-5000
end(rightGR) = end(rightGR)+5000


seqlevelsStyle(tsgr) = 'ucsc'
seqlevelsStyle(nscEm) = 'ucsc'
seqlevelsStyle(nscEa) = 'ucsc'
seqlevelsStyle(me3_peaks_ns) = 'ucsc'

tss_present = tsgr[unique(queryHits(findOverlaps(tsgr,me3_peaks_ns)))]

annot = data.frame( left_prom_1 = countOverlaps(leftGR, tss_present),
                    left_Penh_2 = countOverlaps(leftGR, nscEm),
                    left_Aenh_3 = countOverlaps(leftGR, nscEa),
                    right_prom_4 = countOverlaps(rightGR, tss_present),
                    right_Penh_5 = countOverlaps(rightGR, nscEm),
                    right_Aenh_6 = countOverlaps(rightGR, nscEa))

## make a binary matrix
bm = matrix(as.numeric(annot>0),nrow=nrow(annot), ncol=6 ) 

m = c(no_overlap = sum(rowSums(bm)==0),
      regElem_none = sum( c( sum( rowSums(bm[,c(1,4)])==1 & rowSums(bm[,c(2,3,5,6)])==0 ),
                             sum( rowSums(bm[,c(2,5)])==1 & rowSums(bm[,c(1,3,4,6)])==0 ),
                             sum( rowSums(bm[,c(3,6)])==1 & rowSums(bm[,c(1,2,4,5)])==0 )) ),
      prom_prom = sum( rowSums(bm[,c(2,3,5,6)])==0 &  bm[,1]==1 & bm[,4]==1 ),
      enhancer_enhancer = sum( c( sum( rowSums(bm[,c(1,3,4,6)])==0 &  bm[,2]==1 & bm[,5]==1 ),
                                  sum( rowSums(bm[,c(1,2,4,5)])==0 &  bm[,3]==1 & bm[,6]==1 )) ),
      poised_PromEnh = sum( c( sum( rowSums(bm[,c(2,3,4,6)])==0 &  bm[,1]==1 & bm[,5]==1 ),
                               sum( rowSums(bm[,c(1,3,5,6)])==0 &  bm[,2]==1 & bm[,4]==1 )) ),
      active_PromEnh = sum( c( sum( rowSums(bm[,c(2,3,4,5)])==0 &  bm[,1]==1 & bm[,6]==1 ),
                               sum( rowSums(bm[,c(1,2,5,6)])==0 &  bm[,3]==1 & bm[,4]==1 )) ) )
m = c(m, ambig=length(leftGR)-sum(m))

Functional annotation of random intervals.

randomRangesLeft = gagr[ sample( seq(1,length(gagr)), length(leftGR)) ]
randomRangesRight = randomRangesLeft
end(randomRangesRight) = end(randomRangesRight) + sample( (10000 * (loops.ns$col-loops.ns$row)), length(leftGR), replace=T) + 15000
start(randomRangesRight) = end(randomRangesRight)-15000

ANNOT = data.frame( left_prom_1 = countOverlaps(randomRangesLeft, tss_present),
                    left_Penh_2 = countOverlaps(randomRangesLeft, nscEm),
                    left_Aenh_3 = countOverlaps(randomRangesLeft, nscEa),
                    right_prom_4 = countOverlaps(randomRangesRight, tss_present),
                    right_Penh_5 = countOverlaps(randomRangesRight, nscEm),
                    right_Aenh_6 = countOverlaps(randomRangesRight, nscEa))
BM = matrix(as.numeric(ANNOT>0),nrow=nrow(ANNOT), ncol=6 ) 

M = c(no_overlap = sum(rowSums(BM)==0),
      regElem_none = sum( c( sum( rowSums(BM[,c(1,4)])==1 & rowSums(BM[,c(2,3,5,6)])==0 ),
                             sum( rowSums(BM[,c(2,5)])==1 & rowSums(BM[,c(1,3,4,6)])==0 ),
                             sum( rowSums(BM[,c(3,6)])==1 & rowSums(BM[,c(1,2,4,5)])==0 )) ),
      prom_prom = sum( rowSums(BM[,c(2,3,5,6)])==0 &  BM[,1]==1 & BM[,4]==1 ),
      enhancer_enhancer = sum( c( sum( rowSums(BM[,c(1,3,4,6)])==0 &  BM[,2]==1 & BM[,5]==1 ),
                                  sum( rowSums(BM[,c(1,2,4,5)])==0 &  BM[,3]==1 & BM[,6]==1 )) ),
      poised_PromEnh = sum( c( sum( rowSums(BM[,c(2,3,4,6)])==0 &  BM[,1]==1 & BM[,5]==1 ),
                               sum( rowSums(BM[,c(1,3,5,6)])==0 &  BM[,2]==1 & BM[,4]==1 )) ),
      active_PromEnh = sum( c( sum( rowSums(BM[,c(2,3,4,5)])==0 &  BM[,1]==1 & BM[,6]==1 ),
                               sum( rowSums(BM[,c(1,2,5,6)])==0 &  BM[,3]==1 & BM[,4]==1 )) ) )
M = c(M, ambig=length(leftGR)-sum(M))

We have the two data frames. We compute the enrichment of features compared to random intervals.

m[order(m, decreasing=T)]/ M[order(m, decreasing=T)]

##        no_overlap      regElem_none             ambig enhancer_enhancer 
##         0.4948747         1.4288225         5.6265060         5.3333333 
##    poised_PromEnh    active_PromEnh         prom_prom 
##         7.3846154        10.2500000         8.0000000

We now generate the figure.

par(mar=c(10,5,2,2))
barplot( m[order(m, decreasing=T)], las=2, 
         col=c("red","gray","blue4","brown","brown","green4","black"))

Find if the direction of loop dynamics correlates with transcriptional dynamics. Are genes at loop anchors rather induced or repressed?

load( paste0( data_directory, "RNASeq_ESC_NSC.RData") )


PEI_gene = c( leftGR[ which(rowSums(bm[,c(2,3,4,5)])==0 &  bm[,1]==1 & bm[,6]==1 ) ],
              rightGR[ which(rowSums(bm[,c(1,2,5,6)])==0 &  bm[,3]==1 & bm[,4]==1 ) ] )
genes_for_PEI = unique( names( tsgr[queryHits(findOverlaps(tsgr,PEI_gene))] ) )


par(mar=c(10,5,2,2))
barplot( c(sum( res[res$id %in% genes_for_PEI,]$log2FoldChange>=log2(1.5)),
           sum( res[res$id %in% genes_for_PEI,]$log2FoldChange<log2(1/1.5) )), 
         names=c('up','down'), col=c("green","black"),
         las=2 )

Differentially expressed genes at loop anchors.

all_genes = unique(names(tsgr[queryHits(findOverlaps(tsgr,c(leftGR,rightGR)))]))

il_treg = cbind(countOverlaps(leftGR,tsgr[ which(names(tsgr) %in% res[res$log2FoldChange>log2(1.5) & res$padj<0.1,]$id) ]),
                countOverlaps(rightGR,tsgr[ which(names(tsgr) %in% res[res$log2FoldChange>log2(1.5) & res$padj<0.1,]$id) ]) )
sum(rowSums(il_treg)>0)

## [1] 552

Loop- and compartment-domains

Define compartment and loop domains and check gene expression, enhancer landscape and cohesin and CTCF binding in ES and NS cells.

is_loop_domain = function(domains,loops, offset){
  getGR = function(coords,offset){
  return(GRanges(seqnames = Rle(coords[,1]),
                 ranges = IRanges(start = as.numeric(coords[,2]-offset),
                                  end = as.numeric(coords[,3])+offset),
                 strand = Rle(rep("*", nrow(coords)))) ) }
  
  # domains = es_domains; loops = es_loops_is; offset=20000
  left_dom = getGR( data.frame( chr=domains[,1], start=domains[,2], end=domains[,2], stringsAsFactors=F ),offset)
  right_dom = getGR( data.frame( chr=domains[,1], start=domains[,3], end=domains[,3], stringsAsFactors=F ),offset)
  left_loop = getGR( data.frame( chr=loops[,1], start=loops[,2], end=loops[,3], stringsAsFactors=F ),offset)
  right_loop = getGR( data.frame( chr=loops[,4], start=loops[,5], end=loops[,6], stringsAsFactors=F ),offset)

  ## domain boundaries to loop anchors
  left = as.data.frame( findOverlaps( left_dom, left_loop ) )
  right = as.data.frame( findOverlaps( right_dom, right_loop ) )
  
  ## loop domains
  lr = merge.data.frame( left, right, by.x='queryHits',by.y='queryHits')
  loop_domains = domains[ unique(lr[ lr$subjectHits.x==lr$subjectHits.y, 'queryHits']), ]
  
  ## domains not enclosed by a loop
  not_loop_domains = domains[ -unique(lr[ lr$subjectHits.x==lr$subjectHits.y, 'queryHits']), ]

  # domains without loops inside
  domainGR = getGR( data.frame( chr=not_loop_domains[,1], 
                                start=not_loop_domains[,2], 
                                end=not_loop_domains[,3], stringsAsFactors=F ),offset)
  domainsWithoutLoops = not_loop_domains[ -unique(queryHits(findOverlaps(domainGR, c(left_loop,right_loop)))), ]
  # domains that are not loops domains but contain loop(s) inside
  domainsWithLoops = not_loop_domains[ unique(queryHits(findOverlaps(domainGR, c(left_loop,right_loop)))), ]
  
  return( list(loop_domains = loop_domains,
               domainsWithoutLoops = domainsWithoutLoops,
               domainsWithLoops = domainsWithLoops ))
}

es_is_loop_domain = is_loop_domain(es_domains, es_loops_is, offset=20000 )
ns_is_loop_domain = is_loop_domain(ns_domains, ns_loops_is, offset=20000 )

save( es_is_loop_domain, ns_is_loop_domain, file=paste0(data_directory,'loop_compartment_domains.RData'))

m=rbind( unlist(lapply(es_is_loop_domain,nrow)), unlist(lapply(ns_is_loop_domain,nrow)) )


barplot( (m), beside=T, 
         col=c(myEScolor,myNScolor), ylab='Number',
         names=c('Loop','Ordinary','With loops') )
legend('topleft',c('ES','NS'),col=c(myEScolor,myNScolor),bty='n',pch=15,cex=1.4)

Loop induction and intra-domain promoter-enhancer interactions

Intra domain contacts for loop and ordinary domains.

load(paste0(data_directory,'InSitu_ES_pooled_5kb_my_object.RData'))
load(paste0(data_directory,'InSitu_NS_pooled_5kb_my_object.RData'))
seqlevelsStyle(gagr) = 'ucsc'
load( paste0(data_directory,'loop_compartment_domains.RData') )

## define functions

getGR=function(df,offset){
  return(GRanges(seqnames = Rle(df[,1]),
                 ranges = IRanges(start = as.numeric(df[,2])-offset,
                                  end = as.numeric(df[,3])+offset),
                 strand = Rle(rep("*", nrow(df))))) }


removeLoopArea = function(gr,s){
  gr = gr[ which(width(gr)>2*s) ]
  start(gr)=start(gr)+s
  end(gr)=end(gr)-s
  return(gr)
}


domainIDS = function( domanno, D, gagr ){
  ## D controls the number of off diagonals to be removed
  res = do.call('rbind', lapply( as.list( seq(1,length(domanno)) ), function(x){
    thisDomID = x
    cordinat = queryHits(findOverlaps(gagr,domanno[x]))
    return( cbind( row = rep(cordinat, length((cordinat))),
                   col = rep(cordinat, each=length((cordinat))),
                   id = rep(x, length(cordinat)^2 ) ) ) 
  } ))
  res = res[res[,2]-res[,1]>D, ]
  return(res)
}


## identify compartment and loop domains

comp_domains_es = getGR(es_is_loop_domain[['domainsWithoutLoops']],0)
loop_domains_es = getGR(es_is_loop_domain[['loop_domains']],0)

comp_domains_ns = getGR(ns_is_loop_domain[['domainsWithoutLoops']],0)
loop_domains_ns = getGR(ns_is_loop_domain[['loop_domains']],0)


comp_domains_es_coords = domainIDS( comp_domains_es,D=4,gagr )
loop_domains_es_coords = domainIDS( loop_domains_es,D=4,gagr )

comp_domains_ns_coords = domainIDS( comp_domains_ns,D=4,gagr )
loop_domains_ns_coords = domainIDS( loop_domains_ns,D=4,gagr )



comp_domains_es_signal = data.frame( did = comp_domains_es_coords[,3],
                                     es = esins[[2]][ comp_domains_es_coords[,1:2] ], stringsAsFactors=FALSE )

loop_domains_es_signal = data.frame( did = loop_domains_es_coords[,3],
                                     es = esins[[2]][ loop_domains_es_coords[,1:2] ], stringsAsFactors=FALSE )


comp_domains_ns_signal = data.frame( did = comp_domains_ns_coords[,3],
                                     ns = nsins[[2]][ comp_domains_ns_coords[,1:2] ], stringsAsFactors=FALSE )

loop_domains_ns_signal = data.frame( did = loop_domains_ns_coords[,3],
                                     ns = esins[[2]][ loop_domains_ns_coords[,1:2] ], stringsAsFactors=FALSE )



save( comp_domains_es_signal, loop_domains_es_signal, loop_domains_ns_signal, comp_domains_ns_signal, 
      file=paste0(data_directory,'ordinary_versus_loop_domains_interactions.RData') )

We see that on average, the interactions within loop domains are stronger than in ordinary domains.

load(paste0(data_directory,'ordinary_versus_loop_domains_interactions.RData'))


t.test( unlist( lapply(split(comp_domains_es_signal$es, comp_domains_es_signal$did),
                       function(x){mean(x,na.rm=T)})),
        unlist( lapply(split(loop_domains_es_signal$es, loop_domains_es_signal$did),
                       function(x){mean(x,na.rm=T)})))

## 
##  Welch Two Sample t-test
## 
## data:  unlist(lapply(split(comp_domains_es_signal$es, comp_domains_es_signal$did),  and unlist(lapply(split(loop_domains_es_signal$es, loop_domains_es_signal$did),     function(x) { and     function(x) {        mean(x, na.rm = T) and         mean(x, na.rm = T)    })) and     }))
## t = -19.992, df = 903.71, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0007894862 -0.0006483345
## sample estimates:
##   mean of x   mean of y 
## 0.002400368 0.003119278

t.test( unlist( lapply(split(loop_domains_ns_signal$ns, loop_domains_ns_signal$did),
                       function(x){mean(x,na.rm=T)})),
        unlist( lapply(split(comp_domains_ns_signal$ns, comp_domains_ns_signal$did),
                       function(x){mean(x,na.rm=T)})))

## 
##  Welch Two Sample t-test
## 
## data:  unlist(lapply(split(loop_domains_ns_signal$ns, loop_domains_ns_signal$did),  and unlist(lapply(split(comp_domains_ns_signal$ns, comp_domains_ns_signal$did),     function(x) { and     function(x) {        mean(x, na.rm = T) and         mean(x, na.rm = T)    })) and     }))
## t = 29.42, df = 1374.8, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0009907188 0.0011322799
## sample estimates:
##   mean of x   mean of y 
## 0.002784972 0.001723472

par(mfrow=c(1,2), mar=c(6,6,3,1), cex.lab=1.5, cex.axis=1.5, cex.main=1.4 )
plot(density(unlist( lapply(split(comp_domains_es_signal$es, comp_domains_es_signal$did),
                       function(x){mean(x,na.rm=T)}))), 
     ty='l', col='black', main='ES', lwd=3,
     xlab='Average interactions',
     bty='n', ylim=c(0,800), xlim=c(0,0.008))
lines( density(unlist( lapply(split(loop_domains_es_signal$es, loop_domains_es_signal$did),
                       function(x){mean(x,na.rm=T)}))), lwd=3, col='red' )

plot(density(unlist( lapply(split(comp_domains_ns_signal$ns, comp_domains_ns_signal$did),
                       function(x){mean(x,na.rm=T)}))), 
     ty='l', col='black', main='NS', lwd=3,
     xlab='Average interactions',
     bty='n', ylim=c(0,800), xlim=c(0,0.008) )
lines( density(unlist( lapply(split(loop_domains_ns_signal$ns, loop_domains_ns_signal$did),
                       function(x){mean(x,na.rm=T)}))), lwd=3, col='red' )
legend( y=700, x=0.004, c('Ordinary','Loop'), 
        col=c('black','red'),lwd=3,bty='n',cex=1.5)

Prepare the objects to test if loop induction is related to the dynamics of promoter-enhancer contacts. First, for each domain find the promoter-enhancer contacts.

load( paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision') )

getGR = function(coords,ga){

  return(GRanges(seqnames = Rle(ga[coords[,1],'chr']),
                 ranges = IRanges(start = as.numeric(ga[coords[,1],'start']),
                                  end = as.numeric(ga[coords[,2],'end'])),
                 strand = Rle(rep("*", nrow(coords)))) )


}


inducedLoops = getGR(loops.ns[,c('left_start','right_end')],ga)
repressedLoops = getGR(loops.es[,c('left_start','right_end')],ga)
COM = di[ abs(di$log2FoldChange)<log2(1.25), ]
commonLoops = getGR(COM[,c('left_start','right_end')],ga)

commonLoopsF = commonLoops[ -queryHits(findOverlaps(commonLoops,c(repressedLoops,inducedLoops))) ]
inducedF = inducedLoops[ -queryHits(findOverlaps(inducedLoops,c(repressedLoops,commonLoops)))]
repressedF = repressedLoops[ -queryHits(findOverlaps(repressedLoops,c(inducedLoops,commonLoops)))]

# remove the 20kb comming from loops
removeLoopArea = function(gr,s){
  gr = gr[ which(width(gr)>2*s) ]
  start(gr)=start(gr)+s
  end(gr)=end(gr)-s
  return(gr)
}

induced_prep = removeLoopArea( inducedF, 20000 )
reduced_prep = removeLoopArea( repressedF, 20000 )


domainIDS = function( domanno, D, gagr ){
  ## D controls the number of off diagonals to be removed
  res = do.call('rbind', lapply( as.list( seq(1,length(domanno)) ), function(x){
    thisDomID = x
    cordinat = queryHits(findOverlaps(gagr,domanno[x]))
    return( cbind( row = rep(cordinat, length((cordinat))),
                   col = rep(cordinat, each=length((cordinat))),
                   id = rep(x, length(cordinat)^2 ) ) ) 
  } ))
  res = res[res[,2]-res[,1]>D, ]
  return(res)
}

induced_domain_coords = domainIDS( induced_prep,D=4,gagr )
reduced_domain_coords = domainIDS( reduced_prep,D=4,gagr )



induced_domain_signal = data.frame( did = induced_domain_coords[,3],
                                    es = esins[[2]][ induced_domain_coords[,1:2] ],
                                    ns = nsins[[2]][ induced_domain_coords[,1:2] ], stringsAsFactors=FALSE )

reduced_domain_signal = data.frame( did = reduced_domain_coords[,3],
                                    es = esins[[2]][ reduced_domain_coords[,1:2] ],
                                    ns = nsins[[2]][ reduced_domain_coords[,1:2] ], stringsAsFactors=FALSE )

induced_domain_signal = cbind(induced_domain_signal,induced_domain_coords[,1:2])
reduced_domain_signal = cbind(reduced_domain_signal,reduced_domain_coords[,1:2])


## consider promoter-enhancer contacts only 
filter_for_PEI = function( domain_coords, gagr, proms, enhancers ){
 
  seqlevelsStyle(proms) = 'ucsc'
  seqlevelsStyle(enhancers) = 'ucsc'
  promoter_bins = subjectHits(findOverlaps(proms, gagr))
  enhancer_bins = subjectHits(findOverlaps(enhancers, gagr))

  pei = domain_coords[,'row'] %in% promoter_bins & domain_coords[,'col'] %in% enhancer_bins
  pie = domain_coords[,'row'] %in% enhancer_bins & domain_coords[,'col'] %in% promoter_bins
  
  return( rbind(domain_coords[pei,], domain_coords[pie,]) )
}


induced_domain_signal_PEI_ratio = do.call('c',lapply(split(induced_domain_signal,induced_domain_signal[,1]), function(x){
  if(nrow(x)>4){
    x = cbind(x, induced_domain_coords[induced_domain_coords[,'id']==unique(x[,'did']), ])
    x = filter_for_PEI( x, gagr, active_promoters_ns, nscEa )
    if(nrow(x)>0) {
      x = x[rowSums(x[,2:3])>0,]
      return( mean( log2( (0.000001+x[,3]) / (0.000001+x[,2]) ), na.rm=TRUE )) } } } ))

reduced_domain_signal_PEI_ratio = do.call('c',lapply(split(reduced_domain_signal,reduced_domain_signal[,1]), function(x){
  if(nrow(x)>4){
    x = cbind(x, reduced_domain_coords[reduced_domain_coords[,'id']==unique(x[,'did']), ])
    x = filter_for_PEI( x, gagr, active_promoters_es, escEa )
    if(nrow(x)>0) {
      x = x[rowSums(x[,2:3])>0,]
      return( mean( log2( (0.000001+x[,3]) / (0.000001+x[,2]) ), na.rm=TRUE )) } } } ))

save( induced_domain_signal_PEI_ratio, reduced_domain_signal_PEI_ratio, 
      file=paste0(data_directory,'PEI_and_loop_induction_intraDomain.RData') )

Plot the change of Hi-C signal at promoter-enhancer pairs in induced and reduced loop domains.

load( paste0(data_directory,'PEI_and_loop_induction_intraDomain.RData') )

par(mfrow=c(1,1), mar=c(8,6,3,6), 
    cex.lab=1.5, cex.axis=1.5, cex.main=1.4, bty='n' )
boxplot( reduced_domain_signal_PEI_ratio, 
         induced_domain_signal_PEI_ratio,
         border=c(myEScolor, myNScolor), 
         main='', lwd=3, las=2,
         ylab=expression('NS/ES [log'[2]*']'),
         names=c('Reduced','Induced'), 
         outline=FALSE, ylim=c(-1.5,1.5) )

Loops and chromatin compartments

We will use the PCA approach to call compartments. We will perform a high resolution analysis (50kb) of the in-situ data and low resolution for ES (2i/LIF) and EpiSC. First, we call compartments in the in-situ data. First, we import the matrices at the resolution of 50kb. We will consider only intrachromosomal matrices

## function to read matrices dumped from .hic files
read.hic_files_intra = function( directory, prefix, suffix, binAnnotation, chroms ){
  
  chrlist = as.list(chroms)
  names(chrlist)= chroms
  
  # prefix="ES1_10kb/ES1_30_"; suffix=".txt"; directory=dumped_dir; binAnnotation=ga
  lapply( chrlist, function(chromosome){
    df = read.delim( paste0(directory, prefix, chromosome, suffix ), as.is=T, header=F )
    binAnnotation = binAnnotation[binAnnotation$chr==chromosome,]
    print(chromosome)
    return( transformToSparseMatrix(df, 
                                    row.annotation = binAnnotation,
                                    col.annotation = binAnnotation ) )
  } )
}

es_lfm_50kb = read.hic_files_intra( data_directory, "res_50kb/ES_inter30_",".txt", ga50, chroms )
ns_lfm_50kb = read.hic_files_intra( data_directory, "res_50kb/NS_inter30_",".txt", ga50, chroms )

## upper.tri is stored
test=as.matrix(ns_lfm_50kb[[19]])
sum(test[lower.tri(test)])
sum(test[upper.tri(test)])

es_ipf_50kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga50[ga50$chr==chr,]
  m = es_lfm_50kb[[chr]]
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names( es_ipf_50kb_intra ) = chroms[1:20]

ns_ipf_50kb_intra = lapply( as.list(chroms[1:20]), function(chr){ 
  this_ga = ga50[ga50$chr==chr,]
  m = ns_lfm_50kb[[chr]]
  return( IPF(m, ann=this_ga, numberOfIterations=itn, chroms=chroms)) } )
names( ns_ipf_50kb_intra ) = chroms[1:20]

save( es_ipf_50kb_intra, 
      ns_ipf_50kb_intra, file= paste0(data_directory,'intrachromosomal_50kb_IPF.RData') )

Now, normalize the data.

load(paste0(data_directory,'intrachromosomal_50kb_IPF.RData'))
es_50kb_ipf_list = lapply( es_ipf_50kb_intra, function(x){x[[2]]} )
ns_50kb_ipf_list = lapply( ns_ipf_50kb_intra, function(x){x[[2]]} )

Call compartments

getGR = function( b, offset ){
  GRanges(seqnames = Rle(b$chr),
          ranges = IRanges(as.numeric(b$start)-offset,
                           end = as.numeric(b$end)+offset,
                           names = seq(1, nrow(b))),
          strand = Rle(rep("*", nrow(b))))
}

mode = function( values, nbins ){
  nm = log2(values) %/% 0.1
  nmb = as.data.frame(plyr::count(nm), stringsAsFactors = FALSE)
  Mode = nmb[order(nmb$freq, decreasing=TRUE),"x"][1] * 0.1  ## mode
  return(Mode)
}
      
OE_on_mode = function(vals, distan){
  
  myV = split( vals, as.numeric(distan) )
  stopifnot( all( diff( as.numeric(names(myV))) > 0 ) )
  distances =  as.numeric(names(myV))
  return( lapply( myV, mode ) )
  
}

COR_on_HIC = function( mat ){
  print('New')
  chr = as.data.frame( summary(mat), stringsAsFactors=F )
  chr = chr[chr$i<chr$j,]
  chr = chr[!is.na(chr$x),]
  myV = split( chr$x, chr$j-chr$i )
  modes = unlist(lapply(myV, mode))
  smoothingSplineMODE = smooth.spline(x=log2( seq(1,length(myV)) ), y=modes, spar=1)
  splinesMODE = predict( smoothingSplineMODE, log2(seq(1,length(myV))) )
  chr$expected = 2^splinesMODE$y[chr$j-chr$i]
  chr$oe=(chr$x/chr$expected)
  chr$oe[is.na(chr$oe)]=1
  M = sparseMatrix( i=chr$i, 
                    j=chr$j, 
                    x=chr$oe,
                    dims=rep(nrow(mat),2) )
  M = M + t(M)
  diag(M) = diag(M)/2
  # M = as.matrix(M)
  # M[ is.na(M) ] = 1
  M = as.matrix(M)
  return( cor(M) ) #corSparse(M) )
}

Get the correlation matrices

cor.e = lapply( es_50kb_ipf_list[1:20], COR_on_HIC )
cor.n = lapply( ns_50kb_ipf_list[1:20], COR_on_HIC )

save(cor.e, cor.e, 
     file=paste0(data_directory, 'cor_matrices_50kb.RData') )

eig.e = lapply( cor.e, function(x){x[is.na(x)]=0; eigen((x)) } )
eig.n = lapply( cor.n, function(x){x[is.na(x)]=0; eigen((x)) } )

save(eig.e, eig.n, 
     file=paste0(data_directory, 'cor_matrices_50kb_PCA_IS.RData') )

Now find the gene expression level of genes in each bin

## gene anno
gtf = import( paste0(data_directory, 'Mus_musculus.NCBIM37.67.gtf' ) )
seqlevelsStyle(gtf) = 'ucsc'

gb = data.frame( chr = chrom(gtf), start = start(gtf), end=end(gtf),
                 gid = elementMetadata(gtf)[,5], stringsAsFactors=F )
gb = gb[gb$chr %in% paste0('chr',c(1:19,'X')), ]
gbs = split(gb, gb$gid)
gbd = do.call('rbind',lapply(gbs, function(x){data.frame(chr=unique(x$chr),
                                                         start = min(x$start),
                                                         end = max(x$end),
                                                         gid = unique(x$gid),
                                                         stringsAsFactors=F)}))

## rna expression
fd = data.frame( e1 = read.delim( paste0(data_directory, 'ESC1_counts_reverse.txt'), as.is=T, header=F)[,2] ,
                 e2 = read.delim( paste0(data_directory, 'ESC2_counts_reverse.txt'), as.is=T, header=F )[,2],
                 n1 = read.delim( paste0(data_directory, 'NSC1_counts_reverse.txt'), as.is=T, header=F )[,2],
                 n2 = read.delim( paste0(data_directory, 'NSC2_counts_reverse.txt'), as.is=T, header=F )[,2],
                 geneID = read.delim( paste0(data_directory, 'NSC2_counts_reverse.txt'), as.is=T, header=F )[,1], 
                 stringsAsFactors=FALSE )

rna_es = rowSums(fd[,1:2])
rna_ns = rowSums(fd[,3:4])

names(rna_es) = fd[,5]
names(rna_ns) = fd[,5]

gbd$expr_es = rna_es[ match(gbd$gid,names(rna_es))]
gbd$expr_ns = rna_ns[ match(gbd$gid,names(rna_ns))]

geneR = getGR( gbd[,1:3],0 )
geneR$rna_es = gbd$expr_es 
geneR$rna_ns = gbd$expr_ns 


##############################
rnaL = elementMetadata(geneR)[,1]
expr4binES = as.data.frame( findOverlaps(gagr50,geneR) )
expr4binES = split( rnaL[as.numeric(expr4binES[,2])],expr4binES[,1] )
expr4binES = unlist(lapply(expr4binES,sum))
x = rep(0,length(gagr50))
x[as.numeric(names(expr4binES))] = expr4binES
expr4binES = x
save( expr4binES, file=paste0(data_directory,'expr4binES.RData'))


rnaL = elementMetadata(geneR)[,2]
expr4binNS = as.data.frame( findOverlaps(gagr50,geneR) )
expr4binNS = split( rnaL[as.numeric(expr4binNS[,2])],expr4binNS[,1] )
expr4binNS = unlist(lapply(expr4binNS,sum))
x = rep(0,length(gagr50))
x[as.numeric(names(expr4binNS))] = expr4binNS
expr4binNS = x
save( expr4binNS, file=paste0(data_directory,'expr4binNS.RData'))

For each chromosome, I checked with eigen vector describes best its plaid pattern. This information will now be read.

load( paste0(data_directory,'expr4binES.RData') )
load( paste0(data_directory,'expr4binNS.RData') )
load( paste0(data_directory, 'cor_matrices_50kb_PCA_IS.RData') )

get_eigen_track = function(eigens, which.eigen, RNA, ga) { 
  # eigens=eig.e; which.eigen=1; RNA=expr4binES; ga=ga50
  do.call('c',lapply(as.list(paste0('chr',c(1:19,'X'))), function(chr){
  ids = which(ga$chr==chr)
  thischrExpr = RNA[ ids ]
  thisEigen = eigens[ chr ][[1]]$vectors[,which.eigen]
  Pos = which( thisEigen > 0 )
  Neg = which( thisEigen < 0 )
  if( mean(thischrExpr[Pos],trim=0.1) < mean(thischrExpr[Neg],trim=0.1) ){
    thisEigen[Pos] =-1*thisEigen[Pos] 
    thisEigen[Neg] =-1*thisEigen[Neg] 
  }
  return( thisEigen )
}))
  
}


## check the annotations
comp_track_es_eigen1 = get_eigen_track( eig.e, 1, expr4binES, ga50 )
comp_track_ns_eigen1 = get_eigen_track( eig.n, 1, expr4binNS, ga50 )
  
comp_track_es_eigen2 = get_eigen_track( eig.e, 2, expr4binES, ga50 )
comp_track_ns_eigen2 = get_eigen_track( eig.n, 2, expr4binNS, ga50 )
  
comp_track_es_eigen3 = get_eigen_track( eig.e, 3, expr4binES, ga50 )
comp_track_ns_eigen3 = get_eigen_track( eig.n, 3, expr4binNS, ga50 )

save( comp_track_es_eigen1, comp_track_es_eigen2, comp_track_es_eigen3,
      comp_track_ns_eigen1, comp_track_ns_eigen2, comp_track_ns_eigen3,
      file=paste0(data_directory,'Compartment_tracks_50kb.RData' ) )


targetDir = data_directory

gagr50$score = c(comp_track_es_eigen1,rep(0,sum(chrom(gagr50)=='chrY')))
end(gagr50) = end(gagr50)-1
seqlengths(gagr50) = seqlengths(si)[paste0('chr',c(1:19,'X','Y'))]
trim(gagr50)

## GRanges object with 53106 ranges and 1 metadata column:
##         seqnames               ranges strand   |     score
##            <Rle>            <IRanges>  <Rle>   | <numeric>
##       1     chr1     [     1,  49999]      *   |         0
##       2     chr1     [ 50001,  99999]      *   |         0
##       3     chr1     [100001, 149999]      *   |         0
##       4     chr1     [150001, 199999]      *   |         0
##       5     chr1     [200001, 249999]      *   |         0
##     ...      ...                  ...    ... ...       ...
##   53102     chrY [15700001, 15749999]      *   |         0
##   53103     chrY [15750001, 15799999]      *   |         0
##   53104     chrY [15800001, 15849999]      *   |         0
##   53105     chrY [15850001, 15899999]      *   |         0
##   53106     chrY [15900001, 15902555]      *   |         0
##   -------
##   seqinfo: 21 sequences from an unspecified genome

export.bw(gagr50,paste0(targetDir,'ES_comp1_50kb.bw') )

gagr50$score = c(comp_track_es_eigen2,rep(0,sum(chrom(gagr50)=='chrY')))
end(gagr50) = end(gagr50)-1
seqlengths(gagr50) = seqlengths(si)[paste0('chr',c(1:19,'X','Y'))]
trim(gagr50)

## GRanges object with 53106 ranges and 1 metadata column:
##         seqnames               ranges strand   |     score
##            <Rle>            <IRanges>  <Rle>   | <numeric>
##       1     chr1     [     1,  49998]      *   |         0
##       2     chr1     [ 50001,  99998]      *   |         0
##       3     chr1     [100001, 149998]      *   |         0
##       4     chr1     [150001, 199998]      *   |         0
##       5     chr1     [200001, 249998]      *   |         0
##     ...      ...                  ...    ... ...       ...
##   53102     chrY [15700001, 15749998]      *   |         0
##   53103     chrY [15750001, 15799998]      *   |         0
##   53104     chrY [15800001, 15849998]      *   |         0
##   53105     chrY [15850001, 15899998]      *   |         0
##   53106     chrY [15900001, 15902555]      *   |         0
##   -------
##   seqinfo: 21 sequences from an unspecified genome

export.bw(gagr50,paste0(targetDir,'ES_comp2_50kb.bw') )


gagr50$score = c(comp_track_es_eigen3,rep(0,sum(chrom(gagr50)=='chrY')))
end(gagr50) = end(gagr50)-1
seqlengths(gagr50) = seqlengths(si)[paste0('chr',c(1:19,'X','Y'))]
trim(gagr50)

## GRanges object with 53106 ranges and 1 metadata column:
##         seqnames               ranges strand   |     score
##            <Rle>            <IRanges>  <Rle>   | <numeric>
##       1     chr1     [     1,  49997]      *   |         0
##       2     chr1     [ 50001,  99997]      *   |         0
##       3     chr1     [100001, 149997]      *   |         0
##       4     chr1     [150001, 199997]      *   |         0
##       5     chr1     [200001, 249997]      *   |         0
##     ...      ...                  ...    ... ...       ...
##   53102     chrY [15700001, 15749997]      *   |         0
##   53103     chrY [15750001, 15799997]      *   |         0
##   53104     chrY [15800001, 15849997]      *   |         0
##   53105     chrY [15850001, 15899997]      *   |         0
##   53106     chrY [15900001, 15902555]      *   |         0
##   -------
##   seqinfo: 21 sequences from an unspecified genome

export.bw(gagr50,paste0(targetDir,'ES_comp3_50kb.bw') )


gagr50$score = c(comp_track_ns_eigen1,rep(0,sum(chrom(gagr50)=='chrY')))
end(gagr50) = end(gagr50)-1
seqlengths(gagr50) = seqlengths(si)[paste0('chr',c(1:19,'X','Y'))]
trim(gagr50)

## GRanges object with 53106 ranges and 1 metadata column:
##         seqnames               ranges strand   |     score
##            <Rle>            <IRanges>  <Rle>   | <numeric>
##       1     chr1     [     1,  49996]      *   |         0
##       2     chr1     [ 50001,  99996]      *   |         0
##       3     chr1     [100001, 149996]      *   |         0
##       4     chr1     [150001, 199996]      *   |         0
##       5     chr1     [200001, 249996]      *   |         0
##     ...      ...                  ...    ... ...       ...
##   53102     chrY [15700001, 15749996]      *   |         0
##   53103     chrY [15750001, 15799996]      *   |         0
##   53104     chrY [15800001, 15849996]      *   |         0
##   53105     chrY [15850001, 15899996]      *   |         0
##   53106     chrY [15900001, 15902555]      *   |         0
##   -------
##   seqinfo: 21 sequences from an unspecified genome

export.bw(gagr50,paste0(targetDir,'NS_comp1_50kb.bw') )


gagr50$score = c(comp_track_ns_eigen2,rep(0,sum(chrom(gagr50)=='chrY')))
end(gagr50) = end(gagr50)-1
seqlengths(gagr50) = seqlengths(si)[paste0('chr',c(1:19,'X','Y'))]
trim(gagr50)

## GRanges object with 53106 ranges and 1 metadata column:
##         seqnames               ranges strand   |     score
##            <Rle>            <IRanges>  <Rle>   | <numeric>
##       1     chr1     [     1,  49995]      *   |         0
##       2     chr1     [ 50001,  99995]      *   |         0
##       3     chr1     [100001, 149995]      *   |         0
##       4     chr1     [150001, 199995]      *   |         0
##       5     chr1     [200001, 249995]      *   |         0
##     ...      ...                  ...    ... ...       ...
##   53102     chrY [15700001, 15749995]      *   |         0
##   53103     chrY [15750001, 15799995]      *   |         0
##   53104     chrY [15800001, 15849995]      *   |         0
##   53105     chrY [15850001, 15899995]      *   |         0
##   53106     chrY [15900001, 15902555]      *   |         0
##   -------
##   seqinfo: 21 sequences from an unspecified genome

export.bw(gagr50,paste0(targetDir,'NS_comp2_50kb.bw') )


gagr50$score = c(comp_track_ns_eigen3,rep(0,sum(chrom(gagr50)=='chrY')))
end(gagr50) = end(gagr50)-1
seqlengths(gagr50) = seqlengths(si)[paste0('chr',c(1:19,'X','Y'))]
trim(gagr50)

## GRanges object with 53106 ranges and 1 metadata column:
##         seqnames               ranges strand   |     score
##            <Rle>            <IRanges>  <Rle>   | <numeric>
##       1     chr1     [     1,  49994]      *   |         0
##       2     chr1     [ 50001,  99994]      *   |         0
##       3     chr1     [100001, 149994]      *   |         0
##       4     chr1     [150001, 199994]      *   |         0
##       5     chr1     [200001, 249994]      *   |         0
##     ...      ...                  ...    ... ...       ...
##   53102     chrY [15700001, 15749994]      *   |         0
##   53103     chrY [15750001, 15799994]      *   |         0
##   53104     chrY [15800001, 15849994]      *   |         0
##   53105     chrY [15850001, 15899994]      *   |         0
##   53106     chrY [15900001, 15902555]      *   |         0
##   -------
##   seqinfo: 21 sequences from an unspecified genome

export.bw(gagr50,paste0(targetDir,'NS_comp3_50kb.bw') )

Loop induction compartments and H3K4me1

load( paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision') )

getGR = function(coords,ga){

  return(GRanges(seqnames = Rle(ga[coords[,1],'chr']),
                 ranges = IRanges(start = as.numeric(ga[coords[,1],'start']),
                                  end = as.numeric(ga[coords[,2],'end'])),
                 strand = Rle(rep("*", nrow(coords)))) )


}

inducedLoops = getGR(loops.ns[,c('left_start','right_end')],ga)
repressedLoops = getGR(loops.es[,c('left_start','right_end')],ga)

load(paste0(data_directory,'loop_compartment_domains.RData'))

getGR = function( b, offset ){
  GRanges(seqnames = Rle(b[,1]),
          ranges = IRanges(as.numeric(b[,2])-offset,
                           end = as.numeric(b[,3])+offset,
                           names = seq(1, nrow(b))),
          strand = Rle(rep("*", nrow(b))))}


loop_domains_es_gr = getGR( es_is_loop_domain$loop_domains[,c(1,2,6)],0 )
loop_domains_ns_gr = getGR( ns_is_loop_domain$loop_domains[,c(1,2,6)],0 )

comp_domains_es_gr = getGR( es_is_loop_domain$domainsWithoutLoops[,c(1,2,6)],0 )
comp_domains_ns_gr = getGR( ns_is_loop_domain$domainsWithoutLoops[,c(1,2,6)],0 )

## compartment domains - remove large domains to be able to compare
comp_domains_es_gr = comp_domains_es_gr[ which(width(comp_domains_es_gr)<1000000)]
comp_domains_ns_gr = comp_domains_ns_gr[ which(width(comp_domains_ns_gr)<1000000)]


## results of HMM
getFA = function(NAME, dir){

    do.call( 'rbind', lapply(as.list(paste0('chr', c(1:19, 'X'))),
            function(x){ read.delim( paste0( dir, NAME, "_", x, "_binary.txt" ),
                          as.is=TRUE, skip=1 ) } ) )

}
prepareGR = function(si, modif, binsize){

    chr=c()
    starts = c()
    id = 0
    for( i in paste0("chr", c(1:19, 'X')) )
        { chrl = as.numeric(as.data.frame( si[ i ] )[1])
          chrom = i
          tpstarts = seq(1, chrl, by=binsize )

          hmb = length(tpstarts) 

          tpchr = rep( chrom, hmb )[ modif[ (id+1):(id+hmb-1) ]!=0 ]
          tpstarts = tpstarts[ modif[ (id+1):(id+hmb-1) ]!=0 ]
          chr = c(chr, tpchr)
          starts = c(starts, tpstarts)
          id = hmb + id - 1

         }

    GR = GRanges(seqnames = Rle(chr), 
                  ranges = IRanges(start = as.numeric(starts), 
                                   width = binsize),
                  strand = Rle(rep("*", length(chr))))
    return(GR)

}

es_hmm = getFA( "ES", dir=paste0(data_directory,'chromHMM/ES/')  )
ns_hmm = getFA( "NS", dir=paste0(data_directory,'chromHMM/NS/') )

## H3K4me1
es_h3k4me1.loop.domains = countOverlaps( loop_domains_es_gr, prepareGR( si, es_hmm[,'H3K4me1'], 200 ) )
ns_h3k4me1.loop.domains = countOverlaps( loop_domains_ns_gr, prepareGR( si, ns_hmm[,'H3K4me1'], 200 ) )
es_h3k4me1.comp.domains = countOverlaps( comp_domains_es_gr, prepareGR( si, es_hmm[,'H3K4me1'], 200 ) )
ns_h3k4me1.comp.domains = countOverlaps( comp_domains_ns_gr, prepareGR( si, ns_hmm[,'H3K4me1'], 200 ) )


## loops are more active than compartment domains
t.test( es_h3k4me1.loop.domains/width(loop_domains_es_gr), 
        es_h3k4me1.comp.domains/width(comp_domains_es_gr) )

## 
##  Welch Two Sample t-test
## 
## data:  es_h3k4me1.loop.domains/width(loop_domains_es_gr) and es_h3k4me1.comp.domains/width(comp_domains_es_gr)
## t = 8.8173, df = 721.73, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0001324551 0.0002083348
## sample estimates:
##    mean of x    mean of y 
## 0.0003298155 0.0001594206

t.test( ns_h3k4me1.loop.domains/width(loop_domains_ns_gr), 
        ns_h3k4me1.comp.domains/width(comp_domains_ns_gr) )

## 
##  Welch Two Sample t-test
## 
## data:  ns_h3k4me1.loop.domains/width(loop_domains_ns_gr) and ns_h3k4me1.comp.domains/width(comp_domains_ns_gr)
## t = 16.933, df = 1217.3, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.00008895132 0.00011226503
## sample estimates:
##     mean of x     mean of y 
## 0.00016789933 0.00006729116

es_h3k4me1.loop.domains = countOverlaps( inducedLoops, prepareGR( si, es_hmm[,'H3K4me1'], 200 ) )
ns_h3k4me1.loop.domains = countOverlaps( inducedLoops, prepareGR( si, ns_hmm[,'H3K4me1'], 200 ) )


MAT = data.frame( Signal = c( es_h3k4me1.comp.domains/width(comp_domains_es_gr), 
                              es_h3k4me1.loop.domains/width(inducedLoops), 
                              ns_h3k4me1.comp.domains/width(comp_domains_ns_gr), 
                              ns_h3k4me1.loop.domains/width(inducedLoops) ) ,
                  Cell = c( rep('No loop (ES)', length(comp_domains_es_gr)),
                            rep('Induced loop (ES)', length(inducedLoops)),
                            rep('No loop (NS)', length(comp_domains_ns_gr)),
                            rep('Induced loop (NS)', length(inducedLoops))) )
MAT$Cell = factor(MAT$Cell,levels=c('No loop (ES)','Induced loop (ES)','No loop (NS)','Induced loop (NS)'))


pIL = ggplot(MAT, aes(y=Signal,x=Cell,color=Cell ) ) +
  geom_violin(trim=TRUE,size=0.75) + geom_boxplot( width=0.1, size=0.2 ) +
  scale_colour_manual(values = c('gray','red3','gray','blue3')) +  
  theme_classic() + 
  theme(legend.position="none",
        title = element_text(size=20), 
        line=element_line(size=0.2),
        axis.title.x = element_text(size=10),
        axis.title.y = element_text(size=10),
        axis.text.x  = element_text(size=10,angle=45,hjust=1),
        axis.text.y  = element_text(size=10),
        strip.text.x = element_text(size=10),
        strip.background = element_rect(colour="white", fill="white")) +
  xlab("") +
  ylab( expression( 'Density of H3K4me1+ bins' )  ) +
  theme(axis.line.y = element_line(color="black", size = 1))

Annotation of loop domains to compartments.

load(paste0(data_directory,'Compartment_tracks_50kb.RData' ))

comp_es = c(comp_track_es_eigen2[c(ga50$chr=='chr1')], comp_track_es_eigen1[c(ga50$chr=='chr2')],
            comp_track_es_eigen2[c(ga50$chr=='chr3')], comp_track_es_eigen1[c(ga50$chr=='chr4')],
            comp_track_es_eigen1[c(ga50$chr=='chr5')], comp_track_es_eigen1[c(ga50$chr=='chr6')],
            comp_track_es_eigen1[c(ga50$chr=='chr7')], comp_track_es_eigen1[c(ga50$chr=='chr8')],
            comp_track_es_eigen1[c(ga50$chr=='chr9')], comp_track_es_eigen1[c(ga50$chr=='chr10')],
            comp_track_es_eigen1[c(ga50$chr=='chr11')], comp_track_es_eigen1[c(ga50$chr=='chr12')],
            comp_track_es_eigen1[c(ga50$chr=='chr13')], comp_track_es_eigen2[c(ga50$chr=='chr14')],
            comp_track_es_eigen1[c(ga50$chr=='chr15')], comp_track_es_eigen1[c(ga50$chr=='chr16')],
            comp_track_es_eigen1[c(ga50$chr=='chr17')], comp_track_es_eigen1[c(ga50$chr=='chr18')],
            comp_track_es_eigen1[c(ga50$chr=='chr19')], comp_track_es_eigen3[c(ga50$chr=='chrX')]) 


comp_ns = c(comp_track_ns_eigen1[c(ga50$chr=='chr1')], comp_track_ns_eigen1[c(ga50$chr=='chr2')],
            comp_track_ns_eigen1[c(ga50$chr=='chr3')], comp_track_ns_eigen1[c(ga50$chr=='chr4')],
            comp_track_ns_eigen1[c(ga50$chr=='chr5')], comp_track_ns_eigen1[c(ga50$chr=='chr6')],
            comp_track_ns_eigen1[c(ga50$chr=='chr7')], comp_track_ns_eigen1[c(ga50$chr=='chr8')],
            comp_track_ns_eigen1[c(ga50$chr=='chr9')], comp_track_ns_eigen2[c(ga50$chr=='chr10')],
            comp_track_ns_eigen1[c(ga50$chr=='chr11')], comp_track_ns_eigen1[c(ga50$chr=='chr12')],
            comp_track_ns_eigen2[c(ga50$chr=='chr13')], comp_track_ns_eigen2[c(ga50$chr=='chr14')],
            comp_track_ns_eigen1[c(ga50$chr=='chr15')], comp_track_ns_eigen2[c(ga50$chr=='chr16')],
            comp_track_ns_eigen1[c(ga50$chr=='chr17')], comp_track_ns_eigen2[c(ga50$chr=='chr18')],
            comp_track_ns_eigen2[c(ga50$chr=='chr19')], comp_track_ns_eigen2[c(ga50$chr=='chrX')]) 

save( comp_es, comp_ns, file=paste0(data_directory,'Comp_tracks_ES_NS.RData') )

load( paste0(data_directory,'Comp_tracks_ES_NS.RData') )
gagr50$es = c(comp_es,rep(0,sum(ga50$chr=='chrY')))
gagr50$ns = c(comp_ns,rep(0,sum(ga50$chr=='chrY')))


il_comp = do.call('rbind',lapply(as.list(seq(1,length(inducedLoops))),function(l){
  es_comp = gagr50$es[ queryHits(findOverlaps(gagr50,inducedLoops[l]))]
  ns_comp = gagr50$ns[ queryHits(findOverlaps(gagr50,inducedLoops[l]))]
  data.frame( a_es = sum(es_comp>0), b_es=sum(es_comp<0),
              a_ns = sum(ns_comp>0), b_ns=sum(ns_comp<0), 
              stringsAsFactors=F )
}))


a = il_comp[ il_comp[,3]>il_comp[,4] ,]
sum(a[,1]>a[,2] )

## [1] 1340

sum(a[,1]<a[,2] )

## [1] 360

b = il_comp[ il_comp[,3]<il_comp[,4] ,]
sum(b[,1]<b[,2] )

## [1] 552

m=rbind( c( sum(a[,1]>a[,2] ), sum(a[,1]<a[,2] ) ),
         c( sum(b[,1]>b[,2] ), sum(b[,1]<b[,2] ) ) )
m=m/1000
par(cex.lab=1.5, cex.axis=1.5)
barplot(t(m), col=c('blue','orange'),names=c('A','B'),
        xlab='Comp. in NS',ylab='Number (x1000)')
legend( y=1.5,x=1.5,c('A in ES', 'B in ES'),
        bty='n',col=c('blue','orange'),
        pch=15,cex=1.5)

Mouse gene atlas

This is just to visualize what was done exactly. We used R version to perform these operations. We read all the .CEL files into AffyBatch object. We next perform VSN normalization. We use perfect math probes only.

library(limma)
library(affy)
library(mouse4302.db)
library(biomaRt)
library(vsn)

## affybatch object
load("/g/huber/users/pekowska/Mouse_Gene_Atlas/afb.RData")
afb
## normalize afb
set = expresso(afb, bg.correct = FALSE, normalize.method = "vsn", summary.method = "medianpolish", pmcorrect.method = "pmonly")
save( set, file=paste0(data_directory,'vsnNormalizedSetAllsamples.RData') )

## annotate column names
ex = as.data.frame(exprs(set))
ex = ex[,order(colnames(ex))]
colnames(ex) = unlist(strsplit(colnames(ex), ".CEL.gz"))

## annotate with tissue names
sa = read.delim( file=paste0(annotation_dir,"annotation"), 
                 header = FALSE, as.is = TRUE) 
sa$V3 = unlist(strsplit(sa$V2, "_4MJW"))[ seq(1,2*nrow(sa),by=2) ]
sa$newV1 = c( unlist(strsplit(sa$V1[1:182], " ")), sa$V1[183:185] )

## annotate expression matrix
colnames(ex) = sa[match(colnames(ex),sa$newV1),3]
ex$ps = rownames(ex)


## I do all the gene atlas mapping with bioMart
affyids = rownames(ex)
ensembl = useMart(dataset="mmusculus_gene_ensembl",
biomart="ENSEMBL_MART_ENSEMBL", host="may2012.archive.ensembl.org") ## with this we connect to the appropriae db 
filters = listFilters(ensembl)
attributes = c("affy_mouse430_2", "ensembl_gene_id", "chromosome_name", "start_position", "end_position","strand")
an = getBM(attributes=attributes, filters = "affy_mouse430_2", values = affyids, mart = ensembl)
e = merge(ex, an, by.x = "ps", by.y = "affy_mouse430_2") ### annotate with the gene symbol



## big matix
mGA = do.call('rbind', lapply(split(e, e$ensembl_gene_id), function(geneM){
  if(nrow(geneM)>0){
    whichMaxSD = geneM[ which.max( apply( geneM[,2:151], 1, sd ) ),]
  } } ) )


save( mGA, file=paste0(data_directory,'mGA_repl_sep.RData') )

Assess how many neuronal genes and enhancers are within the induced and reduced loop domains.

## define loop classes:
load( paste0(data_directory,'AllTSS_NCBIM37_67_all_TSS_TTS.RData')  )
seqlevelsStyle(tsgr) = 'ucsc'
seqlevelsStyle(nscEa) = 'ucsc'
seqlevelsStyle(induced_aE)='ucsc'

load( paste0(data_directory,'In_situ_ES_NS_DESeq_loop_comparision') )
load(paste0(data_directory,'mGA_repl_sep.RData'))

all( colnames(mGA[,seq(2,151,by=2)]) == colnames(mGA[,seq(3,151,by=2)]) )

## [1] TRUE

mGAm = do.call('cbind', lapply( as.list(seq(2,151,by=2)), function(COLUMN){
  thisTissue = mGA[ , COLUMN:(COLUMN+1)]
  return( rowMeans(thisTissue) ) }))
colnames( mGAm ) = colnames(mGA[,seq(2,151,by=2)])
mGAm = cbind(mGAm, mGA[c(1,152:156)])
save( mGAm, file=paste0(data_directory,'mGA_repl_merged.RData') )

neurTiss = c("cerebellum", 
             "cerebral_cortex", 
             "cerebral_cortex_prefrontal", 
             "dorsal_striatum", 
             "hippocampus", 
             "hypothalamus", 
             "microglia",
             "nucleus_accumbens", 
             "olfactory_bulb", 
             "pituitary", 
             "spinal_cord")

getGR = function(coords,ga){

  return(GRanges(seqnames = Rle(ga[coords[,1],'chr']),
                 ranges = IRanges(start = as.numeric(ga[coords[,1],'start']),
                                  end = as.numeric(ga[coords[,2],'end'])),
                 strand = Rle(rep("*", nrow(coords)))) )


}

## 
inducedLoops = getGR(loops.ns[,c('left_start','right_end')],ga)
reducedLoops = getGR(loops.es[,c('left_start','right_end')],ga)

## substantial proportion of reduced loops are within induced loop interval
## we will remove these loops from consideration
inducedLoops_filt = inducedLoops[ -unique(queryHits(findOverlaps(inducedLoops,reducedLoops))) ]
reducedLoops_filt = reducedLoops[ -unique(queryHits(findOverlaps(reducedLoops,inducedLoops))) ]

### 
neruTissuesbm = mGAm[ , colnames(mGAm) %in% neurTiss]
neruTissuesbm = do.call('cbind',lapply(as.list(seq(1,ncol(neruTissuesbm))), function(X){
  q = quantile(neruTissuesbm[,X],seq(0,1,by=0.01))[86]
  neruTissuesbm[,X]>q
}))
rownames(neruTissuesbm) = rownames(mGAm)
expressed_in_brain = rownames( neruTissuesbm[ rowSums(neruTissuesbm)>0,] )
length(expressed_in_brain)

## [1] 5867

induced_neuronal = do.call('rbind', lapply( as.list(seq(1,length(inducedLoops_filt))), function(l){
  thisLoopGenes = unique(names(tsgr[ queryHits(findOverlaps(tsgr,inducedLoops_filt[l]))]))
  tp = data.frame( number = length(thisLoopGenes),
                   inGA = sum(thisLoopGenes %in% rownames(mGAm)),
                   upreg = sum(thisLoopGenes %in% ig ),
                   expressed_in_neural = sum( thisLoopGenes[! thisLoopGenes %in% ig ] %in% expressed_in_brain ))
  return(tp)
}))


reduced_neuronal = do.call('rbind', lapply( as.list(seq(1,length(reducedLoops_filt))), function(l){
  thisLoopGenes = unique(names(tsgr[ queryHits(findOverlaps(tsgr,reducedLoops_filt[l]))]))
  tp = data.frame( number = length(thisLoopGenes),
                   inGA = sum(thisLoopGenes %in% rownames(mGAm)),
                   upreg = sum(thisLoopGenes %in% ig ),
                   expressed_in_neural = sum( thisLoopGenes[! thisLoopGenes %in% ig ] %in% expressed_in_brain ))
  return(tp)
}))


m = c( sum( induced_neuronal[,3]==0 & induced_neuronal[,2]>0 & induced_neuronal[, 4]>0 ),
       sum( induced_neuronal[,3]==0 & induced_neuronal[,2]>0 & induced_neuronal[, 4]>0 
            & countOverlaps(inducedLoops_filt,induced_aE)>0 ) )
M = m/sum( induced_neuronal[,3]==0 & induced_neuronal[, 2]>0 )


r = c( sum( reduced_neuronal[,3]==0 & reduced_neuronal[,2]>0 & reduced_neuronal[, 4]>0 ),
       sum( reduced_neuronal[,3]==0 & reduced_neuronal[,2]>0 & reduced_neuronal[, 4]>0 
            & countOverlaps(reducedLoops_filt,induced_aE)>0 ) )
R = r/sum( reduced_neuronal[,3]==0 & reduced_neuronal[, 2]>0 )

fisher.test(matrix( c(sum( induced_neuronal[,3]==0 & induced_neuronal[,2]>0 & induced_neuronal[, 4]>0 ),
                      sum( induced_neuronal[,3]==0 & induced_neuronal[,2]>0 & induced_neuronal[, 4]==0 ),
                      sum( reduced_neuronal[,3]==0 & reduced_neuronal[,2]>0 & reduced_neuronal[, 4]>0 ),
                      sum( reduced_neuronal[,3]==0 & reduced_neuronal[,2]>0 & reduced_neuronal[, 4]==0 )),2,2))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  
## p-value = 0.001347
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.174368 2.032779
## sample estimates:
## odds ratio 
##   1.544839

## p-value = 0.0006677


getGRenh = function(df,offset){
  return(GRanges(seqnames = Rle(df[,1]),
                 ranges = IRanges(start = as.numeric(df[,2])-offset,
                                  end = as.numeric(df[,2])+offset),
                 strand = Rle(rep("*", nrow(df))))) }


## load enhancers from Shen et al
cerebellum_enh = getGRenh( read.delim( paste0(data_directory,'cerebellum.specific.enhancer'),
                             as.is=T, header=F ), 1500 )

cortex_enh = getGRenh( read.delim( paste0(data_directory,'cortex.specific.enhancer'),
                       as.is=T, header=F ), 1500 )

E14.5_brain.specific_enh = getGRenh( read.delim( paste0(data_directory,'E14.5-brain.specific.enhancer'),
                                     as.is=T, header=F ), 1500 )

##
all_brain_enhancers = c( E14.5_brain.specific_enh, cerebellum_enh, cortex_enh )
sum(countOverlaps(inducedLoops_filt,all_brain_enhancers)>0)/length(inducedLoops_filt)

## [1] 0.8231869

sum(countOverlaps(reducedLoops_filt,all_brain_enhancers)>0)/length(reducedLoops_filt)

## [1] 0.5610973

fisher.test(matrix( c(sum( induced_neuronal[,3]==0 
                           & induced_neuronal[,2]>0 
                           & induced_neuronal[, 4]>0 
                           & countOverlaps(inducedLoops_filt,all_brain_enhancers)>0),
                      sum( induced_neuronal[,3]==0 
                           & induced_neuronal[,2]>0 
                           & induced_neuronal[, 4]==0 
                           & countOverlaps(inducedLoops_filt,all_brain_enhancers)==0),
                      sum( reduced_neuronal[,3]==0 
                           & reduced_neuronal[,2]>0 
                           & reduced_neuronal[, 4]>0 
                           & countOverlaps(reducedLoops_filt,all_brain_enhancers)>0),
                      sum( reduced_neuronal[,3]==0 
                           & reduced_neuronal[,2]>0 
                           & reduced_neuronal[, 4]==0 
                           & countOverlaps(reducedLoops_filt,all_brain_enhancers)==0) ),2,2))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  
## p-value = 0.000000000008833
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.693748 6.229032
## sample estimates:
## odds ratio 
##   4.094012

par(mar=c(12,6,1,1), mfrow=c(1,1),
    cex.lab=1.5, cex.axis=1.5 )

M = c( 1.6, 4.1 )
barplot( M, col=c('gray','green4'), ylim=c(0,4),
         names=c('Gene expressed in Nervous system',
                 'Gene expressed in Nervous system + neural enhancer'), 
         las=2, 
         ylab='odds [induced loops/reduced loops]')

## 0.00000000001957

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.5 (Final)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] tools     grid      stats4    parallel  stats     graphics  grDevices
##  [8] utils     datasets  methods   base     
## 
## other attached packages:
##  [1] VennDiagram_1.6.17         futile.logger_1.4.1       
##  [3] DESeq_1.22.1               imageHTS_1.20.0           
##  [5] cellHTS2_2.34.1            hwriter_1.3.2             
##  [7] vsn_3.38.0                 splots_1.36.0             
##  [9] RColorBrewer_1.1-2         EBImage_4.12.2            
## [11] pheatmap_1.0.8             LSD_3.0                   
## [13] org.Mm.eg.db_3.2.3         RSQLite_1.0.0             
## [15] DBI_0.4-1                  colorspace_1.2-6          
## [17] rtracklayer_1.30.3         Rsamtools_1.22.0          
## [19] Biostrings_2.38.4          XVector_0.10.0            
## [21] GenomicFeatures_1.22.13    biomaRt_2.26.1            
## [23] Gviz_1.14.7                st_1.2.5                  
## [25] sda_1.3.7                  fdrtool_1.2.15            
## [27] corpcor_1.6.8              entropy_1.2.1             
## [29] smoothmest_0.1-2           MASS_7.3-45               
## [31] genefilter_1.52.1          edgeR_3.12.1              
## [33] limma_3.26.9               DESeq2_1.10.1             
## [35] RcppArmadillo_0.6.600.4.0  Rcpp_0.12.4               
## [37] SummarizedExperiment_1.0.2 GenomicRanges_1.22.4      
## [39] GenomeInfoDb_1.6.3         geneplotter_1.48.0        
## [41] annotate_1.48.0            XML_3.98-1.4              
## [43] AnnotationDbi_1.32.3       IRanges_2.4.8             
## [45] S4Vectors_0.8.11           lattice_0.20-33           
## [47] locfit_1.5-9.1             Biobase_2.30.0            
## [49] BiocGenerics_0.16.1        plyr_1.8.4                
## [51] ggplot2_2.1.0              Matrix_1.2-6              
## [53] rmarkdown_0.9.6           
## 
## loaded via a namespace (and not attached):
##  [1] Category_2.36.0          bitops_1.0-6            
##  [3] matrixStats_0.50.2       affyio_1.40.0           
##  [5] rpart_4.1-10             Hmisc_3.17-3            
##  [7] nnet_7.3-12              gridExtra_2.2.1         
##  [9] preprocessCore_1.32.0    graph_1.48.0            
## [11] formatR_1.4              labeling_0.3            
## [13] scales_0.4.0             DEoptimR_1.0-4          
## [15] mvtnorm_1.0-5            robustbase_0.92-6       
## [17] affy_1.48.0              RBGL_1.46.0             
## [19] stringr_1.0.0            digest_0.6.9            
## [21] tiff_0.1-5               foreign_0.8-66          
## [23] fftwtools_0.9-7          rrcov_1.3-11            
## [25] dichromat_2.0-0          jpeg_0.1-8              
## [27] htmltools_0.3.5          BSgenome_1.38.0         
## [29] BiocInstaller_1.20.3     BiocParallel_1.4.3      
## [31] acepack_1.3-3.3          magrittr_1.5            
## [33] VariantAnnotation_1.16.4 RCurl_1.95-4.8          
## [35] Formula_1.2-1            munsell_0.4.3           
## [37] abind_1.4-3              stringi_1.1.1           
## [39] yaml_2.1.13              zlibbioc_1.16.0         
## [41] splines_3.2.2            knitr_1.13              
## [43] reshape2_1.4.1           futile.options_1.0.0    
## [45] evaluate_0.9             biovizBase_1.18.0       
## [47] latticeExtra_0.6-28      lambda.r_1.1.7          
## [49] prada_1.46.0             png_0.1-7               
## [51] gtable_0.2.0             xtable_1.8-2            
## [53] e1071_1.6-7              class_7.3-14            
## [55] survival_2.39-2          pcaPP_1.9-60            
## [57] GenomicAlignments_1.6.3  cluster_2.0.4           
## [59] GSEABase_1.32.0

Gain of CTCF-anchored chromatin loops marks the exit from naive pluripotency

Aleksandra Pekowska, Bernd Klaus and Wolfgang Huber

07/03/2018