1 Correlation between RNA expression and protein level

1.1 RNAseq versus FACS

1.2 RNAseq versus Proteomics data

NRIP1 protein is not present in the proteomics dataset

2 Correlation between IGHV and NRIP1/ZAP70 expressions

2.1 IGHV and NRIP1

I am using the log2(RNAseq counts) as the y axis. The counts were normalized by size factors but not using variance stabilizing transformation. Because variance stabilizing transformation sometimes do not reflect the actual value of low counts.

Estimate a cutpoint to define high and low NRIP1 expression Instead of choosing the cut-off by eye, I used a more statistically reasonable way. The basic idea is that, the cut-off of NRIP1 expression is estimate to best separate M-CLL and U-CLL. The M-CLL samples show higher NRIP1 expression than this cut-off or the U-CLL samples show lower NRIP1 expression than this cut-off would be defined as discordance cases.
Note that, the estimation of the cut-off value also depending on how the expression value is transformed. Here I use log2 transformed RNAseq counts with library size adjustment.

2.2 IGHV and ZAP70

Estimate a cutpoint to define high and low ZAP70 expression

2.3 Gene expression heatmap

Define groups based on NRIP1 and ZAP70

## [1] "M_lowNRIP1"  "M_highNRIP1" "U_lowNRIP1"  "U_highNRIP1"
## [1] "M_highZAP70" "M_lowZAP70"  "U_highZAP70" "U_lowZAP70"

2.3.1 Can the groups be separated by unsupervised clustering?

2.3.1.1 Heatmap using top 1000 most variant genes

Firstly, annotating major biological dimensions

Annotation NRIP1/ZAP70 groups

2.3.1.2 PCA

Firstly visualize major biological dimensions

Visualized NRIP1/ZAP70 related groups

2.3.1.3 UMAP

Visualize major biological dimensions

2.3.2 Differential expression

Identify genes that are differentially expression between high expression and low expression groups in M-CLL and U-CLL

2.3.2.1 P-value histogram

Based on the p-value histogram, some genes are differentially regulated in different groups.

2.3.2.2 List of differentially regulated genes in each comparison group (10% FDR)

The Group column indicates the comparison. For example, groupNRIP1_M means “highNRIP1 VS lowNRIP1 in M-CLL”.

2.3.2.3 Overlap of differentially expressed genes

2.3.2.3.1 NRIP1

2.3.2.3.2 ZAP70

2.3.2.4 Enrichment analysis (Hallmark)

2.3.2.5 Enrichment analysis (KEGG)

2.3.3 Clustering using differentially expressed genes

Can the groups be separated better by differentially expressed genes?

2.3.3.1 Heatmap using selected genes (5% FDR)

Not really, still largely by IGHV.

Annotating major biological dimensions

2.3.3.2 PCA

2.3.3.3 UMAP

Visualize major biological dimensions

3 Identify genes correlated with NRIP1 and ZAP70, independent of IGHV status

3.1 Strategy 1: blockling for IGHV in multi-vairate model

3.1.1 NRIP1

3.1.1.1 P value histogram

3.1.1.2 Significantly correlated genes (10% FDR)

3.1.1.3 Plot top 9 correlations, stratified by IGHV

3.1.1.4 Enrichment analysis

Hallmark gene sets

Hallmark gene sets

3.1.2 ZAP70

3.1.2.1 P value histogram

3.1.2.2 Significantly correlated genes (10% FDR)

3.1.2.3 Plot top 9 correlations, stratified by IGHV

3.1.2.4 Enrichment analysis

Hallmark gene sets

Hallmark gene sets

3.2 Strategy 2: Test for M-CLL and U-CLL separately.

3.2.1 NRIP1

3.2.1.1 M-CLL

3.2.1.1.1 P value histogram

3.2.1.1.2 Significantly correlated genes (10% FDR)
3.2.1.1.3 Plot top 9 correlations, stratified by IGHV

3.2.1.1.4 Enrichment analysis

Hallmark gene sets

Hallmark gene sets

3.2.1.2 U-CLL

3.2.1.2.1 P value histogram

In U-CLL, genes correlated with NRIP1 are significantly less than M-CLL

3.2.1.2.2 Significantly correlated genes (10% FDR)
3.2.1.2.3 Plot top 9 correlations, stratified by IGHV

3.2.1.2.4 Enrichment analysis

Hallmark gene sets

Hallmark gene sets

3.2.1.3 Compare DE genes in M-CLL and U-CLL

There are not many overlaps.

3.2.2 ZAP70

3.2.2.1 M-CLL

3.2.2.1.1 P value histogram

3.2.2.1.2 Significantly correlated genes (10% FDR)
3.2.2.1.3 Plot top 9 correlations, stratified by IGHV

3.2.2.1.4 Enrichment analysis

Hallmark gene sets

Hallmark gene sets

3.2.2.2 U-CLL

3.2.2.2.1 P value histogram

In U-CLL, genes correlated with ZAP70 are significantly less than M-CLL

3.2.2.2.2 Significantly correlated genes (10% FDR)

No significant DE genes can be detected

3.2.2.2.3 Enrichment analysis

Hallmark gene sets

Hallmark gene sets

3.2.2.3 Compare DE genes in M-CLL and U-CLL

No overlap can be found.

4 LASSO for explaining NRIP1 and ZAP70 expression

4.1 Preprocessing data

RNAseq

Genomic

4.2 Explaining expression using continous model

4.2.1 NRIP1

Calculate feature importance and optimal number of features

Variance explaied

## [1] 0.7537668

Heatmap of selected features In this analysis, I used a more rigorous feature selection method (stability selection with random lasso). The results may look different to the previous report I sent you. The results from previous method may not be very stable, i.e if you run the feature selection several times, you may get different results.

Enrichment of selected features Noted that the enrichment test was not run only on the optimal number of feature, but all features selected with some frequency. Those features can be regarded as features related to the expression of NRIP1, but may not be necessarily important in predicting NRIP1 expression. Because many genes are correlated with each other, and therefore they contain redundant information.

Therefore, in my opinion, regression with penalization, such as LASSO, may not be the best way to “explain” the biology underlying NRIP1 expression. As many informative features will be dropped by the model simply because they correlate with some other features. The same also holds for ZAP70.

4.2.2 ZAP70

Calculate feature importance and optimal number of features

Variance explaied

## [1] 0.8008923

Heatmap of selected features

Enrichment of selected features

4.2.3 Network plot of regulators of NRIP1 and ZAP70

4.3 Explaining expression using continous model with only genomic data

4.3.1 NRIP1

Calculate feature importance and optimal number of features

Variance explaied

## [1] 0.3439689

Heatmap of selected features Mainly IGHV and Methylation cluster. Other features have very low coefficient.

4.3.2 ZAP70

Calculate feature importance and optimal number of features

Variance explaied

## [1] 0.4607239

Heatmap of selected features Mainly IGHV and Methylation cluster. Other features have very low coefficient.

4.4 Define four groups to describe NRIP1 and ZAP70 expression (catagorical model)

4.4.1 Using median expression to define four groups

4.4.2 Select features that can separate those four groups

In this part, I am only using gene expression. As IGHV and methylation cluster will always be selected, and will downplay the importance of other gene expression features

Calculate feature importance and optimal number of features

Prediction accuracy

## [1] 0.4653061

Heatmap of selected features

4.4.3 Summarise feature importance for each group

This heatmap is similar to the above one. But the coefficient of each feature for each group is plotted on the left-side barplot. Higher coefficient suggests this feature is more important for predicting the corresponding group using multinomial regression. Groups are indicated by the colors.

4.4.4 Plot UMAP using genes selected by LASSO

The separation is still not ideal. The highNRIP1highZAP70 and lowNRIP1lowZAP70 group can be easily separated. But other two groups do not show clear separation with others.

4.5 Assocations between ZAP70/NRIP1 group and clinical outcomes

4.6 All patients

TTT Patient with lowZAP70/lowNRIP1 expression show the best prognosis while highZAP70/highCD30 patients show the worst prognosis.

OS The trend is similar as TTT, but the separation is less clear.

4.7 U-CLLs

TTT

OS

4.8 M-CLLs

TTT

OS

4.8.1 Multivariate-model

Time to treatment The highNRIP1/highZAP70 group is still significant when other factors were adjusted in the model.

Time to treatment

5 Test of correlations between NRIP1/ZAP70 expression and other genomic features

5.1 Mutations and Copy number variations

Prepare table for test

Function for performing test

5.1.1 NRIP1

5.1.1.1 All CLLs

Associations (10% FDR)

## # A tibble: 6 x 4
##   gene     meanDiff  p.value  p.adj
##   <chr>       <dbl>    <dbl>  <dbl>
## 1 NOTCH1     -1.31  0.000573 0.0206
## 2 del11q     -1.02  0.00243  0.0437
## 3 HIST1H1E    2.23  0.00454  0.0545
## 4 DDX3X      -2.14  0.00644  0.0580
## 5 ZMYM3      -2.03  0.00935  0.0615
## 6 TP53       -0.831 0.0103   0.0615

Boxplots of significant associations

5.1.1.2 In M-CLLs only

Associations (10% FDR)

## # A tibble: 1 x 4
##   gene  meanDiff p.value  p.adj
##   <chr>    <dbl>   <dbl>  <dbl>
## 1 ATM      -2.59 0.00170 0.0306

Boxplots of significant associations

5.1.1.3 In U-CLLs only

Associations (10% FDR)

## # A tibble: 0 x 4
## # … with 4 variables: gene <chr>, meanDiff <dbl>, p.value <dbl>, p.adj <dbl>

5.1.2 ZAP70

5.1.2.1 All CLLs

Associations (10% FDR)

## # A tibble: 6 x 4
##   gene   meanDiff   p.value    p.adj
##   <chr>     <dbl>     <dbl>    <dbl>
## 1 TP53       1.22 0.0000485 0.000889
## 2 del11q     1.25 0.0000494 0.000889
## 3 NOTCH1     1.18 0.00108   0.0129  
## 4 MED12      1.56 0.00328   0.0295  
## 5 ZMYM3      1.93 0.00983   0.0707  
## 6 BRAF       1.19 0.0142    0.0852

Boxplots of significant associations

5.1.2.2 In M-CLLs only

Associations (10% FDR)

## # A tibble: 0 x 4
## # … with 4 variables: gene <chr>, meanDiff <dbl>, p.value <dbl>, p.adj <dbl>

5.1.2.3 In U-CLLs only

Associations (10% FDR)

## # A tibble: 1 x 4
##   gene  meanDiff  p.value  p.adj
##   <chr>    <dbl>    <dbl>  <dbl>
## 1 U1       -1.13 0.000847 0.0279

Boxplots of significant associations

5.2 Association with methylation clusters

ANOVA test

## # A tibble: 2 x 2
## # Groups:   gene [2]
##   gene  p.value
##   <chr>   <dbl>
## 1 NRIP1       0
## 2 ZAP70       0

Boxplot