NRIP1 protein is not present in the proteomics dataset
I am using the log2(RNAseq counts) as the y axis. The counts were normalized by size factors but not using variance stabilizing transformation. Because variance stabilizing transformation sometimes do not reflect the actual value of low counts.
Estimate a cutpoint to define high and low NRIP1 expression Instead of choosing the cut-off by eye, I used a more statistically reasonable way. The basic idea is that, the cut-off of NRIP1 expression is estimate to best separate M-CLL and U-CLL. The M-CLL samples show higher NRIP1 expression than this cut-off or the U-CLL samples show lower NRIP1 expression than this cut-off would be defined as discordance cases.
Note that, the estimation of the cut-off value also depending on how the expression value is transformed. Here I use log2 transformed RNAseq counts with library size adjustment.
Estimate a cutpoint to define high and low ZAP70 expression
Define groups based on NRIP1 and ZAP70
## [1] "M_lowNRIP1" "M_highNRIP1" "U_lowNRIP1" "U_highNRIP1"
## [1] "M_highZAP70" "M_lowZAP70" "U_highZAP70" "U_lowZAP70"
Firstly, annotating major biological dimensions
Annotation NRIP1/ZAP70 groups
Firstly visualize major biological dimensions
Visualized NRIP1/ZAP70 related groups
Visualize major biological dimensions
Identify genes that are differentially expression between high expression and low expression groups in M-CLL and U-CLL
Based on the p-value histogram, some genes are differentially regulated in different groups.
Can the groups be separated better by differentially expressed genes?
Not really, still largely by IGHV.
Annotating major biological dimensions
Visualize major biological dimensions
RNAseq
Genomic
Calculate feature importance and optimal number of features
Variance explaied
## [1] 0.7537668
Heatmap of selected features In this analysis, I used a more rigorous feature selection method (stability selection with random lasso). The results may look different to the previous report I sent you. The results from previous method may not be very stable, i.e if you run the feature selection several times, you may get different results.
Enrichment of selected features Noted that the enrichment test was not run only on the optimal number of feature, but all features selected with some frequency. Those features can be regarded as features related to the expression of NRIP1, but may not be necessarily important in predicting NRIP1 expression. Because many genes are correlated with each other, and therefore they contain redundant information.
Therefore, in my opinion, regression with penalization, such as LASSO, may not be the best way to “explain” the biology underlying NRIP1 expression. As many informative features will be dropped by the model simply because they correlate with some other features. The same also holds for ZAP70.
Calculate feature importance and optimal number of features
Variance explaied
## [1] 0.8008923
Heatmap of selected features
Enrichment of selected features
Calculate feature importance and optimal number of features
Variance explaied
## [1] 0.3439689
Heatmap of selected features Mainly IGHV and Methylation cluster. Other features have very low coefficient.
Calculate feature importance and optimal number of features
Variance explaied
## [1] 0.4607239
Heatmap of selected features Mainly IGHV and Methylation cluster. Other features have very low coefficient.
In this part, I am only using gene expression. As IGHV and methylation cluster will always be selected, and will downplay the importance of other gene expression features
Calculate feature importance and optimal number of features
Prediction accuracy
## [1] 0.4653061
Heatmap of selected features
This heatmap is similar to the above one. But the coefficient of each feature for each group is plotted on the left-side barplot. Higher coefficient suggests this feature is more important for predicting the corresponding group using multinomial regression. Groups are indicated by the colors.
The separation is still not ideal. The highNRIP1highZAP70 and lowNRIP1lowZAP70 group can be easily separated. But other two groups do not show clear separation with others.
TTT Patient with lowZAP70/lowNRIP1 expression show the best prognosis while highZAP70/highCD30 patients show the worst prognosis.
OS The trend is similar as TTT, but the separation is less clear.
TTT
OS
TTT
OS
Time to treatment The highNRIP1/highZAP70 group is still significant when other factors were adjusted in the model.
Time to treatment
Prepare table for test
Function for performing test
Associations (10% FDR)
## # A tibble: 6 x 4
## gene meanDiff p.value p.adj
## <chr> <dbl> <dbl> <dbl>
## 1 NOTCH1 -1.31 0.000573 0.0206
## 2 del11q -1.02 0.00243 0.0437
## 3 HIST1H1E 2.23 0.00454 0.0545
## 4 DDX3X -2.14 0.00644 0.0580
## 5 ZMYM3 -2.03 0.00935 0.0615
## 6 TP53 -0.831 0.0103 0.0615
Boxplots of significant associations
Associations (10% FDR)
## # A tibble: 1 x 4
## gene meanDiff p.value p.adj
## <chr> <dbl> <dbl> <dbl>
## 1 ATM -2.59 0.00170 0.0306
Boxplots of significant associations
Associations (10% FDR)
## # A tibble: 0 x 4
## # … with 4 variables: gene <chr>, meanDiff <dbl>, p.value <dbl>, p.adj <dbl>
Associations (10% FDR)
## # A tibble: 6 x 4
## gene meanDiff p.value p.adj
## <chr> <dbl> <dbl> <dbl>
## 1 TP53 1.22 0.0000485 0.000889
## 2 del11q 1.25 0.0000494 0.000889
## 3 NOTCH1 1.18 0.00108 0.0129
## 4 MED12 1.56 0.00328 0.0295
## 5 ZMYM3 1.93 0.00983 0.0707
## 6 BRAF 1.19 0.0142 0.0852
Boxplots of significant associations
Associations (10% FDR)
## # A tibble: 0 x 4
## # … with 4 variables: gene <chr>, meanDiff <dbl>, p.value <dbl>, p.adj <dbl>
Associations (10% FDR)
## # A tibble: 1 x 4
## gene meanDiff p.value p.adj
## <chr> <dbl> <dbl> <dbl>
## 1 U1 -1.13 0.000847 0.0279
Boxplots of significant associations
ANOVA test
## # A tibble: 2 x 2
## # Groups: gene [2]
## gene p.value
## <chr> <dbl>
## 1 NRIP1 0
## 2 ZAP70 0
Boxplot