Last updated: 2020-06-24
Checks: 6 1
Knit directory: Proteomics/analysis/
This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish
to commit the R Markdown file and build the HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200227)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/.Rhistory
Ignored: analysis/analysisDrugResponses_IC50_cache/
Ignored: analysis/analysisDrugResponses_cache/
Ignored: analysis/complexAnalysis_IGHV_alternative_cache/
Ignored: analysis/complexAnalysis_IGHV_cache/
Ignored: analysis/complexAnalysis_trisomy12_alteredPQR_cache/
Ignored: analysis/complexAnalysis_trisomy12_alternative_cache/
Ignored: analysis/complexAnalysis_trisomy12_cache/
Ignored: analysis/correlateCLLPD_cache/
Ignored: analysis/correlateRNAexpression_cache/
Ignored: analysis/manuscript_S1_Overview_cache/
Ignored: analysis/manuscript_S2_genomicAssociation_cache/
Ignored: analysis/predictOutcome_cache/
Ignored: code/.Rhistory
Ignored: data/.DS_Store
Ignored: output/.DS_Store
Untracked files:
Untracked: analysis/CNVanalysis_11q.Rmd
Untracked: analysis/CNVanalysis_trisomy12.Rmd
Untracked: analysis/CNVanalysis_trisomy19.Rmd
Untracked: analysis/analysisDrugResponses.Rmd
Untracked: analysis/analysisDrugResponses_IC50.Rmd
Untracked: analysis/analysisPCA.Rmd
Untracked: analysis/analysisSplicing.Rmd
Untracked: analysis/analysisTrisomy19.Rmd
Untracked: analysis/annotateCNV.Rmd
Untracked: analysis/complexAnalysis_IGHV.Rmd
Untracked: analysis/complexAnalysis_IGHV_alternative.Rmd
Untracked: analysis/complexAnalysis_overall.Rmd
Untracked: analysis/complexAnalysis_trisomy12.Rmd
Untracked: analysis/complexAnalysis_trisomy12_alternative.Rmd
Untracked: analysis/correlateGenomic_PC12adjusted.Rmd
Untracked: analysis/correlateGenomic_noBlock.Rmd
Untracked: analysis/correlateGenomic_noBlock_MCLL.Rmd
Untracked: analysis/correlateGenomic_noBlock_UCLL.Rmd
Untracked: analysis/correlateRNAexpression.Rmd
Untracked: analysis/default.css
Untracked: analysis/del11q.pdf
Untracked: analysis/del11q_norm.pdf
Untracked: analysis/manuscript_S1_Overview.Rmd
Untracked: analysis/manuscript_S2_genomicAssociation.Rmd
Untracked: analysis/manuscript_S3_trisomy12.Rmd
Untracked: analysis/peptideValidate.Rmd
Untracked: analysis/plotExpressionCNV.Rmd
Untracked: analysis/processPeptides_LUMOS.Rmd
Untracked: analysis/style.css
Untracked: analysis/trisomy12.pdf
Untracked: analysis/trisomy12_AFcor.Rmd
Untracked: analysis/trisomy12_norm.pdf
Untracked: code/AlteredPQR.R
Untracked: code/utils.R
Untracked: data/190909_CLL_prot_abund_med_norm.tsv
Untracked: data/190909_CLL_prot_abund_no_norm.tsv
Untracked: data/20190423_Proteom_submitted_samples_bereinigt.xlsx
Untracked: data/20191025_Proteom_submitted_samples_final.xlsx
Untracked: data/LUMOS/
Untracked: data/LUMOS_peptides/
Untracked: data/LUMOS_protAnnotation.csv
Untracked: data/LUMOS_protAnnotation_fix.csv
Untracked: data/SampleAnnotation_cleaned.xlsx
Untracked: data/example_proteomics_data
Untracked: data/facTab_IC50atLeast3New.RData
Untracked: data/gmts/
Untracked: data/mapEnsemble.txt
Untracked: data/mapSymbol.txt
Untracked: data/proteins_in_complexes
Untracked: data/pyprophet_export_aligned.csv
Untracked: data/timsTOF_protAnnotation.csv
Untracked: output/LUMOS_processed.RData
Untracked: output/cnv_plots.zip
Untracked: output/cnv_plots/
Untracked: output/cnv_plots_norm.zip
Untracked: output/deResList.RData
Untracked: output/dxdCLL.RData
Untracked: output/exprCNV.RData
Untracked: output/lassoResults_CPS.RData
Untracked: output/lassoResults_IC50.RData
Untracked: output/pepCLL_lumos.RData
Untracked: output/pepTab_lumos.RData
Untracked: output/plotCNV_allChr11_diff.pdf
Untracked: output/plotCNV_del11q_sum.pdf
Untracked: output/proteomic_LUMOS_20200227.RData
Untracked: output/proteomic_LUMOS_20200320.RData
Untracked: output/proteomic_LUMOS_20200430.RData
Untracked: output/proteomic_timsTOF_20200227.RData
Untracked: output/splicingResults.RData
Untracked: output/timsTOF_processed.RData
Untracked: plotCNV_del11q_diff.pdf
Unstaged changes:
Modified: analysis/_site.yml
Modified: analysis/analysisSF3B1.Rmd
Modified: analysis/compareProteomicsRNAseq.Rmd
Modified: analysis/correlateCLLPD.Rmd
Modified: analysis/correlateGenomic.Rmd
Deleted: analysis/correlateGenomic_removePC.Rmd
Modified: analysis/correlateMIR.Rmd
Modified: analysis/correlateMethylationCluster.Rmd
Modified: analysis/index.Rmd
Modified: analysis/predictOutcome.Rmd
Modified: analysis/processProteomics_LUMOS.Rmd
Modified: analysis/qualityControl_LUMOS.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote
), click on the hyperlinks in the table below to view them.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | b8e0823 | Junyan Lu | 2020-03-10 | Build site. |
Rmd | c7747b2 | Junyan Lu | 2020-03-10 | update analysses |
html | c7747b2 | Junyan Lu | 2020-03-10 | update analysses |
html | 46534c2 | Junyan Lu | 2020-02-27 | Build site. |
Rmd | 2b8852e | Junyan Lu | 2020-02-27 | wflow_publish(list.files(“./”, pattern = “Rmd”)) |
html | cc8c163 | Junyan Lu | 2020-02-27 | Build site. |
Rmd | 16c2694 | Junyan Lu | 2020-02-27 | wflow_publish(c(“processProteomics_LUMOS.Rmd”, “index.Rmd”, |
Process raw data from LUMOS
rawTab <- read_tsv("../data/LUMOS/20191113_112707_20191016_p3254_o6152_hybrid_lybrary_all_Report_proteinQuant.tsv") %>%
separate(R.FileName, c("a","b","c","d","id","e"), sep = "_") %>%
select(id, PG.ProteinAccessions, PG.Quantity) %>%
mutate(id = paste0("A_1_",id)) %>%
dplyr::rename(ID = PG.ProteinAccessions,
count = PG.Quantity)
#annotate patient ID
patAnno <- readxl::read_xlsx("../data/SampleAnnotation_cleaned.xlsx") %>%
mutate(id = paste0("A_1_",id)) %>%
select(-Institute, -Source, -diagnosis)
#annotate basic genomic feature
genAnno <- patMeta %>% select(Patient.ID, gender, IGHV.status, trisomy12) %>%
mutate(trisomy12 = as.integer(as.character(trisomy12))) %>%
mutate(trisomy12 = factor(trisomy12))
#annotate technical variable
techTab <- readxl::read_xlsx("../data/20191025_Proteom_submitted_samples_final.xlsx") %>%
select(`Patient ID`, operator, viability, batch, `date of sample processing`, `protein conc. in ug`, `freeze-thaw cycles of peptide solution`) %>% dplyr::rename(patID = `Patient ID`, processDate = `date of sample processing`, proteinConc = `protein conc. in ug`, `freeThawCycle` = `freeze-thaw cycles of peptide solution`) %>%
mutate(batch = ifelse(batch == "test run", "0", batch))
patAnno <- left_join(patAnno, genAnno, by = c(patID = "Patient.ID")) %>%
left_join(techTab, by = "patID")
#update protein concentration using new data
protConc <- readxl::read_xlsx("../data/LUMOS/p3254_o6152_Nanodrop_measurement.xlsx") %>% mutate(id = paste0("A_1_", sample)) %>%
mutate(patID = patAnno[match(id, patAnno$id),]$patID)
patAnno$proteinConc <- protConc[match(patAnno$patID, protConc$patID),]$`Nanodrop measurement`
rawTab <- left_join(rawTab, patAnno, by = "id") %>% dplyr::rename(name=ID)
Generate formated protein ID
idMap <- tibble(name = unique(rawTab$name)) %>%
mutate(ID = paste0("prot",seq(nrow(.))))
rawTab <- left_join(rawTab, idMap, by = "name")
protCLL <- tidyToSum(rawTab, "ID", "patID", "count",
annoCol = colnames(patAnno)[colnames(patAnno) != "patID"],
annoRow = c("name"))
rowData(protCLL)$ID <- rownames(protCLL)
Dimension of the original data
dim(protCLL)
[1] 4452 50
plot_missval(protCLL)
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
Some samples show less detection rate than others.
proTab <- sumToTiday(protCLL, "uniprotID", "patientID")
patMiss <- group_by(proTab, patientID) %>%
summarise(freqNA = sum(is.na(count))/length(count)) %>%
arrange(desc(freqNA)) %>%
mutate(patientID = factor(patientID, levels = patientID))
ggplot(patMiss, aes(x = patientID, y = freqNA)) + geom_point(size=3) +
geom_segment(aes(x=patientID, xend=patientID, y=0, yend=freqNA)) +
theme(axis.text.x = element_text(angle = 90, vjust =0.5, hjust=1)) + ylab("Frenquency")
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
One sample P0641, has much more missing values than other samples. This sample also behaves strangely in down-stream analysis.
Remove “P0641”
protCLL <- protCLL[,colnames(protCLL)!="P0641"]
proTab <- sumToTiday(protCLL, "uniprotID", "patientID")
proMiss <- group_by(proTab, uniprotID) %>%
summarise(freqNA = sum(is.na(count))/length(count)) %>%
arrange(desc(freqNA)) %>%
mutate(uniprotID = factor(uniprotID, levels = uniprotID))
head(proMiss)
# A tibble: 6 x 2
uniprotID freqNA
<fct> <dbl>
1 prot4425 0.980
2 prot4430 0.980
3 prot4433 0.980
4 prot4434 0.980
5 prot4435 0.980
6 prot4436 0.980
ggplot(proMiss, aes(x=freqNA)) + geom_histogram() +
xlab("Missing value frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
Missing value cut-off versus number of remaining proteins
sumTab <- lapply(seq(0,1,by = 0.01), function(x) tibble(cut = x, freq = sum(proMiss$freqNA < x)/nrow(proMiss))) %>% bind_rows()
ggplot(sumTab, aes(x=cut, y=freq)) + geom_line() + xlab("Missing value cut-off") + ylab("Percent remaining")
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
Missing value frequency versus median expression
compareTab <- group_by(proTab, uniprotID) %>%
summarise(freqNA = sum(is.na(count))/length(count),
medianExpr = median(log2(count), na.rm=TRUE))
ggplot(compareTab, aes(x=freqNA, y = medianExpr)) + geom_point() + geom_smooth(method = "loess") +
ylab("Median log2 count") + xlab("Frequency of missing values")
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
Highly expressed proteins tend to have higher detection rate.
Remove proteins with more than 50% missing values
cut=0.5
protCLL_filt <- protCLL[rowSums(is.na(assay(protCLL)))/ncol(protCLL) <= cut,]
Dimension of the filtered data
dim(protCLL_filt)
[1] 4291 49
protTab <- sumToTiday(protCLL_filt, "id","patientID")
ggplot(protTab, aes(x=patientID, y=count)) + geom_boxplot() + scale_y_log10() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Warning: Removed 5902 rows containing non-finite values (stat_boxplot).
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
exprMat <- assay(protCLL_filt)
resVsn <- vsnMatrix(exprMat)
protCLL_norm <- protCLL_filt
assay(protCLL_norm) <- predict(resVsn, exprMat)
protTab <- sumToTiday(protCLL_norm, "uniprotID","patientID")
ggplot(protTab, aes(x=patientID, y=count)) + geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Warning: Removed 5902 rows containing non-finite values (stat_boxplot).
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
vsn::meanSdPlot(resVsn)
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
Looks OK. Although lowly expressed proteins still have higher variance.
For impute missing values, first I need to see whether the data is missing at random or not.
Missing value pattern after normalization and filtering
plot_missval(protCLL_norm)
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
Detection rate of proteins with and without missing values
plot_detect(protCLL_norm)
Version | Author | Date |
---|---|---|
cc8c163 | Junyan Lu | 2020-02-27 |
Proteins with missing values have on average low intensities. Not missing at random.
This is a method for imputing missing not at random data.
protCLL_imp <- impute(protCLL_norm, fun = "QRILC")
#add QRILC imputed data
assays(protCLL_norm)[["QRILC"]] <- assay(protCLL_imp)
Prepare protein id table
idTab <- map_df(unique(rowData(protCLL)$name), ~tibble(ID = ., uniprotID = str_split(.,";")[[1]]))
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl",host="grch37.ensembl.org")
ids <- idTab$uniprotID
anno <- getBM(attributes=c('hgnc_symbol','ensembl_gene_id',
'uniprotswissprot'),
filters = 'uniprotswissprot',
values = ids,
mart = ensembl)
idTab <- left_join(idTab, anno, by = c(uniprotID="uniprotswissprot"))
Proteins that can not be mapped
sort(unique(filter(idTab, is.na(ensembl_gene_id))$uniprotID))
[1] "A0A024RBG1" "A0A075B6I0" "A0A075B6K4"
[4] "A0A075B6P5" "A0A0A0MT36" "A0A0A0MTA4"
[7] "A0A0B4J1X5" "A0A0B4J1X8" "A0A0B4J2A2"
[10] "A0A0B4J2D5" "A0A0B4J2D9" "A0A0C4DH29"
[13] "A0A0C4DH31" "A0A0C4DH38" "A0A0C4DH67"
[16] "A0A0C4DH69" "A0A0J9YXX1" "A6NI72"
[19] "A8MWD9" "C4AMC7" "E9PAV3"
[22] "E9PRG8" "L0R6Q1" "O15126"
[25] "O75396" "P01593" "P01594"
[28] "P01597" "P01599" "P01602"
[31] "P01615" "P01619" "P01624"
[34] "P01701" "P01709" "P01718"
[37] "P01742" "P01764" "P01780"
[40] "P01782" "P01850" "P01871"
[43] "P01877" "P01880" "P04430"
[46] "P04432" "P06331" "P0C7X5"
[49] "P0CB33" "P0CF74" "P0DMM9"
[52] "P0DMN0" "P0DMV8" "P0DMV9"
[55] "P0DN76" "P0DOY2" "P0DOY3"
[58] "P0DP02" "P0DP04" "P0DP09"
[61] "P0DP23" "P0DP24" "P0DP25"
[64] "P0DPI2" "P16284" "P21127"
[67] "P58107" "P80748" "Q58FF6"
[70] "Q58FG1" "Q59GN2" "Q6EEV4"
[73] "Q6SZW1" "Q6VEQ5" "Q6ZSR6"
[76] "Q6ZSR9" "Q8IY67" "Q8NHM4"
[79] "Q96M42" "Q9BXJ8" "Q9BYP9"
[82] "Q9H270" "Q9NPI6" "Q9Y2S7"
[85] "Y-FGCZCont00001" "Y-FGCZCont00010" "Y-FGCZCont00024"
[88] "Y-FGCZCont00027" "Y-FGCZCont00030" "Y-FGCZCont00032"
[91] "Y-FGCZCont00033" "Y-FGCZCont00035" "Y-FGCZCont00037"
[94] "Y-FGCZCont00052" "Y-FGCZCont00054" "Y-FGCZCont00056"
[97] "Y-FGCZCont00060" "Y-FGCZCont00064" "Y-FGCZCont00065"
[100] "Y-FGCZCont00066" "Y-FGCZCont00077" "Y-FGCZCont00093"
[103] "Y-FGCZCont00098" "Y-FGCZCont00099" "Y-FGCZCont00100"
[106] "Y-FGCZCont00106" "Y-FGCZCont00110" "Y-FGCZCont00111"
[109] "Y-FGCZCont00113" "Y-FGCZCont00116" "Y-FGCZCont00118"
[112] "Y-FGCZCont00122" "Y-FGCZCont00124" "Y-FGCZCont00127"
[115] "Y-FGCZCont00142" "Y-FGCZCont00147" "Y-FGCZCont00148"
[118] "Y-FGCZCont00150" "Y-FGCZCont00151" "Y-FGCZCont00160"
[121] "Y-FGCZCont00161" "Y-FGCZCont00162" "Y-FGCZCont00165"
[124] "Y-FGCZCont00168" "Y-FGCZCont00175" "Y-FGCZCont00180"
[127] "Y-FGCZCont00188" "Y-FGCZCont00189" "Y-FGCZCont00190"
[130] "Y-FGCZCont00193" "Y-FGCZCont00195" "Y-FGCZCont00212"
[133] "Y-FGCZCont00231" "Y-FGCZCont00237" "Y-FGCZCont00243"
[136] "Y-FGCZCont00248" "Y-FGCZCont00251" "Y-FGCZCont00256"
[139] "Y-FGCZCont00260" "Y-FGCZCont00272" "Y-FGCZCont00276"
[142] "Y-FGCZCont00278" "Y-FGCZCont00282" "Y-FGCZCont00285"
[145] "Y-FGCZCont00286" "Y-FGCZCont00288" "Y-FGCZCont00289"
[148] "Y-FGCZCont00293" "Y-FGCZCont00301" "Y-FGCZCont00307"
[151] "Y-FGCZCont00310" "Y-FGCZCont00311" "Y-FGCZCont00319"
[154] "Y-FGCZCont00328" "Y-FGCZCont00329" "Y-FGCZCont00330"
[157] "Y-FGCZCont00343" "Y-FGCZCont00347" "Y-FGCZCont00348"
[160] "Y-FGCZCont00351" "Y-FGCZCont00355" "Y-FGCZCont00356"
[163] "Y-FGCZCont00359" "Y-FGCZCont00367" "Y-FGCZCont00373"
[166] "Y-FGCZCont00374" "Y-FGCZCont00375" "Y-FGCZCont00376"
[169] "Y-FGCZCont00378" "Y-FGCZCont00380" "Y-FGCZCont00414"
[172] "Y-FGCZCont00417" "Y-FGCZCont00418" "Y-FGCZCont00419"
[175] "Y-FGCZCont00422" "Y-FGCZCont00459" "Y-FGCZCont00461"
[178] "Y-FGCZCont00463" "Y-FGCZCont00474" "Y-FGCZCont00482"
Map those proteins using uniprotID to symbol list
idTab.miss <- filter(idTab,is.na(ensembl_gene_id))
mapSymbol <- read_tsv("../data/mapSymbol.txt")
Parsed with column specification:
cols(
From = col_character(),
To = col_character()
)
idTab.miss <- mutate(idTab.miss, hgnc_symbol = mapSymbol[match(uniprotID, mapSymbol$From),]$To)
#proteins that can not be mapped
unique(filter(idTab.miss, is.na(hgnc_symbol))$uniprotID)
[1] "Y-FGCZCont00100" "Y-FGCZCont00147" "Y-FGCZCont00243"
[4] "Y-FGCZCont00285" "Y-FGCZCont00056" "Y-FGCZCont00010"
[7] "Y-FGCZCont00001" "Y-FGCZCont00027" "Y-FGCZCont00248"
[10] "Y-FGCZCont00482" "Y-FGCZCont00093" "Y-FGCZCont00160"
[13] "Y-FGCZCont00175" "Y-FGCZCont00142" "Y-FGCZCont00066"
[16] "Y-FGCZCont00165" "Y-FGCZCont00099" "Y-FGCZCont00190"
[19] "Y-FGCZCont00162" "Y-FGCZCont00278" "Y-FGCZCont00116"
[22] "Y-FGCZCont00106" "Y-FGCZCont00122" "Q6ZSR6"
[25] "Q6ZSR9" "Y-FGCZCont00065" "Y-FGCZCont00282"
[28] "Y-FGCZCont00030" "Y-FGCZCont00113" "Y-FGCZCont00118"
[31] "Y-FGCZCont00378" "Y-FGCZCont00124" "Y-FGCZCont00375"
[34] "Y-FGCZCont00168" "Y-FGCZCont00180" "Y-FGCZCont00256"
[37] "Y-FGCZCont00260" "Y-FGCZCont00380" "Y-FGCZCont00272"
[40] "Y-FGCZCont00293" "Y-FGCZCont00301" "Y-FGCZCont00307"
[43] "Y-FGCZCont00311" "Y-FGCZCont00319" "Y-FGCZCont00330"
[46] "Y-FGCZCont00343" "Y-FGCZCont00347" "Y-FGCZCont00348"
[49] "Y-FGCZCont00351" "Y-FGCZCont00356" "Y-FGCZCont00359"
[52] "Y-FGCZCont00373" "Y-FGCZCont00374" "Y-FGCZCont00461"
[55] "Y-FGCZCont00459" "Y-FGCZCont00463" "Y-FGCZCont00474"
[58] "Y-FGCZCont00212" "Y-FGCZCont00231" "Y-FGCZCont00376"
[61] "Y-FGCZCont00189" "Y-FGCZCont00276" "Y-FGCZCont00367"
[64] "Y-FGCZCont00328" "Y-FGCZCont00355" "Y-FGCZCont00111"
[67] "Y-FGCZCont00054" "Y-FGCZCont00310" "Y-FGCZCont00188"
[70] "Y-FGCZCont00288" "Y-FGCZCont00418" "Y-FGCZCont00417"
[73] "Y-FGCZCont00422" "Y-FGCZCont00419" "Y-FGCZCont00151"
[76] "Y-FGCZCont00148" "Y-FGCZCont00033" "Y-FGCZCont00024"
[79] "Y-FGCZCont00035" "Y-FGCZCont00060" "Y-FGCZCont00237"
[82] "Y-FGCZCont00289" "Y-FGCZCont00098" "Y-FGCZCont00032"
[85] "Y-FGCZCont00329" "Y-FGCZCont00193" "Y-FGCZCont00150"
[88] "Y-FGCZCont00195" "Y-FGCZCont00414" "Y-FGCZCont00064"
[91] "Y-FGCZCont00110" "Y-FGCZCont00037" "Y-FGCZCont00251"
[94] "Y-FGCZCont00077" "Y-FGCZCont00127" "Y-FGCZCont00161"
[97] "Y-FGCZCont00286" "Y-FGCZCont00052"
#they can be safely removed
idTab.miss <- filter(idTab.miss, !is.na(hgnc_symbol)) %>%
dplyr::select(-ensembl_gene_id)
Get ensemble IDs for those proteins using symbol
ids <- idTab.miss$hgnc_symbol
anno <- getBM(attributes=c('hgnc_symbol','ensembl_gene_id'),
filters = 'hgnc_symbol',
values = ids,
mart = ensembl)
idTab.miss <- left_join(idTab.miss, anno, by = "hgnc_symbol")
Genes whose ensembl ID still can not found, use manual annotation
mapSE <- structure(c("ENSG00000180389","ENSG00000282100","ENSG00000180448",
NA,NA,NA,"ENSG00000177144","ENSG00000223614","ENSG00000160221",
"ENSG00000280071","ENSG00000278615","ENSG00000282651",
"ENSG00000275895"),
names=c("ATP5F1EP2","HSP90AB4P","ARHGAP45","IGHV3-43D","GAGE7","IGHV3-30-3","NUDT4B",
"ZNF735","GATD3A","GATD3B","C11orf98","IGHV5-10-1","U2AF1L5"))
idTab.miss <- mutate(idTab.miss, ensembl_gene_id = ifelse(is.na(ensembl_gene_id), mapSE[hgnc_symbol], ensembl_gene_id))
Update annotations for proteins with missing annotation
idTab <- bind_rows(filter(idTab, !is.na(ensembl_gene_id)), idTab.miss)
Retrieve chromosome information (using ensembl id)
#firstly based on id
ids <- idTab$ensembl_gene_id
anno <- getBM(attributes=c('ensembl_gene_id','chromosome_name'),
filters = 'ensembl_gene_id',
values = ids,
mart = ensembl)
idTab <- mutate(idTab,
chromosome_name = anno[match(idTab$ensembl_gene_id,anno$ensembl_gene_id),]$chromosome_name) %>%
filter(!grepl("CHR|HG",chromosome_name)) %>%
arrange(ensembl_gene_id) %>%
distinct(uniprotID, .keep_all = TRUE) #some proteins can be mapped to several genes, this is normal and mostly happens to histones. This step will remove then and only keep one. This should be fine.
filter(idTab, is.na(chromosome_name))
# A tibble: 7 x 5
ID uniprotID hgnc_symbol ensembl_gene_id chromosome_name
<chr> <chr> <chr> <chr> <chr>
1 P0DN76;Q01081 P0DN76 U2AF1L5 ENSG00000275895 <NA>
2 E9PRG8 E9PRG8 C11orf98 ENSG00000278615 <NA>
3 A0A0B4J2D5;P0DPI2 A0A0B4J2… GATD3B ENSG00000280071 <NA>
4 Q58FF6 Q58FF6 HSP90AB4P ENSG00000282100 <NA>
5 A0A0C4DH38;A0A0J9Y… A0A0J9YX… IGHV5-10-1 ENSG00000282651 <NA>
6 P01782;P0DP04 P0DP04 IGHV3-43D <NA> <NA>
7 P0DP02 P0DP02 IGHV3-30-3 <NA> <NA>
For proteins chromosome can not be annotated, manually add chromosome info
chrMap <- structure(c("15","21","11","14","14","21"),
names = c("Q58FF6","A0A0B4J2D5",
"E9PRG8","A0A0J9YXX1",
"P0DP04","P0DN76"))
idTab <- mutate(idTab, chromosome_name = ifelse(is.na(chromosome_name),
chrMap[uniprotID],chromosome_name))
Annotate
rawTab <- sumToTiday(protCLL, rowID = "rowID", colID = "patID")
rawTab <- left_join(rawTab, idTab, by = c(name = "ID")) %>%
filter(!is.na(uniprotID)) %>% select(-rowID, -ID)
normTab <- sumToTiday(protCLL_norm, rowID = "rowID", colID = "patID")
normTab <- left_join(normTab, idTab, by = c(name = "ID")) %>%
filter(!is.na(uniprotID)) %>% select(-rowID, -ID)
In the LUMOS dataset, one protein identifier can be mapped to several proteins and genes. This may be because some peptides can not be uniquely mapped to a protein. In the analysis, I just duplicated the values for different proteins share the same identifier.
Combine table
protCLL_raw <- tidyToSum(rawTab, "uniprotID","patID", values = "count",
annoCol = colnames(patAnno)[colnames(patAnno)!="patID"],
annoRow = c("name","ensembl_gene_id","hgnc_symbol","chromosome_name"))
protCLL <- tidyToSum(normTab, "uniprotID","patID", values = c("count", "QRILC"),
annoCol = colnames(patAnno)[colnames(patAnno)!="patID"],
annoRow = c("name","ensembl_gene_id","hgnc_symbol","chromosome_name"))
Annotate proteins that can not be uniquely mapped
#for all proteins
dupTab <- rowData(protCLL_raw) %>% data.frame(stringsAsFactors = FALSE) %>% rownames_to_column("id") %>%
group_by(name) %>% summarise(n=length(id)) %>% filter(n>1)
rowData(protCLL)$uniqueMap <- ! rowData(protCLL)$name %in% dupTab$name
rowData(protCLL_raw)$uniqueMap <- ! rowData(protCLL_raw)$name %in% dupTab$name
#a list of proteins that can not be uniquely mapped
unique(rowData(protCLL[!rowData(protCLL)$uniqueMap,])$hgnc_symbol)
[1] "NUDT4B" "GATD3B" "POM121C" "SNRPGP15" "ZNF732"
[6] "GPR89A" "WASH3P" "NACA" "TIMM23" "MYL12B"
[11] "TOX3" "RBFOX2" "EIF1B" "PHF2" "UGT2B17"
[16] "PIAS1" "PIAS2" "TOX4" "KRT6A" "MT2A"
[21] "MT1E" "HIST1H2AB" "CDK1" "PRSS1" "H2AFZ"
[26] "HIST1H2AK" "ZNF735" "GPR89B" "SULT1A3" "SULT1A4"
[31] "HSPA1A" "HSPA1B" "U2AF1L5" "IGLC2" "IGLC3"
[36] "CALM1" "CALM2" "CALM3" "GATD3A" "SLC2A3"
[41] "MT1G" "SNRPB" "ZNF43" "MYL12A" "CDK2"
[46] "RBMS1" "HIST1H2BB" "EIF1" "KRT6C" "SULT1A1"
[51] "UBE2D1" "UGT2B15" "ATP5E" "ACTB" "ARF3"
[56] "SNRPG" "ACTA2" "RPL39" "GNAS" "SNRPN"
[61] "ACTG1" "ACTG2" "HIST1H3B" "NOMO3" "MT1X"
[66] "ARF1" "H3F3B" "CCZ1B" "CCZ1" "CDK3"
[71] "U2AF1" "ZNF85" "RAB31" "GTF2H2" "ZNF141"
[76] "HIST2H2AC" "HIST2H2BE" "ORMDL2" "RPL39P5" "ALG10"
[81] "ALG10B" "NOMO2" "HIST2H2BF" "TIMM23B" "ATP5EP2"
[86] "HIST2H2AA3" "LAIR1" "LAIR2" "SLC25A25" "GTF2H2C"
[91] "WASH2P" "RBMS3" "HIST2H3D" "H2AFV" "HIST3H2A"
[96] "MOB1B" "ZNF679" "MT1M" "LILRB2" "LILRB1"
[101] "PRSS3P2" "SLC2A14" "HIST1H2BH" "SLC25A36" "POM121"
[106] "FEM1C" "HIST1H2AH" "CNOT6L" "ZNF681" "CLIC6"
[111] "HIST1H2AJ" "SLC25A33" "FEM1A" "H2AFJ" "SLC25A23"
[116] "TRIM34" "PANK2" "MOB1A" "PANK3" "SPINK5"
[121] "RBFOX1" "CLIC5" "NUDT4" "ORMDL1" "RAB22A"
[126] "CNOT6" "PHF8" "UBE2D4"
Save object
#for other projects
save(protCLL, protCLL_raw, file = "../../var/proteomic_LUMOS_20200430.RData")
#for this project
save(protCLL, protCLL_raw, file = "../output/proteomic_LUMOS_20200430.RData")
```
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS 10.15.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0
[3] dplyr_0.8.5 purrr_0.3.3
[5] readr_1.3.1 tidyr_1.0.0
[7] tibble_3.0.0 tidyverse_1.3.0
[9] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
[11] BiocParallel_1.18.0 matrixStats_0.54.0
[13] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0
[15] IRanges_2.18.1 S4Vectors_0.22.0
[17] biomaRt_2.40.0 DEP_1.6.1
[19] jyluMisc_0.1.5 vsn_3.52.0
[21] Biobase_2.44.0 BiocGenerics_0.30.0
[23] pheatmap_1.0.12 cowplot_0.9.4
[25] ggplot2_3.3.0 limma_3.40.2
loaded via a namespace (and not attached):
[1] utf8_1.1.4 shinydashboard_0.7.1 gmm_1.6-2
[4] tidyselect_1.0.0 RSQLite_2.1.1 AnnotationDbi_1.46.0
[7] htmlwidgets_1.3 grid_3.6.0 norm_1.0-9.5
[10] maxstat_0.7-25 munsell_0.5.0 codetools_0.2-16
[13] preprocessCore_1.46.0 DT_0.7 withr_2.1.2
[16] colorspace_1.4-1 knitr_1.23 rstudioapi_0.10
[19] ggsignif_0.5.0 mzID_1.22.0 labeling_0.3
[22] git2r_0.26.1 slam_0.1-45 GenomeInfoDbData_1.2.1
[25] KMsurv_0.1-5 farver_2.0.3 bit64_0.9-7
[28] rprojroot_1.3-2 vctrs_0.2.4 generics_0.0.2
[31] TH.data_1.0-10 xfun_0.8 sets_1.0-18
[34] R6_2.4.0 doParallel_1.0.14 clue_0.3-57
[37] bitops_1.0-6 fgsea_1.10.0 assertthat_0.2.1
[40] promises_1.0.1 scales_1.1.0 multcomp_1.4-10
[43] gtable_0.3.0 affy_1.62.0 sandwich_2.5-1
[46] workflowr_1.6.0 rlang_0.4.5 cmprsk_2.2-8
[49] mzR_2.18.1 GlobalOptions_0.1.0 splines_3.6.0
[52] impute_1.58.0 hexbin_1.27.3 broom_0.5.2
[55] modelr_0.1.5 BiocManager_1.30.4 yaml_2.2.0
[58] abind_1.4-5 backports_1.1.4 httpuv_1.5.1
[61] tools_3.6.0 relations_0.6-8 affyio_1.54.0
[64] ellipsis_0.2.0 gplots_3.0.1.1 RColorBrewer_1.1-2
[67] MSnbase_2.10.1 Rcpp_1.0.1 plyr_1.8.4
[70] progress_1.2.2 visNetwork_2.0.7 zlibbioc_1.30.0
[73] RCurl_1.95-4.12 prettyunits_1.0.2 ggpubr_0.2.1
[76] GetoptLong_0.1.7 zoo_1.8-6 haven_2.2.0
[79] cluster_2.1.0 exactRankTests_0.8-30 fs_1.4.0
[82] magrittr_1.5 data.table_1.12.2 openxlsx_4.1.0.1
[85] circlize_0.4.6 reprex_0.3.0 survminer_0.4.4
[88] pcaMethods_1.76.0 mvtnorm_1.0-11 whisker_0.3-2
[91] ProtGenerics_1.16.0 hms_0.5.2 shinyjs_1.0
[94] mime_0.7 evaluate_0.14 xtable_1.8-4
[97] XML_3.98-1.20 rio_0.5.16 readxl_1.3.1
[100] gridExtra_2.3 shape_1.4.4 compiler_3.6.0
[103] KernSmooth_2.23-15 ncdf4_1.16.1 crayon_1.3.4
[106] htmltools_0.4.0 mgcv_1.8-28 later_0.8.0
[109] lubridate_1.7.4 DBI_1.0.0 dbplyr_1.4.2
[112] ComplexHeatmap_2.0.0 MASS_7.3-51.4 tmvtnorm_1.4-10
[115] Matrix_1.2-17 car_3.0-3 cli_1.1.0
[118] imputeLCMD_2.0 marray_1.62.0 gdata_2.18.0
[121] igraph_1.2.4.1 pkgconfig_2.0.2 km.ci_0.5-2
[124] foreign_0.8-71 piano_2.0.2 xml2_1.2.2
[127] MALDIquant_1.19.3 foreach_1.4.4 XVector_0.24.0
[130] drc_3.0-1 rvest_0.3.5 digest_0.6.19
[133] rmarkdown_1.13 cellranger_1.1.0 fastmatch_1.1-0
[136] survMisc_0.5.5 curl_3.3 shiny_1.3.2
[139] gtools_3.8.1 rjson_0.2.20 lifecycle_0.2.0
[142] nlme_3.1-140 jsonlite_1.6 carData_3.0-2
[145] fansi_0.4.0 pillar_1.4.3 lattice_0.20-38
[148] httr_1.4.1 plotrix_3.7-6 survival_2.44-1.1
[151] glue_1.3.2 zip_2.0.2 png_0.1-7
[154] iterators_1.0.10 bit_1.1-14 stringi_1.4.3
[157] blob_1.1.1 memoise_1.1.0 caTools_1.17.1.2