This document was prepared on Fri Jun 22 16:57:42 2018.
This document represents the analysis steps used to analyze quantitative proteomics data as produced by Proteomics Discoverer (PD) software. The PD software was configured to export individual peptide abundances including:
The raw abundances are used for this analysis, however the other abundances are imported and are readily available throughout the workflow.
The import files are in Excel format, and include these files:
basedir <- path.expand("~/Projects/London/londonBALF");
pepFilesRaw <- list.files(path=file.path(basedir, "PD_peptide_data_raw"),
pattern="^BALF.*.xlsx$",
full.names=TRUE);
names(pepFilesRaw) <- gsub("BALF_|[.]xlsx$", "",
basename(pepFilesRaw));
kable(data.frame(`Peptide input files`=pepFilesRaw)) %>%
kable_styling() %>%
row_spec(0, background="#DDDDDD")
| Peptide.input.files | |
|---|---|
| LM | /Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LM.xlsx |
| LS | /Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LS.xlsx |
| LT | /Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LT.xlsx |
First, we load some custom R functions:
cbioPath <- "/Users/wardjm/Builds/NIEHS_IntegrativeBIx-svn/Bpipe/bpipeScripts/CBioRUtils.R";
CBioRUtils <- new.env(parent=.GlobalEnv);
sys.source(cbioPath, envir=CBioRUtils, keep.source=TRUE);
assign("CBioRUtils", CBioRUtils, envir=.GlobalEnv);
attach(CBioRUtils, warn=FALSE, pos=1);
This step calls a custom function processPDdata() which encapsulates several data curation steps. The result is an R object type SummarizedExperiment which is a common standard format for most omics analysis workflows.
Create a unique row identifier, using the peptide sequence, and the PTM string. This combination of values is sufficient and necessary to produce a unique identifier with these properties:
Derive the PTM string from the PD column “Modifications” by taking the first six character of the type of modification, then concatenating the sites with an underscore. For example:
Identify abundance column types, and create a separate data matrix for each abundance type, including:
Annotate samples by identifiers to describe the experimental factors, for example the sample “F13:.Sample,.LM,.1,.WT,.1” becomes annotated:
Note: there is a curation step introduced here which ensures the samples are consistently annotated. The parameters curateFromSampleID and curateToSampleID are used as a regular expression find and replace.
Assign gene symbols by converting the PD-assigned protein accession numbers from RefSeq protein to Entrez Gene ID. Then convert to unique Entrez gene symbol.
Recognize “Found” flags and create a data matrix representing the various PD flags indicating their flags:
curateFromSampleID <- c("^(F[0-9]+:.Sample,.)(WT|KO),.([0-9]+),.([0-9]+)$",
"(F([123]):.Sample,.LM,.1,.WT,.)n[/]a",
"(LM_)LM_|(LS_)LS_|(LT_)LT_");
curateToSampleID <- c("\\1LM,.\\3,.\\2,.\\4",
"\\1\\2",
"\\1\\2\\3");
allNormPepRawSEL <- lapply(nameVectorN(pepDataRawL), function(iName){
#printDebug("iName:", iName);
iDF <- pepDataRawL[[iName]];
iAllNormSE <- processPDdata(iDF,
iName=iName,
returnType="SummarizedExperiment",
sampleColnames=c("SampleCode","Blank","Line","Animal","Genotype", "Rep"),
abundancesGrep=c(Raw="^abundance:.",
Scaled="Abundances.[(]Scaled[)]",
Normalized="Abundances.[(]Normalized[)]",
Count="Abundances.Count"),
groupColnames=c("Experiment","Line","Genotype"),
labelColnames=c("Experiment","Line","Genotype","Animal","Rep"),
sampleCurateFrom=curateFromSampleID,
sampleCurateTo=curateToSampleID,
orgDb="org.Mm.eg.db",
keepFullTable=TRUE,
verbose=FALSE);
iAllNormSE;
});
kable(sdim(allNormPepRawSEL)) %>%
kable_styling() %>%
row_spec(0, background="#DDDDDD");
| rows | cols | class | |
|---|---|---|---|
| LM | 17087 | 36 | SummarizedExperiment |
| LS | 14717 | 39 | SummarizedExperiment |
| LT | 15856 | 39 | SummarizedExperiment |
This step creates one large data matrix, whose rownames are defined using the unique peptide_PTM identifier described above. No values are modifed during this step, its main purpose is to ensure all experimental data shares a common row (peptide) identifier for any subsequent comparisons.
Note: Accession numbers are retained during this step, however we noticed that accession numbers are not consistently assigned to accession numbers across different PD experiments. The Accession numbers represent the union of all accession numbers assigned to each peptide-PTM entry.
allNormPepRawUnionSE <- unionSEL(allNormPepRawSEL,
probeID="probeID",
geneKeepColnames=c("probeID",
"geneSymbol", "SeqPTM", "EG", "Accession"),
verbose=FALSE);
kable(sdim(allNormPepRawUnionSE)) %>%
kable_styling() %>%
row_spec(0, background="#DDDDDD");
| rows | cols | class | |
|---|---|---|---|
| colData | 114 | 15 | DataFrame |
| assays | 24112 | 114 | ShallowSimpleListAssays |
| NAMES | 24112 | character | |
| elementMetadata | 24112 | 5 | DataFrame |
| metadata | 3 | list |
Change row names to geneSymbol_peptide_PTM, to make it easier to understand the relationship between gene, peptide, and PTM for each peptide.
For each gene, peptide sequences are sorted by size, alphabetically, then numbered.
The “PTM number” is similarly defined by the sorted PTMs observed for each peptide.
For example, the gene Apoe has a peptide “NEVHTMLGQSTEEIR” which was observed with no PTM, and two PTMs. In this case, the peptide is renamed “pep13”, and the two PTMs are named “ptm1” and “ptm2”. For this peptide here are the three new names:
To summarize for one peptide of Apoe:
| Gene | Peptide | PTM | New_Name |
|---|---|---|---|
| Apoe | NEVHTMLGQSTEEIR | Apoe_pep13 | |
| Apoe | NEVHTMLGQSTEEIR | 1xDeamQ9 | Apoe_pep13_ptm1 |
| Apoe | NEVHTMLGQSTEEIR | 1xOxidM6 | Apoe_pep13_ptm2 |
allNormPepRawUnionSE <- addPeptidePTMLabels(allNormPepRawUnionSE);
## Add indication of how many genes are associated with each peptide
rowData(allNormPepRawUnionSE)$GeneCount <- lengths(
strsplit(rowData(allNormPepRawUnionSE)$geneSymbol, ","));
## Filter for peptide rows with counts
iGenesPepRaw <- rownames(removeBlankRowCols(allNormPepRawUnionSE));
For example we can display the actual data for selected genes.
## Show the full table for selected genes
keepGenes <- c("Apoe","Hbb-b2");
iDF <- mixedSortDF(as.data.frame(
subset(rowData(allNormPepRawUnionSE), geneSymbol %in% keepGenes)[,
c("geneSymbol","GenePep","PTMnum","Peptide","PTM","GenePepPTM")]))
colorSubPepNum <- group2colors(unique(iDF[,"GenePep"]));
iDF <- kable_coloring(iDF,
colorSub=colorSubPepNum,
row_color_by="GenePep",
row.names=FALSE,
returnType="kable") %>%
collapse_rows(columns=1:2, valign="middle") %>%
row_spec(0, background="#DDDDDD")
iDF;
| geneSymbol | GenePep | PTMnum | Peptide | PTM | GenePepPTM |
|---|---|---|---|---|---|
| Apoe | Apoe_pep1 | FWDYLR | Apoe_pep1 | ||
| Apoe_pep2 | LGPLVEQGR | Apoe_pep2 | |||
| Apoe_pep3 | MEEQTQQIR | Apoe_pep3 | |||
| Apoe_pep4 | GRLEEVGNQAR | Apoe_pep4 | |||
| Apoe_pep5 | LGADMEDLRNR | Apoe_pep5 | |||
| Apoe_pep6 | LGKEVQAAQAR | Apoe_pep6 | |||
| Apoe_pep7 | SKMEEQTQQIR | Apoe_pep7 | |||
| Apoe_pep8 | GWFEPIVEDMHR | Apoe_pep8 | |||
| Apoe_pep9 | TANLGAGAAQPLR | Apoe_pep9 | |||
| Apoe_pep10 | DRLEEVREHMEEVR | Apoe_pep10 | |||
| ptm1 | DRLEEVREHMEEVR | 1xOxidM10 | Apoe_pep10_ptm1 | ||
| Apoe_pep11 | ELEEQLGPVAEETR | Apoe_pep11 | |||
| Apoe_pep12 | LKGWFEPIVEDMHR | Apoe_pep12 | |||
| Apoe_pep13 | NEVHTMLGQSTEEIR | Apoe_pep13 | |||
| ptm1 | NEVHTMLGQSTEEIR | 1xDeamQ9 | Apoe_pep13_ptm1 | ||
| ptm2 | NEVHTMLGQSTEEIR | 1xOxidM6 | Apoe_pep13_ptm2 | ||
| Apoe_pep14 | TANLGAGAAQPLRDR | Apoe_pep14 | |||
| Apoe_pep15 | AYKKELEEQLGPVAEETR | Apoe_pep15 | |||
| Apoe_pep16 | IQASVATNPIITPVAQENQ | Apoe_pep16 | |||
| ptm1 | IQASVATNPIITPVAQENQ | 1xDeamQ.N | Apoe_pep16_ptm1 | ||
| Hbb-b2 | Hbb-b2_pep1 | VNPDEVGGEALGR | Hbb-b2_pep1 | ||
| ptm1 | VNPDEVGGEALGR | 1xDeamN2 | Hbb-b2_pep1_ptm1 |
SummarizedExperiment documentation
The object allNormPepRawUnionSE object is SummarizedExperiment, which contains a data matrix with Raw, Normalized, Scaled, and Counts data directly obtained from PD software.
assays(allNormPepRawUnionSE)[["Raw"]].rowData(allNormPepRawUnionSE), as a data.frame which contains peptide and gene annotations. Each row describes peptide measurements, including the peptide sequence, PTM string, Entrez gene ID, gene symbol, etc.colData(allNormPepRawUnionSE), as a data.frame. Each row describes one biological sample, including the mouse line (LM, LS, LT), genotype (WT, KO), animal number (1, 2, 3, 4, 5, 6), and technical replicate (1, 2, 3).The allNormPepRawUnionSE object will also store the normalized data matrix, the statistical design, and statistical comparisons.
Now that we have prepared the data matrix, we can describe some basic features of the data, first is the number of rows containing any measurement.
######################################################################
## Filter for peptide rows with counts
peptideSummaryDF <- data.frame(check.names=FALSE,
`Summary Stats`=c(
`Total Peptide Rows`=formatInt(nrow(allNormPepRawUnionSE)),
`Peptide Rows with Data`=formatInt(length(iGenesPepRaw))
));
kable(peptideSummaryDF) %>%
kable_styling() %>%
row_spec(0, background="#DDDDDD");
| Summary Stats | |
|---|---|
| Total Peptide Rows | 24,112 |
| Peptide Rows with Data | 10,481 |
Note that most of the peptide rows where PD assigned abundance values are also rows without PTMs:
## add a column "HasData" to rowData(allNormPepRawUnionSE)
rowData(allNormPepRawUnionSE)[["HasData"]] <- (rownames(allNormPepRawUnionSE) %in%
iGenesPepRaw);
## Calculate a frequency table of "HasData" with "HasPTM"
hasDataPTMtable <- table(
HasPTM=ifelse(!rowData(allNormPepRawUnionSE)$PTM %in% "",
"hasPTM", "noPTM"),
HasData=ifelse(rowData(allNormPepRawUnionSE)$HasData,
"hasData", "noData"));
kable(hasDataPTMtable) %>%
kable_styling() %>%
row_spec(0, background="#DDDDDD")
| hasData | noData | |
|---|---|---|
| hasPTM | 1624 | 6409 |
| noPTM | 8857 | 7222 |
For consistency in downstream figures, we define a full data.frame with experimental factors, with corresponding colors.
#####################################
## Vector of all sample identifiers
iSamplesPepRaw <- rownames(allNormPepRawUnion$targets);
#####################################
## Define colors
## Line and Genotype
colorSubLine <- c(
LM="brown3",
LS="darkorange",
LT="purple4",
KO="royalblue",
WT="red4");
## Sample Groups
colorSubBase <- c(
LM_WT="brown3",
LS_WT="salmon",
LT_WT="coral3",
LM_KO="darkslateblue",
LS_KO="royalblue1",
LT_KO="deepskyblue4");
## Replicate numbers
colorSubRep <- nameVector(
color2gradient("grey40", n=6, gradientWtFactor=2),
1:6);
## color by unique animal number
iSamplesPepRawAni <- (unique(gsub("_[0-9]$", "", iSamplesPepRaw)));
colorSubAni <- nameVector(colorSubBase[gsub("^L[ST]_(L)|_[0-9].*$", "\\1",
iSamplesPepRawAni)],
iSamplesPepRawAni);
colorSubAni2 <- color2gradient(gradientWtFactor=1/2,
colorSubAni[sort(names(colorSubAni))]);
colorSubTechRep <- nameVector(colorSubAni2[gsub("^L[ST]_(L)|_[0-9]$", "\\1",
iSamplesPepRaw)],
iSamplesPepRaw);
## Put the colors together into a named vector
colorSubPepRaw <- c(colorSubLine,
colorSubBase,
colorSubAni2,
colorSubTechRep,
colorSubRep);
########################################################
## Make a data.frame with experiment design
targetsPepRawDF <- allNormPepRawUnion$targets[,c("Experiment","groupName","Line",
"Animal","Genotype","Rep","sampleLabel")];
targetsPepRawDF$Experiment <- makeFactor(targetsPepRawDF$Experiment);
targetsPepRawDF$Genotype <- makeFactor(targetsPepRawDF$Genotype,
provigrepPattern=c("WT","KO","."));
targetsPepRawDF$Line <- makeFactor(targetsPepRawDF$Line);
targetsPepRawDF$Animal <- asFactor(targetsPepRawDF$Animal);
targetsPepRawDF$groupName <- gsubOrdered("LM_(LM)|LS_(LS)|LT_(LT)", "\\1\\2\\3",
pasteByRowOrdered(targetsPepRawDF[,c("Experiment","Line","Genotype")]));
targetsPepRawDF$groupAnimal <- pasteByRowOrdered(targetsPepRawDF[,c("groupName","Animal")]);
## Sort the data.frame in proper factor order
targetsPepRawDF <- mixedSortDF(targetsPepRawDF);
## Define a data.frame of colors using the named color vector
targetsPepRawDFcolors <- df2groupColors(targetsPepRawDF,
colorSub=colorSubPepRaw);
par("mfrow"=c(1,1), "mar"=c(8,4,4,2));
imageByColors(targetsPepRawDFcolors,
cellnote=targetsPepRawDF,
yaxt="n",
cexCellnote=makeCexCellnote(targetsPepRawDF));
Data is normalized by using the log2 abundance values, and median scaling with a method from the limma R package.
## Normalize
allNormPepRawUnionSE <- doNorm(allNormPepRawUnionSE,
iGenes=iGenesPepRaw,
keepZeros=FALSE,
signal="Raw",
methodList=c("median"));
## ## (17:20:28) 21Jun2018: doNorm(): Applying normalizations for signal: 'Raw'
## ## (17:20:28) 21Jun2018: doNorm(): normMethod: FALSE
## Define ordered sample identifiers
iSamplesPepRaw <- rownames(colData(allNormPepRawUnionSE));
sortColnames <- c("Experiment", "Genotype", "Line", "Animal", "Rep");
iSamplesUnionO <- rownames(mixedSortDF(colData(allNormPepRawUnionSE)[iSamplesPepRaw,sortColnames]));
## Extract the data matrix of normalized values
iMatrixPepRawGmed <- assays(allNormPepRawUnionSE)[["median_Raw"]][iGenesPepRaw,iSamplesUnionO];
One potential insight into the sample preparation relates to the observation of lung lavage bleeding, which may be evident by the presence of hemoglobin peptides.
First, pull together hemoglobin genes with peptide abundance data:
## Hemoglobin gene names to Entrez Gene ID
hbEG <- names(vigrep("hemoglobin", as.list(org.Mm.egGENENAME)));
hbSymbols <- mgetMapped(hbEG, org.Mm.egSYMBOL);
hbGeneNames <- mgetMapped(hbEG, org.Mm.egGENENAME);
hbProbeIDs <- intersect(iGenesPepRaw,
provigrep(paste0("(^|,)", hbSymbols, "($|_|,)"), rownames(allNormPepRawUnionSE)));
## Remove single-peptide entries for now
hbGeneTc <- tcount(gsub("_.+", "", hbProbeIDs));
hbGeneTc1 <- names(hbGeneTc[hbGeneTc %in% 1]);
hbProbeIDs <- hbProbeIDs[!gsub("_.+", "", hbProbeIDs) %in% hbGeneTc1];
hbDF <- data.frame(hbEG=hbEG,
hbSymbols=hbSymbols,
hbGeneNames=hbGeneNames)
## Subset for the peptides being visualized
knitr::kable(subset(hbDF,
hbEG %in% rowData(allNormPepRawUnionSE[hbProbeIDs,])[,"EG"])) %>%
kable_styling();
| hbEG | hbSymbols | hbGeneNames | |
|---|---|---|---|
| 15130 | 15130 | Hbb-b2 | hemoglobin, beta adult minor chain |
| 110257 | 110257 | Hba-a2 | hemoglobin alpha, adult chain 2 |
| 216635 | 216635 | Hbq1a | hemoglobin, theta 1A |
| 100503605 | 100503605 | Hbb-bs | hemoglobin, beta adult s chain |
Next, produce a heatmap of abundances, using the median-normalized abundance values, visualizing all technical replicates.
## iMatrix
iMatrixPepRawG <- log2(1+assays(allNormPepRawUnionSE[iGenesPepRaw,iSamplesUnionO])[["Raw"]]);
## color scale visual max
farbeLim <- sqrt(400000000);
## Make a color key to add to the left of the heatmap
iRowMeans <- rowMeans(na.rm=TRUE,
(2^(iMatrixPepRawGmed[iGenesPepRaw,iSamplesUnionO])-1));
iRowMeanNote <- data.frame(mean_abundance=format(
justify="right",
big.mark=",",
scientific=FALSE,
iRowMeans));
iRowMeanColors <- data.frame(mean_abundance=vals2colorLevels(
symmetricZero=TRUE,
farbeLim=farbeLim,
colorRamp="RdGy_r",
rmNA(naValue=,
sqrt(iRowMeans))));
## Ordered sample identifiers
iSamplesUnionO <- rownames(targetsPepRawDF);
iSamplesUnionO1 <- unvigrep("L._LM", iSamplesUnionO);
## Produce a heatmap
## heatmap using hemoglobins only, color scale sqrt
heatmap.3(rmNA(naValue=-10000,
sqrt(2^iMatrixPepRawGmed[hbProbeIDs,iSamplesUnionO1]-1),
),
cexCol=0.7,
col="RdGy_r",
srtCol=90,
Colv=FALSE,
add.expr=box(),
rowGroups=gsub("_.*$", "", nameVector(hbProbeIDs)),
sepcolor="black",
symkey=FALSE,
keyMargins=c(17,7,5,2),
keyLabel="",
keysize=c(0.7,1.2),
tracecol="transparent",
farbeLim=farbeLim,
margins=c(15,15),
densityScale="sqrt",
main="PD Abundances",
srtRowSideColorsNote=0,
lwidRowSideColors=0.7,
RowSideColors=iRowMeanColors[hbProbeIDs,,drop=FALSE],
RowSideColorsNote=iRowMeanNote[hbProbeIDs,,drop=FALSE],
ColSideColors=t(targetsPepRawDFcolors[iSamplesUnionO1,4:8]),
ColSideColorsNote=t(targetsPepRawDF[iSamplesUnionO1,4:8]),
lheiColSideColors=1.3,
cexColSideColorsNote=(makeCexCellnote(targetsPepRawDF[iSamplesUnionO1,8:4],
cex=1.8, expFactor=0.3)),
srtColSideColorsNote=90,
subMain=c("Median normalized",
"Showing only hemoglobin peptides"));
Overall, it would appear that mouse LT_KO_5 might have elevated hemoglobin abundances, however the other mice appear to have relatively consistent variability.
The function CollapseSEbyRow() combines peptide abundances per gene, while also combining the relevant gene annotations into one row per gene.
The groupFuncName="sum" takes the sum of all peptide abundances per gene.
## Exponentiate the median normalized signal
if (max(assays(allNormPepRawUnionSE)$median_Raw, na.rm=TRUE) < 100) {
assays(allNormPepRawUnionSE)$median_Raw <- 2^(assays(allNormPepRawUnionSE)$median_Raw)-1;
}
## Combine peptide rows per gene
allNormPepRawUnionSEgrp <- CollapseSEbyRow(
allNormPepRawUnionSE[iGenesPepRaw,iSamplesUnionO],
signal=c("Raw","median_Raw"),
groupFuncName="sum",
rowDataColnames=c("geneSymbol","EG","Accession","PTMcount","AAcount"),
numShrinkFunc=c(AAcount=mean, PTMcount=mean),
rows=iGenesPepRaw,
rowGroups=nameVector(rowData(allNormPepRawUnionSE[iGenesPepRaw,])
[,c("geneSymbol","GenePepPTM")])
)
## ## (13:06:24) 22Jun2018: CollapseSEbyRow(): Performing these operations by signal:
## groupFuncName rowStatsFunc
## Raw sum NULL
## median_Raw sum NULL
## ## (13:06:24) 22Jun2018: CollapseSEbyRow(): Collapsing signal:Raw
## ## (13:06:26) 22Jun2018: CollapseSEbyRow(): Collapsing signal:median_Raw
## ## (13:06:29) 22Jun2018: CollapseSEbyRow(): Collapsing rowData(SE)
## ## (13:06:30) 22Jun2018: CollapseSEbyRow(): Re-delimiting unique values in column:geneSymbol
## ## (13:06:30) 22Jun2018: CollapseSEbyRow(): Re-delimiting unique values in column:EG
## head(rowData(allNormPepRawUnionSEgrp));
After this operation 24,112 peptide rows were combined into 1,850 gene rows.
To review the variability of technical replicates, we create MA-plots centered by each technical replicate triplicate. The x-axis shows the mean signal for each protein, the y-axis shows the log2 difference from mean per protein, per technical replicate group.
For each panel, the MAD is determined using signal at or above 20 (roughly abundance 1 million), then MAD factors are calculated as the ratio of each sample MAD versus divided by the median MAD. A threshold was set at 2x MAD for outliers, which are shaded in yellow.
There are a number of obvious individual outliers, but there appears to be larger overall MAD factor from technical replicate 2. This observation is most apparent with the “LM” mouse line, mostly because the LM samples had few individual outliers, and very low overall variance. The “LS” mouse line showed higher variance for technical replicate 2, but not above the MAD cutoff since the overall variance was higher than LM.
## MA-plot
par("oma"=c(0,0,4,0));
iLM <- vigrep("^LM_(KO|WT).*[123]$", colnames(allNormPepRawUnionSEgrp));
outliersLM <- c("LM_WT_1_1", "LM_WT_2_2", "LM_WT_4_2",
"LM_WT_6_2", "LM_KO_1_2", "LM_KO_2_2", "LM_KO_3_2",
"LM_KO_4_2", "LM_KO_6_2");
jpLM <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLM])$median_Raw),
ncol=6,
displayMAD=TRUE, outlierMAD=2, outlierRowMin=21,
titleCex=0.9,
titleBoxColor=colorSubPepRaw[iLM],
groupSuffix="",
useMean=TRUE,
colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
controlSamples=setdiff(iLM, outliersLM),
#controlSamples=setdiff(vigrep("_[13]$", iLM), outliersLM),
centerGroups=gsub("_[123]$", "", iLM));
title(main="LM mouse line MA-plot\nprotein data centered per techrep triplet",
outer=TRUE);
outliersLM <- attr(jpLM, "mvaMADoutliers");
## MA-plot
par("oma"=c(0,0,4,0));
iLS <- vigrep("^LS_(KO|WT)", colnames(allNormPepRawUnionSEgrp));
outliersLS <- NULL;
jpLS <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLS])$median_Raw),
ncol=6,
displayMAD=TRUE, outlierMAD=2, outlierRowMin=20,
titleCex=0.9,
titleBoxColor=colorSubPepRaw[iLS],
groupSuffix="",
useMean=TRUE,
colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
controlSamples=vigrep("_[13]$", iLS),
#controlSamples=setdiff(iLS, outliersLS),
#controlSamples=setdiff(vigrep("_[13]$", iLS), outliersLS),
centerGroups=gsub("_[123]$", "", iLS));
title(main="LS mouse line MA-plot\nprotein data centered per techrep triplet",
outer=TRUE);
outliersLS <- attr(jpLS, "mvaMADoutliers");
## MA-plot
par("oma"=c(0,0,4,0));
iLT <- vigrep("^LT_(KO|WT)", colnames(allNormPepRawUnionSEgrp));
outliersLT <- c("LT_WT_6_1", "LT_KO_1_3", "LT_KO_3_3",
"LT_KO_6_3");
jpLT <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLT])$median_Raw),
ncol=6,
displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
titleCex=0.9,
titleBoxColor=colorSubPepRaw[iLT],
groupSuffix="",
useMean=TRUE,
colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
controlSamples=setdiff(iLT, outliersLT),
centerGroups=gsub("_[123]$", "", iLT));
title(main="LT mouse line MA-plot\nprotein data centered per techrep triplet",
outer=TRUE);
outliersLT <- attr(jpLT, "mvaMADoutliers");
Three technical replicates were performed for each animal, after lung lavage fluid was isolate, this fluid was subjected to proteomics analysis three times. These replicates are combined prior to downstream analysis, evaluating each triplicate for outliers, using a MAD factor cutoff of 5. For each triplet, the MAD is calculated, and any point more than 5x the MAD is considered an outlier.
Given the MA-plots of the technical replicates above, individual outlier samples were defined and omitted from this procedure. Notably, any technical triplicate which already had one or more samples removed based upon the MA plot above cannot have any further outliers removed, since two replicates are not sufficient to establish an outlier condition.
TechRepSamples <- setdiff(iSamplesUnionO,
c(outliersLM, outliersLS, outliersLT));
TechRepGroups <- nameVector(
pasteByRow(colData(allNormPepRawUnionSEgrp[,TechRepSamples])[,c("groupName","Animal")]),
TechRepSamples);
## Combine peptide rows per gene
allNormPepRawUnionSEgrp2 <- CollapseSEbyColumn(
allNormPepRawUnionSEgrp[,TechRepSamples],
signal=c("median_Raw"),
colDataColnames=c("Experiment","Line","Animal","Genotype","groupName"),
numShrinkFunc=c(AAcount=mean, PTMcount=mean),
cols=TechRepSamples,
colGroups=TechRepGroups
)
In summary:
We now revisit the MA-plots, using the per-protein data, for each biological sample.
iLM2 <- vigrep("^LM_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
jpLM2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLM2])$median_Raw),
ncol=3,
titleCex=0.9,
titleBoxColor=colorSubPepRaw[iLM2],
displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
groupSuffix="",
useMean=FALSE,
centerGroups=gsub("_[0-9]+$", "", iLM2));
title(main=paste0("LM mouse line MA-plot\n",
"protein data per biological replicate\n",
"centered per KO or WT genotype"),
outer=TRUE);
outliersLM2 <- attr(jpLM2, "mvaMADoutliers");
iLS2 <- vigrep("^LS_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
jpLS2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLS2])$median_Raw),
ncol=3,
titleCex=0.9,
blankPlotPos=6,
titleBoxColor=colorSubPepRaw[iLS2],
displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
groupSuffix="",
useMean=FALSE,
centerGroups=gsub("_[0-9]+$", "", iLS2));
title(main=paste0("LS mouse line MA-plot\n",
"protein data per biological replicate\n",
"centered per KO or WT genotype"),
outer=TRUE);
outliersLS2 <- attr(jpLS2, "mvaMADoutliers");
iLT2 <- vigrep("^LT_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
outliersLT2 <- NULL;
jpLT2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLT2])$median_Raw),
ncol=3,
titleCex=0.9,
titleBoxColor=colorSubPepRaw[iLT2],
displayMAD=TRUE, outlierMAD=1.6, outlierRowMin=20,
colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
controlSamples=setdiff(iLT2, outliersLT2),
groupSuffix="",
useMean=TRUE,
centerGroups=gsub("_[0-9]+$", "", iLT2));
title(main=paste0("LT mouse line MA-plot\n",
"protein data per biological replicate\n",
"centered per KO or WT genotype"),
outer=TRUE);
outliersLT2 <- attr(jpLT2, "mvaMADoutliers");
Three technical replicates were performed for each animal, which will be combined prior to downstream analysis. The method evaluates each triplicate for outliers, using a MAD factor cutoff of 5, meaning that for each triplet, the MAD is calculated, and any point more than 5x the MAD is considered an outlier.
## combine tech rep data
iSamplesUnionOtechRep <- nameVector(gsub("_[0-9]+$", "", iSamplesUnionO),
iSamplesUnionO);
## Produce a matrix of data after outlier removal
iMatrixPepRawGmedTechMAD <- rowGroupMeans(iMatrixPepRawGmed[,iSamplesUnionO],
useMedian=FALSE,
returnType="input",
madFactor=5,
rmOutliers=TRUE,
groups=iSamplesUnionOtechRep[iSamplesUnionO]);
## Summary table of outliers from the raw data, versus outliers
## after MAD outlier filtering
outlierDF <- table2dataFrame(table(
MADoutliers=is.na(iMatrixPepRawGmedTechMAD[,iSamplesUnionO]),
rawOutliers=is.na(iMatrixPepRawGmed[,iSamplesUnionO])));
knitr::kable(outlierDF);
| rawOutliers:FALSE | rawOutliers:TRUE | |
|---|---|---|
| MADoutliers:FALSE | 527408 | 0 |
| MADoutliers:TRUE | 31370 | 636056 |
## Summary stats
numMADoutliers <- outlierDF["MADoutliers:TRUE","rawOutliers:FALSE"];
pctMADoutliers <- round(outlierDF["MADoutliers:TRUE","rawOutliers:FALSE"]/sum(outlierDF[,"rawOutliers:FALSE"])*1000)/10;
Overall, 31,370 individual values were removed, roughly 5.6%, and only for triplicates containing three values, where the third value was 5xMAD away from the other two.
Now, combine technical replicates to produce a new data matrix.
## Combine technical replicates, using MAD factor cutoff of 5 for outliers
iMatrixPepRawGmedTechGrp <- rowGroupMeans(iMatrixPepRawGmed[,iSamplesUnionO],
useMedian=FALSE,
madFactor=5,
rmOutliers=TRUE,
groups=iSamplesUnionOtechRep[iSamplesUnionO]);
iSamplesUnionOtech <- colnames(iMatrixPepRawGmedTechGrp);
iSamplesUnionOtech1 <- unvigrep("_LM", iSamplesUnionOtech);
We plotted a histogram showing the number of samples with measured abundance values per peptide, out of 36 total samples. The distribution was so interesting, we further color-coded the bars by the number of sample groups represented. It shows that in general, mouse lines agree closely with one another – a peptide detected in a mouse wildtype line is almost always detected in the same mouse knockout line.
## Look at number of animals with detected peptides by sample group
detectedGroupsV <- nameVector(gsub("_[0-9].*$", "", iSamplesUnionOtech1),
iSamplesUnionOtech1);
detectedGroupCountL <- apply(iMatrixPepRawGmedTechGrp[,iSamplesUnionOtech1], 1, function(i){
detectedGroupsV[!is.na(i)]
});
detectedGroupCount <- lengths(strsplit(cPasteUnique(detectedGroupCountL), ","));
detectedSampleCount <- lengths(detectedGroupCountL);
gplots::barplot2(table(detectedGroupCount, detectedSampleCount),
las=2,
ylim=c(0,1100),
ylab="Number of peptides",
xlab="Number of samples with non-zero abundance per peptide",
col=rainbowCat(6));
legend("topleft",
cex=0.8,
bg="transparent",
box.col="transparent",
title="groups detected",
inset=c(0.05,0.05),
fill=rainbowCat(6),
legend=1:6);
detectedDF <- table2dataFrame(table(
Groups=detectedGroupCount,
S=detectedSampleCount));
colnames(detectedDF) <- gsub("^.*:", "", colnames(detectedDF));
knitr::kable(detectedDF);
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Groups:1 | 44 | 32 | 30 | 14 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Groups:2 | 0 | 20 | 64 | 68 | 161 | 205 | 269 | 340 | 509 | 608 | 733 | 836 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Groups:3 | 0 | 0 | 1 | 2 | 3 | 7 | 8 | 13 | 8 | 10 | 9 | 16 | 13 | 8 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Groups:4 | 0 | 0 | 0 | 2 | 2 | 0 | 2 | 7 | 9 | 13 | 27 | 36 | 43 | 60 | 75 | 85 | 108 | 143 | 177 | 226 | 260 | 277 | 277 | 182 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Groups:5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 0 | 6 | 9 | 10 | 5 | 5 | 3 | 11 | 9 | 5 | 8 | 7 | 3 | 3 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Groups:6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 3 | 2 | 3 | 11 | 10 | 12 | 27 | 38 | 32 | 42 | 61 | 66 | 108 | 137 | 172 | 206 | 277 | 349 | 528 | 561 | 691 | 997 |
Interesting observations:
Next, the peptide abundances are combined into protein level abundance values, taking care to maintain the gene-specificity.
## Aggregate tech-rep combined abundances from peptide to protein
iRow <- match(rownames(iMatrixPepRawGmedTechGrp),
rowData(allNormPepRawUnionSE)$probeID);
iGenesProtV <- nameVector(rowData(allNormPepRawUnionSE[iRow,])[,c("geneSymbol","probeID")]);
## Take only peptides assigned to one gene
iGenesProtVuse <- unvigrep(",", iGenesProtV);
## Non-unique genes
nonUniqueGenes <- mixedSort(setdiff(
unique(unlist(strsplit(iGenesProtV,","))),
unique(iGenesProtVuse)));
There are 1 unique gene symbols represented, and 1 unique gene symbols which have at least one peptide specific to that gene.
That means there are 0 genes for which there are peptides measuring multiple genes, sometime seen with gene families with extremely high homology. These genes are:
## aggregate peptides to protein taking normal sum of abundance
iMatrixPepRawGmedTechGrpProtSum <- t(rowGroupMeans(
x=t(2^iMatrixPepRawGmedTechGrp[names(iGenesProtVuse),,drop=FALSE]-1),
groups=iGenesProtVuse,
rowStatsFunc=function(x,...){rowSums(x,...,na.rm=TRUE)},
verbose=FALSE,
keepNULLlevels=TRUE,
));
## Revert zero to NA
iMatrixPepRawGmedTechGrpProtSum[iMatrixPepRawGmedTechGrpProtSum == 0] <- NA;
Individual technical replicate outlier samples were defined using protein-level abundance values, in order to provide a robust measure of protein abundances by which to measure sample processing outliers.
However, individual peptides are useful for analysis in part because the PD software includes post-translational modification (PTM) data alongside peptides where PD detects such PTM events.
Therefore it is useful to combine technical replicates at the peptide level. We perform this step using:
TechRepSamples determined using the protein-level abundance data, andiGenesPepRaw peptides having at least one reported PD measurement.## Combine peptide rows per gene
allNormPepRawUnionSE2 <- CollapseSEbyColumn(
allNormPepRawUnionSE[iGenesPepRaw,TechRepSamples],
signal=c("median_Raw"),
colDataColnames=c("Experiment","Line","Animal","Genotype","groupName"),
numShrinkFunc=c(AAcount=mean, PTMcount=mean),
cols=TechRepSamples,
colGroups=TechRepGroups
);
iSamplesUnionO2 <- colnames(allNormPepRawUnionSE2);
The steps below are the same as with technical replicate data.
One interesting observation is that the peptide expression specific to each mouse line appears more definitive in this view, however the majority of those peptides are also extremely low abundance relative to the highest abundance hemoglobin peptides.
## iMatrix
#iMatrixPepRawG2 <- log2(1+assays(allNormPepRawUnionSE2[iGenesPepRaw,iSamplesUnionO2])$Raw);
iMatrixPepRawGmed2 <- log2(1+assays(allNormPepRawUnionSE2[iGenesPepRaw,iSamplesUnionO2])$median_Raw);
## color scale visual max
farbeLim <- sqrt(400000000);
## Make a color key to add to the left of the heatmap
iRowMeans2 <- rowMeans(na.rm=TRUE,
(2^(iMatrixPepRawGmed2[iGenesPepRaw,iSamplesUnionO2])-1));
iRowMeanNote2 <- data.frame(mean_abundance=format(
justify="right",
big.mark=",",
scientific=FALSE,
iRowMeans2));
iRowMeanColors2 <- data.frame(mean_abundance=vals2colorLevels(
symmetricZero=TRUE,
farbeLim=farbeLim,
colorRamp="RdGy_r",
rmNA(naValue=,
sqrt(iRowMeans2))));
## Ordered sample identifiers
iSamplesUnionO21 <- unvigrep("L._LM", iSamplesUnionO2);
## New sample data.frame
targetsPepRawDF2 <- as.data.frame(colData(allNormPepRawUnionSE2[,iSamplesUnionO2]));
targetsPepRawDFcolors2 <- df2groupColors(targetsPepRawDF2,
colorSub=colorSubPepRaw);
## Produce a heatmap
## heatmap using hemoglobins only, color scale sqrt
heatmap.3(rmNA(naValue=-10000,
sqrt(2^iMatrixPepRawGmed2[hbProbeIDs,iSamplesUnionO21]-1),
),
cexCol=1.3,
col="RdGy_r",
srtCol=90,
Colv=FALSE,
add.expr=box(),
rowGroups=gsub("_.*$", "", nameVector(hbProbeIDs)),
sepcolor="black",
symkey=FALSE,
keyMargins=c(17,7,5,2),
keyLabel="",
keysize=c(0.7,1.2),
tracecol="transparent",
farbeLim=farbeLim,
margins=c(15,15),
densityScale="sqrt",
main="PD Peptide Abundances",
srtRowSideColorsNote=0,
lwidRowSideColors=0.7,
RowSideColors=iRowMeanColors2[hbProbeIDs,,drop=FALSE],
RowSideColorsNote=iRowMeanNote2[hbProbeIDs,,drop=FALSE],
ColSideColors=t(targetsPepRawDFcolors2[iSamplesUnionO21,-1]),
ColSideColorsNote=t(targetsPepRawDF2[iSamplesUnionO21,-1]),
lheiColSideColors=1.3,
cellnote=rmNA(naValue="",
signif(digits=3,
round((2^iMatrixPepRawGmed2[hbProbeIDs,iSamplesUnionO21]-1)/100000)/10)),
cexnote=0.65,
cexColSideColorsNote=(makeCexCellnote(targetsPepRawDF2[iSamplesUnionO21,-1],
cex=1.8, expFactor=0.3)),
srtColSideColorsNote=90,
subMain=c("Median normalized",
"Showing only hemoglobin peptides",
"after combining technical replicates.",
"Heatmap cell labels are abundance in millions."));