London BALF Proteomics Analysis

This document was prepared on Fri Jun 22 16:57:42 2018.

Description of Input Data

This document represents the analysis steps used to analyze quantitative proteomics data as produced by Proteomics Discoverer (PD) software. The PD software was configured to export individual peptide abundances including:

peptide counts,
raw abundances,
normalized abundances, and
scaled abundances.

The raw abundances are used for this analysis, however the other abundances are imported and are readily available throughout the workflow.

The import files are in Excel format, and include these files:

basedir <- path.expand("~/Projects/London/londonBALF");
pepFilesRaw <- list.files(path=file.path(basedir, "PD_peptide_data_raw"),
   pattern="^BALF.*.xlsx$",
   full.names=TRUE);
names(pepFilesRaw) <- gsub("BALF_|[.]xlsx$", "",
   basename(pepFilesRaw));
kable(data.frame(`Peptide input files`=pepFilesRaw)) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD")

	Peptide.input.files
LM	/Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LM.xlsx
LS	/Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LS.xlsx
LT	/Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LT.xlsx

First, we load some custom R functions:

cbioPath <- "/Users/wardjm/Builds/NIEHS_IntegrativeBIx-svn/Bpipe/bpipeScripts/CBioRUtils.R";
CBioRUtils <- new.env(parent=.GlobalEnv);
sys.source(cbioPath, envir=CBioRUtils, keep.source=TRUE);
assign("CBioRUtils", CBioRUtils, envir=.GlobalEnv);
attach(CBioRUtils, warn=FALSE, pos=1);

Import PD Excel data into R

Import each file into a data.frame

Convert data.frame to a SummarizedExperiment object

This step calls a custom function processPDdata() which encapsulates several data curation steps. The result is an R object type SummarizedExperiment which is a common standard format for most omics analysis workflows.

Create a unique row identifier, using the peptide sequence, and the PTM string. This combination of values is sufficient and necessary to produce a unique identifier with these properties:
- Uniquely identifies each row with no duplications, for example the same peptide will be present on multiple rows, but each row contains a unique PTM string.
- Exactly matches corresponding rows across multiple PD export files, with no ambiguity.
- Does not require assignment to a particular protein accession number, which is known to be inconsistent due to the logic PD uses to assign peptide sequence to accession numbers.
Derive the PTM string from the PD column “Modifications” by taking the first six character of the type of modification, then concatenating the sites with an underscore. For example:
- “1xDeamidated [Q10]” becomes “1xDeamQ10”
- “2xDeamidated [N11; N]” becomes “2xDeamN11_N”
- “1xDeamidated [N10]; 1xOxidation [M14]” becomes “1xDeamN10_1xOxidM14”
Identify abundance column types, and create a separate data matrix for each abundance type, including:
- raw abundance
- normalized abundance
- scaled abundance
- peptide counts
Annotate samples by identifiers to describe the experimental factors, for example the sample “F13:.Sample,.LM,.1,.WT,.1” becomes annotated:
- Experiment “LM” - indicates the PD quantitation experiment
- Mouse line “LM” - indicates the mouse genotype line
- Animal “1” - indicates the mouse number within a sample group
- Genotype “WT” - indicates wildtype (WT) or gene knockout (KO).

Note: there is a curation step introduced here which ensures the samples are consistently annotated. The parameters curateFromSampleID and curateToSampleID are used as a regular expression find and replace.

Assign gene symbols by converting the PD-assigned protein accession numbers from RefSeq protein to Entrez Gene ID. Then convert to unique Entrez gene symbol.
- In cases where a peptide is annotated with multiple accession numbers, each accession number is converted to Entrez gene.
- Roughly 6% of peptides were assigned to multiple Entrez genes, indicating the peptide sequence was shared between multiple Entrez genes.
- Most commonly, a peptide assigned to multiple accession numbers referred to the same gene.
Recognize “Found” flags and create a data matrix representing the various PD flags indicating their flags:
- High
- Peak Found
- Not Found

curateFromSampleID <- c("^(F[0-9]+:.Sample,.)(WT|KO),.([0-9]+),.([0-9]+)$",
   "(F([123]):.Sample,.LM,.1,.WT,.)n[/]a",
   "(LM_)LM_|(LS_)LS_|(LT_)LT_");
curateToSampleID <- c("\\1LM,.\\3,.\\2,.\\4",
   "\\1\\2",
   "\\1\\2\\3");

allNormPepRawSEL <- lapply(nameVectorN(pepDataRawL), function(iName){
   #printDebug("iName:", iName);
   iDF <- pepDataRawL[[iName]];
   iAllNormSE <- processPDdata(iDF,
      iName=iName,
      returnType="SummarizedExperiment",
      sampleColnames=c("SampleCode","Blank","Line","Animal","Genotype", "Rep"),
      abundancesGrep=c(Raw="^abundance:.",
         Scaled="Abundances.[(]Scaled[)]",
         Normalized="Abundances.[(]Normalized[)]",
         Count="Abundances.Count"),
      groupColnames=c("Experiment","Line","Genotype"),
      labelColnames=c("Experiment","Line","Genotype","Animal","Rep"),
      sampleCurateFrom=curateFromSampleID,
      sampleCurateTo=curateToSampleID,
      orgDb="org.Mm.eg.db",
      keepFullTable=TRUE,
      verbose=FALSE);
   iAllNormSE;
});

kable(sdim(allNormPepRawSEL)) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD");

	rows	cols	class
LM	17087	36	SummarizedExperiment
LS	14717	39	SummarizedExperiment
LT	15856	39	SummarizedExperiment

Merge SummarizedExperiment objects

This step creates one large data matrix, whose rownames are defined using the unique peptide_PTM identifier described above. No values are modifed during this step, its main purpose is to ensure all experimental data shares a common row (peptide) identifier for any subsequent comparisons.

Note: Accession numbers are retained during this step, however we noticed that accession numbers are not consistently assigned to accession numbers across different PD experiments. The Accession numbers represent the union of all accession numbers assigned to each peptide-PTM entry.

allNormPepRawUnionSE <- unionSEL(allNormPepRawSEL,
   probeID="probeID",
   geneKeepColnames=c("probeID",
      "geneSymbol", "SeqPTM", "EG", "Accession"),
   verbose=FALSE);
kable(sdim(allNormPepRawUnionSE)) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD");

	rows	cols	class
colData	114	15	DataFrame
assays	24112	114	ShallowSimpleListAssays
NAMES	24112		character
elementMetadata	24112	5	DataFrame
metadata	3		list

Label peptides geneSymbol_pepNum_ptmNum

Change row names to geneSymbol_peptide_PTM, to make it easier to understand the relationship between gene, peptide, and PTM for each peptide.

For each gene, peptide sequences are sorted by size, alphabetically, then numbered.

In this way, the peptide number helps indicate when a peptide is present in multiple rows.
For example, the gene Apoe has peptide “TANLGAGAAQPLR” and “TANLGAGAAQPLRDR” which visibly look very similar, however they would be renamed “pep9” and “pep14”.

The “PTM number” is similarly defined by the sorted PTMs observed for each peptide.

For example, the gene Apoe has a peptide “NEVHTMLGQSTEEIR” which was observed with no PTM, and two PTMs. In this case, the peptide is renamed “pep13”, and the two PTMs are named “ptm1” and “ptm2”. For this peptide here are the three new names:
1. APOE_pep13
2. APOE_pep13_ptm1
3. APOE_pep13_ptm2

To summarize for one peptide of Apoe:

Gene	Peptide	PTM	New_Name
Apoe	NEVHTMLGQSTEEIR		Apoe_pep13
Apoe	NEVHTMLGQSTEEIR	1xDeamQ9	Apoe_pep13_ptm1
Apoe	NEVHTMLGQSTEEIR	1xOxidM6	Apoe_pep13_ptm2

allNormPepRawUnionSE <- addPeptidePTMLabels(allNormPepRawUnionSE);

## Add indication of how many genes are associated with each peptide
rowData(allNormPepRawUnionSE)$GeneCount <- lengths(
   strsplit(rowData(allNormPepRawUnionSE)$geneSymbol, ","));
## Filter for peptide rows with counts
iGenesPepRaw <- rownames(removeBlankRowCols(allNormPepRawUnionSE));

For example we can display the actual data for selected genes.

## Show the full table for selected genes
keepGenes <- c("Apoe","Hbb-b2");
iDF <- mixedSortDF(as.data.frame(
   subset(rowData(allNormPepRawUnionSE), geneSymbol %in% keepGenes)[,
      c("geneSymbol","GenePep","PTMnum","Peptide","PTM","GenePepPTM")]))
colorSubPepNum <- group2colors(unique(iDF[,"GenePep"]));
iDF <- kable_coloring(iDF,
      colorSub=colorSubPepNum,
      row_color_by="GenePep",
      row.names=FALSE,
      returnType="kable") %>%
   collapse_rows(columns=1:2, valign="middle") %>%
   row_spec(0, background="#DDDDDD")
iDF;

geneSymbol	GenePep	PTMnum	Peptide	PTM	GenePepPTM
Apoe	Apoe_pep1		FWDYLR		Apoe_pep1
	Apoe_pep2		LGPLVEQGR		Apoe_pep2
	Apoe_pep3		MEEQTQQIR		Apoe_pep3
	Apoe_pep4		GRLEEVGNQAR		Apoe_pep4
	Apoe_pep5		LGADMEDLRNR		Apoe_pep5
	Apoe_pep6		LGKEVQAAQAR		Apoe_pep6
	Apoe_pep7		SKMEEQTQQIR		Apoe_pep7
	Apoe_pep8		GWFEPIVEDMHR		Apoe_pep8
	Apoe_pep9		TANLGAGAAQPLR		Apoe_pep9
	Apoe_pep10		DRLEEVREHMEEVR		Apoe_pep10
	Apoe_pep10	ptm1	DRLEEVREHMEEVR	1xOxidM10	Apoe_pep10_ptm1
	Apoe_pep11		ELEEQLGPVAEETR		Apoe_pep11
	Apoe_pep12		LKGWFEPIVEDMHR		Apoe_pep12
	Apoe_pep13		NEVHTMLGQSTEEIR		Apoe_pep13
		ptm1	NEVHTMLGQSTEEIR	1xDeamQ9	Apoe_pep13_ptm1
		ptm2	NEVHTMLGQSTEEIR	1xOxidM6	Apoe_pep13_ptm2
	Apoe_pep14		TANLGAGAAQPLRDR		Apoe_pep14
	Apoe_pep15		AYKKELEEQLGPVAEETR		Apoe_pep15
	Apoe_pep16		IQASVATNPIITPVAQENQ		Apoe_pep16
	Apoe_pep16	ptm1	IQASVATNPIITPVAQENQ	1xDeamQ.N	Apoe_pep16_ptm1
Hbb-b2	Hbb-b2_pep1		VNPDEVGGEALGR		Hbb-b2_pep1
Hbb-b2	Hbb-b2_pep1	ptm1	VNPDEVGGEALGR	1xDeamN2	Hbb-b2_pep1_ptm1

About the SummarizedExperiment object

SummarizedExperiment documentation

The object allNormPepRawUnionSE object is SummarizedExperiment, which contains a data matrix with Raw, Normalized, Scaled, and Counts data directly obtained from PD software.

Raw measurements are accessed assays(allNormPepRawUnionSE)[["Raw"]].
Gene annotations are accessed rowData(allNormPepRawUnionSE), as a data.frame which contains peptide and gene annotations. Each row describes peptide measurements, including the peptide sequence, PTM string, Entrez gene ID, gene symbol, etc.
Sample annotations are accessed colData(allNormPepRawUnionSE), as a data.frame. Each row describes one biological sample, including the mouse line (LM, LS, LT), genotype (WT, KO), animal number (1, 2, 3, 4, 5, 6), and technical replicate (1, 2, 3).

The allNormPepRawUnionSE object will also store the normalized data matrix, the statistical design, and statistical comparisons.

Summary information about peptide abundances

Now that we have prepared the data matrix, we can describe some basic features of the data, first is the number of rows containing any measurement.

######################################################################
## Filter for peptide rows with counts
peptideSummaryDF <- data.frame(check.names=FALSE,
   `Summary Stats`=c(
      `Total Peptide Rows`=formatInt(nrow(allNormPepRawUnionSE)),
      `Peptide Rows with Data`=formatInt(length(iGenesPepRaw))
   ));
kable(peptideSummaryDF) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD");

	Summary Stats
Total Peptide Rows	24,112
Peptide Rows with Data	10,481

Note that most of the peptide rows where PD assigned abundance values are also rows without PTMs:

## add a column "HasData" to rowData(allNormPepRawUnionSE)
rowData(allNormPepRawUnionSE)[["HasData"]] <- (rownames(allNormPepRawUnionSE) %in%
   iGenesPepRaw);

## Calculate a frequency table of "HasData" with "HasPTM"
hasDataPTMtable <- table(
   HasPTM=ifelse(!rowData(allNormPepRawUnionSE)$PTM %in% "",
      "hasPTM", "noPTM"),
   HasData=ifelse(rowData(allNormPepRawUnionSE)$HasData,
      "hasData", "noData"));
kable(hasDataPTMtable) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD")

	hasData	noData
hasPTM	1624	6409
noPTM	8857	7222

Experiment design and color-coding for figures

For consistency in downstream figures, we define a full data.frame with experimental factors, with corresponding colors.

#####################################
## Vector of all sample identifiers
iSamplesPepRaw <- rownames(allNormPepRawUnion$targets);

#####################################
## Define colors
## Line and Genotype
colorSubLine <- c(
   LM="brown3",
   LS="darkorange",
   LT="purple4",
   KO="royalblue",
   WT="red4");
## Sample Groups
colorSubBase <- c(
   LM_WT="brown3",
   LS_WT="salmon",
   LT_WT="coral3",
   LM_KO="darkslateblue",
   LS_KO="royalblue1",
   LT_KO="deepskyblue4");
## Replicate numbers
colorSubRep <- nameVector(
   color2gradient("grey40", n=6, gradientWtFactor=2),
   1:6);
## color by unique animal number
iSamplesPepRawAni <- (unique(gsub("_[0-9]$", "", iSamplesPepRaw)));
colorSubAni <- nameVector(colorSubBase[gsub("^L[ST]_(L)|_[0-9].*$", "\\1",
      iSamplesPepRawAni)],
   iSamplesPepRawAni);
colorSubAni2 <- color2gradient(gradientWtFactor=1/2,
   colorSubAni[sort(names(colorSubAni))]);
colorSubTechRep <- nameVector(colorSubAni2[gsub("^L[ST]_(L)|_[0-9]$", "\\1",
      iSamplesPepRaw)],
   iSamplesPepRaw);

## Put the colors together into a named vector
colorSubPepRaw <- c(colorSubLine,
   colorSubBase,
   colorSubAni2,
   colorSubTechRep,
   colorSubRep);

########################################################
## Make a data.frame with experiment design
targetsPepRawDF <- allNormPepRawUnion$targets[,c("Experiment","groupName","Line",
   "Animal","Genotype","Rep","sampleLabel")];
targetsPepRawDF$Experiment <- makeFactor(targetsPepRawDF$Experiment);
targetsPepRawDF$Genotype <- makeFactor(targetsPepRawDF$Genotype,
   provigrepPattern=c("WT","KO","."));
targetsPepRawDF$Line <- makeFactor(targetsPepRawDF$Line);
targetsPepRawDF$Animal <- asFactor(targetsPepRawDF$Animal);
targetsPepRawDF$groupName <- gsubOrdered("LM_(LM)|LS_(LS)|LT_(LT)", "\\1\\2\\3",
   pasteByRowOrdered(targetsPepRawDF[,c("Experiment","Line","Genotype")]));
targetsPepRawDF$groupAnimal <- pasteByRowOrdered(targetsPepRawDF[,c("groupName","Animal")]);

## Sort the data.frame in proper factor order
targetsPepRawDF <- mixedSortDF(targetsPepRawDF);

## Define a data.frame of colors using the named color vector
targetsPepRawDFcolors <- df2groupColors(targetsPepRawDF,
   colorSub=colorSubPepRaw);
par("mfrow"=c(1,1), "mar"=c(8,4,4,2));
imageByColors(targetsPepRawDFcolors,
   cellnote=targetsPepRawDF,
   yaxt="n",
   cexCellnote=makeCexCellnote(targetsPepRawDF));

Peptide-level data processing

Normalize peptide abundances

Data is normalized by using the log2 abundance values, and median scaling with a method from the limma R package.

## Normalize
allNormPepRawUnionSE <- doNorm(allNormPepRawUnionSE,
   iGenes=iGenesPepRaw,
   keepZeros=FALSE,
   signal="Raw",
   methodList=c("median"));

## ##  (17:20:28) 21Jun2018:  doNorm(): Applying normalizations for signal: 'Raw' 
## ##  (17:20:28) 21Jun2018:  doNorm(): normMethod: FALSE

## Define ordered sample identifiers
iSamplesPepRaw <- rownames(colData(allNormPepRawUnionSE));
sortColnames <- c("Experiment", "Genotype", "Line", "Animal", "Rep");
iSamplesUnionO <- rownames(mixedSortDF(colData(allNormPepRawUnionSE)[iSamplesPepRaw,sortColnames]));

## Extract the data matrix of normalized values
iMatrixPepRawGmed <- assays(allNormPepRawUnionSE)[["median_Raw"]][iGenesPepRaw,iSamplesUnionO];

Review hemoglobin abundances

One potential insight into the sample preparation relates to the observation of lung lavage bleeding, which may be evident by the presence of hemoglobin peptides.

First, pull together hemoglobin genes with peptide abundance data:

## Hemoglobin gene names to Entrez Gene ID
hbEG <- names(vigrep("hemoglobin", as.list(org.Mm.egGENENAME)));
hbSymbols <- mgetMapped(hbEG, org.Mm.egSYMBOL);
hbGeneNames <- mgetMapped(hbEG, org.Mm.egGENENAME);
hbProbeIDs <- intersect(iGenesPepRaw,
   provigrep(paste0("(^|,)", hbSymbols, "($|_|,)"), rownames(allNormPepRawUnionSE)));

## Remove single-peptide entries for now
hbGeneTc <- tcount(gsub("_.+", "", hbProbeIDs));
hbGeneTc1 <- names(hbGeneTc[hbGeneTc %in% 1]);
hbProbeIDs <- hbProbeIDs[!gsub("_.+", "", hbProbeIDs) %in% hbGeneTc1];

hbDF <- data.frame(hbEG=hbEG,
   hbSymbols=hbSymbols,
   hbGeneNames=hbGeneNames)
## Subset for the peptides being visualized
knitr::kable(subset(hbDF,
   hbEG %in% rowData(allNormPepRawUnionSE[hbProbeIDs,])[,"EG"])) %>%
   kable_styling();

	hbEG	hbSymbols	hbGeneNames
15130	15130	Hbb-b2	hemoglobin, beta adult minor chain
110257	110257	Hba-a2	hemoglobin alpha, adult chain 2
216635	216635	Hbq1a	hemoglobin, theta 1A
100503605	100503605	Hbb-bs	hemoglobin, beta adult s chain

Next, produce a heatmap of abundances, using the median-normalized abundance values, visualizing all technical replicates.

## iMatrix
iMatrixPepRawG <- log2(1+assays(allNormPepRawUnionSE[iGenesPepRaw,iSamplesUnionO])[["Raw"]]);
## color scale visual max
farbeLim <- sqrt(400000000);

## Make a color key to add to the left of the heatmap
iRowMeans <- rowMeans(na.rm=TRUE,
   (2^(iMatrixPepRawGmed[iGenesPepRaw,iSamplesUnionO])-1));
iRowMeanNote <- data.frame(mean_abundance=format(
   justify="right",
   big.mark=",",
   scientific=FALSE,
   iRowMeans));
iRowMeanColors <- data.frame(mean_abundance=vals2colorLevels(
   symmetricZero=TRUE,
   farbeLim=farbeLim,
   colorRamp="RdGy_r",
   rmNA(naValue=,
      sqrt(iRowMeans))));

## Ordered sample identifiers
iSamplesUnionO <- rownames(targetsPepRawDF);
iSamplesUnionO1 <- unvigrep("L._LM", iSamplesUnionO);

## Produce a heatmap
## heatmap using hemoglobins only, color scale sqrt
heatmap.3(rmNA(naValue=-10000,
      sqrt(2^iMatrixPepRawGmed[hbProbeIDs,iSamplesUnionO1]-1),
   ),
   cexCol=0.7,
   col="RdGy_r",
   srtCol=90,
   Colv=FALSE,
   add.expr=box(),
   rowGroups=gsub("_.*$", "", nameVector(hbProbeIDs)),
   sepcolor="black",
   symkey=FALSE,
   keyMargins=c(17,7,5,2),
   keyLabel="",
   keysize=c(0.7,1.2),
   tracecol="transparent",
   farbeLim=farbeLim,
   margins=c(15,15),
   densityScale="sqrt",
   main="PD Abundances",
   srtRowSideColorsNote=0,
   lwidRowSideColors=0.7,
   RowSideColors=iRowMeanColors[hbProbeIDs,,drop=FALSE],
   RowSideColorsNote=iRowMeanNote[hbProbeIDs,,drop=FALSE],
   ColSideColors=t(targetsPepRawDFcolors[iSamplesUnionO1,4:8]),
   ColSideColorsNote=t(targetsPepRawDF[iSamplesUnionO1,4:8]),
   lheiColSideColors=1.3,
   cexColSideColorsNote=(makeCexCellnote(targetsPepRawDF[iSamplesUnionO1,8:4],
      cex=1.8, expFactor=0.3)),
   srtColSideColorsNote=90,
   subMain=c("Median normalized",
      "Showing only hemoglobin peptides"));

Overall, it would appear that mouse LT_KO_5 might have elevated hemoglobin abundances, however the other mice appear to have relatively consistent variability.

Combine peptides into proteins

The function CollapseSEbyRow() combines peptide abundances per gene, while also combining the relevant gene annotations into one row per gene.

The groupFuncName="sum" takes the sum of all peptide abundances per gene.

## Exponentiate the median normalized signal
if (max(assays(allNormPepRawUnionSE)$median_Raw, na.rm=TRUE) < 100) {
   assays(allNormPepRawUnionSE)$median_Raw <- 2^(assays(allNormPepRawUnionSE)$median_Raw)-1;
}

## Combine peptide rows per gene
allNormPepRawUnionSEgrp <- CollapseSEbyRow(
   allNormPepRawUnionSE[iGenesPepRaw,iSamplesUnionO],
   signal=c("Raw","median_Raw"),
   groupFuncName="sum",
   rowDataColnames=c("geneSymbol","EG","Accession","PTMcount","AAcount"),
   numShrinkFunc=c(AAcount=mean, PTMcount=mean),
   rows=iGenesPepRaw,
   rowGroups=nameVector(rowData(allNormPepRawUnionSE[iGenesPepRaw,])
      [,c("geneSymbol","GenePepPTM")])
)

## ##  (13:06:24) 22Jun2018:  CollapseSEbyRow(): Performing these operations by signal: 
##            groupFuncName rowStatsFunc
## Raw                  sum         NULL
## median_Raw           sum         NULL
## ##  (13:06:24) 22Jun2018:  CollapseSEbyRow(): Collapsing signal:Raw 
## ##  (13:06:26) 22Jun2018:  CollapseSEbyRow(): Collapsing signal:median_Raw 
## ##  (13:06:29) 22Jun2018:  CollapseSEbyRow(): Collapsing rowData(SE) 
## ##  (13:06:30) 22Jun2018:  CollapseSEbyRow(): Re-delimiting unique values in column:geneSymbol 
## ##  (13:06:30) 22Jun2018:  CollapseSEbyRow(): Re-delimiting unique values in column:EG

## head(rowData(allNormPepRawUnionSEgrp));

After this operation 24,112 peptide rows were combined into 1,850 gene rows.

Protein-level data processing

MA-plots of protein abundance data

To review the variability of technical replicates, we create MA-plots centered by each technical replicate triplicate. The x-axis shows the mean signal for each protein, the y-axis shows the log2 difference from mean per protein, per technical replicate group.

For each panel, the MAD is determined using signal at or above 20 (roughly abundance 1 million), then MAD factors are calculated as the ratio of each sample MAD versus divided by the median MAD. A threshold was set at 2x MAD for outliers, which are shaded in yellow.

There are a number of obvious individual outliers, but there appears to be larger overall MAD factor from technical replicate 2. This observation is most apparent with the “LM” mouse line, mostly because the LM samples had few individual outliers, and very low overall variance. The “LS” mouse line showed higher variance for technical replicate 2, but not above the MAD cutoff since the overall variance was higher than LM.

## MA-plot
   par("oma"=c(0,0,4,0));
   iLM <- vigrep("^LM_(KO|WT).*[123]$", colnames(allNormPepRawUnionSEgrp));
   outliersLM <- c("LM_WT_1_1", "LM_WT_2_2", "LM_WT_4_2",
      "LM_WT_6_2", "LM_KO_1_2", "LM_KO_2_2", "LM_KO_3_2",
      "LM_KO_4_2", "LM_KO_6_2");
   jpLM <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLM])$median_Raw),
      ncol=6,
      displayMAD=TRUE, outlierMAD=2, outlierRowMin=21,
      titleCex=0.9,
      titleBoxColor=colorSubPepRaw[iLM],
      groupSuffix="",
      useMean=TRUE,
      colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
      controlSamples=setdiff(iLM, outliersLM),
      #controlSamples=setdiff(vigrep("_[13]$", iLM), outliersLM),
      centerGroups=gsub("_[123]$", "", iLM));
   title(main="LM mouse line MA-plot\nprotein data centered per techrep triplet",
      outer=TRUE);

   outliersLM <- attr(jpLM, "mvaMADoutliers");

## MA-plot
   par("oma"=c(0,0,4,0));
   iLS <- vigrep("^LS_(KO|WT)", colnames(allNormPepRawUnionSEgrp));
   outliersLS <- NULL;
   jpLS <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLS])$median_Raw),
      ncol=6,
      displayMAD=TRUE, outlierMAD=2, outlierRowMin=20,
      titleCex=0.9,
      titleBoxColor=colorSubPepRaw[iLS],
      groupSuffix="",
      useMean=TRUE,
      colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
      controlSamples=vigrep("_[13]$", iLS),
      #controlSamples=setdiff(iLS, outliersLS),
      #controlSamples=setdiff(vigrep("_[13]$", iLS), outliersLS),
      centerGroups=gsub("_[123]$", "", iLS));
   title(main="LS mouse line MA-plot\nprotein data centered per techrep triplet",
      outer=TRUE);

   outliersLS <- attr(jpLS, "mvaMADoutliers");

## MA-plot
   par("oma"=c(0,0,4,0));
   iLT <- vigrep("^LT_(KO|WT)", colnames(allNormPepRawUnionSEgrp));
   outliersLT <- c("LT_WT_6_1", "LT_KO_1_3", "LT_KO_3_3",
      "LT_KO_6_3");
   jpLT <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLT])$median_Raw),
      ncol=6,
      displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
      titleCex=0.9,
      titleBoxColor=colorSubPepRaw[iLT],
      groupSuffix="",
      useMean=TRUE,
      colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
      controlSamples=setdiff(iLT, outliersLT),
      centerGroups=gsub("_[123]$", "", iLT));
   title(main="LT mouse line MA-plot\nprotein data centered per techrep triplet",
      outer=TRUE);

   outliersLT <- attr(jpLT, "mvaMADoutliers");

Combine protein technical replicates

Three technical replicates were performed for each animal, after lung lavage fluid was isolate, this fluid was subjected to proteomics analysis three times. These replicates are combined prior to downstream analysis, evaluating each triplicate for outliers, using a MAD factor cutoff of 5. For each triplet, the MAD is calculated, and any point more than 5x the MAD is considered an outlier.

Given the MA-plots of the technical replicates above, individual outlier samples were defined and omitted from this procedure. Notably, any technical triplicate which already had one or more samples removed based upon the MA plot above cannot have any further outliers removed, since two replicates are not sufficient to establish an outlier condition.

TechRepSamples <- setdiff(iSamplesUnionO,
   c(outliersLM, outliersLS, outliersLT));
TechRepGroups <- nameVector(
   pasteByRow(colData(allNormPepRawUnionSEgrp[,TechRepSamples])[,c("groupName","Animal")]),
   TechRepSamples);
## Combine peptide rows per gene
allNormPepRawUnionSEgrp2 <- CollapseSEbyColumn(
   allNormPepRawUnionSEgrp[,TechRepSamples],
   signal=c("median_Raw"),
   colDataColnames=c("Experiment","Line","Animal","Genotype","groupName"),
   numShrinkFunc=c(AAcount=mean, PTMcount=mean),
   cols=TechRepSamples,
   colGroups=TechRepGroups
)

In summary:

114 total technical replicate samples, from
38 biological samples
17 outlier samples defined from the MA-plot centered by technical replicate
97 technical replicate samples were used to combine technical replicates
37 biological samples resulted

MA-plots using proteins per biological sample

We now revisit the MA-plots, using the per-protein data, for each biological sample.

iLM2 <- vigrep("^LM_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
jpLM2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLM2])$median_Raw),
   ncol=3,
   titleCex=0.9,
   titleBoxColor=colorSubPepRaw[iLM2],
   displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
   colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
   groupSuffix="",
   useMean=FALSE,
   centerGroups=gsub("_[0-9]+$", "", iLM2));
title(main=paste0("LM mouse line MA-plot\n",
   "protein data per biological replicate\n",
   "centered per KO or WT genotype"),
   outer=TRUE);

outliersLM2 <- attr(jpLM2, "mvaMADoutliers");

iLS2 <- vigrep("^LS_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
jpLS2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLS2])$median_Raw),
   ncol=3,
   titleCex=0.9,
   blankPlotPos=6,
   titleBoxColor=colorSubPepRaw[iLS2],
   displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
   colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
   groupSuffix="",
   useMean=FALSE,
   centerGroups=gsub("_[0-9]+$", "", iLS2));
title(main=paste0("LS mouse line MA-plot\n",
   "protein data per biological replicate\n",
   "centered per KO or WT genotype"),
   outer=TRUE);

outliersLS2 <- attr(jpLS2, "mvaMADoutliers");

iLT2 <- vigrep("^LT_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
outliersLT2 <- NULL;
jpLT2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLT2])$median_Raw),
   ncol=3,
   titleCex=0.9,
   titleBoxColor=colorSubPepRaw[iLT2],
   displayMAD=TRUE, outlierMAD=1.6, outlierRowMin=20,
   colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
   controlSamples=setdiff(iLT2, outliersLT2),
   groupSuffix="",
   useMean=TRUE,
   centerGroups=gsub("_[0-9]+$", "", iLT2));
title(main=paste0("LT mouse line MA-plot\n",
   "protein data per biological replicate\n",
   "centered per KO or WT genotype"),
   outer=TRUE);

outliersLT2 <- attr(jpLT2, "mvaMADoutliers");

Three technical replicates were performed for each animal, which will be combined prior to downstream analysis. The method evaluates each triplicate for outliers, using a MAD factor cutoff of 5, meaning that for each triplet, the MAD is calculated, and any point more than 5x the MAD is considered an outlier.

## combine tech rep data
iSamplesUnionOtechRep <- nameVector(gsub("_[0-9]+$", "", iSamplesUnionO),
   iSamplesUnionO);

## Produce a matrix of data after outlier removal
iMatrixPepRawGmedTechMAD <- rowGroupMeans(iMatrixPepRawGmed[,iSamplesUnionO],
   useMedian=FALSE,
   returnType="input",
   madFactor=5,
   rmOutliers=TRUE,
   groups=iSamplesUnionOtechRep[iSamplesUnionO]);

## Summary table of outliers from the raw data, versus outliers
## after MAD outlier filtering
outlierDF <- table2dataFrame(table(
   MADoutliers=is.na(iMatrixPepRawGmedTechMAD[,iSamplesUnionO]),
   rawOutliers=is.na(iMatrixPepRawGmed[,iSamplesUnionO])));
knitr::kable(outlierDF);

	rawOutliers:FALSE	rawOutliers:TRUE
MADoutliers:FALSE	527408	0
MADoutliers:TRUE	31370	636056

## Summary stats
numMADoutliers <- outlierDF["MADoutliers:TRUE","rawOutliers:FALSE"];
pctMADoutliers <- round(outlierDF["MADoutliers:TRUE","rawOutliers:FALSE"]/sum(outlierDF[,"rawOutliers:FALSE"])*1000)/10;

Overall, 31,370 individual values were removed, roughly 5.6%, and only for triplicates containing three values, where the third value was 5xMAD away from the other two.

Now, combine technical replicates to produce a new data matrix.

## Combine technical replicates, using MAD factor cutoff of 5 for outliers
iMatrixPepRawGmedTechGrp <- rowGroupMeans(iMatrixPepRawGmed[,iSamplesUnionO],
   useMedian=FALSE,
   madFactor=5,
   rmOutliers=TRUE,
   groups=iSamplesUnionOtechRep[iSamplesUnionO]);
iSamplesUnionOtech <- colnames(iMatrixPepRawGmedTechGrp);
iSamplesUnionOtech1 <- unvigrep("_LM", iSamplesUnionOtech);

We plotted a histogram showing the number of samples with measured abundance values per peptide, out of 36 total samples. The distribution was so interesting, we further color-coded the bars by the number of sample groups represented. It shows that in general, mouse lines agree closely with one another – a peptide detected in a mouse wildtype line is almost always detected in the same mouse knockout line.

## Look at number of animals with detected peptides by sample group
detectedGroupsV <- nameVector(gsub("_[0-9].*$", "", iSamplesUnionOtech1),
   iSamplesUnionOtech1);
detectedGroupCountL <- apply(iMatrixPepRawGmedTechGrp[,iSamplesUnionOtech1], 1, function(i){
   detectedGroupsV[!is.na(i)]
});
detectedGroupCount <- lengths(strsplit(cPasteUnique(detectedGroupCountL), ","));
detectedSampleCount <- lengths(detectedGroupCountL);
gplots::barplot2(table(detectedGroupCount, detectedSampleCount),
   las=2,
   ylim=c(0,1100),
   ylab="Number of peptides",
   xlab="Number of samples with non-zero abundance per peptide",
   col=rainbowCat(6));
legend("topleft",
   cex=0.8,
   bg="transparent",
   box.col="transparent",
   title="groups detected",
   inset=c(0.05,0.05),
   fill=rainbowCat(6),
   legend=1:6);

detectedDF <- table2dataFrame(table(
   Groups=detectedGroupCount,
   S=detectedSampleCount));
colnames(detectedDF) <- gsub("^.*:", "", colnames(detectedDF));
knitr::kable(detectedDF);

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36
Groups:1	44	32	30	14	9	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Groups:2	0	20	64	68	161	205	269	340	509	608	733	836	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Groups:3	0	0	1	2	3	7	8	13	8	10	9	16	13	8	3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Groups:4	0	0	0	2	2	0	2	7	9	13	27	36	43	60	75	85	108	143	177	226	260	277	277	182	0	0	0	0	0	0	0	0	0	0	0	0
Groups:5	0	0	0	0	0	0	0	0	0	0	1	2	3	0	6	9	10	5	5	3	11	9	5	8	7	3	3	2	0	0	0	0	0	0	0	0
Groups:6	0	0	0	0	0	0	0	0	0	0	0	0	1	1	3	2	3	11	10	12	27	38	32	42	61	66	108	137	172	206	277	349	528	561	691	997

Interesting observations:

Of the 890 peptides detected in only 12 samples, 93% of them were from the wildtype and knockout from the same mouse line.
There are roughly as many peptides (3,813) that are specific to one mouse line, as there are peptides (4,335) that are shared across all three mouse lines.

Combine peptides to protein

Next, the peptide abundances are combined into protein level abundance values, taking care to maintain the gene-specificity.

## Aggregate tech-rep combined abundances from peptide to protein
iRow <- match(rownames(iMatrixPepRawGmedTechGrp),
   rowData(allNormPepRawUnionSE)$probeID);
iGenesProtV <- nameVector(rowData(allNormPepRawUnionSE[iRow,])[,c("geneSymbol","probeID")]);
## Take only peptides assigned to one gene
iGenesProtVuse <- unvigrep(",", iGenesProtV);
## Non-unique genes
nonUniqueGenes <- mixedSort(setdiff(
   unique(unlist(strsplit(iGenesProtV,","))),
   unique(iGenesProtVuse)));

There are 1 unique gene symbols represented, and 1 unique gene symbols which have at least one peptide specific to that gene.

That means there are 0 genes for which there are peptides measuring multiple genes, sometime seen with gene families with extremely high homology. These genes are:

## aggregate peptides to protein taking normal sum of abundance
iMatrixPepRawGmedTechGrpProtSum <- t(rowGroupMeans(
   x=t(2^iMatrixPepRawGmedTechGrp[names(iGenesProtVuse),,drop=FALSE]-1),
   groups=iGenesProtVuse,
   rowStatsFunc=function(x,...){rowSums(x,...,na.rm=TRUE)},
   verbose=FALSE,
   keepNULLlevels=TRUE,
));
## Revert zero to NA
iMatrixPepRawGmedTechGrpProtSum[iMatrixPepRawGmedTechGrpProtSum == 0] <- NA;

Peptide-level technical replicates

Combine peptide technical replicates

Individual technical replicate outlier samples were defined using protein-level abundance values, in order to provide a robust measure of protein abundances by which to measure sample processing outliers.

However, individual peptides are useful for analysis in part because the PD software includes post-translational modification (PTM) data alongside peptides where PD detects such PTM events.

Therefore it is useful to combine technical replicates at the peptide level. We perform this step using:

the non-outlier samples TechRepSamples determined using the protein-level abundance data, and
the iGenesPepRaw peptides having at least one reported PD measurement.

## Combine peptide rows per gene
allNormPepRawUnionSE2 <- CollapseSEbyColumn(
   allNormPepRawUnionSE[iGenesPepRaw,TechRepSamples],
   signal=c("median_Raw"),
   colDataColnames=c("Experiment","Line","Animal","Genotype","groupName"),
   numShrinkFunc=c(AAcount=mean, PTMcount=mean),
   cols=TechRepSamples,
   colGroups=TechRepGroups
);
iSamplesUnionO2 <- colnames(allNormPepRawUnionSE2);

Review of hemoglobin by biological sample

The steps below are the same as with technical replicate data.

One interesting observation is that the peptide expression specific to each mouse line appears more definitive in this view, however the majority of those peptides are also extremely low abundance relative to the highest abundance hemoglobin peptides.

## iMatrix
#iMatrixPepRawG2 <- log2(1+assays(allNormPepRawUnionSE2[iGenesPepRaw,iSamplesUnionO2])$Raw);
iMatrixPepRawGmed2 <- log2(1+assays(allNormPepRawUnionSE2[iGenesPepRaw,iSamplesUnionO2])$median_Raw);
## color scale visual max
farbeLim <- sqrt(400000000);

## Make a color key to add to the left of the heatmap
iRowMeans2 <- rowMeans(na.rm=TRUE,
   (2^(iMatrixPepRawGmed2[iGenesPepRaw,iSamplesUnionO2])-1));
iRowMeanNote2 <- data.frame(mean_abundance=format(
   justify="right",
   big.mark=",",
   scientific=FALSE,
   iRowMeans2));
iRowMeanColors2 <- data.frame(mean_abundance=vals2colorLevels(
   symmetricZero=TRUE,
   farbeLim=farbeLim,
   colorRamp="RdGy_r",
   rmNA(naValue=,
      sqrt(iRowMeans2))));

## Ordered sample identifiers
iSamplesUnionO21 <- unvigrep("L._LM", iSamplesUnionO2);

## New sample data.frame
targetsPepRawDF2 <- as.data.frame(colData(allNormPepRawUnionSE2[,iSamplesUnionO2]));
targetsPepRawDFcolors2 <- df2groupColors(targetsPepRawDF2,
   colorSub=colorSubPepRaw);

## Produce a heatmap
## heatmap using hemoglobins only, color scale sqrt
heatmap.3(rmNA(naValue=-10000,
      sqrt(2^iMatrixPepRawGmed2[hbProbeIDs,iSamplesUnionO21]-1),
   ),
   cexCol=1.3,
   col="RdGy_r",
   srtCol=90,
   Colv=FALSE,
   add.expr=box(),
   rowGroups=gsub("_.*$", "", nameVector(hbProbeIDs)),
   sepcolor="black",
   symkey=FALSE,
   keyMargins=c(17,7,5,2),
   keyLabel="",
   keysize=c(0.7,1.2),
   tracecol="transparent",
   farbeLim=farbeLim,
   margins=c(15,15),
   densityScale="sqrt",
   main="PD Peptide Abundances",
   srtRowSideColorsNote=0,
   lwidRowSideColors=0.7,
   RowSideColors=iRowMeanColors2[hbProbeIDs,,drop=FALSE],
   RowSideColorsNote=iRowMeanNote2[hbProbeIDs,,drop=FALSE],
   ColSideColors=t(targetsPepRawDFcolors2[iSamplesUnionO21,-1]),
   ColSideColorsNote=t(targetsPepRawDF2[iSamplesUnionO21,-1]),
   lheiColSideColors=1.3,
   cellnote=rmNA(naValue="",
      signif(digits=3,
         round((2^iMatrixPepRawGmed2[hbProbeIDs,iSamplesUnionO21]-1)/100000)/10)),
   cexnote=0.65,
   cexColSideColorsNote=(makeCexCellnote(targetsPepRawDF2[iSamplesUnionO21,-1],
      cex=1.8, expFactor=0.3)),
   srtColSideColorsNote=90,
   subMain=c("Median normalized",
      "Showing only hemoglobin peptides",
      "after combining technical replicates.",
      "Heatmap cell labels are abundance in millions."));