This document was prepared on Fri Jun 22 16:57:42 2018.

Description of Input Data

This document represents the analysis steps used to analyze quantitative proteomics data as produced by Proteomics Discoverer (PD) software. The PD software was configured to export individual peptide abundances including:

  • peptide counts,
  • raw abundances,
  • normalized abundances, and
  • scaled abundances.

The raw abundances are used for this analysis, however the other abundances are imported and are readily available throughout the workflow.

The import files are in Excel format, and include these files:

basedir <- path.expand("~/Projects/London/londonBALF");
pepFilesRaw <- list.files(path=file.path(basedir, "PD_peptide_data_raw"),
   pattern="^BALF.*.xlsx$",
   full.names=TRUE);
names(pepFilesRaw) <- gsub("BALF_|[.]xlsx$", "",
   basename(pepFilesRaw));
kable(data.frame(`Peptide input files`=pepFilesRaw)) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD")
Peptide.input.files
LM /Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LM.xlsx
LS /Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LS.xlsx
LT /Users/wardjm/Projects/London/londonBALF/PD_peptide_data_raw/BALF_LT.xlsx

First, we load some custom R functions:

cbioPath <- "/Users/wardjm/Builds/NIEHS_IntegrativeBIx-svn/Bpipe/bpipeScripts/CBioRUtils.R";
CBioRUtils <- new.env(parent=.GlobalEnv);
sys.source(cbioPath, envir=CBioRUtils, keep.source=TRUE);
assign("CBioRUtils", CBioRUtils, envir=.GlobalEnv);
attach(CBioRUtils, warn=FALSE, pos=1);

Import PD Excel data into R

Import each file into a data.frame

Convert data.frame to a SummarizedExperiment object

This step calls a custom function processPDdata() which encapsulates several data curation steps. The result is an R object type SummarizedExperiment which is a common standard format for most omics analysis workflows.

  1. Create a unique row identifier, using the peptide sequence, and the PTM string. This combination of values is sufficient and necessary to produce a unique identifier with these properties:

    • Uniquely identifies each row with no duplications, for example the same peptide will be present on multiple rows, but each row contains a unique PTM string.
    • Exactly matches corresponding rows across multiple PD export files, with no ambiguity.
    • Does not require assignment to a particular protein accession number, which is known to be inconsistent due to the logic PD uses to assign peptide sequence to accession numbers.
  2. Derive the PTM string from the PD column “Modifications” by taking the first six character of the type of modification, then concatenating the sites with an underscore. For example:

    • 1xDeamidated [Q10]” becomes “1xDeamQ10
    • 2xDeamidated [N11; N]” becomes “2xDeamN11_N
    • 1xDeamidated [N10]; 1xOxidation [M14]” becomes “1xDeamN10_1xOxidM14
  3. Identify abundance column types, and create a separate data matrix for each abundance type, including:

    • raw abundance
    • normalized abundance
    • scaled abundance
    • peptide counts
  4. Annotate samples by identifiers to describe the experimental factors, for example the sample “F13:.Sample,.LM,.1,.WT,.1” becomes annotated:

    • Experiment “LM” - indicates the PD quantitation experiment
    • Mouse line “LM” - indicates the mouse genotype line
    • Animal “1” - indicates the mouse number within a sample group
    • Genotype “WT” - indicates wildtype (WT) or gene knockout (KO).

Note: there is a curation step introduced here which ensures the samples are consistently annotated. The parameters curateFromSampleID and curateToSampleID are used as a regular expression find and replace.

  1. Assign gene symbols by converting the PD-assigned protein accession numbers from RefSeq protein to Entrez Gene ID. Then convert to unique Entrez gene symbol.

    • In cases where a peptide is annotated with multiple accession numbers, each accession number is converted to Entrez gene.
    • Roughly 6% of peptides were assigned to multiple Entrez genes, indicating the peptide sequence was shared between multiple Entrez genes.
    • Most commonly, a peptide assigned to multiple accession numbers referred to the same gene.
  2. Recognize “Found” flags and create a data matrix representing the various PD flags indicating their flags:

    • High
    • Peak Found
    • Not Found
curateFromSampleID <- c("^(F[0-9]+:.Sample,.)(WT|KO),.([0-9]+),.([0-9]+)$",
   "(F([123]):.Sample,.LM,.1,.WT,.)n[/]a",
   "(LM_)LM_|(LS_)LS_|(LT_)LT_");
curateToSampleID <- c("\\1LM,.\\3,.\\2,.\\4",
   "\\1\\2",
   "\\1\\2\\3");

allNormPepRawSEL <- lapply(nameVectorN(pepDataRawL), function(iName){
   #printDebug("iName:", iName);
   iDF <- pepDataRawL[[iName]];
   iAllNormSE <- processPDdata(iDF,
      iName=iName,
      returnType="SummarizedExperiment",
      sampleColnames=c("SampleCode","Blank","Line","Animal","Genotype", "Rep"),
      abundancesGrep=c(Raw="^abundance:.",
         Scaled="Abundances.[(]Scaled[)]",
         Normalized="Abundances.[(]Normalized[)]",
         Count="Abundances.Count"),
      groupColnames=c("Experiment","Line","Genotype"),
      labelColnames=c("Experiment","Line","Genotype","Animal","Rep"),
      sampleCurateFrom=curateFromSampleID,
      sampleCurateTo=curateToSampleID,
      orgDb="org.Mm.eg.db",
      keepFullTable=TRUE,
      verbose=FALSE);
   iAllNormSE;
});
kable(sdim(allNormPepRawSEL)) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD");
rows cols class
LM 17087 36 SummarizedExperiment
LS 14717 39 SummarizedExperiment
LT 15856 39 SummarizedExperiment

Merge SummarizedExperiment objects

This step creates one large data matrix, whose rownames are defined using the unique peptide_PTM identifier described above. No values are modifed during this step, its main purpose is to ensure all experimental data shares a common row (peptide) identifier for any subsequent comparisons.

Note: Accession numbers are retained during this step, however we noticed that accession numbers are not consistently assigned to accession numbers across different PD experiments. The Accession numbers represent the union of all accession numbers assigned to each peptide-PTM entry.

allNormPepRawUnionSE <- unionSEL(allNormPepRawSEL,
   probeID="probeID",
   geneKeepColnames=c("probeID",
      "geneSymbol", "SeqPTM", "EG", "Accession"),
   verbose=FALSE);
kable(sdim(allNormPepRawUnionSE)) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD");
rows cols class
colData 114 15 DataFrame
assays 24112 114 ShallowSimpleListAssays
NAMES 24112 character
elementMetadata 24112 5 DataFrame
metadata 3 list

Label peptides geneSymbol_pepNum_ptmNum

Change row names to geneSymbol_peptide_PTM, to make it easier to understand the relationship between gene, peptide, and PTM for each peptide.

For each gene, peptide sequences are sorted by size, alphabetically, then numbered.

  • In this way, the peptide number helps indicate when a peptide is present in multiple rows.
  • For example, the gene Apoe has peptide “TANLGAGAAQPLR” and “TANLGAGAAQPLRDR” which visibly look very similar, however they would be renamed “pep9” and “pep14”.

The “PTM number” is similarly defined by the sorted PTMs observed for each peptide.

  • For example, the gene Apoe has a peptide “NEVHTMLGQSTEEIR” which was observed with no PTM, and two PTMs. In this case, the peptide is renamed “pep13”, and the two PTMs are named “ptm1” and “ptm2”. For this peptide here are the three new names:

    1. APOE_pep13
    2. APOE_pep13_ptm1
    3. APOE_pep13_ptm2

To summarize for one peptide of Apoe:

Gene Peptide PTM New_Name
Apoe NEVHTMLGQSTEEIR Apoe_pep13
Apoe NEVHTMLGQSTEEIR 1xDeamQ9 Apoe_pep13_ptm1
Apoe NEVHTMLGQSTEEIR 1xOxidM6 Apoe_pep13_ptm2
allNormPepRawUnionSE <- addPeptidePTMLabels(allNormPepRawUnionSE);

## Add indication of how many genes are associated with each peptide
rowData(allNormPepRawUnionSE)$GeneCount <- lengths(
   strsplit(rowData(allNormPepRawUnionSE)$geneSymbol, ","));
## Filter for peptide rows with counts
iGenesPepRaw <- rownames(removeBlankRowCols(allNormPepRawUnionSE));

For example we can display the actual data for selected genes.

## Show the full table for selected genes
keepGenes <- c("Apoe","Hbb-b2");
iDF <- mixedSortDF(as.data.frame(
   subset(rowData(allNormPepRawUnionSE), geneSymbol %in% keepGenes)[,
      c("geneSymbol","GenePep","PTMnum","Peptide","PTM","GenePepPTM")]))
colorSubPepNum <- group2colors(unique(iDF[,"GenePep"]));
iDF <- kable_coloring(iDF,
      colorSub=colorSubPepNum,
      row_color_by="GenePep",
      row.names=FALSE,
      returnType="kable") %>%
   collapse_rows(columns=1:2, valign="middle") %>%
   row_spec(0, background="#DDDDDD")
iDF;
geneSymbol GenePep PTMnum Peptide PTM GenePepPTM
Apoe Apoe_pep1 FWDYLR Apoe_pep1
Apoe_pep2 LGPLVEQGR Apoe_pep2
Apoe_pep3 MEEQTQQIR Apoe_pep3
Apoe_pep4 GRLEEVGNQAR Apoe_pep4
Apoe_pep5 LGADMEDLRNR Apoe_pep5
Apoe_pep6 LGKEVQAAQAR Apoe_pep6
Apoe_pep7 SKMEEQTQQIR Apoe_pep7
Apoe_pep8 GWFEPIVEDMHR Apoe_pep8
Apoe_pep9 TANLGAGAAQPLR Apoe_pep9
Apoe_pep10 DRLEEVREHMEEVR Apoe_pep10
ptm1 DRLEEVREHMEEVR 1xOxidM10 Apoe_pep10_ptm1
Apoe_pep11 ELEEQLGPVAEETR Apoe_pep11
Apoe_pep12 LKGWFEPIVEDMHR Apoe_pep12
Apoe_pep13 NEVHTMLGQSTEEIR Apoe_pep13
ptm1 NEVHTMLGQSTEEIR 1xDeamQ9 Apoe_pep13_ptm1
ptm2 NEVHTMLGQSTEEIR 1xOxidM6 Apoe_pep13_ptm2
Apoe_pep14 TANLGAGAAQPLRDR Apoe_pep14
Apoe_pep15 AYKKELEEQLGPVAEETR Apoe_pep15
Apoe_pep16 IQASVATNPIITPVAQENQ Apoe_pep16
ptm1 IQASVATNPIITPVAQENQ 1xDeamQ.N Apoe_pep16_ptm1
Hbb-b2 Hbb-b2_pep1 VNPDEVGGEALGR Hbb-b2_pep1
ptm1 VNPDEVGGEALGR 1xDeamN2 Hbb-b2_pep1_ptm1

About the SummarizedExperiment object

SummarizedExperiment documentation

The object allNormPepRawUnionSE object is SummarizedExperiment, which contains a data matrix with Raw, Normalized, Scaled, and Counts data directly obtained from PD software.

  1. Raw measurements are accessed assays(allNormPepRawUnionSE)[["Raw"]].
  2. Gene annotations are accessed rowData(allNormPepRawUnionSE), as a data.frame which contains peptide and gene annotations. Each row describes peptide measurements, including the peptide sequence, PTM string, Entrez gene ID, gene symbol, etc.
  3. Sample annotations are accessed colData(allNormPepRawUnionSE), as a data.frame. Each row describes one biological sample, including the mouse line (LM, LS, LT), genotype (WT, KO), animal number (1, 2, 3, 4, 5, 6), and technical replicate (1, 2, 3).

The allNormPepRawUnionSE object will also store the normalized data matrix, the statistical design, and statistical comparisons.

Summary information about peptide abundances

Now that we have prepared the data matrix, we can describe some basic features of the data, first is the number of rows containing any measurement.

######################################################################
## Filter for peptide rows with counts
peptideSummaryDF <- data.frame(check.names=FALSE,
   `Summary Stats`=c(
      `Total Peptide Rows`=formatInt(nrow(allNormPepRawUnionSE)),
      `Peptide Rows with Data`=formatInt(length(iGenesPepRaw))
   ));
kable(peptideSummaryDF) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD");
Summary Stats
Total Peptide Rows 24,112
Peptide Rows with Data 10,481

Note that most of the peptide rows where PD assigned abundance values are also rows without PTMs:

## add a column "HasData" to rowData(allNormPepRawUnionSE)
rowData(allNormPepRawUnionSE)[["HasData"]] <- (rownames(allNormPepRawUnionSE) %in%
   iGenesPepRaw);

## Calculate a frequency table of "HasData" with "HasPTM"
hasDataPTMtable <- table(
   HasPTM=ifelse(!rowData(allNormPepRawUnionSE)$PTM %in% "",
      "hasPTM", "noPTM"),
   HasData=ifelse(rowData(allNormPepRawUnionSE)$HasData,
      "hasData", "noData"));
kable(hasDataPTMtable) %>%
   kable_styling() %>%
   row_spec(0, background="#DDDDDD")
hasData noData
hasPTM 1624 6409
noPTM 8857 7222

Experiment design and color-coding for figures

For consistency in downstream figures, we define a full data.frame with experimental factors, with corresponding colors.

#####################################
## Vector of all sample identifiers
iSamplesPepRaw <- rownames(allNormPepRawUnion$targets);

#####################################
## Define colors
## Line and Genotype
colorSubLine <- c(
   LM="brown3",
   LS="darkorange",
   LT="purple4",
   KO="royalblue",
   WT="red4");
## Sample Groups
colorSubBase <- c(
   LM_WT="brown3",
   LS_WT="salmon",
   LT_WT="coral3",
   LM_KO="darkslateblue",
   LS_KO="royalblue1",
   LT_KO="deepskyblue4");
## Replicate numbers
colorSubRep <- nameVector(
   color2gradient("grey40", n=6, gradientWtFactor=2),
   1:6);
## color by unique animal number
iSamplesPepRawAni <- (unique(gsub("_[0-9]$", "", iSamplesPepRaw)));
colorSubAni <- nameVector(colorSubBase[gsub("^L[ST]_(L)|_[0-9].*$", "\\1",
      iSamplesPepRawAni)],
   iSamplesPepRawAni);
colorSubAni2 <- color2gradient(gradientWtFactor=1/2,
   colorSubAni[sort(names(colorSubAni))]);
colorSubTechRep <- nameVector(colorSubAni2[gsub("^L[ST]_(L)|_[0-9]$", "\\1",
      iSamplesPepRaw)],
   iSamplesPepRaw);

## Put the colors together into a named vector
colorSubPepRaw <- c(colorSubLine,
   colorSubBase,
   colorSubAni2,
   colorSubTechRep,
   colorSubRep);
########################################################
## Make a data.frame with experiment design
targetsPepRawDF <- allNormPepRawUnion$targets[,c("Experiment","groupName","Line",
   "Animal","Genotype","Rep","sampleLabel")];
targetsPepRawDF$Experiment <- makeFactor(targetsPepRawDF$Experiment);
targetsPepRawDF$Genotype <- makeFactor(targetsPepRawDF$Genotype,
   provigrepPattern=c("WT","KO","."));
targetsPepRawDF$Line <- makeFactor(targetsPepRawDF$Line);
targetsPepRawDF$Animal <- asFactor(targetsPepRawDF$Animal);
targetsPepRawDF$groupName <- gsubOrdered("LM_(LM)|LS_(LS)|LT_(LT)", "\\1\\2\\3",
   pasteByRowOrdered(targetsPepRawDF[,c("Experiment","Line","Genotype")]));
targetsPepRawDF$groupAnimal <- pasteByRowOrdered(targetsPepRawDF[,c("groupName","Animal")]);

## Sort the data.frame in proper factor order
targetsPepRawDF <- mixedSortDF(targetsPepRawDF);

## Define a data.frame of colors using the named color vector
targetsPepRawDFcolors <- df2groupColors(targetsPepRawDF,
   colorSub=colorSubPepRaw);
par("mfrow"=c(1,1), "mar"=c(8,4,4,2));
imageByColors(targetsPepRawDFcolors,
   cellnote=targetsPepRawDF,
   yaxt="n",
   cexCellnote=makeCexCellnote(targetsPepRawDF));

Peptide-level data processing

Normalize peptide abundances

Data is normalized by using the log2 abundance values, and median scaling with a method from the limma R package.

## Normalize
allNormPepRawUnionSE <- doNorm(allNormPepRawUnionSE,
   iGenes=iGenesPepRaw,
   keepZeros=FALSE,
   signal="Raw",
   methodList=c("median"));
## ##  (17:20:28) 21Jun2018:  doNorm(): Applying normalizations for signal: 'Raw' 
## ##  (17:20:28) 21Jun2018:  doNorm(): normMethod: FALSE
## Define ordered sample identifiers
iSamplesPepRaw <- rownames(colData(allNormPepRawUnionSE));
sortColnames <- c("Experiment", "Genotype", "Line", "Animal", "Rep");
iSamplesUnionO <- rownames(mixedSortDF(colData(allNormPepRawUnionSE)[iSamplesPepRaw,sortColnames]));

## Extract the data matrix of normalized values
iMatrixPepRawGmed <- assays(allNormPepRawUnionSE)[["median_Raw"]][iGenesPepRaw,iSamplesUnionO];

Review hemoglobin abundances

One potential insight into the sample preparation relates to the observation of lung lavage bleeding, which may be evident by the presence of hemoglobin peptides.

First, pull together hemoglobin genes with peptide abundance data:

## Hemoglobin gene names to Entrez Gene ID
hbEG <- names(vigrep("hemoglobin", as.list(org.Mm.egGENENAME)));
hbSymbols <- mgetMapped(hbEG, org.Mm.egSYMBOL);
hbGeneNames <- mgetMapped(hbEG, org.Mm.egGENENAME);
hbProbeIDs <- intersect(iGenesPepRaw,
   provigrep(paste0("(^|,)", hbSymbols, "($|_|,)"), rownames(allNormPepRawUnionSE)));

## Remove single-peptide entries for now
hbGeneTc <- tcount(gsub("_.+", "", hbProbeIDs));
hbGeneTc1 <- names(hbGeneTc[hbGeneTc %in% 1]);
hbProbeIDs <- hbProbeIDs[!gsub("_.+", "", hbProbeIDs) %in% hbGeneTc1];

hbDF <- data.frame(hbEG=hbEG,
   hbSymbols=hbSymbols,
   hbGeneNames=hbGeneNames)
## Subset for the peptides being visualized
knitr::kable(subset(hbDF,
   hbEG %in% rowData(allNormPepRawUnionSE[hbProbeIDs,])[,"EG"])) %>%
   kable_styling();
hbEG hbSymbols hbGeneNames
15130 15130 Hbb-b2 hemoglobin, beta adult minor chain
110257 110257 Hba-a2 hemoglobin alpha, adult chain 2
216635 216635 Hbq1a hemoglobin, theta 1A
100503605 100503605 Hbb-bs hemoglobin, beta adult s chain

Next, produce a heatmap of abundances, using the median-normalized abundance values, visualizing all technical replicates.

## iMatrix
iMatrixPepRawG <- log2(1+assays(allNormPepRawUnionSE[iGenesPepRaw,iSamplesUnionO])[["Raw"]]);
## color scale visual max
farbeLim <- sqrt(400000000);

## Make a color key to add to the left of the heatmap
iRowMeans <- rowMeans(na.rm=TRUE,
   (2^(iMatrixPepRawGmed[iGenesPepRaw,iSamplesUnionO])-1));
iRowMeanNote <- data.frame(mean_abundance=format(
   justify="right",
   big.mark=",",
   scientific=FALSE,
   iRowMeans));
iRowMeanColors <- data.frame(mean_abundance=vals2colorLevels(
   symmetricZero=TRUE,
   farbeLim=farbeLim,
   colorRamp="RdGy_r",
   rmNA(naValue=,
      sqrt(iRowMeans))));

## Ordered sample identifiers
iSamplesUnionO <- rownames(targetsPepRawDF);
iSamplesUnionO1 <- unvigrep("L._LM", iSamplesUnionO);

## Produce a heatmap
## heatmap using hemoglobins only, color scale sqrt
heatmap.3(rmNA(naValue=-10000,
      sqrt(2^iMatrixPepRawGmed[hbProbeIDs,iSamplesUnionO1]-1),
   ),
   cexCol=0.7,
   col="RdGy_r",
   srtCol=90,
   Colv=FALSE,
   add.expr=box(),
   rowGroups=gsub("_.*$", "", nameVector(hbProbeIDs)),
   sepcolor="black",
   symkey=FALSE,
   keyMargins=c(17,7,5,2),
   keyLabel="",
   keysize=c(0.7,1.2),
   tracecol="transparent",
   farbeLim=farbeLim,
   margins=c(15,15),
   densityScale="sqrt",
   main="PD Abundances",
   srtRowSideColorsNote=0,
   lwidRowSideColors=0.7,
   RowSideColors=iRowMeanColors[hbProbeIDs,,drop=FALSE],
   RowSideColorsNote=iRowMeanNote[hbProbeIDs,,drop=FALSE],
   ColSideColors=t(targetsPepRawDFcolors[iSamplesUnionO1,4:8]),
   ColSideColorsNote=t(targetsPepRawDF[iSamplesUnionO1,4:8]),
   lheiColSideColors=1.3,
   cexColSideColorsNote=(makeCexCellnote(targetsPepRawDF[iSamplesUnionO1,8:4],
      cex=1.8, expFactor=0.3)),
   srtColSideColorsNote=90,
   subMain=c("Median normalized",
      "Showing only hemoglobin peptides"));

Overall, it would appear that mouse LT_KO_5 might have elevated hemoglobin abundances, however the other mice appear to have relatively consistent variability.

Combine peptides into proteins

The function CollapseSEbyRow() combines peptide abundances per gene, while also combining the relevant gene annotations into one row per gene.

The groupFuncName="sum" takes the sum of all peptide abundances per gene.

## Exponentiate the median normalized signal
if (max(assays(allNormPepRawUnionSE)$median_Raw, na.rm=TRUE) < 100) {
   assays(allNormPepRawUnionSE)$median_Raw <- 2^(assays(allNormPepRawUnionSE)$median_Raw)-1;
}

## Combine peptide rows per gene
allNormPepRawUnionSEgrp <- CollapseSEbyRow(
   allNormPepRawUnionSE[iGenesPepRaw,iSamplesUnionO],
   signal=c("Raw","median_Raw"),
   groupFuncName="sum",
   rowDataColnames=c("geneSymbol","EG","Accession","PTMcount","AAcount"),
   numShrinkFunc=c(AAcount=mean, PTMcount=mean),
   rows=iGenesPepRaw,
   rowGroups=nameVector(rowData(allNormPepRawUnionSE[iGenesPepRaw,])
      [,c("geneSymbol","GenePepPTM")])
)
## ##  (13:06:24) 22Jun2018:  CollapseSEbyRow(): Performing these operations by signal: 
##            groupFuncName rowStatsFunc
## Raw                  sum         NULL
## median_Raw           sum         NULL
## ##  (13:06:24) 22Jun2018:  CollapseSEbyRow(): Collapsing signal:Raw 
## ##  (13:06:26) 22Jun2018:  CollapseSEbyRow(): Collapsing signal:median_Raw 
## ##  (13:06:29) 22Jun2018:  CollapseSEbyRow(): Collapsing rowData(SE) 
## ##  (13:06:30) 22Jun2018:  CollapseSEbyRow(): Re-delimiting unique values in column:geneSymbol 
## ##  (13:06:30) 22Jun2018:  CollapseSEbyRow(): Re-delimiting unique values in column:EG
## head(rowData(allNormPepRawUnionSEgrp));

After this operation 24,112 peptide rows were combined into 1,850 gene rows.

Protein-level data processing

MA-plots of protein abundance data

To review the variability of technical replicates, we create MA-plots centered by each technical replicate triplicate. The x-axis shows the mean signal for each protein, the y-axis shows the log2 difference from mean per protein, per technical replicate group.

For each panel, the MAD is determined using signal at or above 20 (roughly abundance 1 million), then MAD factors are calculated as the ratio of each sample MAD versus divided by the median MAD. A threshold was set at 2x MAD for outliers, which are shaded in yellow.

There are a number of obvious individual outliers, but there appears to be larger overall MAD factor from technical replicate 2. This observation is most apparent with the “LM” mouse line, mostly because the LM samples had few individual outliers, and very low overall variance. The “LS” mouse line showed higher variance for technical replicate 2, but not above the MAD cutoff since the overall variance was higher than LM.

## MA-plot
   par("oma"=c(0,0,4,0));
   iLM <- vigrep("^LM_(KO|WT).*[123]$", colnames(allNormPepRawUnionSEgrp));
   outliersLM <- c("LM_WT_1_1", "LM_WT_2_2", "LM_WT_4_2",
      "LM_WT_6_2", "LM_KO_1_2", "LM_KO_2_2", "LM_KO_3_2",
      "LM_KO_4_2", "LM_KO_6_2");
   jpLM <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLM])$median_Raw),
      ncol=6,
      displayMAD=TRUE, outlierMAD=2, outlierRowMin=21,
      titleCex=0.9,
      titleBoxColor=colorSubPepRaw[iLM],
      groupSuffix="",
      useMean=TRUE,
      colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
      controlSamples=setdiff(iLM, outliersLM),
      #controlSamples=setdiff(vigrep("_[13]$", iLM), outliersLM),
      centerGroups=gsub("_[123]$", "", iLM));
   title(main="LM mouse line MA-plot\nprotein data centered per techrep triplet",
      outer=TRUE);

   outliersLM <- attr(jpLM, "mvaMADoutliers");
## MA-plot
   par("oma"=c(0,0,4,0));
   iLS <- vigrep("^LS_(KO|WT)", colnames(allNormPepRawUnionSEgrp));
   outliersLS <- NULL;
   jpLS <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLS])$median_Raw),
      ncol=6,
      displayMAD=TRUE, outlierMAD=2, outlierRowMin=20,
      titleCex=0.9,
      titleBoxColor=colorSubPepRaw[iLS],
      groupSuffix="",
      useMean=TRUE,
      colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
      controlSamples=vigrep("_[13]$", iLS),
      #controlSamples=setdiff(iLS, outliersLS),
      #controlSamples=setdiff(vigrep("_[13]$", iLS), outliersLS),
      centerGroups=gsub("_[123]$", "", iLS));
   title(main="LS mouse line MA-plot\nprotein data centered per techrep triplet",
      outer=TRUE);

   outliersLS <- attr(jpLS, "mvaMADoutliers");
## MA-plot
   par("oma"=c(0,0,4,0));
   iLT <- vigrep("^LT_(KO|WT)", colnames(allNormPepRawUnionSEgrp));
   outliersLT <- c("LT_WT_6_1", "LT_KO_1_3", "LT_KO_3_3",
      "LT_KO_6_3");
   jpLT <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp[,iLT])$median_Raw),
      ncol=6,
      displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
      titleCex=0.9,
      titleBoxColor=colorSubPepRaw[iLT],
      groupSuffix="",
      useMean=TRUE,
      colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
      controlSamples=setdiff(iLT, outliersLT),
      centerGroups=gsub("_[123]$", "", iLT));
   title(main="LT mouse line MA-plot\nprotein data centered per techrep triplet",
      outer=TRUE);

   outliersLT <- attr(jpLT, "mvaMADoutliers");

Combine protein technical replicates

Three technical replicates were performed for each animal, after lung lavage fluid was isolate, this fluid was subjected to proteomics analysis three times. These replicates are combined prior to downstream analysis, evaluating each triplicate for outliers, using a MAD factor cutoff of 5. For each triplet, the MAD is calculated, and any point more than 5x the MAD is considered an outlier.

Given the MA-plots of the technical replicates above, individual outlier samples were defined and omitted from this procedure. Notably, any technical triplicate which already had one or more samples removed based upon the MA plot above cannot have any further outliers removed, since two replicates are not sufficient to establish an outlier condition.

TechRepSamples <- setdiff(iSamplesUnionO,
   c(outliersLM, outliersLS, outliersLT));
TechRepGroups <- nameVector(
   pasteByRow(colData(allNormPepRawUnionSEgrp[,TechRepSamples])[,c("groupName","Animal")]),
   TechRepSamples);
## Combine peptide rows per gene
allNormPepRawUnionSEgrp2 <- CollapseSEbyColumn(
   allNormPepRawUnionSEgrp[,TechRepSamples],
   signal=c("median_Raw"),
   colDataColnames=c("Experiment","Line","Animal","Genotype","groupName"),
   numShrinkFunc=c(AAcount=mean, PTMcount=mean),
   cols=TechRepSamples,
   colGroups=TechRepGroups
)

In summary:

  • 114 total technical replicate samples, from
  • 38 biological samples
  • 17 outlier samples defined from the MA-plot centered by technical replicate
  • 97 technical replicate samples were used to combine technical replicates
  • 37 biological samples resulted

MA-plots using proteins per biological sample

We now revisit the MA-plots, using the per-protein data, for each biological sample.

iLM2 <- vigrep("^LM_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
jpLM2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLM2])$median_Raw),
   ncol=3,
   titleCex=0.9,
   titleBoxColor=colorSubPepRaw[iLM2],
   displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
   colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
   groupSuffix="",
   useMean=FALSE,
   centerGroups=gsub("_[0-9]+$", "", iLM2));
title(main=paste0("LM mouse line MA-plot\n",
   "protein data per biological replicate\n",
   "centered per KO or WT genotype"),
   outer=TRUE);

outliersLM2 <- attr(jpLM2, "mvaMADoutliers");
iLS2 <- vigrep("^LS_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
jpLS2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLS2])$median_Raw),
   ncol=3,
   titleCex=0.9,
   blankPlotPos=6,
   titleBoxColor=colorSubPepRaw[iLS2],
   displayMAD=TRUE, outlierMAD=1.9, outlierRowMin=20,
   colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
   groupSuffix="",
   useMean=FALSE,
   centerGroups=gsub("_[0-9]+$", "", iLS2));
title(main=paste0("LS mouse line MA-plot\n",
   "protein data per biological replicate\n",
   "centered per KO or WT genotype"),
   outer=TRUE);

outliersLS2 <- attr(jpLS2, "mvaMADoutliers");
iLT2 <- vigrep("^LT_(KO|WT)", colnames(allNormPepRawUnionSEgrp2));
par("oma"=c(0,0,4,0));
outliersLT2 <- NULL;
jpLT2 <- jammaplot(log2(1+assays(allNormPepRawUnionSEgrp2[,iLT2])$median_Raw),
   ncol=3,
   titleCex=0.9,
   titleBoxColor=colorSubPepRaw[iLT2],
   displayMAD=TRUE, outlierMAD=1.6, outlierRowMin=20,
   colrampOutlier=c("palegoldenrod", "lightblue", "blue", "navy", "orange", "orangered2"),
   controlSamples=setdiff(iLT2, outliersLT2),
   groupSuffix="",
   useMean=TRUE,
   centerGroups=gsub("_[0-9]+$", "", iLT2));
title(main=paste0("LT mouse line MA-plot\n",
   "protein data per biological replicate\n",
   "centered per KO or WT genotype"),
   outer=TRUE);

outliersLT2 <- attr(jpLT2, "mvaMADoutliers");

Three technical replicates were performed for each animal, which will be combined prior to downstream analysis. The method evaluates each triplicate for outliers, using a MAD factor cutoff of 5, meaning that for each triplet, the MAD is calculated, and any point more than 5x the MAD is considered an outlier.

## combine tech rep data
iSamplesUnionOtechRep <- nameVector(gsub("_[0-9]+$", "", iSamplesUnionO),
   iSamplesUnionO);

## Produce a matrix of data after outlier removal
iMatrixPepRawGmedTechMAD <- rowGroupMeans(iMatrixPepRawGmed[,iSamplesUnionO],
   useMedian=FALSE,
   returnType="input",
   madFactor=5,
   rmOutliers=TRUE,
   groups=iSamplesUnionOtechRep[iSamplesUnionO]);

## Summary table of outliers from the raw data, versus outliers
## after MAD outlier filtering
outlierDF <- table2dataFrame(table(
   MADoutliers=is.na(iMatrixPepRawGmedTechMAD[,iSamplesUnionO]),
   rawOutliers=is.na(iMatrixPepRawGmed[,iSamplesUnionO])));
knitr::kable(outlierDF);
rawOutliers:FALSE rawOutliers:TRUE
MADoutliers:FALSE 527408 0
MADoutliers:TRUE 31370 636056
## Summary stats
numMADoutliers <- outlierDF["MADoutliers:TRUE","rawOutliers:FALSE"];
pctMADoutliers <- round(outlierDF["MADoutliers:TRUE","rawOutliers:FALSE"]/sum(outlierDF[,"rawOutliers:FALSE"])*1000)/10;

Overall, 31,370 individual values were removed, roughly 5.6%, and only for triplicates containing three values, where the third value was 5xMAD away from the other two.

Now, combine technical replicates to produce a new data matrix.

## Combine technical replicates, using MAD factor cutoff of 5 for outliers
iMatrixPepRawGmedTechGrp <- rowGroupMeans(iMatrixPepRawGmed[,iSamplesUnionO],
   useMedian=FALSE,
   madFactor=5,
   rmOutliers=TRUE,
   groups=iSamplesUnionOtechRep[iSamplesUnionO]);
iSamplesUnionOtech <- colnames(iMatrixPepRawGmedTechGrp);
iSamplesUnionOtech1 <- unvigrep("_LM", iSamplesUnionOtech);

We plotted a histogram showing the number of samples with measured abundance values per peptide, out of 36 total samples. The distribution was so interesting, we further color-coded the bars by the number of sample groups represented. It shows that in general, mouse lines agree closely with one another – a peptide detected in a mouse wildtype line is almost always detected in the same mouse knockout line.

## Look at number of animals with detected peptides by sample group
detectedGroupsV <- nameVector(gsub("_[0-9].*$", "", iSamplesUnionOtech1),
   iSamplesUnionOtech1);
detectedGroupCountL <- apply(iMatrixPepRawGmedTechGrp[,iSamplesUnionOtech1], 1, function(i){
   detectedGroupsV[!is.na(i)]
});
detectedGroupCount <- lengths(strsplit(cPasteUnique(detectedGroupCountL), ","));
detectedSampleCount <- lengths(detectedGroupCountL);
gplots::barplot2(table(detectedGroupCount, detectedSampleCount),
   las=2,
   ylim=c(0,1100),
   ylab="Number of peptides",
   xlab="Number of samples with non-zero abundance per peptide",
   col=rainbowCat(6));
legend("topleft",
   cex=0.8,
   bg="transparent",
   box.col="transparent",
   title="groups detected",
   inset=c(0.05,0.05),
   fill=rainbowCat(6),
   legend=1:6);

detectedDF <- table2dataFrame(table(
   Groups=detectedGroupCount,
   S=detectedSampleCount));
colnames(detectedDF) <- gsub("^.*:", "", colnames(detectedDF));
knitr::kable(detectedDF);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
Groups:1 44 32 30 14 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Groups:2 0 20 64 68 161 205 269 340 509 608 733 836 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Groups:3 0 0 1 2 3 7 8 13 8 10 9 16 13 8 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Groups:4 0 0 0 2 2 0 2 7 9 13 27 36 43 60 75 85 108 143 177 226 260 277 277 182 0 0 0 0 0 0 0 0 0 0 0 0
Groups:5 0 0 0 0 0 0 0 0 0 0 1 2 3 0 6 9 10 5 5 3 11 9 5 8 7 3 3 2 0 0 0 0 0 0 0 0
Groups:6 0 0 0 0 0 0 0 0 0 0 0 0 1 1 3 2 3 11 10 12 27 38 32 42 61 66 108 137 172 206 277 349 528 561 691 997

Interesting observations:

  • Of the 890 peptides detected in only 12 samples, 93% of them were from the wildtype and knockout from the same mouse line.
  • There are roughly as many peptides (3,813) that are specific to one mouse line, as there are peptides (4,335) that are shared across all three mouse lines.

Combine peptides to protein

Next, the peptide abundances are combined into protein level abundance values, taking care to maintain the gene-specificity.

## Aggregate tech-rep combined abundances from peptide to protein
iRow <- match(rownames(iMatrixPepRawGmedTechGrp),
   rowData(allNormPepRawUnionSE)$probeID);
iGenesProtV <- nameVector(rowData(allNormPepRawUnionSE[iRow,])[,c("geneSymbol","probeID")]);
## Take only peptides assigned to one gene
iGenesProtVuse <- unvigrep(",", iGenesProtV);
## Non-unique genes
nonUniqueGenes <- mixedSort(setdiff(
   unique(unlist(strsplit(iGenesProtV,","))),
   unique(iGenesProtVuse)));

There are 1 unique gene symbols represented, and 1 unique gene symbols which have at least one peptide specific to that gene.

That means there are 0 genes for which there are peptides measuring multiple genes, sometime seen with gene families with extremely high homology. These genes are:

## aggregate peptides to protein taking normal sum of abundance
iMatrixPepRawGmedTechGrpProtSum <- t(rowGroupMeans(
   x=t(2^iMatrixPepRawGmedTechGrp[names(iGenesProtVuse),,drop=FALSE]-1),
   groups=iGenesProtVuse,
   rowStatsFunc=function(x,...){rowSums(x,...,na.rm=TRUE)},
   verbose=FALSE,
   keepNULLlevels=TRUE,
));
## Revert zero to NA
iMatrixPepRawGmedTechGrpProtSum[iMatrixPepRawGmedTechGrpProtSum == 0] <- NA;

Peptide-level technical replicates

Combine peptide technical replicates

Individual technical replicate outlier samples were defined using protein-level abundance values, in order to provide a robust measure of protein abundances by which to measure sample processing outliers.

However, individual peptides are useful for analysis in part because the PD software includes post-translational modification (PTM) data alongside peptides where PD detects such PTM events.

Therefore it is useful to combine technical replicates at the peptide level. We perform this step using:

  • the non-outlier samples TechRepSamples determined using the protein-level abundance data, and
  • the iGenesPepRaw peptides having at least one reported PD measurement.
## Combine peptide rows per gene
allNormPepRawUnionSE2 <- CollapseSEbyColumn(
   allNormPepRawUnionSE[iGenesPepRaw,TechRepSamples],
   signal=c("median_Raw"),
   colDataColnames=c("Experiment","Line","Animal","Genotype","groupName"),
   numShrinkFunc=c(AAcount=mean, PTMcount=mean),
   cols=TechRepSamples,
   colGroups=TechRepGroups
);
iSamplesUnionO2 <- colnames(allNormPepRawUnionSE2);

Review of hemoglobin by biological sample

The steps below are the same as with technical replicate data.

One interesting observation is that the peptide expression specific to each mouse line appears more definitive in this view, however the majority of those peptides are also extremely low abundance relative to the highest abundance hemoglobin peptides.

## iMatrix
#iMatrixPepRawG2 <- log2(1+assays(allNormPepRawUnionSE2[iGenesPepRaw,iSamplesUnionO2])$Raw);
iMatrixPepRawGmed2 <- log2(1+assays(allNormPepRawUnionSE2[iGenesPepRaw,iSamplesUnionO2])$median_Raw);
## color scale visual max
farbeLim <- sqrt(400000000);

## Make a color key to add to the left of the heatmap
iRowMeans2 <- rowMeans(na.rm=TRUE,
   (2^(iMatrixPepRawGmed2[iGenesPepRaw,iSamplesUnionO2])-1));
iRowMeanNote2 <- data.frame(mean_abundance=format(
   justify="right",
   big.mark=",",
   scientific=FALSE,
   iRowMeans2));
iRowMeanColors2 <- data.frame(mean_abundance=vals2colorLevels(
   symmetricZero=TRUE,
   farbeLim=farbeLim,
   colorRamp="RdGy_r",
   rmNA(naValue=,
      sqrt(iRowMeans2))));

## Ordered sample identifiers
iSamplesUnionO21 <- unvigrep("L._LM", iSamplesUnionO2);

## New sample data.frame
targetsPepRawDF2 <- as.data.frame(colData(allNormPepRawUnionSE2[,iSamplesUnionO2]));
targetsPepRawDFcolors2 <- df2groupColors(targetsPepRawDF2,
   colorSub=colorSubPepRaw);

## Produce a heatmap
## heatmap using hemoglobins only, color scale sqrt
heatmap.3(rmNA(naValue=-10000,
      sqrt(2^iMatrixPepRawGmed2[hbProbeIDs,iSamplesUnionO21]-1),
   ),
   cexCol=1.3,
   col="RdGy_r",
   srtCol=90,
   Colv=FALSE,
   add.expr=box(),
   rowGroups=gsub("_.*$", "", nameVector(hbProbeIDs)),
   sepcolor="black",
   symkey=FALSE,
   keyMargins=c(17,7,5,2),
   keyLabel="",
   keysize=c(0.7,1.2),
   tracecol="transparent",
   farbeLim=farbeLim,
   margins=c(15,15),
   densityScale="sqrt",
   main="PD Peptide Abundances",
   srtRowSideColorsNote=0,
   lwidRowSideColors=0.7,
   RowSideColors=iRowMeanColors2[hbProbeIDs,,drop=FALSE],
   RowSideColorsNote=iRowMeanNote2[hbProbeIDs,,drop=FALSE],
   ColSideColors=t(targetsPepRawDFcolors2[iSamplesUnionO21,-1]),
   ColSideColorsNote=t(targetsPepRawDF2[iSamplesUnionO21,-1]),
   lheiColSideColors=1.3,
   cellnote=rmNA(naValue="",
      signif(digits=3,
         round((2^iMatrixPepRawGmed2[hbProbeIDs,iSamplesUnionO21]-1)/100000)/10)),
   cexnote=0.65,
   cexColSideColorsNote=(makeCexCellnote(targetsPepRawDF2[iSamplesUnionO21,-1],
      cex=1.8, expFactor=0.3)),
   srtColSideColorsNote=90,
   subMain=c("Median normalized",
      "Showing only hemoglobin peptides",
      "after combining technical replicates.",
      "Heatmap cell labels are abundance in millions."));