How the "Peak Probability Contrasts" (PPC) method works

PPC is a method for classifying spectra into two or more groups (such as normal and diseased), from labelled training data.


The input data can be either raw spectra, one spectra per individual, or a list of extracted peaks, one list per individual. If raw data spectra are provided, PPC applies its built-in peak finder. This peak-finder is crude, simply looking for local maxima in a window, with an intensity a pre-specified amount over background. Better results can often be obtained by applying a more sophisticated peak finding procedure, and then using the extracted peaks as input into PPC.

With a list of peaks for each spectra, PPC does the following

1. apply hierarchical clustering along the m/z axis, to get a common set of peaks for all spectra
2. go back to the individual spectra, and record whether each common peak is present, and the intensity of that peak.
3. for each common peak, find a intensity cutpoint that best discriminates normal spectra from diseased spectra.
4. for each common peak, compute the proportion of spectra with peaks above the optimal cutpoint
5. apply a nearest shrunken centroid classifier to these proportions.

In this last step, the proportions are shrunk towards each other by as user-specified amount (estimated by cross-validation). If they are shrunk so as to be equal to one another, that common peak is eliminated from the classifier. Hence PPC has built-in feature selection. Peaks whose proportions are the most different between the two groups, are the best ones for discriminating between the groups.

PPC classifies new spectra as follows. A feature vector of zeroes and ones is created, with an entry for each common peak: 1 if the new spectra has a common peak with intensity above the optimal cutpoint for that common peak, and 0 otherwise. Then the feature vector is compared to the vectors of proportions for the normal and diseased groups (computed in step 4), and assigned to the group whose vector it is closest to, in simple Euclidean distance.

In addition, PPC can handle batches, eg samples run on different chip surfaces. It concatenates the spectra from different batches into one long spectra, and compares spectra from different people on the same batch.