How the "Peak Probability Contrasts" (PPC) method works
PPC is a method for classifying spectra into two or more groups (such as normal and diseased), from labelled training data.
The input data can be either raw spectra, one spectra per individual, or a list
of extracted peaks, one list per individual. If raw data spectra are provided,
PPC applies its built-in peak finder. This
peak-finder
is crude, simply looking for local maxima in a window, with an intensity a pre-specified
amount over background. Better results can often be obtained by applying a
more sophisticated peak finding procedure, and then using the extracted peaks
as input into PPC.
With a list of peaks for each spectra, PPC does the following
1. apply hierarchical clustering along the m/z axis, to get a common
set of peaks for all spectra
2. go back to the individual spectra, and record whether each common
peak is present, and the intensity of that peak.
3. for each common peak, find a intensity cutpoint that best discriminates
normal spectra from diseased spectra.
4. for each common peak, compute the proportion of spectra with peaks
above the optimal cutpoint
5. apply a nearest shrunken centroid classifier to these proportions.
In this last step, the proportions are shrunk towards each other by as user-specified amount (estimated by cross-validation). If they are shrunk so as to be equal to one another, that common peak is eliminated from the classifier. Hence PPC has built-in feature selection. Peaks whose proportions are the most different between the two groups, are the best ones for discriminating between the groups.
PPC classifies new spectra as follows. A feature vector of zeroes and
ones is created, with an entry for each common peak: 1 if the new spectra
has a common peak with intensity above the optimal cutpoint for that
common peak, and 0 otherwise. Then the feature vector is compared to
the vectors of proportions for the normal and diseased groups (computed
in step 4), and assigned to the group whose vector it is closest to,
in simple Euclidean distance.
In addition, PPC can handle batches, eg samples run on different chip
surfaces. It concatenates the spectra from different batches into one
long spectra, and compares spectra from different people on the same
batch.

