Simplicity is prerequisite for reliability --Edsger W. Dijkstra
Greylag uses a novel (we believe) "inside-out" search strategy. Database search is essentially a depth-first search, but there are different ways that this search can be structured.
A straightforward search algorithm might work something like this
We have not exhaustively investigated other programs, but we suspect that they all use algorithms similar to this.
Greylag's search algorithm looks like this
There are a number of advantages to this inside-out approach. Most importantly, it allows us to hoist a significant amount of the search out of the performance-critical loop, thus allowing quite a bit of the search to be implemented in a high-level language without loss of performance. This is a great benefit because high-level code is much easier to write, understand, make correct, debug, and modify.
For a typical search, in the traditional algorithm described above, step 2 and all subsequent steps will be executed millions of times, thus requiring a low-level implementation.
In the greylag approach, steps 1 through 5 will be executed a relatively small number of times, perhaps as much as few thousand times for complex searches, but far fewer for basic searches. This means that they are not particularly time-critical, and can be implemented in high-level code. Steps 6 through 11 must still be implemented in a low-level language, but the code required is smaller and simpler because part of the work has already been done in the high-level portion. In fact, the vast proportion of run time is spent in step 11 for most searches.
There are several other minor advantages to the inside-out approach. Because a number of common calculations are being lifted out of the inner loop, it's probably a little more efficient than the straightforward approach.
It also makes possible an incremental approach to modification search. Since zero modifications are searched first, then one, then two, etc., one could potentially search for a fixed amount of time and then stop, knowing that the most likely combinations will have been searched first.
There is one potential disadvantage to the inside-out approach, which is that more total work will be done by the modification assignment performed in step 10, versus what happens in step 4 of the straightforward approach. This seems not to be a problem in practice, probably because the cost of the final scoring in step 11 tends to dwarf the costs of the previous steps.
The core algorithm is based on a step-up procedure at a specified FDR (False Discovery Rate), in the manner of Benjamini and Hochberg. In particular, the result scores are ordered, and the lowest threshold having the specified FDR (or better) is chosen. The number of false, non-decoy discoveries in the chosen set is assumed to be equal to the number of decoy discoveries, given that the decoy database is the same size as the original database.
Rather than just using scores, the greylag algorithm attempts to optimize the number of ids by adjusting both the score and delta. Roughly, the delta threshold is set to a range of different values, and at each value the score delta and number of ids is evaluated. The score and delta threshold pair having the most ids is chosen.
Outside of this core, greylag-validate allows some filtration of the input ids, according to several criteria. First, ids can be excluded based on trypticity--having an insufficient number of tryptic terminii. Ids can also be excluded if they do not satisfy a "minimum peptides per locus" criterion. Finally, a "maximum Sp rank criterion" can be applied (this supports SEQUEST output, but is not useful for greylag output).
After these filters are applied, some peptide ids are added back even if they do not satisfy the filter criteria or score/delta thresholds, in a method we call saturation. The basic idea is that if we have called an id for peptide PEPTIDE valid, we declare all other ids for that peptide (with possible modifications) valid as well, regardless of score/delta/etc. The rationale for this is that once we have identified a peptide as being present in the sample, it is much more likely that further identifications of that peptide are also valid, rather than just occurring by chance. Although this may introduce some false id's, the presumption is that there will still be a larger number of valid id's at the given FDR. If this is not the case, the saturation step is omitted. (The option --no-saturation will cause saturation to be skipped entirely.)
In addition to the saturation described above, we also perform prefix saturation, which is similar except that we add all identifications for peptides that have a prefix that is a peptide previously identified as valid. (We could also perform suffix saturation, but in testing this seems not to be helpful.)
Because these filters and saturation are applied after the score/delta thresholds are chosen, the resulting FDR may differ from that chosen by the user (usually the resulting FDR will be better). In order to minimize this difference, the core algorithm is run iteratively with differing goal FDRs, in order to try to generate a resulting FDR that is close to what the user has specified. (The option --no-adjust-fdr will omit this step.)
The implementation of greylag involved many technological choices and tradeoffs. Here is a discussion of several of the major choices and their rationales.
Peptide searching is very CPU-intensive, and somewhat memory-intensive as well, so the typical choice is a low-level language like C++ (or C) that can generate very efficient programs. An alternative approach is to write the non-critical portions of the program in a high-level language and to use a low-level language only for the performance critical sections. We chose the latter for the following reasons:
Python has the following salient advantages as the high-level language for greylag:
Currently, the only other alternative with these properties is Ruby, which we might have chosen instead.
The only serious candidates for the low-level language are C++ and C (the latter being arguably now just a simpler subset of the former).
We chose C++ with the hope that using STL data structures and algorithms would help us to reduce the amount of code to be written, which seems to have worked.
However, using the STL also introduces a lot of mental overhead and opportunities for subtle bugs--one needs to understand the implications of singular iterators, for example.
Another problem with STL data structures--and C++ objects--is that they're noticeably slower than C-style data structures, and we found it necessary to drop vectors in favor of C arrays for several structures that are operated on in the most CPU-intensive core of the program.
In hindsight, for greylag C++ seems to be at best a wash versus C. We might switch to C in the future.
(Java was ruled out because it's somewhat slower, which is particularly important because we're only using the low-level language for performance-critical sections to begin with. Also, Java's VM makes it difficult to integrate under other languages, and there are lingering licensing issues.)
There are a number of ways to make C++ (or C) code callable from Python:
Greylag currently uses SWIG. Despite its problems, it keeps the code small and allows for functionality to be easily moved back and forth between Python and C++.
There are some other alternatives, too, like Boost.Python, SIP, Weave, etc. These don't appear to be viable for greylag.
Greylag generates peptide modification combinations using depth first search (DFS), which means that memory usage is minimal (logarithmic). The only real limit is run time, which is necessarily exponential with respect to search depth. Greylag limits search depth by explicitly limiting the total number of modifications (modified residues) per peptide.
Greylag will not generate modified combinations for which there is no spectrum having a suitable precursor mass. So, for example, in considering a peptide 'AAAAAAAAAXXXXXXXDDDDDDDDD' with (say) mass 1200.00 and possible modifications A+10 and D+9, the modified mass would be 1286.00 for five +10 and four +9 modifications. If there are no actual spectra that have precursor mass close to 1286.00, greylag will not attempt to generate the specific modification combinations (e.g., 'AA*AA*A*A*AAA*XXXXXXXD%DD%D%DDDDD%'), thus saving some running time. Of the alternative programs, only SEQUEST appears to implement a similar optimization.
SEQUEST appears also to use DFS and limits search depth in a similar way, except that the depth of each kind (e.g., oxidation) of modification is limited separately. The version of SEQUEST we have ("v.27 (rev.9)") can simultaneously search at most three kinds of modification.
X!Tandem also uses DFS, using a recursion-free "state engine" approach that also uses minimal memory. Rather than explicitly limiting search depth, it places a limit (of 4096) on the number of combinations searched for each peptide. In the simple case, this means that X!Tandem will consider at most 12 modifications per peptide. (It's not clear to us how terminal modifications and other advanced modifications such as point mutations figure in this.)
This limit is not a parameter of the program and appears to be undocumented, so users may not realize that larger modification sets are silently ignored. Also, since the limit cannot be adjusted (short of recompiling the program), X!Tandem cannot do more shallow--and therefore much faster--searches such as searching for peptides with at most two methionine oxidations. (Balancing this somewhat, X!Tandem does provide a unique "refinement" algorithm, but that is orthogonal to the considerations of this section.)
OMSSA appears to use breadth-first search on a per-peptide basis to generate modification combinations, and uses a combination limit similar to that of X!Tandem. This means that it will use an amount of memory linear in the number of combinations, which will be exponential in search depth. This would limit search to a depth of 25 or so on current hardware, but it's not clear whether this matters in practice.
MyriMatch uses DFS and a depth limit like that of greylag.
As a final detail, the implementations of X!Tandem, OMSSA, and MyriMatch appear to also limit the maximum number of modifications per peptide to the machine word size: 32 on a 32-bit machine and 64 on a 64-bit machine. (The corresponding limit for greylag is at least one billion.)