==================
Greylag User Guide
==================

Overview
--------

The basic workflow of greylag is as follows.  First, a shuffled sequence
database is prepared with ``greylag-shuffle-database``.  This generates a
sequence database with one decoy sequence for each real sequence, which is
required by ``greylag-validate``.

Then, the spectra are searched using ``greylag-rally``, which takes a
configuration file and a set of spectrum files as input, and generates a glw
file as output.

The ``greylag-rally`` program connects to one or more ``greylag-chase``
programs running on the same or different hosts, and these worker programs do
the actual search.  The ``greylag-chase`` programs can be started using any
convienient means.

If the search has been broken into multiple pieces, ``greylag-merge`` can be
used to merge the resulting glw files into a single glw file that represents
the best matches found in all of the spectrum files.

Finally ``greylag-sqt`` is used to generate the corresponding sqt output files
(one for each input spectrum file) from a glw file.

After the search, the program ``greylag-validate`` can be used to determine a
set of "valid" identifications that satisfy a specified false discovery rate
(FDR), if the spectra have been searched against a sequence database that
includes decoys.

The auxilliary program ``greylag-flatten-fasta`` can be used to translate
FASTA databases to and from a one-line-per-locus format that makes it easier
to manipulate FASTA files using standard Unix utilities (e.g., ``grep``).


Simple Example
--------------

Here is a transcript for a simple, non-specific search, performed locally on an
eight-processor machine::

    $ ls
    greylag.conf  greylag.hosts  test.ms2  yeast.fasta
    $ cat greylag.conf
    [greylag]
    databases = yeast_decoy.fasta
    mass_regimes = AVG/MONO
    pervasive_mods = +C2H3ON!@C
    potential_mods = O@M oxidation '@'
    potential_mod_limit = 4
    parent_mz_tolerance = 1.25
    $ greylag-shuffle-database yeast.fasta > yeast_decoy.fasta
    six-mers: 2248190 real 2485103 decoy 315491 both
    $ cat greylag.hosts
    localhost:10078
    localhost:10079
    localhost:10080
    localhost:10081
    localhost:10082
    localhost:10083
    localhost:10084
    localhost:10085
    $ greylag-chase --port-count=8 &
    [3] 21736
    $ greylag-chase --port-count=8 &
    [4] 21737
    $ greylag-chase --port-count=8 &
    [5] 21744
    $ greylag-chase --port-count=8 &
    [6] 21745
    $ greylag-chase --port-count=8 &
    [7] 21746
    $ greylag-chase --port-count=8 &
    [8] 21747
    $ greylag-chase --port-count=8 &
    [9] 21748
    $ greylag-chase --port-count=8 &
    [10] 21755
    $ greylag-rally --hostfile=greylag.hosts greylag.conf test.ms2
    $ greylag-sqt chase_unknown.glw
    $ greylag-validate --kill *.sqt
    52 real ids, 0 decoys (FDR = 0.0000) [p1]


This is just an example.  Different command options and configuration file
options will be appropriate, depending on the instrument and situation.
Adding the ``-v`` flag to ``greylag-rally``, for example, will show progress
during the search.

Use the ``--help`` flag to see documentation for each command.


Host File
---------

The host file read by ``greylag-rally`` consists of a series of lines like

    ``host1234:10078``

which indicates that a ``greylag-chase`` program is (potentially) running on
host ``host1234`` and listening for connections on port ``10078``.  While
``greylag-rally`` is running, it will attempt to connect (and reconnect, if
later disconnected) to each of these ``greylag-chase`` instances, once per
minute.

.. caution:: Currently ``greylag-chase`` will accept all connections with no
             authorization required.  As such it probably should not be run on
             an insecure network.  (This problem will be addressed in a future
             version.)


Configuration File
------------------

The configuration file specifies the search parameters and is read by
``greylag-rally``.  It uses a simple, readable format similar to the format of
Windows .ini files.  The file must begin with the section marker ``[greylag]``.
See the included example configuration file.

Except for ``databases`` and probably ``pervasive_mods``, all parameters have
reasonable defaults and often need not be specified.  Note that if a parameter
is specified more than once in the configuration file, the final occurrence
takes precedence.

Here is a detailed description of each parameter.

``databases``
-------------

This is a whitespace-separated list of filenames.  Each filename may be an
absolute or relative path.  *The named file must be visible at this path to
both the* ``greylag-rally`` *and* ``greylag-chase`` *processes.*

Each file should contain a set of sequences in FASTA format.  Locus names must
be unique across all database files, to avoid confusion.  Sequence characters
are uppercased for uniformity.  Non-amino-acid sequence characters are
interpreted as "breaks"--candidate peptides do not cross these breaks.
Specifically, only the residues

    ``A C D E F G H I K L M N P Q R S T V W Y``

are recognized.

This parameter must be specified--there is no default.


``mass_regimes``
----------------

This specifies the mass regimes to be searched.  A *mass regime* is an
assignment of masses to atoms, which in turn determines the mass of each
residue.  The two most familiar mass regimes are monoisotopic masses, denoted
``MONO``, and average masses, denoted ``AVG``.  The former uses atomic masses
that correspond to the most commonly found element isotopes.  The latter uses
atomic masses that are weighted averages, weighted according to the relative
prevalence of the isotopes for each element (according to US NIST figures).

In addition to these two, mass regimes may also look like ``MONO(N15@90%)``,
which denotes a monoisotopic assignment, except that nitrogen is assigned the
mass of N15 instead of the standard N14.  The ``90%`` conceptually indicates
that 90% of the nitrogen in the sample is N15, but this prevalence information
is not currently used in the MONO case.

Similarly, ``AVG(N15@95.5%)`` would indicate that average masses are to be
used, except that nitrogen is assigned the mass

    avgmass(N) + 0.955 * (mass(N15) - mass(N14))

Mass regimes are specified as pairs, where the first member of the pair is
used for parent (precursor) calculations, and the second member is used for
fragment calculations.  For example, ``AVG/MONO`` will use average masses to
calculate the parent mass and monoisotopic masses to do the (virtual) fragment
mass calculations.  For convenience, if only one mass regime is specified, it
is used as both the parent and fragment regimes.

The mass regimes to be searched are specified as a semicolon-separated list.
For example,

    ``AVG/MONO;AVG/MONO(N15@90%)``

would specify two mass regime pairs to search with.

The default for this parameter is ``MONO``.


``pervasive_mods``
------------------

Pervasive modifications (also known as fixed or static modifications) are
changes in the mass of a residue that are assumed to always be present, for
the purpose of the search.  For example

    ``+C2H3ON!@C carboxyamidomethylation``

is the preferred form of the typical C+57 fixed modification specifier.
Instead of a symbolic formula (``+C2H3ON``), a specific numeric value such as
+57 may be specified.  In this case, the same value is used for all mass
regimes.

The ``!`` indicates that this formula is to be evaluated using the first mass
regime and that that value should be used everywhere.  This is useful when one
has a sample part of which is isotope-enriched, but the modification in
question is caused by a later chemical modification that is *not*
isotope-enriched.

The label (*e.g.*, ``carboxyamidomethylation``) is optional, but must be
unique if specified.  It is used to mark modifications in the output, which
makes them easier for downstream software to track.

Each specifier may specify the modification for only a single residue (``C``
in the above example), or ``[`` or ``]`` (the peptide N- and C-terminii,
respectively).  Each residue (or terminus) may have at most one specifier.

The value of this parameter is a comma-separated list of modification
specifiers.  The default is empty, but usually some value will be needed.


``potential_mods``
------------------

Potential modifications (also known as dynamic modifications) are changes in
the mass of a residue that may or may not be present, for the purpose of the
search.  For example

    ``PO3H@STY phosphorylation '*'``

is a potential modification specifier for the phosphorylation of serine,
threonine, or tyrosine.  As with pervasive specifiers, the modification delta
can be specified numerically, and the modification label can be omitted.

The modification marker (``'*'`` in this case) is optional; if present, it is
used to mark occuring modifications in peptides in the output.

The value of this parameter is a boolean expression composed of conjuctions
and disjunctions of potential modification specifiers.  So, for example,

::

    potential_mods = (PO3H@STY phosphorylation '*';
		      C2H2O@KST acetylation '^';
		      CH2@AKST methylation '#'),
		     O@M oxidation '@'

means that each peptide is searched for

* phosphorylation and oxidation, or
* acetylation and oxidation, or
* methylation and oxidation.

This might find for example, a peptide with two phosphorylations and one
oxidation, or a peptide with just one acetylation.  It will not find a peptide
with a phosphorylation and a methylation.

Syntactically, the ``';'`` is disjunction (read "or") and the ``','`` is
conjuction (read "and").  Parentheses may be used, as above, for grouping, and
can be nested without limit.

The purpose of these expressions is to limit search to likely combinations of
potential modifications, as search for arbitrary combinations may be
prohibitively costly (in terms of running time).

Internally the expressions are reduced to `disjunctive normal form`__ and
redundancy is removed.  Roughly, this means that the search times for two
equivalent potential modification expressions will be the same, even though
they are expressed differently.

__ http://en.wikipedia.org/wiki/Disjunctive_normal_form

The default is empty--no potential modifications.


``potential_mod_limit``
-----------------------

This parameter is non-negative integer that specifies the maximum number of
potential modifications (*i.e.*, modified residues) sought for a peptide.
(Terminal modifications, including PCA modifications, are not counted in this
limit.)

In general, run time is an exponential function of this search depth, so large
values are likely to be prohibitive.

The default is ``2``.


``enable_pca_mods``
-------------------

This parameter determines whether PCA modifications are searched for, and must
be ``yes`` (the default) or ``no``.

This feature is borrowed from the `X!Tandem implementation`__.  Roughly, a PCA
is a special kind of N-terminal modification.

__ http://www.thegpm.org/TANDEM/api/rpmm.html

Enabling this option will typically increase search time by around 10% and
increase the number of valid matches found by 5% or so.


``charge_limit``
----------------

This parameter is a positive integer specifying the largest spectrum precursor
charge.  This should be set to the largest precursor charge expected for the
input spectra, and is used to scale the ``parent_mz_tolerance`` parameter.

The default value is ``3``.


``min_peptide_length``
----------------------

This parameter is a positive integer specifying the minimum peptide length to
search.  Peptides shorter than this will not be searched.

The default value is ``5``.


``cleavage_motif``
------------------

This parameter is a string and follows the `X!Tandem syntax for specifying
cleavage sites`__.  In particular ``[X]|[X]`` indicates non-specific
digestion, and ``[KR]|{P}`` indicates tryptic digestion (cleavage following
lysine or arginine, except when followed by proline).

__ http://www.thegpm.org/TANDEM/api/pcs.html

The default value is ``[X]|[X]``.


``maximum_missed_cleavage_sites``
---------------------------------

This parameter is a non-negative integer specifying the maximum number of
missed cleavage sites in a candidate peptide.  (If non-specific cleavage is
being used, the value of this parameter is not used.)

The default value is ``1000000000``.


``min_parent_spectrum_mass``
----------------------------

This parameter is a non-negative floating point value specifying the minimum
parent mass for spectra, measured in Daltons.  Spectra having a smaller parent
mass are skipped.

The default value is ``0``.


``max_parent_spectrum_mass``
----------------------------

This parameter is a non-negative floating point value specifying the maximum
parent mass for spectra, measured in Daltons.  Spectra having a larger parent
mass are skipped.

The default value is ``10000``.


``TIC_cutoff_proportion``
-------------------------

This parameter is a floating point value between zero and one, inclusive.
It's used to filter spectra before search.  The spectrum peaks are sorted by
intensity, and then all but the N most intense peaks are discarded, where N is
the smallest value for which the summed intensity of the remaining peaks is at
least the specified proportion of the total ion current (TIC).

The default value is ``0.98``.


``parent_mz_tolerance``
-----------------------

This parameter is a non-negative floating point value.  It's used to select
candidate spectra that might match a peptide--candidates within the specified
tolerance are considered possible matches and scored.

The unit of this parameter is m/z--the actual tolerance in Daltons will depend
on the charge of the spectrum in question.  If the value of this parameter is
``1.25``, the mass tolerance will be ``1.25 Da`` if the charge is +1.  If the
charge is +3, the mass tolerance is ``3.75 Da``, meaning that the difference
between the masses of the actual and predicted spectrum precursors must be
less than ``3.75 Da``.

Search run time will be roughly proportional to this value (at least across
the range of interest), so it's convenient to choose the smallest value that
will work.  Sufficiently small values will start excluding true matches, so a
choice that's too small will be counterproductive.  (Excessively large values
may also reduce the number of valid matches found.)

The default value is ``1.25``.


``fragment_mass_tolerance``
---------------------------

This parameter is a non-negative floating point value, measured in Daltons.
It determines how close together a real and a predicted fragment peak have to
be to be counted as overlapping (or matching).

The default value is ``0.5``.


``intensity_class_count``
-------------------------

This parameter and the next control how spectrum intensities are classified.
During classification, spectrum peaks are ordered according to decreasing
intensity.  The most intense peaks are placed in the first class, next most
intense into the second class, and so on.  This classification scheme is drawn
from MyriMatch_'s scoring algorithm--see the paper for more details.

.. _MyriMatch: http://fenchurch.mc.vanderbilt.edu/lab/software.php

This parameter is a positive integer, and determines how many classes the
classifier uses.

The default value is ``3``.


``intensity_class_ratio``
-------------------------

This parameter is a positive floating point value.  It determines the relative
size of the succeeding intensity classification classes.  (See
``intensity_class_count`` above.)

So, for example, if the parameter value is ``2.0``, the second class will be
twice as large as the first, the third twice as large as the second, and so
on.

The default value is ``2.0``.  (This parameter should be greater than one in
normal usage.)


``best_result_count``
---------------------

This parameter is a positive integer and describes the number of candidate
matches output for each spectrum.

The default value is ``5``.


``mass_regime_debug_delta``
---------------------------

This parameter is only for debugging use.  The value is a float and will be
added to the mass of every residue, across all mass regimes.  The purpose is
to support certain potential modification test scenarios.

The default value is ``0``, which gives normal behavior.


Detailed Command Descriptions
-----------------------------


Usage: ``greylag-rally [options] <configuration-file> <ms2-file>...``

[to be completed]