Data preprocessing
In a recent post, I described the components in proteoQ for data normalization and informatic analysis. Most of the time, I get my job done by invoking modules in the program. Occasionally, I would examine the codes to find out how things were actually done.
More recently, I was about to set out an example comparing the results of the same data set being searched independently against two fasta databases at partially different taxonomies. I thought I can simply put the two pieces of data together under the same file folder, go through the normalization steps and visualize the correlations. However, the results was at first a bit puzzling.
Data are always right (we are not).
An apparently innocuous assumption of “same peptide, same protein” became a challenge in the above scenario. For this reason, I take this opportunity to first outline a few flowcharts in data preprocessing and will go back to the original question of cross-fasta data combination separately.
1 PSMs
In great details, the steps in PSM preprocessing were separated into three smaller pieces and will be gone through in order.
1.1 Unique identifiers
Taking the format of Mascot outputs as an example (2011), let’s have a closer look at the first component in proteoQ normalization, normPSM
.
PSM outputs may be subject to redundancy.1 The normPSM
utility aims to first reduce the information within individual PSM files using the rule of Occam’s razor. A unique identifier, uniq_id
, was defined as the combination of values under columns pep_query
+ pep_seq
+ pep_var_mod_pos
+ dat_file
.
The first three fields are kept at the levels of PSM observation for the following considerations. Firstly, the same MSMS query may be assigned to multiple peptide sequences, including the isomeric leucine/isoleucine residues. Secondly, even with the same query and sequence, we might incur additional ambiguity in the position of variable modification(s). Chances are that it remains unknown about which identification to keep and which to tuck away. For this reason, we sacrifice a little specificity using the combination of pep_query
+ pep_seq
+ pep_var_mod_pos
to be more inclusive.2
The dat_file
holds the name(s) of PSM file(s). It will have no effect in identifying redundant PSM entries if there is only one piece of PSM file for processing. In the event of multiple files, since all PSM files were combined, there is no guarantee that two different files will not contain entries of the same pep_query
+ pep_seq
+ pep_var_mod_pos
; thus the dat_file
is also needed.
1.2 Leading versus redundant entries
PSMs were then grouped by uniq_id
and ordered by decreased values in prot_matches
, pep_isbold
and prot_mass
. The first entry under each group was considered leading and kept.
From set theory, we can be assured that, when shared, two sets of peptides from two proteins can either be equal-set, sub-set or intersecting between one and the other. The above procedure will remove the PSM entries from sub-set proteins (that will be smaller in the number of prot_matches
) and same-set proteins (that are smaller in the value of prot_mass
). Note that the minimalism will also result in shared peptides being only retained for the leading entries in protein quantitation. In other words, if shared peptides are used in protein quantitation, they will only be ascribed (Occam’ed) to the the leading entries (see also section 1.5).
1.3 Merged search
When possible, I typically make a merged search
in that there is only one piece of PSM file for a given project. In this way, the same pep_seq
+ pep_var_mod_pos
under one or multiple pep_query
would be associated to the same leading protein.3 It is sometimes difficult to have a single piece of PSM file, for example, with both 10- and 16-plex TMT data for processing at the same time. In cases like these, the same pep_seq
+ pep_var_mod_pos
may still be linked to different protein IDs after normPSM
,4 and would need additional handling in a later mergePep
.
1.4 Extended query data
The steps between PSM (interim I)
and PSM (interim II)
are straightforward. Note that the probability in the position(s) of peptide modifications can be incorporated from extended PSM data. Also note that the filter_
varargs are applied after the updates of prot_matches_sig
and prot_sequences_sig
.5
1.5 Unique peptides
We may toggle on or off the option of Unique peptides only
when exporting Mascot PSMs. If checked, Mascot would typically nullify the readouts in reporter ion intensities and ratios for peptides that are considered non-unique.
Having have this in mind, we may reason from the following flowchart that: with the Unique peptides only
being checked, the data rows corresponding to non-unique peptides will be removed with the action of Remove all-NA entries
.
As a result, a vararg statement of filter_ = exprs(pep_isunique == 0)
will have no effect. In other words, it will not help us retain non-unique peptides. The same is true with a later Pep2Prn(use_unique_pep = FALSE)
(See also an earlier post about the utility).
If the Unique peptides only
was left unchecked when exporting Mascot PSMs, the non-unique peptides will remain available after the removal of all-NA entries. In this case, we could have the option to remove them with either a filter_
statement or use_unique_pep = TRUE
defaulted in Pep2Prn
.6
2 Peptides
The above procedure would hopefully assign the same peptide to the same protein in individual PSM files. In this section, we would consider the similar issue when putting together multiple, processed PSM files into one piece of peptide data. This includes the merge of PSM data across LCMS series, pre-LCMS fractions and TMT experiments, if applicable.
When calling mergePep
, we may set use_duppeps = FALSE
, which will toss away double-dipping peptides (one peptide ID to multiple protein IDs). Alternative, a peptide with ambiguous protein associations will be re-assigned among candidate proteins, simply by the decreased orders in the number of unique peptides, the number of PSMs and the values of molecular masses.
3 Proteins
The procedure summarizing peptides to proteins is outlined in the flowchart below. As noted previously, the setting in unique_pep
offers to either keep or leave out non-unique peptides when applicable.
Column keys
Column keys mentioned in the post (see also here or the README for a complete list).
Header | Descrption |
---|---|
prot_mass | Protein mass |
prot_matches | Count of PSMs |
prot_matches_sig | Count of PSMs that have significant scores under a proposed protein |
prot_sequences_sig | Count of distinct sequences that have significant scores under a proposed protein |
prot_cover | Protein sequence coverage by amino acid residues |
pep_query | Ordinal number of query after sorting by Mr |
pep_isbold | If grouping enabled, then a significant PSM. Otherwise, indicates this is the highest scoring protein that contains a match to this query. |
pep_seq | One-letter representation of peptide sequences without variable modifications |
pep_seq_mod | pep_seq with variable modifications |
pep_start | Ordinal position of first peptide residue in protein sequence |
pep_end | Ordinal position of last peptide residue in protein sequence |
pep_var_mod_pos | Variable modifications as a string of digits, e.g. ’0.0001000.0?. Non-zero digits identify mods according to key in export header. First and last positions are for terminus mods. |
pep_locprob | The highest probablity from the Mascot site analysis of the variable modification sites |
pep_locdiff | pep_locprob – pep_locprob2 |
References
Koskinen, Ville. 2011. “Hierarchical Clustering of Shotgun Proteomics Data.” Mol Cell Proteomics 10 (6): M110.003822. https://doi.org/10.1074/mcp.M110.003822.
My understanding is that there is a good reason for this. For instance, provided a list of peptide hits, biologists may identify a pivotal
sub-set
protein, instead of asuper-set
representative.↩︎Within a
dat_file
, chimeric peptides from the same MSMS spectrum (or event) would typically have different query numbers.↩︎This should be guaranteed within the same
dat_file
as Mascot does this for us.↩︎This would be particularly the case for fasta databases with high redundancy.↩︎
The updates are necessary. For instance, an identifier of proteins could go with either
prot_acc
orgene
. With the later, the maximum coverage across different protein accessions will be used.↩︎The argument
use_unique_pep
is a little redundant as we can achieve the same viafilter_
varargs. When summarizing peptides to proteins, its primary purpose is to serve as a safeguard by removing non-unique peptides by default. Up to date, the exclusion of non-unique peptides in protein quantitation seems a prevailing approach in MS-based proteomics.↩︎