Data preprocessing

Last updated on Apr 20, 2020 10 min read

In a recent post, I described the components in proteoQ for data normalization and informatic analysis. Most of the time, I get my job done by invoking modules in the program. Occasionally, I would examine the codes to find out how things were actually done.

More recently, I was about to set out an example comparing the results of the same data set being searched independently against two fasta databases at partially different taxonomies. I thought I can simply put the two pieces of data together under the same file folder, go through the normalization steps and visualize the correlations. However, the results was at first a bit puzzling.

Data are always right (we are not).

An apparently innocuous assumption of “same peptide, same protein” became a challenge in the above scenario. For this reason, I take this opportunity to first outline a few flowcharts in data preprocessing and will go back to the original question of cross-fasta data combination separately.

1 PSMs

In great details, the steps in PSM preprocessing were separated into three smaller pieces and will be gone through in order.

1.1 Unique identifiers

Taking the format of Mascot outputs as an example (2011), let’s have a closer look at the first component in proteoQ normalization, normPSM.

PSM outputs may be subject to redundancy.¹ The normPSM utility aims to first reduce the information within individual PSM files using the rule of Occam’s razor. A unique identifier, uniq_id, was defined as the combination of values under columns pep_query + pep_seq + pep_var_mod_pos + dat_file.

The first three fields are kept at the levels of PSM observation for the following considerations. Firstly, the same MSMS query may be assigned to multiple peptide sequences, including the isomeric leucine/isoleucine residues. Secondly, even with the same query and sequence, we might incur additional ambiguity in the position of variable modification(s). Chances are that it remains unknown about which identification to keep and which to tuck away. For this reason, we sacrifice a little specificity using the combination of pep_query + pep_seq + pep_var_mod_pos to be more inclusive.²

The dat_file holds the name(s) of PSM file(s). It will have no effect in identifying redundant PSM entries if there is only one piece of PSM file for processing. In the event of multiple files, since all PSM files were combined, there is no guarantee that two different files will not contain entries of the same pep_query + pep_seq + pep_var_mod_pos; thus the dat_file is also needed.

1.2 Leading versus redundant entries

PSMs were then grouped by uniq_id and ordered by decreased values in prot_matches, pep_isbold and prot_mass. The first entry under each group was considered leading and kept.

From set theory, we can be assured that, when shared, two sets of peptides from two proteins can either be equal-set, sub-set or intersecting between one and the other. The above procedure will remove the PSM entries from sub-set proteins (that will be smaller in the number of prot_matches) and same-set proteins (that are smaller in the value of prot_mass). Note that the minimalism will also result in shared peptides being only retained for the leading entries in protein quantitation. In other words, if shared peptides are used in protein quantitation, they will only be ascribed (Occam’ed) to the the leading entries (see also section 1.5).

1.3 Merged search

When possible, I typically make a merged search in that there is only one piece of PSM file for a given project. In this way, the same pep_seq + pep_var_mod_pos under one or multiple pep_query would be associated to the same leading protein.³ It is sometimes difficult to have a single piece of PSM file, for example, with both 10- and 16-plex TMT data for processing at the same time. In cases like these, the same pep_seq + pep_var_mod_pos may still be linked to different protein IDs after normPSM,⁴ and would need additional handling in a later mergePep.

1.4 Extended query data

The steps between PSM (interim I) and PSM (interim II) are straightforward. Note that the probability in the position(s) of peptide modifications can be incorporated from extended PSM data. Also note that the filter_ varargs are applied after the updates of prot_matches_sig and prot_sequences_sig.⁵

1.5 Unique peptides

We may toggle on or off the option of Unique peptides only when exporting Mascot PSMs. If checked, Mascot would typically nullify the readouts in reporter ion intensities and ratios for peptides that are considered non-unique.

Having have this in mind, we may reason from the following flowchart that: with the Unique peptides only being checked, the data rows corresponding to non-unique peptides will be removed with the action of Remove all-NA entries.

As a result, a vararg statement of filter_ = exprs(pep_isunique == 0) will have no effect. In other words, it will not help us retain non-unique peptides. The same is true with a later Pep2Prn(use_unique_pep = FALSE) (See also an earlier post about the utility).

If the Unique peptides only was left unchecked when exporting Mascot PSMs, the non-unique peptides will remain available after the removal of all-NA entries. In this case, we could have the option to remove them with either a filter_ statement or use_unique_pep = TRUE defaulted in Pep2Prn.⁶

2 Peptides

The above procedure would hopefully assign the same peptide to the same protein in individual PSM files. In this section, we would consider the similar issue when putting together multiple, processed PSM files into one piece of peptide data. This includes the merge of PSM data across LCMS series, pre-LCMS fractions and TMT experiments, if applicable.

When calling mergePep, we may set use_duppeps = FALSE, which will toss away double-dipping peptides (one peptide ID to multiple protein IDs). Alternative, a peptide with ambiguous protein associations will be re-assigned among candidate proteins, simply by the decreased orders in the number of unique peptides, the number of PSMs and the values of molecular masses.

3 Proteins

The procedure summarizing peptides to proteins is outlined in the flowchart below. As noted previously, the setting in unique_pep offers to either keep or leave out non-unique peptides when applicable.

Column keys

Column keys mentioned in the post (see also here or the README for a complete list).

Header	Descrption
prot_mass	Protein mass
prot_matches	Count of PSMs
prot_matches_sig	Count of PSMs that have significant scores under a proposed protein
prot_sequences_sig	Count of distinct sequences that have significant scores under a proposed protein
prot_cover	Protein sequence coverage by amino acid residues
pep_query	Ordinal number of query after sorting by Mr
pep_isbold	If grouping enabled, then a significant PSM. Otherwise, indicates this is the highest scoring protein that contains a match to this query.
pep_seq	One-letter representation of peptide sequences without variable modifications
pep_seq_mod	`pep_seq` with variable modifications
pep_start	Ordinal position of first peptide residue in protein sequence
pep_end	Ordinal position of last peptide residue in protein sequence
pep_var_mod_pos	Variable modifications as a string of digits, e.g. ’0.0001000.0?. Non-zero digits identify mods according to key in export header. First and last positions are for terminus mods.
pep_locprob	The highest probablity from the Mascot site analysis of the variable modification sites
pep_locdiff	pep_locprob – pep_locprob2

References

Koskinen, Ville. 2011. “Hierarchical Clustering of Shotgun Proteomics Data.” Mol Cell Proteomics 10 (6): M110.003822. https://doi.org/10.1074/mcp.M110.003822.

My understanding is that there is a good reason for this. For instance, provided a list of peptide hits, biologists may identify a pivotal sub-set protein, instead of a super-set representative.↩︎
Within a dat_file, chimeric peptides from the same MSMS spectrum (or event) would typically have different query numbers.↩︎
This should be guaranteed within the same dat_file as Mascot does this for us.↩︎
This would be particularly the case for fasta databases with high redundancy.↩︎
The updates are necessary. For instance, an identifier of proteins could go with either prot_acc or gene. With the later, the maximum coverage across different protein accessions will be used.↩︎
The argument use_unique_pep is a little redundant as we can achieve the same via filter_ varargs. When summarizing peptides to proteins, its primary purpose is to serve as a safeguard by removing non-unique peptides by default. Up to date, the exclusion of non-unique peptides in protein quantitation seems a prevailing approach in MS-based proteomics.↩︎