Protein Name
Protein names are downloaded using the mapped protein accession. There are cases where the protein names will not be available, possible reasons are:
- No valid mapped protein accession
- Protein accession has been deleted from the original database
- Protein accession is unknown to the original database
Status
In MS proteomics based experiments, potentially identified proteins are reported using the searched database’s proprietary identifiers. These identifiers are unstable and can change or may even be deleted over time. The latter happens if, for instance, hypothetical proteins are removed when gene prediction algorithms are updated or new biological evidence is created.
In a recent paper we investigated the impact of changing protein identifiers on stored proteomics data over time [1]. We found that in several cases 10-20% of the reported identifiers were no longer valid after only a year after the experimental results had been published. To highlight this problem to the user as well as to keep the reported data usable, PRIDE Inspector has a function to automatically check the reported identification’s status. To do this we integrated specific components that access the identifications source database and retrieve the current identifier status. If the identifier was only updated, the new accession is automatically displayed in the protein table and the updated sequence retrieved. In some cases, even though a protein’s identifier did not change its underlying sequence was altered in the protein database. Therefore, PRIDE Inspector automatically fetches a protein’s current sequence and checks whether the reported peptides still fit this identification.
When using the “Update Protein Details” feature in the PRIDE Inspector, the status of the protein according to the original database is downloaded in addition to protein name and protein sequence. It could be one of the following cases:
- Active: the protein accession still exists in the original database, and the details remain unchanged.
- Unknown: the protein accession does not exist in the original database.
- Deleted: the protein accession has been removed from the original database.
- Merged: the protein accession has been merged with other protein accession to form a new protein.
- Demerged: the protein accession has been splited into two or more proteins.
- Changed: there has been some changes on this protein, but the type of the change is unknown.
- Error: there is an error associated with this protein.
To summarize, there are three main results for a protein’s status: active, changed, and deleted. For UniProtKB changed identifiers are subdivided in merged and demerged identifiers. The main reason for the demerging of identifiers is that new identifiers were created for every species a protein was identified in as well as new identifiers for the various genes a protein can come from. The merging of identifiers mainly happens when based on new gene prediction algorithms proteins that were previously believed to be distinct are then considered to actually come from the same gene.
The International Protein Index (IPI) database has been discontinued since September 2011. Therefore, PRIDE Inspector can only report whether a given identifier was still active in the last IPI release but cannot report on changed or deleted identifiers.
Reference
1. Griss, J., Cote, R. G., Gerner, C., Hermjakob, H. & Vizcaíno, J.A. Published and perished? The influence of the searched protein database on the long-term storage of proteomics data. Mol Cell Proteomics 10, M111 008490 (2011).
Protein Sequence
The latest protein sequence form the original database, this might be different from the original sequences used for the search. In some cases, the protein sequence won't be available, for example, when a protein accession is demerged into multiple proteins.
Fit in Sequence
This field indicates whether the peptide sequence can still be found in the latest protein sequence. It could be one of the following status:
- Fit: both the peptide sequence and the start/stop position fits the protein sequence.
- Fuzzy Fit: only the peptide sequence can be found in the protein sequence, this may indicates a change of the original protein sequence.
- No Fit: the peptide sequence can not be found in the protein sequence.
- Unknown: When the protein sequence is not available.