Missense Prediction Tool Catalogue
Here we review and catalogue the extensive range of available tools for the evaluation and classification of missense variants. Broadly, they can be divided into 3 types:
Click here for a brief summary of their features
Disease-causing nsSNPs have been found to occur at evolutionarily conserved positions that have an essential role in the structure and/or function of the encoded protein. These sites are typically characterised by high sequence conservation amongst homologues and assessing this conservation in multiple sequence alignments (MSAs) can highlight the degree of amino acid divergence that can be tolerated. A missense variant will result in an amino acid with altered physicochemical properties to the original and this change can be captured to predict functional consequence. The following algorithms are based on these principles and combine MSAs, generated through a variety of methods, with scoring functions based on measures of amino acid similarity to produce predictions of variant pathogenicity.
Disease-associated missense variants are found to correlate with conserved positions in alignments of human proteins.
Many of these methods are highly sensitive to the MSA that the user provides and in many cases varying the evolutionary depth of an alignment can produce different predictions.
The consequences of amino acid sequence change will depend on the individual amino acids involved and their degree of similarity. Other structural implications of missense variants include physical disruption of well-packed protein cores, highlighted through packing analysis and substitutions at sites crucial for molecular function. The following methods either combine information from protein sequence and structure or use protein structural information alone to analyse missense variants.
Considering other aspects of evolutionary constraint may aid missense variant classification, especially when combined with sequence conservation measures.
Some of these algorithms return a great deal of structural information that then requires interpreting by the user to make an informed assessment of pathogenicity. Without a good understanding of some of these parameters, the information could be confusing or misinterpreted.
PolyPhen (no longer supported)
Protein stability-based methods
Supervised learning algorithms include neural networks (NNs), support vector machines (SVMs) and random forests (RFs) and naive Bayes classifiers. NNs and SVMs are trained using two sets of data: variants that are associated with disease and variants with no known disease association. The characteristics of the variants in each set are assessed, typically on features of conservation or protein structure and the algorithm is programmed to ‘learn’ the difference between the variants. RFs combine predictions from various methods on variants associated with disease. When a query variant is submitted, the same characteristics are assessed to determine which category it best matches and a prediction given.
Whether a missense variant results in a pathogenic or neutral mutation can depend on a number of different factors that can’t be captured using basic protein sequence and structural information alone. In theory, methods that combine a large number of predictive parameters have the ability to partition pathogenic and neutral mutations using a greater range of information.
These types of classifiers require large datasets of variants to train them. In many cases, the datasets used have been found to contain ‘pathogenic’ and ‘neutral’ classifications that conflict with classifications from expertly curated datasets. For these methods to be more reliable, large accurate datasets are required.