Comparing simple and complex AI solutions for the classification of E.coli protein localisation sites.
Introduction
As cloud technologies drive down the cost of high performance computing and of extremely large storage repositories, the automated analysis of vast quantities of data using mathematical models is increasingly affordable. Available models vary greatly in complexity, in the number of hyper-parameters governing their behaviour, and in the extent to which training can increase accuracy. This final aspect of analysis models, their propensity to being automatically adjusted, is called ‘machine learning’. However when models expose many hyper-parameters and are self-adjusting, they are extremely difficult to understand and become ‘black boxes’.
Models are designed to adapt to the patterns and behaviour inherent in the data, and to make automated predictions about new observations. Nonetheless, a significant by-product of using such models is the insight one gains about the source data. During initial iterations of data analysis, it may be beneficial to use simpler models which are more easy to follow. Even though such models may not produce optimum solutions, the expertise gained about the underlying data could inform the design of more sophisticated approaches.
In this paper, I contrast two classification models; a simple distance-based approach called the K-Nearest Neighbour (KNN) classifier, and a more complex multi-layered perceptron classification method. The comparison uses publicly available data on the location of E.coli proteins, and attempts to replicate published results based on the same data. The discussion also explores the experiments one could undertake with a view to increase the models’ accuracy. All tables and graphs were produce using Python3, data set and source code is available on GitHub (https://github.com/gnerzic/ecoli/releases/tag/v1.0), apart two clearly referenced tables which originate from published literature.
Data
The UCI Machine Learning Repository hosts a labelled dataset of E.coli protein localisation sites collated in the 1990’s by Paul Horton and Kenta Nakai (UCI). This data, alongside a larger set of observations of yeast proteins, forms the backbone of their research about automatically determining the localisation site of proteins using their amino acid sequence. Their research is motivated by the fact that “knowing a protein’s localisation helps elucidate its function, its role in both healthy processes and in the onset of diseases, and its potential use as a drug target” (Chen, 2008). Horton and Nakai published accuracy results from an expert system (Nakai & Kanehisa, 1991) and then more promising predictions from a statistical model (Horton & Nakai, 1996). Subsequently, they compared these findings with the results from using a KNN classification approach which performed significantly better than the statistical model they had used previously (Horton & Nakai, 1997).
The dataset is a labelled collection of 336 proteins each described by 7 features. The Structure of E.coli source data is given in the table below :
# | Name | Col. Description |
0 | seq | Sequence Name |
1 | mcg | McGeoch’s method for signal sequence recognition |
2 | gvh | von Heijne’s method for signal sequence recognition |
3 | lip | von Heijne’s Signal Peptidase II consensus sequence score (a binary attribute) |
4 | chg | Presence of charge on N-terminus of predicted lipoproteins (a binary attribute) |
5 | aac | score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins |
6 | alm1 | score of the ALOM membrane spanning region prediction program |
7 | alm2 | score of ALOM program after excluding putative cleavable signal regions from the sequence |
8 | class | localisation site of protein |
The class names and their distribution are given in below:
Class Code | Class Name | Number of sequences |
cp | cytoplasm | 143 |
im | inner membrane without signal sequence | 77 |
pp | Perisplasm | 52 |
imU | inner membrane, uncleavable signal sequence | 35 |
om | outer membrane | 20 |
omL | outer membrane lipoprotein | 5 |
imL | inner membrane lipoprotei | 2 |
imS | inner membrane, cleavable signal sequence | 2 |
As the table shows, the classes are not evenly distributed and any data samples will have to be stratified (that is sampling should be proportional to the full class distribution).
K-Nearest Neighbour (KNN) Classifier
Replicating Research Results
The KNN classifier makes predictions by determining the proximity of new observations to existing data points for which labels are known. This is based on the valid assumption that data points which are close together are highly likely to belong to the same class. KNN works on distance and proximity, but can be successfully applied to none geometric parameters (such as the amino acid measurement in the E.coli protein dataset). Nonetheless, selected features of the data should support meaningful distance measurement for the purposes of classification. For example, classifying books by measuring the distance of the author’s name to other authors on an alphanumeric scale would be a very good predictor of where the book would be located on the shelf of a bookshop, but would be a poor predictor of anything else.
The KNN approach is an example of instance based learning. The algorithm does not produce a generalised model, instead it stores instances of the training data as reference. The KNN classifier’s behaviour is governed by how many neighbours should be considered when making a prediction (a parameter called ‘k’ which gives the model its name), how distance is measured, and if a weighting should be applied based on each neighbour’s distance. Furthermore, the prediction of the model will be skewed if specific features have greater numerical values than others, and so data should be normalised and scaled so that all features have equal mean and variance. The dataset stored on the UCI repository appears already normalised, as shown in the table below which the mean and standard deviation of each feature in the dataset:
Feature | mean | std |
mcg | 0.50006 | 0.194634 |
gvh | 0.5 | 0.148157 |
lip | 0.495476 | 0.088495 |
chg | 0.501488 | 0.027277 |
aac | 0.50003 | 0.122376 |
alm1 | 0.500179 | 0.215751 |
alm2 | 0.499732 | 0.209411 |
Horton and Nakai evaluate their prediction model over 4 iterations, each time taking different cuts of data for training and validation such that each iteration is based on a different validation dataset, an approach called cross-validation. They use 4-fold cross validation to compare a KNN classifier (with K set to 7), the statistic model from (Horton & Nakai, 1996), a decision tree, and a naïve Bayes model. Their results are presented in Table below (extracted from their paper):
The SciKit-Learn package, which is a commonly distributed Python library, implements many machine learning algorithms and support functionality, including KNN classifiers and cross validation functions. Using this package, I was able to perform a 4-fold validation on a KNN classifier with K set to 7:
Partition | KNN |
0 | 86.21 |
1 | 83.13 |
2 | 87.06 |
3 | 91.36 |
Mean | 86.94 |
Std. Dev | 2.63 |
The two results vary slightly, and this could be due to difference in normalisation between the UCI dataset and the linear normalisation Horton and Nakai applied to their data.
Validating Model Configuration
Horton and Nakai explain that choosing K equals 7 was the result of taking the average of nested leave-one-out cross validation on each training segment. This can be visualised by plotting 4-fold cross validation accuracy results obtained by increasing values of K:
This agrees with Horton and Nakai’s finding that 7 is an appropriate value of K in this instance.
Often the choice of hyper-parameter in a machine learning model is a trade-off between bias and variance (VanderPlas, 2017). As hyper-parameters are increased to make the model more complex, it moves away from crude predictions (high bias) and become sensitive to the patterns and noise in the data (high variance). A model with high variance will over-fit the training data which will have a negative effect on accuracy in the general case. However, this is not true for a KNN classifier as can be seen from Figure 1. Increasing the value of K decreases the accuracy of predictions on both the training and the validation data.
An interesting insight into a model’s accuracy is its confusion matrix, a table showing how the model has classified the validation data. This highlights which specific classes the model misclassified. Table 6 shows the confusion matrix from a KNN classifier with K equal 7.
Note: due to stratification, instances of the ‘imL’ and ‘imS’ classes are not present in the validation dataset.
When reviewing the confusion matrix, the authors note the model’s misclassification of localisation sites in the inner membrane without signal sequence (‘im’) and with an uncleavable signal sequence (‘imU’). They consider this to be acceptable since the “presence or absence of an uncleavable signal sequence is somewhat arbitrary” (Horton & Nakai, 1997). As a result they decide to fold class ‘imU’ into class ‘im’, which yields a higher accuracy rate.
The table below shows 4-fold cross validation results once the two classes have been collapsed.
Partition | KNN |
0 | 92.94 |
1 | 90.48 |
2 | 91.57 |
3 | 94.05 |
Mean | 92.26 |
Std. Dev | 1.21 |
These results are encouraging, though it is somewhat ironic that the research initially focused on complex expert systems and statistical models, before finding that simple KNN classifiers produced better results. The confusion matrix from the KNN classifier led to insights which allowed the researchers to make adjustments informed by domain expertise. Such breakthroughs are extremely valuable by-products of such analysis.
Searching for a Better Configuration
As mentioned at the start of this section, KNN models expose more hyper-parameters than just K, notably the way the distance is computed and whether any distance weighting should be applied. With so few parameters it becomes feasible to employ a brute-force approach to search for the best KNN model. In Scikit-Learn, distance weighting is configured by a Boolean input to the KNN classifier and the distance calculation can be altered by changing the power parameter of the Minkowski distance calculator. So far we have relied on Euclidian distance, which is the square root of the sum of the square differences of the features, this is a specialised Minkowski distance calculation where the power parameter is 2. Manhattan distance (also known a city-block distance) is the sum of the absolute value of the feature differences, which is a special case of Minkowski distance with power parameter 1. In the general case, Minkowski distance can be written as:
Where:
D: distance between two points X and Y
P: Minkowski power parameter
N: number of features
Note: if p = 2, then this formula becomes a Euclidian distance calculation.
I performed a brute force search of the hyper-parameter space and produced result based on 4-fold cross validation using SciKit-Learn’s GridSearchCV function. The search was performed over the following hyper-parameter ranges:
- K values from 1 to 10
- weighting method either ‘uniform’ (no weighting) or ‘distance’ (distance weighted)
- p value from 1 to 10
Initial searches failed to identify a consistent permutation of parameters which perform best. Consequently, the search was executed 100 times (arbitrarily), to obtain the mean of the best result such a search produces.
Partition | Avg KNN |
0 | 93.44 |
1 | 93.11 |
2 | 93.12 |
3 | 93.42 |
Mean | 93.27 |
Std. Dev | 1.99% |
The list of the 100 best permutations that the grid search function identified can be provided as a supplementary file. The most common permutation (25%) of hyper-parameters for a best result was: {'algorithm': 'brute', 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
The permutation where K equal 7 occurred only once. However, the difference in accuracy does not appear significant, and the researcher’s selection of K =7 for an unweighted Euclidian distance calculation is close to optimum for a KNN model on this dataset.
The table below compares leave-on-out cross validation for two KNN classifier with K = 5 and K = 7:
K | Mean |
5 | 93.15 |
7 | 93.75 |
Multi-Layered Perceptron Classifier
A KNN classifier is an instance-based non-generic model, and experiments in the section above seem to show the optimum accuracy on the E.coli dataset for such a model is close to 93% (after the class ‘imU’ has been folded into class ‘im’). Though this is a significant improvement from the statistical model described by Horton and Nakai, it begs the question as to whether a more complex generic model could perform better. This section will investigate multi-layered perceptron classifiers to estimate their potential of performing with greater accuracy.
A Neural network is an interconnected collection of logical neurons (also called perceptrons). Generally these neurons are organised in layers, information flows from all the neurons in one layer to all the neurons in the next. Each connection from neuron to neuron is weighted, and each receiving neuron further applies a bias. These weights and biases effect the data value received by the neuron. The final value is then fed into a function which outputs the value for the next layer. Algorithms have been devised that iteratively tweak each weight and bias to increase the accuracy of the network’s output, or more exactly decrease its inaccuracy. This progressive adjustment to reduce error is what is referred to as ‘learning’ in the context of these models. However, the algorithm may arrive to a point where tweaking any of the parameters (no matter how minutely) will increase the inaccuracy of the model. Though this could be locally the most accurate model, there is no guarantee this is the best solution over the entire multi-dimension parameter space. Yetian Chen uses the E.coli protein dataset to compare a decision tree model to a perceptron and a simple neural network (Chen, 2008). His results are presented below:
It is surprising that Chen’s results different so significantly from Horton and Nakai (Horton & Nakai, 1997): 68% versus 80%.
SciKit-Learn provides an implementation of a multi-layered perceptron classifier, which I configured to have 5 hidden nodes to replicate Chen’s experiment, using a learning algorithm which Scikit-Learn’s documentation suggest should be used for small datasets (SciKit-Learn API). The results of 4-fold cross validation are shown in below for both the original and collapsed data set.
Partition | MLP (original dataset) | MLP (collapsed dataset) |
0 | 84.88 | 90.59 |
1 | 85.54 | 88.10 |
2 | 82.14 | 85.88 |
3 | 84.34 | 92.68 |
Mean | 84.23 | 89.31 |
Std. Dev | 1.14 | 2.29 |
Sadly, these results do not match (Chen 2008) but it is difficult to align the technical configuration of my classifier with his.
To search for an optimum network configuration, the learning curve of the model can be plotted, this time in terms of the number of neurons in the hidden layer.
As opposed to the KNN classifier, accuracy does not degrade as the hyper-parameter increases. In our case the accuracy of a single-layer perceptron seems to plateau in the high 80’s when the number of neuron in the layer hit 10 or more perceptrons. Assuming 10 perceptrons in the first layer is a meaningful milestone, the following graph shows what happens with two-layered perceptron with the first layer containing 10 neurons and an increasing number of neurons in the second layer.
Comparing the two graphs shows there is not gain in accuracy by adding another layer to the neural network, though again 10 or more neurons in this layer seems to have significance.
Let us compare the confusion matrix of a multi-layered perceptron classifier with two layers of 10 neurons with that of a KNN classifier with K equals 7 based on Euclidean distance measurements.
Finally, the table below shows leave-one-out cross validation of a multi-layered perceptron with a single 10 neuron layer and two 10 neuron layers
Neuron Configuration | Mean |
(10,) | 91.96 |
(10,10) | 91.37 |
The configuration of neural networks is more akin to an art than a science, and so it is not possible to categorically state that accuracy rates over 93% are beyond the capability of this model. However, the initial findings listed above hint that achieving such results will not be a quick win.
Conclusion
This paper discusses in depth the application of a K-nearest neighbour classifier on the E.coli protein localisation site dataset collated by Horton and Nakai and made available on the UCI Machine Learning repository. After producing promising KNN accuracy results, neural networks are considered to explore if similar accuracy levels can be achieved and what network configuration would yield the most promising results. However, initial investigations of neural networks do not seem to provide any easy leads to follow, and suggest that more thorough investigation is needed in that area if the KNN accuracy is to be replicated or bettered. An important aspect of analysis is understanding the underlying data and the available models. By first focusing on a simpler model, a great deal of domain insight was achieved as well as establishing a benchmark with which to evaluate more complex solutions.
Bibliography
Yetian Chen. Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks. http://web.cs.iastate.edu/~yetianc/cs572/files/CS572_Project_YETIANCHEN.pdf. 2008.
Paul Horton and Kenta Nakai. Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier. ISMB. 1997.
Paul Horton & Kenta Nakai. A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. Intelligent Systems in Molecular Biology, 109-115. St. Louis, USA 1996.
Kenta Nakai & Minoru Kanehisa. “Expert Sytem for Predicting Protein Localization Sites in Gram-Negative Bacteria”, PROTEINS: Structure, Function, and Genetics 11:95-110, 1991.
SciKit-Learn API, sklearn.neural_network.MLPClassifier, available at http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier. Last accessed 29/01/2018.
VanderPlas, J., 2017. Python Data Science Handbook. Sebastopol, USA: O’Reilly.
UCI Machine Learning Repository, Ecoli Data Set, available online at http://archive.ics.uci.edu/ml/datasets/Ecoli. Last accessed 29/01/2018.