Protein site classification using two machine learning techniques.

Comparing simple and complex AI solutions for the classification of E.coli protein localisation sites.


As cloud technologies drive down the cost of high performance computing and of extremely large storage repositories, the automated analysis of vast quantities of data using mathematical models is increasingly affordable. Available models vary greatly in complexity, in the number of hyper-parameters governing their behaviour, and in the extent to which training can increase accuracy. This final aspect of analysis models, their propensity to being automatically adjusted, is called ‘machine learning’. However when models expose many hyper-parameters and are self-adjusting, they are extremely difficult to understand and become ‘black boxes’.

Models are designed to adapt to the patterns and behaviour inherent in the data, and to make automated predictions about new observations. Nonetheless, a significant by-product of using such models is the insight one gains about the source data. During initial iterations of data analysis, it may be beneficial to use simpler models which are more easy to follow. Even though such models may not produce optimum solutions, the expertise gained about the underlying data could inform the design of more sophisticated approaches.

In this paper, I contrast two classification models; a simple distance-based approach called the K-Nearest Neighbour (KNN) classifier, and a more complex multi-layered perceptron classification method. The comparison uses publicly available data on the location of E.coli proteins, and attempts to replicate published results based on the same data. The discussion also explores the experiments one could undertake with a view to increase the models’ accuracy. All tables and graphs were produce using Python3, data set and source code is available on GitHub (, apart two clearly referenced tables which originate from published literature.


The UCI Machine Learning Repository hosts a labelled dataset of E.coli protein localisation sites collated in the 1990’s by Paul Horton and Kenta Nakai (UCI). This data, alongside a larger set of observations of yeast proteins, forms the backbone of their research about automatically determining the localisation site of proteins using their amino acid sequence. Their research is motivated by the fact that “knowing a protein’s localisation helps elucidate its function, its role in both healthy processes and in the onset of diseases, and its potential use as a drug target” (Chen, 2008). Horton and Nakai published accuracy results from an expert system (Nakai & Kanehisa, 1991) and then more promising predictions from a statistical model (Horton & Nakai, 1996). Subsequently, they compared these findings with the results from using a KNN classification approach which performed significantly better than the statistical model they had used previously (Horton & Nakai, 1997).

The dataset is a labelled collection of 336 proteins each described by 7 features. The Structure of E.coli source data is given in the table below :

#  Name Col. Description
0 seq Sequence Name
1 mcg McGeoch’s method for signal sequence recognition
2 gvh von Heijne’s method for signal sequence recognition
3 lip von Heijne’s Signal Peptidase II consensus sequence score (a binary attribute)
4 chg Presence of charge on N-terminus of predicted lipoproteins (a binary attribute)
5 aac score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins
6 alm1 score of the ALOM membrane spanning region prediction program
7 alm2 score of ALOM program after excluding putative cleavable signal regions from the sequence
8 class localisation site of protein

The class names and their distribution are given in below:

Class Code Class Name Number of sequences
cp cytoplasm 143
im inner membrane without signal sequence 77
pp Perisplasm 52
imU inner membrane, uncleavable signal sequence 35
om outer membrane 20
omL outer membrane lipoprotein 5
imL inner membrane lipoprotei 2
imS inner membrane, cleavable signal sequence 2

As the table shows, the classes are not evenly distributed and any data samples will have to be stratified (that is sampling should be proportional to the full class distribution).

K-Nearest Neighbour (KNN) Classifier

Replicating Research Results

The KNN classifier makes predictions by determining the proximity of new observations to existing data points for which labels are known. This is based on the valid assumption that data points which are close together are highly likely to belong to the same class. KNN works on distance and proximity, but can be successfully applied to none geometric parameters (such as the amino acid measurement in the E.coli protein dataset). Nonetheless, selected features of the data should support meaningful distance measurement for the purposes of classification. For example, classifying books by measuring the distance of the author’s name to other authors on an alphanumeric scale would be a very good predictor of where the book would be located on the shelf of a bookshop, but would be a poor predictor of anything else.

The KNN approach is an example of instance based learning. The algorithm does not produce a generalised model, instead it stores instances of the training data as reference. The KNN classifier’s behaviour is governed by how many neighbours should be considered when making a prediction (a parameter called ‘k’ which gives the model its name), how distance is measured, and if a weighting should be applied based on each neighbour’s distance. Furthermore, the prediction of the model will be skewed if specific features have greater numerical values than others, and so data should be normalised and scaled so that all features have equal mean and variance. The dataset stored on the UCI repository appears already normalised, as shown in the table below which the mean and standard deviation of each feature in the dataset:

Feature mean std
mcg 0.50006 0.194634
gvh 0.5 0.148157
lip 0.495476 0.088495
chg 0.501488 0.027277
aac 0.50003 0.122376
alm1 0.500179 0.215751
alm2 0.499732 0.209411

Horton and Nakai evaluate their prediction model over 4 iterations, each time taking different cuts of data for training and validation such that each iteration is based on a different validation dataset, an approach called cross-validation. They use 4-fold cross validation to compare a KNN classifier (with K set to 7), the statistic model from (Horton & Nakai, 1996), a decision tree, and a naïve Bayes model. Their results are presented in Table below (extracted from their paper):

Cross validation results published in (Norton & Nakai, 1997), HN is the probabilistic model discussed in (Horton & Nakai, 1996)

The SciKit-Learn package, which is a commonly distributed Python library, implements many machine learning algorithms and support functionality, including KNN classifiers and cross validation functions. Using this package, I was able to perform a 4-fold validation on a KNN classifier with K set to 7:

Partition KNN
0 86.21
1 83.13
2 87.06
3 91.36
Mean 86.94
Std. Dev 2.63

The two results vary slightly, and this could be due to difference in normalisation between the UCI dataset and the linear normalisation Horton and Nakai applied to their data.

Validating Model Configuration

Horton and Nakai explain that choosing K equals 7 was the result of taking the average of nested leave-one-out cross validation on each training segment. This can be visualised by plotting 4-fold cross validation accuracy results obtained by increasing values of K:

Accuracy of KNN over different values of K

This agrees with Horton and Nakai’s finding that 7 is an appropriate value of K in this instance.

Often the choice of hyper-parameter in a machine learning model is a trade-off between bias and variance (VanderPlas, 2017). As hyper-parameters are increased to make the model more complex, it moves away from crude predictions (high bias) and become sensitive to the patterns and noise in the data (high variance). A model with high variance will over-fit the training data which will have a negative effect on accuracy in the general case. However, this is not true for a KNN classifier as can be seen from Figure 1. Increasing the value of K decreases the accuracy of predictions on both the training and the validation data.

An interesting insight into a model’s accuracy is its confusion matrix, a table showing how the model has classified the validation data. This highlights which specific classes the model misclassified. Table 6 shows the confusion matrix from a KNN classifier with K equal 7.

Confusion Matrix for a CNN Classifier with K = 7

Note: due to stratification, instances of the ‘imL’ and ‘imS’ classes are not present in the validation dataset.

When reviewing the confusion matrix, the authors note the model’s misclassification of localisation sites in the inner membrane without signal sequence (‘im’) and with an uncleavable signal sequence (‘imU’). They consider this to be acceptable since the “presence or absence of an uncleavable signal sequence is somewhat arbitrary” (Horton & Nakai, 1997). As a result they decide to fold class ‘imU’ into class ‘im’, which yields a higher accuracy rate.

The table below shows 4-fold cross validation results once the two classes have been collapsed.

Partition KNN
0 92.94
1 90.48
2 91.57
3 94.05
Mean 92.26
Std. Dev 1.21

These results are encouraging, though it is somewhat ironic that the research initially focused on complex expert systems and statistical models, before finding that simple KNN classifiers produced better results. The confusion matrix from the KNN classifier led to insights which allowed the researchers to make adjustments informed by domain expertise. Such breakthroughs are extremely valuable by-products of such analysis.

Searching for a Better Configuration

As mentioned at the start of this section, KNN models expose more hyper-parameters than just K, notably the way the distance is computed and whether any distance weighting should be applied. With so few parameters it becomes feasible to employ a brute-force approach to search for the best KNN model. In Scikit-Learn, distance weighting is configured by a Boolean input to the KNN classifier and the distance calculation can be altered by changing the power parameter of the Minkowski distance calculator. So far we have relied on Euclidian distance, which is the square root of the sum of the square differences of the features, this is a specialised Minkowski distance calculation where the power parameter is 2. Manhattan distance (also known a city-block distance) is the sum of the absolute value of the feature differences, which is a special case of Minkowski distance with power parameter 1. In the general case, Minkowski distance can be written as:

D: distance between two points X and Y
P: Minkowski power parameter
N: number of features
Note: if p = 2, then this formula becomes a Euclidian distance calculation.

I performed a brute force search of the hyper-parameter space and produced result based on 4-fold cross validation using SciKit-Learn’s GridSearchCV function. The search was performed over the following hyper-parameter ranges:

  • K values from 1 to 10
  • weighting method either ‘uniform’ (no weighting) or ‘distance’ (distance weighted)
  • p value from 1 to 10

Initial searches failed to identify a consistent permutation of parameters which perform best. Consequently, the search was executed 100 times (arbitrarily), to obtain the mean of the best result such a search produces.

Partition Avg KNN
0 93.44
1 93.11
2 93.12
3 93.42
Mean 93.27
Std. Dev 1.99%

The list of the 100 best permutations that the grid search function identified can be provided as a supplementary file. The most common permutation (25%) of hyper-parameters for a best result was: {'algorithm': 'brute', 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}

The permutation where K equal 7 occurred only once.  However, the difference in accuracy does not appear significant, and the researcher’s selection of K =7 for an unweighted Euclidian distance calculation is close to optimum for a KNN model on this dataset.

The table below compares leave-on-out cross validation for two KNN classifier with K = 5 and K = 7:

K Mean
5 93.15
7 93.75

Multi-Layered Perceptron Classifier

A KNN classifier is an instance-based non-generic model, and experiments in the section above seem to show the optimum accuracy on the E.coli dataset for such a model is close to 93% (after the class ‘imU’ has been folded into class ‘im’). Though this is a significant improvement from the statistical model described by Horton and Nakai, it begs the question as to whether a more complex generic model could perform better. This section will investigate multi-layered perceptron classifiers to estimate their potential of performing with greater accuracy.

A Neural network is an interconnected collection of logical neurons (also called perceptrons). Generally these neurons are organised in layers, information flows from all the neurons in one layer to all the neurons in the next. Each connection from neuron to neuron is weighted, and each receiving neuron further applies a bias. These weights and biases effect the data value received by the neuron. The final value is then fed into a function which outputs the value for the next layer. Algorithms have been devised that iteratively tweak each weight and bias to increase the accuracy of the network’s output, or more exactly decrease its inaccuracy. This progressive adjustment to reduce error is what is referred to as ‘learning’ in the context of these models. However, the algorithm may arrive to a point where tweaking any of the parameters (no matter how minutely) will increase the inaccuracy of the model. Though this could be locally the most accurate model, there is no guarantee this is the best solution over the entire multi-dimension parameter space. Yetian Chen uses the E.coli protein dataset to compare a decision tree model to a perceptron and a simple neural network (Chen, 2008). His results are presented below:

Validation results published in (Chen, 2008)

It is surprising that Chen’s results different so significantly from Horton and Nakai (Horton & Nakai, 1997): 68% versus 80%.

SciKit-Learn provides an implementation of a multi-layered perceptron classifier, which I configured to have 5 hidden nodes to replicate Chen’s experiment, using a learning algorithm which Scikit-Learn’s documentation suggest should be used for small datasets (SciKit-Learn API). The results of 4-fold cross validation are shown in below for both the original and collapsed data set.

Partition MLP (original dataset) MLP (collapsed dataset)
0 84.88 90.59
1 85.54 88.10
2 82.14 85.88
3 84.34 92.68
Mean 84.23 89.31
Std. Dev 1.14 2.29

Sadly, these results do not match (Chen 2008) but it is difficult to align the technical configuration of my classifier with his.

To search for an optimum network configuration, the learning curve of the model can be plotted, this time in terms of the number of neurons in the hidden layer.

accuracy of Single-Layer Perceptron Classifier as a function of the number of nodes in the hidden layer

As opposed to the KNN classifier, accuracy does not degrade as the hyper-parameter increases. In our case the accuracy of a single-layer perceptron seems to plateau in the high 80’s when the number of neuron in the layer hit 10 or more perceptrons. Assuming 10 perceptrons in the first layer is a meaningful milestone, the following graph shows what happens with two-layered perceptron with the first layer containing 10 neurons and an increasing number of neurons in the second layer.

accuracy of a Two-Layered Perceptron Classifier as a function of the number of nodes in the second hidden layer

Comparing the two graphs shows there is not gain in accuracy by adding another layer to the neural network, though again 10 or more neurons in this layer seems to have significance.

Let us compare the confusion matrix of a multi-layered perceptron classifier with two layers of 10 neurons with that of a KNN classifier with K equals 7 based on Euclidean distance measurements.

Confusion Matrix for a KNN Classifier with K = 7 on the collapsed E.coli dataset (accuracy 92.6%)
Confusion Matrix for a multi-layered perceptron classifier with two layers with with 10 neurons on the collapsed E.coli dataset (accuracy 83.8%)

Finally, the table below shows leave-one-out cross validation of a multi-layered perceptron with a single 10 neuron layer and two 10 neuron layers

Neuron Configuration Mean
(10,) 91.96
(10,10) 91.37

The configuration of neural networks is more akin to an art than a science, and so it is not possible to categorically state that accuracy rates over 93% are beyond the capability of this model. However, the initial findings listed above hint that achieving such results will not be a quick win.


This paper discusses in depth the application of a K-nearest neighbour classifier on the E.coli protein localisation site dataset collated by Horton and Nakai and made available on the UCI Machine Learning repository. After producing promising KNN accuracy results, neural networks are considered to explore if similar accuracy levels can be achieved and what network configuration would yield the most promising results. However, initial investigations of neural networks do not seem to provide any easy leads to follow, and suggest that more thorough investigation is needed in that area if the KNN accuracy is to be replicated or bettered. An important aspect of analysis is understanding the underlying data and the available models. By first focusing on a simpler model, a great deal of domain insight was achieved as well as establishing a benchmark with which to evaluate more complex solutions.


Yetian Chen. Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks. 2008.

Paul Horton and Kenta Nakai. Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier. ISMB. 1997.

Paul Horton & Kenta Nakai. A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins. Intelligent Systems in Molecular Biology, 109-115. St. Louis, USA 1996.

Kenta Nakai & Minoru Kanehisa. “Expert Sytem for Predicting Protein Localization Sites in Gram-Negative Bacteria”, PROTEINS: Structure, Function, and Genetics 11:95-110, 1991.

SciKit-Learn API, sklearn.neural_network.MLPClassifier, available at Last accessed 29/01/2018.

VanderPlas, J., 2017. Python Data Science Handbook. Sebastopol, USA: O’Reilly. 

UCI Machine Learning Repository, Ecoli Data Set, available online at Last accessed 29/01/2018.

Key Limitations of Knowledge Base Systems (in 200 words of less)

Knowledge Based Systems (KBS) denotes a field of artificial intelligence research for the encoding of expert knowledge in computer logic as repository of “if-then” rules. Though successful instances of such systems are worthy of note (e.g. MYCIN, DENDRAL and PROSPECTOR) KBS have key limitations. Namely, an expert may establish semantic narrative to relate the rules that (s)he applies when addressing a problem, however when compiled into a machine the relationship between the rules becomes confused. When an expert applied rules, they are following an overall strategy for solving a specific class of problem, computer implementations take a more probabilistic approach to selecting which rules to ‘fire’. This leads to a second significant limitation that when choosing what rules to apply, the computer will attempt to exhaustively search the knowledge based. Experts on the other hand are able to focus the applicability of specific rules. Finally, KBS are not able to self acquire any semantic knowledge of the rule base, or even gain the experience needed to know when rules should be broken and why. KBS are expensive to develop, and require continuing maintenance to grow and evolve the way human experts naturally do.

Working things out backwards with Bayes

In this post, I’ll delve into Bayesian Inference which is used to determine the probably of cause and effect. That is; when I observe an event, with what certainty can I deduce the reason for it?

This is based on a probability theorem discovered by the English statistician  Thomas Bayes, though it would be wrong to attribute all the glory to him alone; Bayes’ notes were corrected, compiled and published posthumously by Richard Price (a Fellow of the Royal Society), and it was Pierre-Simon Laplace who understood the wide-ranging reach of (what is now called) Bayesian Logic.

Thomas Bayes

First a little review of the basics; probability is all about measuring the likelihood of events occurring, this can be calculated using the following ratio:

\textbf{probability of event occurring} = \frac{\textbf{number of times event can occur}}{\textbf{number of all possible events}}

In technical notation, the “probability of event occurring” is denoted P(\textbf{event}). So, for example, the probability of rolling a 1 on a 6-sided die. Would be calculated as follows:

P(\textbf{1}) = \frac{\textbf{number of sides on a die with value 1}}{\textbf{total number of sides on a die}} = \frac{\textbf{1}}{\textbf{6}} =  \textbf{0.1666666666666667}

Taking another example, the probability of someone having an Apple iPhone would be expressed as follows:

P(\textbf{iPhone}) = \frac{\textbf{number of people who have an iPhone}}{\textbf{total number of people}}

Similarly, the probability of owning an iMac would be:

P(\textbf{iMac}) = \frac{\textbf{number of people who have an iMac}}{\textbf{total number of people}}

This is all very interesting, but Bayesian logic allows us to take things a step further so that we can calculate the probability of something being true (e.g. that someone has an iPhone) if we already know something else is true (i.e. they already have an iMac).  The probability of something (A) being true given than something else (B) is already true, is denoted as P(\textbf{A} |\textbf{B}), or in our case P(\textbf{iPhone} |\textbf{iMac}).

In probability, the denominator (the thing on the bottom of the fraction) is the entire population. When we were just considering the probability of having an iPhone the denominator was the total number of all people. However, now our denominator will be the probability of the person having an iMac.

The nominator (the stuff on the top of the fraction) is the subpopulation we’re trying to focus on. In your case, this population has an iPhone and, having this iPhone, also has an iMac.

So the Bayesian theorem in full is written as follow:

P(\textbf{iPhone} |\textbf{iMac}) = \frac{P(\textbf{iMac} |\textbf{iPhone}) \textbf * P(\textbf{iPhone})}{P(\textbf{iMac})}

It seems a little backwards that in order to determine the probability that someone who has an iMac will also have an iPhone, you need to know the probability that someone who has an iPhone also has an iMac. However it it necessary since our numerator is really capturing two distinct events (each with its own probability of occurrence): first that he person has an iPhone (i.e. P(\textbf{iPhone}) ), and second that given someone has an iPhone what is the probability they also have an iMac (i.e. P(\textbf{iMac} |\textbf{iPhone}) ).

Unfortunately, there’s quite a lot of different variables in this theorem, which makes it difficult to apply. However this can be simplified (though it will appear that we’ve made it more complicated).

The total probability of having an iMac can be expressed as the probability of having an iMac if there already an iPhone and the probability of having an iPhone, plus the probability of having an iMac and no iPhone and the probability of not having an iPhone.

This can is written as follows:

P(\textbf{iMac}) = P(\textbf{iMac} |\textbf{iPhone}) \textbf * P(\textbf{iPhone}) +  P(\textbf{iMac} |\neg\textbf{iPhone}) \textbf * P(\neg\textbf{iPhone})

This would seem to be hyper pedantic, but note two things:

  1. We can express the probability of having an iMac while completely replacing P(\textbf{iMac})
  2. Also we know that P(\textbf{iPhone})  + P(\neg\textbf{iPhone}) = 1 , so if we know one we can easily calculate the other.

Now we can replace the denominator of Bayes’ theorem as follows:

P(\textbf{iPhone} |\textbf{iMac}) = \frac{P(\textbf{iMac} |\textbf{iPhone}) \textbf * P(\textbf{iPhone})}{P(\textbf{iMac} |\textbf{iPhone}) \textbf * P(\textbf{iPhone}) +  P(\textbf{iMac} |\neg\textbf{iPhone}) \textbf * P(\neg\textbf{iPhone})}

This may look more complicated than the original equation, but now that we’ve significantly reduced the number of variables in the formula and so it should (in theory) be easier to compute.

History of Artificial Intelligence – Raiders of the Lost Arts

Some previous posts provide a quick postcard from the early days of AI and the rise of the first commercial AI applications: Expert Systems. However for all the initial hype around expert systems, their domain of expertise was (by definition) limited, they were expensive to build and maintain, and impossible to formally prove complete or correct. Moreover their most lacking feature, one that threw serious doubt as to whether expert systems could be classified as intelligent machines, was their inability to learn from the problem domain or from experience.

Feeling as if they had all rushed down a blind alley, researchers once again looked to the functioning of the human brain for inspirations and resumed work on replicating its neural structure inside machines. Even though the work of early pioneers had documented the essential concepts of neural networks, they had lacked the powerful computing infrastructure needed to implement the theory. Furthermore, mathematical analysis of neural network models appeared to demonstrate the computing limitation of such structures.

However, with more modern computing technology and the renewed enthusiasm directed at neural networks, fresh breakthroughs seems to happen simultaneously. Important theoretical advances were made in the 1980s such a Adaptive Reasoning Theory (Grossberg), Hopfield Networks (Hopfield), Self-Organising Maps (Kohonen), Reinforced Learning (Barto), and the high influential Back-Propagation Learning Algorithm (Bryson and Ho).  All these resulted in a new breed of neural networks that could be trained and learn for themselves.

Other AI research postulated that since human intelligence emerged from evolutionary forces nascent in the natural world, i.e. Charles Darwin’s theory of evolution and natural selection, then intelligent machines would arise from an synthetic evolutionary environment. This approach to developing AI solution involves simulating a population of objects, allowing for evolutionary relationship to occur (selection, crossover and mutation), adding a healthy amount of entropy and letting generations evolve. The evolution based approach encapsulates three main techniques: genetic algorithms, evolutionary strategies, and genetic programming where the computer doesn’t produce answers but outputs programmes as the solution.

It would be unfair to say that neural networks, with their ability to learn from experience, discover patterns, and operate in the face of incomplete information, superseded expert systems. In fact the two technologies complement each other rather well. As was discussed in a previous post, knowledge elicitation from human expert is time consuming, can be expensive, and may lead to contradictions if multiple experts contribute to the knowledge base. Furthermore, experts may themselves make decision in the face of a great amount of uncertainty and are only able to explain their actions by vague explanations, lacking the preciseness that rules-based systems need.  Neural networks can be used to discover hidden knowledge in the system, manage vagueness in the rule definition, and also correct rules where the entered expertise is contradictory.

It seems that human experts are able to reason and make decisions in the face of uncertainty because the natural language used in the reasoning process supports the expression of concepts with are vague and subjective. And so the theory of fuzzy logic became of primary interest for expert system developers. Fuzzy logic and fuzzy set theory is not a new discovery and had been established by Lotfi Zadeh in 1965. However the concept of fuzzy logic had not been well received by Zadeh’s contemporaries, possibly because the word “fuzzy” was offensive to scientists who wanted to be taken seriously. However by the 1980’s, the idea had travelled east to Japan where it had been successfully implemented in consumer goods (such as air conditioners and washing machines). Hence fuzzy logic had a  proven commercial track record and  significantly reduced the development effort and complexity of expert systems.

Nowadays expert systems use fuzzy rules and neural networks to create more powerful AI solutions. The field has matured and new expert systems development is based on existing theories rather than the expression of new ones. But processing potential had taken an exponential leap forwards with the advent of cloud computing, resulting in new powerful AI frameworks and solutions (e.g. deep learning). Though it may take an infinity of computers to replicate the power of the human mind, such a superlative seems to be within our grasp and AI is now more relevant in society than ever.

This post is based on the first chapter of “Artificial Intelligence -A Guide to Intelligent Systems” (2nd Edition) by Michael Negnevitsky.

History of Artificial Intelligence – Bringing in the Experts

As explained in a previous post, the initial drive in Artificial Intelligence attempted to deliver generic reasoning machines with little or no domain knowledge. Finding a solution was reduced to a “searching problem”: the  program would try different permutations of incremental steps until a path to the answer was found. Even though this worked in practice for small “toy” domains, such an approach would not scale up to a real world situations. As soon as a problem could no longer be solved in polynomial time, but required exponential time to solve, such AI programs proved completely ineffective. As a consequence, in the early 1970s government funding for AI projects was cut.

However, the foundation of knowledge engineering concepts, along with very useful tools and AI programming languages had been inherited from the founding fathers of this new science. The next generation of researchers soon discovered that for generic problems domain only weak solutions would emerge but if the domain was sufficiently restricted, stronger heuristics could be built in to the system which would result in stronger solutions.

DENDRAL was one of the first implementation of such a system. NASA was planning to send an unmanned spacecraft to Mars, and needed a machine to analyse the molecular structure of the Martian soil. The brute force generate-and-test approach of producing all possible solutions given the potential molecular input and then comparing them with the field observations proved NP-Complete (a problem that can only be solve in exponential time, and becomes impossible even for modest sizes of input).

(This picture was taken by the Viking Lander 1 on February 11, 1978)

However, such tasks could be solved by human experts who were able to reduce the problem and make it tractable. As the project team set about encoding expertise to the system, it became apparent that the expert’s knowledge was not limited to the laws of chemistry, but relied on personal heuristics (rules of thumb), and guesswork. In fact, extracting knowledge from the expert became a significant bottleneck in the development of the system. Nonetheless, DENDRAL was a success which ended up being marketed commercially in the USA.

Following on from the success of DENDRAL, the research team at Stanford University set to work on an medical expert system for the diagnosis of infectious blood diseases. The project was called MYCIN, and started to formalise the approach to developing expert systems. Once completed it was equivalent to a human expert (in the narrow field of infectious blood diseases), and outperformed junior doctors. It incorporated some 450 rules which were clearly separated from the reasoning procedure. Using this software design, the team also developed a generic expert system, EMYCIN, which had all the features of the initial software, but none of the rules. New expert applications could be developed by just adding rules to the system.

One of MYCIN’s features was its ability to reason in the face of uncertainty. The approach was taken up in PROSPECTOR, an expert system for mineral exploration. In geological investigation, crucial decisions have to be made in face of uncertainty. To automate such decision making PROSPECTOR implemented Bayes’s rule of evidence to propagate uncertainties throughout the system. The system had over 1,000 rules and also featured a knowledge acquisition system.

Though, I’ve only presented three early examples, expert systems were very popular and successful in the 1980s and 1990s. However, such rule based AI proved to have significant limitation:

  • The restricted problem domain within which expert systems operate limit significantly their usefulness. For example, MYCIN would not perform well in situations where the patient suffered from multiple health conditions simultaneously.
  • Furthermore, the expert system would not be aware of the narrow boundaries limiting it and attempts to solve problems outside its domain would yield unpredictable results.
  • Being rule based, an expert system may have limited capacity to explain its findings, and may find it impossible to derive heuristic links between its rules to gain deeper insight into the problem domain.
  • The completeness, robustness and soundness of an expert system is thus far impossible to formally establish.
  • Finally, many early expert system were not able to learn from experience and had to be provided all the rules applicable to the problem domain by developers. The development effort in building useful expert system is thus prohibitively significant.

This post is based on the first chapter of “Artificial Intelligence -A Guide to Intelligent Systems” (2nd Edition) by Michael Negnevitsky.

History of Artificial Intelligence – Foundations of a Science

Philosophers have been considering how the human mind works, and whether non-humans can have minds, since the dawn of time (sorry about the platitude, but I have to start somewhere). On one side of the argument, some consider that machines can do everything that humans can, while the other side believes that the complex and sophisticated behaviour commonplace in humans (e.g. love, creativity, morality) will never be obtainable by machines.

The objective of the field of Artificial Intelligence is to create machines that can perform tasks which require intelligence when humans perform them. Consequently, the question of whether machines can think for themselves become a vitally important one. Fundamentally, considering this question forces us to define very precisely what is ‘intelligence’ and what is ‘thinking’, and wonder if these concepts only make sense in the context of the human brain.

An early yet significant paper on machine intelligence was proposed by Alan Turing in 1950 entitled “Computing machinery and intelligence”. Turing is the iconic father of AI and computer science, his achievement have shaped the disciple ever since. Turing wisely didn’t attempt to define what ‘intelligence’ is or how it could be embodied within a machine. He instead defined a test to see if a machine could fool a human in believing that it was human too. He also described an abstract machine able to manipulate symbols on a magnetic tape  as it migrates some state to state.

(Statue of Alan Turing, by Stephen Kettle)

The field of AI as we know it today was established by successive generations of groundbreaking scientists. Some of the key first generation fathers are introduced below:

  • Alan Turing
  • Warren McCulloch and Walter Pitts defined a neural network model where each neuron in the network would have binary state. They showed that neural networks were instances of Turing Machines, and hence could become the medium through which AI could be developed.
  • John von Neumann, a friend and colleague of Alan Turing, helped design some of the earliest  stored program machines and formalised computer architecture concepts which are still used to the present day.
  • Claude Shannon demonstrated (through the idea of a computer playing chess) that a brute force approach to AI would never be computationally possible, and that a heuristic approach was needed.
  • John McCarthy was instrumental in organising the 1956 AI workshop at Dartmouth College, sponsored by IBM. During this workshop the field of AI was established as a science. Though attendance to the workshop was sparse, the next 20 years of AI research would be dominated by its attendees and their students. He also designed LISP, one of the oldest computer language.
  • Marvin Minsky worked on a non-logic-based approach to knowledge representation an reasoning.

The early days of AI focused on simulating cognitive processes by defining general methods for solving a wide range for problems. The emphasis was on general-purpose researching and reasoning approaches. Unfortunately, this resulted in weak performance of the resulting AI programmes. However, the great minds attracted to the field of AI helped set the foundations of knowledge representation, learning algorithms, neural computing, and natural language computing.

This post is based on the first chapter of “Artificial Intelligence -A Guide to Intelligent Systems” (2nd Edition) by Michael Negnevitsky.