Professor Max Bramer

Knowledge Discovery and Data Mining

Edited by Max Bramer (University of Portsmouth, UK)

Published by the Institution of Electrical Engineers. Summer 1999. ISBN 0 85296 767 5. 326 pp.

Volume 1 in the IEE Professional Applications of Computing series

A collection of 12 refereed papers on the theory and practice of Knowledge Discovery and Data Mining

Modern computer systems are accumulating data at an almost unimaginable rate and from a very wide variety of sources: from point of sale machines in the high street to machines logging every cheque clearance, bank cash withdrawal and credit card transaction, to Earth observation satellites in space. Three examples will serve to give an indication of the volumes of data involved:

The 1990 US census collected over a million million bytes of data

The Human Genome project will store thousands of bytes for each of several billion genetic bases

NASA Earth observation satellites generate a terabyte (i.e. 109 bytes) of data every day

Alongside advances in storage technology which increasingly make it possible to store such vast amounts of data at relatively low cost, whether in commercial data warehouses, scientific research laboratories or elsewhere, has come a growing realisation that such data contains buried within it knowledge that can be critical to a company's growth or decline, knowledge that could lead to important discoveries in science, knowledge that could enable us accurately to predict the weather and natural disasters, knowledge that could enable us to identify the causes of and possible cures for lethal illnesses, knowledge that could literally mean the difference between life and death. Yet the huge volumes involved mean that most of this data is merely stored - never to be examined in more than the most superficial way, if at all. Machine learning technology, some of it very long established, has the potential to solve the problem of the tidal wave of data that is flooding around organisations, governments and individuals.

Knowledge Discovery has been defined as the 'non-trivial extraction of implicit, previously unknown and potentially useful information from data'. The underlying technologies of knowledge discovery include induction of decision rules and decision trees, neural networks, genetic algorithms, instance-based learning and statistics. There is a rapidly growing body of successful applications in a wide range of areas as diverse as:

Medical Diagnosis
Weather Forecasting
Product Design Electric Load Prediction
Thermal Power Plant Optimisation
Analysis of Organic Compounds
Credit Card Fraud Detection
Predicting Share of Television Audiences
Real Estate Valuation
Toxic Hazard Analysis
Automatic Abstracting
Financial Forecasting

The book comprises six papers on technical issues in the field of Knowledge Discovery and Data Mining followed by six chapters on applications. It grew out of a colloquium on Knowledge Discovery and Data Mining which I organised for Professional Group A4 (Artificial Intelligence) of the Institution of Electrical Engineers (IEE) in London on May 7th and 8th 1998. This was the third in a series of colloquia on this topic which began in 1995. The colloquium was co-sponsored by BCS-SGES (the British Computer Society Specialist Group on Knowledge Based Systems and Applied Artificial Intelligence), AISB (the Society for Artificial Intelligence and Simulation of Behaviour) and AIED (the International Society for AI and Education).

The papers included here have been significantly expanded from those presented at the colloquium and were selected for inclusion following a rigorous refereeing process. The book should be of particular interest to researchers and active practitioners in this increasingly important field. I should like to thank the referees for their valuable contribution and Jonathan Simpson (formerly of the IEE) for his encouragement to publish the proceedings in book form.

Part I: Knowledge Discovery and Data Mining in Theory looks at a variety of technical issues, all of considerable practical importance for the future development of the field.

Estimating Concept Difficulty with Cross-Entropy by Kamal Nazar and Max Bramer presents an approach to anticipating and overcoming some of the problems which can occur in applying a learning algorithm due to unfavourable characteristics of a particular dataset such as feature interaction.
Analysing Outliers by Searching for Plausible Hypotheses by Xiaohui Liu and Gongxian Cheng describes a method for determining whether 'outliers' in data are merely noise or potentially valuable information and presents experimental results on visual function data used for diagnosing two blinding diseases: glaucoma and onchocerciasis.
Attribute-Value Distribution as a Technique for Increasing the Efficiency of Data Mining by David McSherry describes a method for efficient rule discovery, illustrated by generating rules for the domain of contact lens prescription.
Using Background Knowledge with Attribute-Oriented Data Mining by Mary Shapcott, Sally McClean and Bryan Scotney looks at the important question of how background knowledge of a domain can be used to aid the data mining process.
A Development Framework for Temporal Data Mining by Xiaodong Chen and Ilias Petrounias is concerned with datasets which include information about time. The paper presents a framework for discovering temporal patterns and a query language for extracting them from a database.
An Integrated Architecture for OLAP and Data Mining by Zhengxin Chen examines features of DM specific to the data warehousing environment where On-Line Analysis Processing (OLAP) takes place. An integrated architecture for OLAP and data mining is proposed.

Part II: Knowledge Discovery and Data Mining in Practice begins with a chapter entitled Empirical Studies of the Knowledge Discovery Approach to Health Information Analysis by Michael Lloyd-Williams which introduces the basic concepts of knowledge discovery, identifying data mining as an information processing activity within a wider knowledge discovery process (although the terms knowledge discovery and data mining are often used interchangeably). The chapter presents empirical studies of the use of a neural network learning technique known as the Kohonen Self-Organising Map in the analysis of health information taken from threes sources: the World Health Organisation's 'Health for All' database, the 'Babies at Risk of Intrapartum Asphyxia' database and a series of databases containing infertility information.

The next chapter Direct Knowledge Discovery and Interpretation from a Multilayer Perceptron Network that Performs Low Back Pain Classification by Marilyn Vaughn et al. discusses the uses of a widely used type of neural network, the Multi-Layer Perceptron (MLP) to classify patients suffering from low back pain, an ailment which it is estimated that between 60% and 80% of the population will experience at least once at some time in their lives. A particular emphasis of this work is on the induction of rules from the training examples.

The two chapters on medical applications are followed by two on meteorology.

Discovering Knowledge from Low-Quality Meteorological Databases by Craig Howard and Vic Rayward-Smith proposes a strategy for dealing with databases containing unreliable or missing data based on experiences derived from experiments with a number of meteorological datasets.

A Meteorological Knowledge Discovery Environment by Alex Buchner describes a knowledge discovery environment which allows experimentation with a variety of textual and graphical geophysical data, incorporating a number of different types of data mining model.

The final two chapters are concerned with the application of knowledge discovery techniques in two other important areas: organic chemistry and the electricity supply industry.

Mining the Organic Compound Jungle: A Functional Programming Approach by Kathryn Burn-Thornton and John Bradshaw describes experiments aimed at enabling researchers in the Pharmaceutical industry to determine common substructures of organic compounds using data mining techniques rather than by the traditional method involving visual inspection of graphical representations.

Data Mining with Neural Networks: An Applied Example in Understanding Electricity Consumption Patterns by Philip Brierley and W.Batty gives further information about neural networks and shows how they can be used to analyse electricity consumption data as an aid to comprehension of the factors influencing demand. Fortran 90 source code for a multi-layer perceptron is also provided as a way of showing that 'implementing a neural network can be a very simple process that does not require sophisticated simulators or super-computers'.