DAMI - Data Mining in Computer and System Sciences
90 hp Computer and Systems Sciences
As data is becoming more and more readily available, the need to analyse and make use of these large amounts of data is rapidly growing. Data mining deals with techniques that can find interesting and useful patterns in large volumes of data. This course covers basic concepts, techniques and algorithms in data mining combined with hands-on experimentation.
Knowledge and understanding: After having taken the course, the student is expected to:
- know how to do data mining on large data sets
- have knowledge of the basic concepts in data mining
- be familiar with basic techniques and algorithms used in data mining
Abilities and skills: After having taken the course, the student is expected to be able to:
- formulate a data mining problem
- represent a data set in a form that will be useful for data mining
- evaluate the performance of different machine learning algorithms
Judgements and values: After having taken the course, the student is expected to:
- be able to critically select appropriate tools, representations and algorithms for a given data mining scenario
- be able to critically reflect over ethical/privacy aspects of a proposed data mining study, such as whether or not the design or the results of the study may have any negative effect for people, either by their direct involvment in the study or through the results of the study.
Data mining and machine learning Fielded applications Machine learning and statistics Generalization as search Data mining and ethics
Input: Concepts, instances, attributes What’s a concept? What’s in an example? What’s in an attribute? Preparing the input
Output: Knowledge representation Decision tables Decision trees Classification rules ssociation rules Rules with exceptions Rules involving relations Trees for numeric prediction Instance-based representation Clusters
Algorithms: The basic methods Inferring rudimentary rules Statistical modeling Divide-and-conquer: constructing decision trees Covering algorithms: constructing rules Mining association rules Linear models Instance-based learning Clustering Further reading Credibility: Evaluating what’s been learned Training and testing Predicting performance Cross-validation Other estimates Comparing data mining schemes Predicting probabilities Counting the cost Evaluating numeric prediction The minimum description length (MDL) principle Applying MDL to clustering
Real machine learning schemes Decision trees Classification rules Extending linear models Instance-based learning Numeric prediction Clustering Bayesian networks
Transformations: Engineering the input and output Attribute selection Discretizing numeric attributes Some useful transformations Automatic data cleansing Combining multiple models Using unlabeled data Further reading
Moving on: Extensions and applications Learning from massive datasets Incorporating domain knowledge Text and Web mining Adversarial situations Ubiquitous data mining
Lectures: 8 x 2 hours Assignment: 1 Seminars: 12 hours.