# DAMI - Data Mining in Computer and System Sciences

## Level

Advanced level - Second cycle course

## Requirements

90 credits in Computer and Systems Sciences or 90 credits in a DSV bachelor programme with at least 60 credits in Computer and Systems Sciences.

## Short description

As data is becoming more and more readily available, the need to analyse and make use of these large amounts of data is rapidly growing. Data mining deals with techniques that can find interesting and useful patterns in large volumes of data. This course covers basic concepts, techniques and algorithms in data mining combined with hands-on experimentation.

## Aim

Knowledge and understanding: After having taken the course, the student is expected to:

- know how to do data mining on large data sets
- have knowledge of the basic concepts in data mining
- be familiar with basic techniques and algorithms used in data mining

Abilities and skills: After having taken the course, the student is expected to be able to:

- formulate a data mining problem
- represent a data set in a form that will be useful for data mining
- evaluate the performance of different machine learning algorithms

Judgements and values: After having taken the course, the student is expected to:

- be able to critically select appropriate tools, representations and algorithms for a given data mining scenario
- be able to critically reflect over ethical/privacy aspects of a proposed data mining study, such as whether or not the design or the results of the study may have any negative effect for people, either by their direct involvment in the study or through the results of the study.

## Syllabus

Data mining and machine learning Fielded applications Machine learning and statistics Generalization as search Data mining and ethics

Input: Concepts, instances, attributes What’s a concept? What’s in an example? What’s in an attribute? Preparing the input

Output: Knowledge representation Decision tables Decision trees Classification rules ssociation rules Rules with exceptions Rules involving relations Trees for numeric prediction Instance-based representation Clusters

Algorithms: The basic methods Inferring rudimentary rules Statistical modeling Divide-and-conquer: constructing decision trees Covering algorithms: constructing rules Mining association rules Linear models Instance-based learning Clustering Further reading Credibility: Evaluating what’s been learned Training and testing Predicting performance Cross-validation Other estimates Comparing data mining schemes Predicting probabilities Counting the cost Evaluating numeric prediction The minimum description length (MDL) principle Applying MDL to clustering

Real machine learning schemes Decision trees Classification rules Extending linear models Instance-based learning Numeric prediction Clustering Bayesian networks

Transformations: Engineering the input and output Attribute selection Discretizing numeric attributes Some useful transformations Automatic data cleansing Combining multiple models Using unlabeled data Further reading

Moving on: Extensions and applications Learning from massive datasets Incorporating domain knowledge Text and Web mining Adversarial situations Ubiquitous data mining

## Outline

Lectures: 8 x 2 hours Assignment: 1 Seminars: 12 hours.