Web Mining in Computer and Systems Sciences

Requirements

60 hp Computer and Systems Sciences with at least:

7,5 hp objectoriented programming
7,5 hp data warehousing

Aim

The course intends to give an insight into techniques for text mining applied on Internet related data, and for what they can be used. After the course is finished the student should be able to:

Identify and differentiate between application areas for web content mining, web structure mining and web usage mining.
Describe key concepts such as web content mining, web log, hypertext, social network, information synthesis, corpora, morphology and evaluation measures such as precision and recall.
Discuss the use of methods and techniques such as word frequency and co-occurrence statistics, normalization of data, machine learning, clustering, vector space models and lexical semantics.
Explain the architecture and main algorithms commonly used by web mining applications.
Appropriately select between different approaches and techniques of web mining for e.g. sentiment analysis, targeted marketing, linguistic forensics, customer profiling.
Apply human language technology tools such as tokenizers, stemmers, part-of-speech taggers, noun phrase chunkers and shallow parsers on different types of web content gathered from for instance e-commerce sites.
Perform analysis of linguistically processed data using a suitable automatic classifier.
Set requirements to, compare and assess the quality of existing web mining tools.
Analyze and explain what web mining problems are satisfiably solved, what is worked upon at the research frontier and what still lies beyond the current state-of-the-art.
Independently solve a well-defined practical web mining problem using tools and techniques introduced in the course.
Convey the outcome of own work on web mining orally and in written form to fellow peers using relevant and appropriate terminology.

Syllabus

Internet contains a huge amount of information, which is rapidly growing at an ever-increasing pace. People, organizations and corporations from the whole world are adding different types of information to the web continuously in various languages. The web therefore contains potentially very interesting and valuable information. This course will investigate various techniques for processing the Web in order to extract such information, refine it and make it more structured, thus making it both more valuable and accessible. These techniques are often referred to as web mining techniques.

The domains within the Internet that we will study are e-commerce web sites, wikis, virtual communities and blogs. Web mining is considered to contain three main areas, namely web content mining, web structure mining and web usage mining. Web structure mining is closely related to information search techniques, and web usage mining to opinion mining or sentiment analysis. Web content mining can for example be used to find the cheapest airline tickets, by monitoring all web based databases of all airlines in order to attempt to find the lowest common denominator of all databases.

Web mining techniques explored in the course are human language technology, machine learning, statistical models, information retrieval and extraction, text mining, text summarization, automatic classification and clustering, wrapper induction, normalization of data, information integration, interface matching, schema matching, sentiment analysis and opinion mining, extraction of comparatives, forensic linguistics etc.