Text Analysis and Retrieval

Course Description

Most human knowledge is stored in unstructured, textual format. Due to the vast and rapidly growing amount of text data available, text analysis and retrieval systems have become an indispensable part of modern ICT infrastructure. Such systems address diverse information needs of the users and enable the extraction of information from large volumes of unstructured data. Because of the complexity and ambiguity of natural language, text analysis is a non-trivial task, which relies on natural language processing, computational linguistics, and machine learning. This course provides a systematic overview of both traditional and advanced methods for text analysis and retrieval. The first part of the course deals with document representation and methods document retrieval, classification, and clustering. The second part deals with information extraction and text mining with an emphasis on methods based on statistical natural language processing and machine learning.

General Competencies

Familiarity with the basic language processing tasks, document representation models, methods for document retrieval, classification, and clustering, as well as semantic search techniques. Familiarity with basic information extraction methods, text mining, and document visualization techniques. Familiarity with the evaluation of information retrieval systems. Understanding of the theoretical foundations of these methods as well as their limitations, advantages, and disadvantages. Familiarity with the tools and frameworks for language processing, text mining, and document retrieval. The ability to design, implement, and evaluate a simple full-text retrieval and analysis system. Familiarity with the applications, best practices, trends, and challenges in the field of text ​​analysis and retrieval.

Learning Outcomes

  1. Summarize the application areas, trends, and challenges in text analysis and retrieval
  2. Describe the fundamental techniques of text analysis and retrieval
  3. Use linguistic preprocessing tools
  4. Design and implement a text analysis/retrieval system
  5. Apply machine learning algorithms to text analysis tasks
  6. Evaluate a text analysis/retrieval system
  7. Formulate and write a system description paper
  8. Describe, review, analyze, and criticize the main text analysis methods present in scientific papers

Forms of Teaching

Lectures

Two hours lecture per week for 13 weeks. Lectures include the presentation of the teaching material, discussions, and group work.

Exams

Continuous assessment consisting of a midterm exam, a final exam, one reading assignment, and one project assignment.

Seminars

6-8 reading assignments.

Other Forms of Group and Self Study

One group project assignment.

Other

Additional study at home is required.

Grading Method

Continuous Assessment Exam
Type Threshold Percent of Grade Threshold Percent of Grade
Homeworks 0 % 25 % 0 % 0 %
Seminar/Project 25 % 50 % 0 % 50 %
Mid Term Exam: Written 0 % 25 % 0 %
Exam: Written 50 % 50 %

Week by Week Schedule

  1. Introduction: motivation and applications, examples of successful systems, literature overview, overview of the existing tools.
  2. Basics of natural language processing.
  3. Basics of information retrieval.
  4. Web search, advanced information retrieval, information retrieval evaluation.
  5. Machine learning for natural language processing.
  6. Text classification, clustering, and latent semantic models.
  7. Word embeddings and neural networks for natural language processing.
  8. Midterm exam.
  9. Information extraction and applications.
  10. Question answering systems.
  11. Semantic textual similarity, summarization, and simplification.
  12. Sentiment analysis.
  13. Authorship analysis.
  14. Extra topic. Summary and suggestions for further study.
  15. Final exam.

Study Programmes

University graduate
Computer Science (profile)
Specialization Course (2. semester)
Information Processing (profile)
Specialization Course (2. semester)
Software Engineering and Information Systems (profile)
Specialization Course (2. semester)

Literature

C. D. Manning, P. Raghavan, H. Schütze (2008.), Introduction to Information Retrieval, Cambridge University Press
S. Buettcher, C. L. A. Clarke, G. V. Cormack (2010.), Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press
S. M. Weiss, N. Indurkhya, T. Zhang, F. Damera (2010.), Text Mining: Predictive Methods for Analyzing Unstructured Information, Springer
G. Miner, J. Elder IV, T. Hill, R. Nisbet, D. Delen, A. Fast (2012.), Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, Academic Press
C. D. Manning, H. Schütze (1999.), Foundations of Statistical Natural Language Processing, The MIT Press

Lecturers

Grading System

ID 104399
  Summer semester
4 ECTS
L3 English Level
L1 e-Learning

General

89 Excellent
76 Very Good
63 Good
50 Acceptable