Annual Report 2005

Document Management, Information Retrieval and Data Mining (Doremi)

The Doremi research group is active in the areas of document management, information retrieval, data mining and human language technology. The group has developed methods for question-answering systems, information extraction, event detection and tracking, data retrieval from XML documents, and text mining.

The idea of QA systems is that users can ask them questions in a natural language, and the systems finds the answer in a large body of text. Depending on the requirements, the answer is either a fragment of text from which the reader can find the answer, or an exact answer, like a proper noun. During 2005, Doremi has participated in the QA part of the evaluation project Cross-Language Evaluation Forum (CLEF) with the object of offering experiment material and an evaluation environment for QA systems. The group participated in the project with three systems; two monolingual (Finnish, French) and one bilingual (questions in Finnish, corpuses in English).

The project Mobile and Multilingual Maintenance Man (4M) incorporates several new research problems, and is a joint effort between the University of Helsinki and several project groups at Helsinki University of Technology, as well as VTT Information Technology (the national technology research centre). The 4M project aims at developing a knowledge support system that communicates with natural languages to aid maintenance workers repairing machinery. The Doremi team is in charge of developing methods for producing knowledge from text documents, e.g. to extract different instructions from the user manuals of the machinery. In addition, we are researching information retrieval that will fit on a small screen and give exact results while utilising the ontologies and previous discussions.

A new project, Pattern-based Understanding and Learning System (PULS), branches out to develop a system to aid infectionists (medical). The system collects new announcements of infectious diseases in the world from a mailing list for doctors, mines out the facts in the announcement (place, disease type, number of patients etc.) and saves them in a database where the general public can fetch information through the WWW (http://doremi.cs.helsinki.fi/puls/). The project has the special mission of adding to the reliability of results by analysing the database as a whole. Knowledge is usually only retrieved from one document at a time, so this kind of retrieval that spans several documents is still very new.

Other research interests pursued in the group include text mining and information retrieval from XML documents, on which subject Antoine Doucet finished his PhD thesis on finding and utilizing multi-word terms.

Contact persons: Professor Helena Ahonen-Myka and researcher Roman Yangarber, PhD.

Homepage: http://www.cs.helsinki.fi/research/doremi/

Project

Mobile and Multilingual Maintenance Man (4M)

Publications

L. Aunimo: A Question Typology and Feature Set for QA. Proceedings of the Workshop for Knowledge and Reasoning for Answering Questions, held in conjuction with IJCAI-05, July 2005, Edinburgh , Great Britain .

A. Doucet: Advanced Document Description, a Sequential Approach. PhD Thesis. Department of Computer Science, Series of Publications A, Report A-2005-2.

A. Doucet and H. Ahonen-Myka: A Method to Calculate Probability and Expected Document Frequency of Discontinued Word Sequences. In Proceedings of ACM SIGIR 2005, ELECTRA Workshop on Methodologies and Evaluation of Lexical Cohesion Techniques in Real-world Applications (Beyond Bag of Words), Salvador, Brazil, August 15-19, 2005.

A. Vallin, B. Magnini, D. Giampiccolo, L. Aunimo, C. Ayache, P. Osenova, A. Penas, M. de Rijke, B. Sacaleanu, D. Santos, and R. Sutcliffe: Overview of the CLEF 2005 Multilingual Question Answering Track. Proceedings of the 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna , Austria , September 21-23, 2005.

R. Yangarber and L. Jokipii: Redundancy-based Correction of Automatically Extracted Facts. In Proceedings of the Human Language Technology Conference/ Conference on Empirical Methods in Natural Language Processing: HLT/EMNLP-2005, Vancouver , Canada .

International visits

To the Group

Damien Beaudrey, INSA Lyon , France , 21 February – 31 July 2005

Annual report 2005

Document Management, Information Retrieval and Data Mining (Doremi)

Project

Publications

International visits