next up previous contents
Next: Transaction Management Support for Up: c) Information Systems Previous: Data Mining

Document Management (DocMan)

Text with a structure is quite common: dictionaries, reference manuals, and annual reports are typical examples. In recent years, research on systems for processing structured documents has flourished. The SGML and ODA standards have further increased the interest in the area. The Document Management (DocMan) Research Group studies the theory and application of such structured documents.

Structured and Intelligent Documents (SID) is an on-going project within the DocMan group that studies and develops methods for attaching intelligent features to structured documents. The purpose of these features is to make the manipulation (storage, retrieval, and assembly) of documents easier. The project started in 1995. SID is part of the Electronic Printing and Publishing program started by the Technology Development Centre of Finland (TEKES). Funding for SID is provided by TEKES and a group of supporting companies.

One of the basic problems in document management is to provide on-demand generation of individualized documents through dynamic document assembly. Document assembly composes new documents from an existing collection of documents. Naturally, document markup and structure contribute to the retrieval and reuse of document fragments.

An intelligent document contains knowledge about itself and its environment. It supports assembly of documents based on inputs given by the user. It is no longer a passive, linear representation of text, but is able to construct itself dynamically. Document assembly is intelligent when it uses application-domain-specific information about the document in addition to the contents and their structure.

The goals of the SID project include (1) defining the information and the knowledge a structured document must contain so that it can work in an active and intelligent way, (2) developing prototype tools for intelligent assembly, and (3) defining a methodology for incorporating intelligence into document material. As a basis for the project we consider structured documents marked up with SGML. The project combines methods and tools from, e.g., structured-document management, information retrieval, pattern matching, data mining, distributed systems, and machine learning. When dealing with documents in morphologically rich languages like Finnish, also natural language processing is vital to the success of document assembly.

Other ongoing research within the DocMan group includes creating automatically meaningful fragments of long documents, and classifying roles of structured document elements. Former research projects include the RATI project (1988-91) for building a prototype document manipulation system which provides multiple views of a document and the sgrep project (1995) which designed and implemented a search tool for structured documents. Also some results of the VITAL project (1990-95) are usable in this context: one of the tools built in the VITAL project was a general purpose text transformation generator suitable also for structured document transformations.

Researchers of the group are Prof. Heikki Mannila, Doc. Pekka Kilpeläinen, Dr. Helena Ahonen, M.Sc. Barbara Heikkinen, M.Sc. Oskari Heinonen, Jani Jaakkola, Dr. Greger Lindén, Jyrki Niemi and Kimmo Paasiala.

Publications: [71, 117, 133, 139, 167-169, 198-202, 247-256].

Home Page: http://www.cs.helsinki.fi/research/rati/
next up previous contents
Next: Transaction Management Support for Up: c) Information Systems Previous: Data Mining