Structured and Intelligent Documents
"Älykkäät ja rakenteiset dokumentit"
© Document Management Research Group
University of Helsinki
Department of Computer Science
Contents
Structured and Intelligent Documents (SID) is
a three-year research project, which studies and develops methods
for attaching intelligent features to structured documents.
The purpose of these features is to make the manipulation, i.e.,
- storage
- retrieval and
- assembly
of documents easier.
The SID project started on August 1, 1995 and will end on July 31,
1998. SID is a part of the Electronic Printing and Publishing
programme started by the Finnish Technology Development
Centre (TEKES). Funding for SID is provided by TEKES and a
group of supporting companies.
The role and even the concept of a document is undergoing a tremendous
change. A document is no longer a passive linear presentation of text.
Text with a structure is quite common: dictionaries, reference manuals,
annual reports, etc., are typical examples. We create structured
documents by using markup methods, such as SGML or the
HTML standard of the World-Wide Web; however, there is
more to come!
An intelligent document contains knowledge about itself and
its environment. It supports assembly of documents based on inputs
given by the user. An active intelligent document is able to
construct and transform itself dynamically.
One of the basic problems in document management is to provide
on-demand generation of individualised documents through dynamic
document assembly. Document assembly composes new documents
from an existing collection of documents. Naturally, document markup
and structure contribute to the retrieval of the document fragments.
Document assembly is intelligent when it uses
application-domain-specific information about the document in addition
to the contents and the structure. Inherent information is
present in the document or directly computable from the document, e.g.,
keyword or phrase lists. Besides inherent information,
supplementary information is associated with the document.
Supplementary information includes, e.g., references, thesauri and
common-sense knowledge. Supplementary information can be described, for
example, in dependencies or conceptual hierarchies, and can reside in
additional document markup, as well as in separate databases.
Automated assembly consists of three phases:
- The user expresses his demands.
- Appropriate documents or document fragments are found and returned.
- The returned fragments are merged into a single uniform assembled document.
The result is presented to the user on the screen or on paper. The
World-Wide Web is not the least of the presenting possibilities. The
Java language, furthermore, introduces increased manipulative
capabilities.
The goals of the SID project are
- to define the supplementary information needed for building and
intelligently assembling active and intelligent documents,
- to develop tools for intelligent assembly, and
- to develop a methodology for introducing supplementary information
into a document collection.
Especially, we want the assembled document to be a finished product.
As a basis for the project we consider structured documents marked up
according to the Standard Generalized Markup Language (SGML),
which is an ISO standard for defining document markup languages.
The project combines methods and tools from, e.g.,
structured-document management, information retrieval,
pattern matching, machine learning, data
mining, distributed systems, etc. When dealing with
documents in morphologically rich languages, like Finnish, also
natural language processing is vital to the success of document
assembly.
(in chronological order)
- Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, Jani Jaakkola,
Pekka Kilpeläinen, Greger Lindén, and Heikki Mannila. Intelligent Assembly
of Structured Documents. Technical Report C-1996-40, University of
Helsinki, Department of Computer Science, June 1996.
- Helena Ahonen. Automatic
generation of SGML content models. In Electronic Publishing
'96, Palo Alto, California, USA, September 1996.
- Helena Ahonen. Disambiguation
of SGML content models. In Proceedings of the PODP'96 Workshop
on the Principles of Document Processing, Palo Alto, California,
USA, September 1996.
- Pekka Kilpeläinen and Derick Wood. SGML and exceptions.
In Proceedings of the PODP'96 Workshop on the Principles of
Document Processing, Palo Alto, California, USA, September 1996.
- Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, Jani Jaakkola,
Pekka Kilpeläinen, Greger Lindén, and Heikki Mannila. Constructing tailored SGML documents. In
J. Saarela, editor, Proceedings of SGML Finland
1996, Espoo, Finland, October 1996. SGML Users Group Finland,
pages 106-116.
- Jani Jaakkola and Pekka Kilpeläinen. Using sgrep for querying
structured text files.
In J. Saarela, editor, Proceedings of SGML Finland
1996, Espoo, Finland, October 1996. SGML Users Group Finland,
pages 56-67. Available as a short
abstract and as Technical Report C-1996-83.
- Helena Ahonen. Generating Grammars for Structured Documents Using
Grammatical Inference Methods. Ph.D. Thesis, University of Helsinki,
Department of Computer Science, Series of Publications A,
Report A-1996-4, November 1996.
- Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and
A. Inkeri Verkamo. Mining
in the Phrasal Frontier. Technical Report C-1997-14, University
of Helsinki, Department of Computer Science, February 1997. A
revised version to appear in Proceedings of PKDD'97 - 1st European
Symposium on Principles of Data Mining and Knowledge Discovery,
Trondheim, Norway, June 1997.
- Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, and Pekka
Kilpeläinen. A System for Assembling Specialized Textbooks from a Pool of
Documents. Technical Report C-1997-22, University of Helsinki,
Department of Computer Science, March 1997.
- Helena Ahonen, Oskari Heinonen, Mika Klemettinen, and
A. Inkeri Verkamo. Applying
Data Mining Techniques in Text Analysis. Technical Report C-1997-23,
University of Helsinki, Department of Computer Science, March 1997.
- Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, and Mika
Klemettinen. Improving
the accessibility of SGML documents - A content-analytical approach.
To appear in SGML Europe '97, Barcelona, Spain, May 1997. GCA.
- Jani Jaakkola, Pekka Kilpeläinen, and Greger Lindén.
TranSID:
An SGML transformation language. In the
Proceecings of the Fifth Symposium on Programming Languages and
Software Tools,
J. Paakki (ed.),
Jyväskylä, Finland, June 1997, p. 72-83.
Available as Department of Computer Science Report C-1997-36,
University of Helsinki, May 1997.
-
Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, and Pekka Kilpeläinen.
Assembling documents from digital libraries.
In Database and Expert Systems Applications,
Proceedings of the
8th International Conference, DEXA '97, A. Hameurlain and A.M. Tjoa
(eds.),
Springer Lecture Notes in Computer Science 1308,
419-429.
- Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, Mika Klemettinen.
Discovery of reasonably-sized fragments using inter-paragraph
similarities. Department of Computer Science Report C-1997-67.
University of Helsinki, November 1997.
- Helena Ahonen, Barbara Heikkinen, Oskari Heinonen, Jani Jaakkola, Pekka Kilpeläinen,
and Greger Lindén.
Design and implementation of a document assembly workbench.
Proc. of the Seventh International Conference on
Electronic Publishing, Document Manipulation and Typography
in Saint
Malo, France, April 1-3, 1998, 476-486.
The supporting companies of the SID project are leading Finnish
enterprises involved in electronic printing. The companies
are as follows:
The following researchers take
part in the project:
Department of Computer Science
P.O. Box 26 (Teollisuuskatu 23)
FIN-00014 University of Helsinki
Finland
Phone: +358 9 70 851
Fax: +358 9 7084 4441
You can contact any of us by email. Questions and inquiries
concerning the project are most suitably emailed to project manager
Pekka Kilpeläinen at Pekka.Kilpelainen@cs.Helsinki.FI.
Oskari Heinonen,
SID Project, DocMan
Group, March 25, 1996. Last updated November 19, 1997.