Topic detection and tracking

In topic detection and tracking (TDT), a news stream is monitored for significant news events and their development. Thus, TDT can be understood as a form of information retrieval, where relevance, the ``aboutness'', is based on news events. The TDT research initiative was launched in 1996 with a pilot study that set out to evaluate the feasibility of the existing information retrieval methods in tasks of this kind ( Allan et al. 1998 ). Despite the initial success, it was soon noticed that the event-based similarity was considerably more difficult to identify than the similarity in text classification and information filtering, for example. This is true especially for detection of new events that requires scanning through the all the previous data. Allan, Lavrenko and Jin have supported this observation with probabilistic upper-bound analysis that suggests that traditional methods relying on full-text similarity are not likely to yield feasible new event detection ( Allan et al. 2000 ).

Our work is primarily concerned in developing new, more expressive representations and means of comparisons for documents. In Makkonen et al. (2004), for example, the termspace is split into four semantic classes: proper names, locations, temporal expressions and general terms. Proper names are names of companies, organizations, people etc. Locations denote places in the world. Temporal expressions require succesful recognition and formalization before they can be employed (see e.g., Makkonen and Ahonen-Myka, 2003). Finally, the general terms are pretty much what is left out: common nouns, verbs, adjectives. Loosely, these classes can be seen to answer questions who?, where?, when? and what?

The division of this kind enables us to carry out the comparisons of two documents one class at a time: names vs. names, locations vs. locations, etc. This way we can compare the terms within a class using whatever measure we wish: for locations it could be distance in coordinates or in a geographical hierarchy. For temporal expressions, we can measure the amount of overlap on time-axis, for example. If we think that the meaning of a word is in its relation to other words, we are in a simple sense introducing semantics into the comparison, i.e., within a semantic class each term has a distance to any other term.

The work is on-going: Juha Makkonen is working on a PhD thesis on these problems.

Links:

TDT corpora at LCD.
A small TDT bibliography.
TDT evaluations at NIST.