Sgrep home page
What is sgrep ?
sgrep (structured grep) is a tool for searching and indexing
text, SGML,XML and HTML files and filtering text streams using
structural criteria.
The data
model of sgrep is based on regions, which are nonempty substrings of
text. Regions are typically occurrences of constant strings, SGML-tags,
or meaningful text elements, which are recognizable through some
delimiting strings or the builtin SGML, XML and HTML parser. Regions can
be arbitrarily long, arbitrarily
overlapping, and arbitrarily nested.
Sgrep is a convenient tool for making queries to almost any kind
of text files with some well kown structure. These include programs,
mail folders, news folders, HTML, SGML, etc...
With relatively simple queries you can display mail messages by their
subject or sender, extract titles or links or any regions from HTML files,
function prototypes from C or make complex queries to SGML files
based on the DTD of the file.
NEW! Third prerelease of sgrep-2 is out!
Sgrep version 1.92a is out. This version contains the sources,
Win32 binary and binaries for HP-UX, Linux, OSF1 and Solaris.
See the
download page. The Win32 binary also
includes the m4 macro processor.
Version 1.92 also fixes a fatal bug in sgrep-1.91, which caused
version 1.91 to core dump when searching without using the SGML-scanner.
Major new features since 1.90a are:
- Nearness operators for both ordered and unordered nearness.
- Support for 16-bit wide query terms (this really means, that Sgrep
now supports Unicode)
- Support for UTF-16 and UTF-8 encodings
- 'parenting' operator is now an order of magnitude faster (in the common
case)
- Sgrep now emits and parses #line-directives, which allows for more
accurate error reporting
- An option to query terms from index files
- Many bug fixes
- Introduces some new bugs (hopefully not as many as I fixed).
Major new features in 1.90a since version 1.70 are:
- Query operators supporting direct containment. In SGML and XML world
this means that you can query children and parents of given elements.
- The sources are available under GPL-license for those interested in
compiling sgrep themselves.
- Sgrep now uses GNU autoconf, so compiling sgrep under unix like systems
should be easy.
- Many bug fixes
Major new features since version 0.99 are:
- Indexing of both structure and content.
- SGML/XML/HTML scanner.
- Official Win32 binary.
- sgtool has been dumped. It never really worked and even when it
did, it wasn't very useful.
- Should be completely compatible with older versions of sgrep.
See the README file for details.
How is sgrep used
Sgrep queries are constructed with it's own language. The details of the
language are covered on the
sgrep manual page. See also the
report using sgrep for querying structured text files.
With the query language you can express queries like:
- Give me all lines with text "Hello World"
- Give me all from "From" fields in my mail messages
- Give me senders of all news articles with a word "sgrep" or
"linux" in the subject field
- Give me titles and names of all HTML documents that contain links
to www.cs.helsinki.fi
The new features in sgrep-1.90a, including indexing, are currently documented
only in the README file.
The power of sgrep query language is at its best when making complex
queries on SGML like tagged documents. See a set of
example queries
including the queries above.
The most recent stable sgrep version is 0.99. See the
announcement of version 0.99
The most recent alpha version is 1.91a. See the
announcement of version 1.91a
Sgrep requirements
Sgrep-1.91a works in Win32 systems (Win95, Win98 and Windows NT) as a console
application or in any decent unix-like system supporting memory mapped
files.
Sgrep from the net
Authors
Sgrep was made by
Jani Jaakkola,
email:jjaakkol@cs.helsinki.fi
Pekka Kilpeläinen,
email:
Pekka.Kilpelainen@helsinki.fi
Last modified: Dec 22,1998
This document is maintained by
Jani Jaakkola
at email address
jjaakkol@cs.helsinki.fi