Example queries using sgrep

Queries from sgrep home page

All lines with text "Hello World"

sgrep 'start or "\n" .. (end or "\n") containing "Hello World"'

Same query using sample macros

sgrep 'LINE containing "Hello World"'

All From fields from mail messages

sgrep '"\nFrom: " .. "\n" extracting ("\n" in "\nFrom: ")'

Same query using sample macros

sgrep 'MAIL_FROM'

Give me senders of all news articles with a word "sgrep" or "linux" in the subject field

Query using sample news macros

NEWS_FROM in (NEWS_HEADER containing (NEWS_SUBJ containing 
	("sgrep" or "linux")))

Same query with macros expanded

(("\nFrom: " in ( ( start or (("\n\nFrom ") extracting 
	("\n\n" in "\n\nFrom "))) .. ("\n" in "\n\n")) extracting 
	("\n" in "\nFrom: ") .. ( "\n" or end ))) in 
	(( ( start or (("\n\nFrom ") extracting 
	("\n\n" in "\n\nFrom "))) .. ("\n" in "\n\n")) 
	containing ((("\nSubject: " in ( ( start or (("\n\nFrom ") 
	extracting ("\n\n" in "\n\nFrom "))) .. ("\n" in "\n\n")) 
	extracting ("\n" in "\nSubject: ") .. ( "\n" or end ))) 
	containing ("sgrep" or "linux")))

Now you see that macros are very useful :)

Give me titles and names of all HTML documents that contain links to www.cs.helsinki.fi

Query using sample macros

sgrep -o"%f:%r\n" '(HTML_TITLE in (start .. end containing (HTML_HREF
	containing "www.cs.helsinki.fi")))' *.html

Same query with macros expanded

((( ( "<TITLE>" or ( ("<TITLE " or "<TITLE\t" or "<TITLE\n")  .. ">"))
	.. ( "</TITLE>" ) )) in (start .. end containing ((( 
	(( " " or "\t" or "\n" or "\r") __ ">"  in (inner(("<" not 
	in ("</"  or "<!"  or "<?" )) .. ">" )  extracting 
	(("<" not in ("</"  or "<!"  or "<?" ))  __ (( " "
	or "\t" or "\n" or "\r") or ">" ) in 
	inner(("<" not in ("</"  or "<!" 
	or "<?" ))  .. ">" ) )))  containing "HREF="
        ._ (( " " or "\t" or "\n" or "\r") or ">"))) containing
	"www.cs.helsinki.fi")))

Queries from the sgrep announce

Locate only TITLE and H1 .. H9 elements from HTML documents

Simple version

sgrep '("<TITLE>" .. "</TITLE>") or ("<H1>" .. "</H1>") or \
	("<H2>" .. "</H2>") or ("<H3>" .. "</H3>") or \
	("<H4>" .. "</H4>") or ("<H5>" .. "</H5>") or \ 
	("<H6>" .. "</H6>") or ("<H7>" .. "</H7>") or\
	("<H8>" .. "</H8>") or ("<H9>" .. "</H9>")'

Same query using example macros. This query is more exact, since it uses the SGML macros which can handle tags which contain attributes.

sgrep 'HTML_TITLE or HTML_H1 or HTML_H3 or HTML_H4 or HTML_H5 \
	or HTML_H6 or NAMED_ELEMS(H7) or NAMED_ELEMS(H8) \
	or NAMED_ELEMS(H9)'

Previous query with macros expanded

(( ( "<TITLE>" or ( ("<TITLE " or "<TITLE\t" or "<TITLE\n")  .. ">")) ..
( "</TITLE>" ) )) or (( ( "<H1>" or ( ("<H1 " or "<H1\t" or "<H1\n")  ..
">")) .. ( "</H1>" ) )) or (( ( "<H3>" or ( ("<H3 " or "<H3\t" or
"<H3\n")  .. ">")) .. ( "</H3>" ) )) or (( ( "<H4>" or ( ("<H4 " or
"<H4\t" or "<H4\n")  .. ">")) .. ( "</H4>" ) )) or (( ( "<H5>" or (
("<H5 " or "<H5\t" or "<H5\n")  .. ">")) .. ( "</H5>" ) )) 
or (( ( "<H6>" or ( ("<H6 " or "<H6\t" or "<H6\n")  .. ">")) ..
( "</H6>" ) )) or ( ( "<H7>" or ( ("<H7 " or "<H7\t" or "<H7\n")  ..
">")) .. ( "</H7>" ) ) or ( ( "<H8>" or ( ("<H8 " or "<H8\t" or "<H8\n")
 .. ">")) .. ( "</H8>" ) ) 
or ( ( "<H9>" or ( ("<H9 " or "<H9\t" or "<H9\n")  .. ">")) .. (
"</H9>" ) )

Remove all <FONT> tags from HTML document

sgrep -a -o" " 'NAMED_STAG(FONT) or "</FONT>"'

Same example with macros expanded

sgrep -a -o" " ( "<FONT>" or ( ("<FONT " or "<FONT\t" or \
"<FONT\n")  .. ">")) or "</FONT>"

A different solution to same problem

sgrep 'start .. end extracting (NAMED_STAG(FONT) or "</FONT>")'

Find out how many FIG elements there are under SUBPARA elements but not under PARA elements in your SGML file

sgrep -c '"<FIG>" .. "</FIG>" in ("<SUBPARA>".."</SUBPARA>")'

Same example using sample macros

sgrep -c 'NAMED_ELEMS(FIG) in NAMED_ELEMS(SUBPARA) not in NAMED_ELEMS(PARA)'

Print out the TITLE elements from a set of HTML documents in which word 'SGML' is mentioned more than 12 times, or which contain word SGML inside H1 or H2 elements.

sgrep 'HTML_TITLE in (start .. end containing (\
	join(12,"SGML") or (HTML_H1 or HTML_H2 containing "SGML") ) )' *.html

Find out mail senders of mail messages from a set of mail files, which contain word 'SGML' in the subject line, do not contain 'HTML' in the body of the mail, are sent in year 1996 and are not sent from address flame@hot.com

sgrep 'MAIL_FROM in (MAIL_MESS containing \
	(MAIL_SUBJ containing "SGML") \
	not containing (MAIL_BODY containing "HTML") \
	containing (MAIL_DATE containing "1996") \
	not containing (MAIL_FROM containing "flame@hot.com") )'

Shell scripts

A shell script to convert all <, > and & characters to <, > and & entities inside PRE elements

Note that this script bypasses all <, > and & entities so that this script can be run multiple times over one HTML document. However this presents one problem: In an sgrep query where you try to locate all & entities with a phrase like

	sgrep '"&amp;"'

the script does not convert the query to proper HTML because the "&" phrase looks like correct entity, and is bypassed. Instead you get HTML which when rendered by browser looks like this

	sgrep '"&"'

Yes, this did bite me. Thanks to Axel Boldt for pointing this out.

Here is the manually edited script anyway:

#!/bin/tcsh
sgrep -a -o"&lt;" '"&lt;" in ("<PRE>"__"</PRE>")' | \
        sgrep -a -o"&gt;" '"&gt;" in ("<PRE>"__"</PRE>")' | \
        sgrep -a -o"&amp;" '"&amp;" in ("<PRE>"__"</PRE>") \
                not in ("&gt;" or "&lt;" or "&amp;")'

Sgrep home page

Last modified: May 3,1996

This document is maintained by Jani Jaakkola
at email address Jani.Jaakkola@helsinki.fi