On this page you'll find exact definitions for input and output formats, tools and references to tools, example code and links to papers necessary for completing a coding-oriented tree language course project work.
You may assume that the character encoding of all input files is ISO-8859-1 and do not have to have support for other encodings.
The tree-regular grammar will be specified in the following normalized format (in pseudo-BNF):
rule := non-terminal \s* '->' \s* terminal (\s+ RE)? \s* '\n' non-terminal := \w[\w\d._-]* terminal := \w[\w\d._-]* RE := any regular expression over non-terminals using * ? | , ( )
(\w: any letter, \d any digit, \s whitespace)
An example:
P_book -> book P_frontmatter,P_mainmatter,P_backmatter P_frontmatter -> frontmatter P_mainmatter -> mainmatter (P_chapter,P_chapter*) P_backmatter -> backmatter P_chapter -> chapter (P_xref)* P_xref -> xref
Notes: terminals are not allowed in the regular expression, and you do not have to support the + quantifier. Since the alphabet of terminals is words, sequences are denoted by ,. Each rule ends with an end-of-line \n. Each terminal is represented by a rule with no regular expression. Character content and mixed content are effectively ignored.
The start expression r_0 is not explicitly given. Given grammar rules G_1 ... G_n in the form above, with the first rule G_1 being X -> a r_1 you may assume r_0 = X. An alternative approach is to look at the name of the root of the document and construct an r_0 that contains as alternatives all non-terminals with that label.
Queries are grammars with a number of non-terminals on the right-hand side marked with a #. An example:
P_book -> book P_frontmatter,P_mainmatter,P_backmatter P_frontmatter -> frontmatter P_mainmatter -> mainmatter (#P_chapter,P_chapter*) P_backmatter -> backmatter P_chapter -> chapter (P_xref)* P_xref -> xref
is a query whose result set is the first chapter of the book.
There will be a formal set of test cases with expected outcomes available. For the moment, you can see some example grammars and documents. Note that some of them are not meant to be valid.
A recognizer is a single runnable program (or script) that can be run with a command line PROGRAM name_of_grammar_file name_of_xml_file. It may produce any output on stdout and stderr, but must return a status code such that status 0 means that the input document matches the grammar given, 1 means that it does not match and any other status code means that the program encountered an error.
Building a parser for regular expressions is not hard, but takes some time and is not the focus of this course. Below are the relevant bits of a lexer and parser for the regular-expressions used in the grammar above. They should enable you to implement such a parser in your favorite language. If you are totally unfamiliar with C++ and cannot read the code, drop me a line and I'll write some pseudocode (and for those of you that are, yes, it will leak memory when encountering an error).
token_val get_token(input_if<charT>& buf) { for(;;) { charT c=buf.get(); if (c==0 || isequal(c, '\n')) return token_val(END, ""); if (isspace(c)) continue; if (isequal(c, '(')) return token_val(OPENBRACE, ""); if (isequal(c, ')')) return token_val(CLOSEBRACE, ""); if (isequal(c, '?')) return token_val(QUESTION, ""); if (isequal(c, ',')) return token_val(COMMA, ""); if (isequal(c, '*')) return token_val(STAR, ""); if (isequal(c, '|')) return token_val(BAR, ""); if (isalpha(c)) { typename re_traits<charT>::string name; name.push_back(c); while ( isnamechar(c=buf.get()) ) { name.push_back(c); } buf.put(c); return token_val(NAME, name); } typename re_traits::string tok; tok.push_back(c); return token_val(ERROR, tok); } }
This is a straight-forward, hand-written recursive-descent parser. Probably not the prettiest way to implement it, but simple.
retype* parse() { lookahead=_lexer.get_token(_buf); retype* ret=seq(); if (lookahead.first!=lexertype::END) { error(); } if (ret) return ret; return new empty<charT>(_mgr); } void match(typename lexertype::token t) { if (lookahead.first==t) { lookahead=_lexer.get_token(_buf); } else { error(); } } retype* single() { retype* ret=0; if (lookahead.first==lexertype::OPENBRACE) { match(lexertype::OPENBRACE); ret=seq(); match(lexertype::CLOSEBRACE); } else if (lookahead.first==lexertype::NAME) { ret=new identifier<charT>(_mgr, lookahead.second); match(lexertype::NAME); } if (ret) { if (lookahead.first==lexertype::STAR) { match(lexertype::STAR); ret=new star<charT>(_mgr, ret); } else if (lookahead.first==lexertype::QUESTION) { match(lexertype::QUESTION); ret=new optional<charT>(_mgr, ret); } } if (!ret) { error(); return 0; } else { return ret; } } retype* single_or_choice() { retype* first=single(); choice<charT>* c=0; while (lookahead.first==lexertype::BAR) { if (!c) { c=new choice<charT>(_mgr); c->add_child(first); } match(lexertype::BAR); c->add_child(single()); } if (c) return c; return first; } retype* seq() { if (lookahead.first==lexertype::END) { return 0; } retype* first=single_or_choice(); sequence<charT>* s=0; while (lookahead.first==lexertype::COMMA) { if (!s) { s=new sequence<charT>(_mgr); s->add_child(first); } match(lexertype::COMMA); s->add_child(single_or_choice()); } if (s) return s; return first; }
There are XML parsers for all languages available. If you are not familiar with them, here are some usage examples with libraries and languages available at the department:
I can help you get started with the XML parsing side if you are having problems.
This Simple perl script will convert a DTD to our grammar format. It should be run agains an XML file that uses a DTD, not directly against a DTD.
Last updated by Mika Raento, mikie (at) iki.fi on 2004-10-21