Information extraction from text, Week 6



Some students have told they cannot attend the exam on 27.3. Hence, there will be an alternative. Unfortunately, there are not that many days possible, actually only one: Tuesday 26.3. at 16-20 (Auditorio). If you plan to come to this exam, please tell Helena.


The solutions should be ready for inspection by Friday 22.3.2002 (midnight).

Remember that always, if you are in doubt what you should do, you can ask Reeta or send a message to our newsgroup!!


Both exercises are based on the paper Soderland: Learning Information Extraction Rules for Semi-structured and Free Text (Machine Learning, 1999) . You can skip the sections on free text, extensions of the rules for grammatical text etc. We concentrate on the parts concerning semi-structured and structured text. We discuss chapter 3 in the lecture on 18.3., so you may want to take a look of it, but you probably don't need it in the exercises (at least not any details).

  1. Apply the following WHISK rule to the two sample rental ads below. There is a list for semantic classes Bdrm and Nghbr, as well as a README file explaining the format, available.

    ID::2
    Pattern:: *( Nghbr ) * ( Digit ) ' ' Bdrm * '$' (Number)
    Output:: Rental {Neighborhood $1} {Bedrooms $2} {Price $3}
    
    

    If a rental ad is covered by the rule, explain how the parts of the rule match the parts of the ad, and tell what is the output in this case. If the rule does not cover the ad, explain why.

    @S[
      Green Lake 1 br, parklike, rose grdn, quiet, $495. 
      206/632-5502 <br>
      <i> <font size=-2> (This ad last ran on 08/03/97.) </font> </i> <hr>
    ]@S 159
    @@TAGS Rental {Neighborhood Green Lake} {Bedrooms 1} {Price 495} 
    @@COVERED_BY 
    @@ENDTAGS
    
    @S[
      MAPLE LEAF/ROOSEVELT - <br> 
      Near new large 1 BR, $620-$630. 
      206-524-6446, Pioneer Realty, 206-525-7200.  <br>
      <i> <font size=-2> (This ad is from 08/01/97 to 08/03/97.) </font>
    </i> <hr>
    ]@S 237
    @@TAGS Rental {Neighborhood MAPLE LEAF} {Neighborhood ROOSEVELT} 
    {Bedrooms 1} {Price 620} {Price 630}  @@COVERED_BY 
    @@ENDTAGS
    
    
  2. Read the section 5.2. (Discussion) and describe what are the easy and what are the problematic cases for WHISK rules. You find more information about the other datasets (seminar announcements, jobs) from the previous sections and some samples through the page RISE: Repository of Test Domains for Information Extraction. See "Page Models" for the examples.



Helena Ahonen-Myka
Last modified: Fri Mar 15 19:55:25 EET 2002