Thursday 3 May 2012

Text extraction with PARSE

Parse is the mixed blessing of Rebol. It's very powerful, easy, but it's totally different from the well know regular expression of other languages.
If you don't know what is parse, go to read:
Now you can follow this post.
Everybody know that to extract a text you can use thru, to and copy words as rule:

>> parse "Hello, my name is Carl!" [thru "," copy temp to "!" (print temp)]
my name is Carl
== false


If you need all occurrences, you may use some or any:


>> parse "Hello, my name is Carl! Hello, my name is Carl!" [some [thru "," copy temp to "!" (print temp)] ]
my name is Carl
my name is Carl
== false


The power of parse is to build rules on rules

>> letters: charset [#"A" - #"Z" #"a" - #"z"]
== make bitset! #{0000000000000000FEFFFF07FEFFFF0700000000000000000000000000000000}

>> digits: charset "0123456789"
== make bitset! #{
000000000000FF03000000000000000000000000000000000000000000000000
}

>> parse "acd" [some letters]
== true

>> parse "123cfc" [ some [digits | letters ]]
== true



If you want to extract text basing on complex rules, you can't use to and thru words, as for an exact string; you have to use some or any in conjunction with skip this way, example:


>> text: { Codename 007 Sassenrath Carl.
Codename 008 Max Vessi.
Codename 101 Semseddin Moldibi. }


How you can extract the name, if you know that the name is just after a 3 digits number and it ends before a full stop?
Here the solution:

>> parse text [some [3 digits copy temp to "." (print temp) | skip]]
Sassenrath Carl
Max Vessi
Semseddin Moldibi
== true


Remember: the last part with "| skip" is fundamental to avoid infinite loops.

No comments:

Post a Comment