544 lines
34 KiB
HTML
544 lines
34 KiB
HTML
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
|
||
<html>
|
||
<head>
|
||
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
||
<meta name="Author" content="Jim Cooper">
|
||
<meta name="GENERATOR" content="Mozilla/4.5 [en] (Win95; I) [Netscape]">
|
||
<title>Parser</title>
|
||
</head>
|
||
<body>
|
||
<b>Here be dragons</b>
|
||
<p>The standard text on writing parsers and compilers is “Compilers. Principles,
|
||
Techniques and Tools” by Aho, Sethi and Ullman. The book is so well known
|
||
it has a nickname: “The Dragon Book”. This is because there is a picture
|
||
of a dragon on the cover. Who says programmers aren’t creative? If you
|
||
ever need to write a parser, never try to work out how to do it yourself.
|
||
The area is well researched and excellent techniques are well known. All
|
||
you need to do is apply them. The difference between a parser and a compiler
|
||
is that a parser processes some text, providing information on its structure
|
||
in a useful form. A compiler will contain a parser, but goes on to use
|
||
its output to generate machine code. Usually compilers also make use of
|
||
a symbol table, to store information about identifiers (like types, scope,
|
||
number of parameters etc) and an error handler, to report errors, and possibly
|
||
to take corrective action so that the remainder of the program can be parsed.
|
||
<p>The phases can be simply described as follows:
|
||
<p>1. <b>Lexical analysis</b>. This takes a stream of characters, and outputs
|
||
a stream of tokens. A token is a non-divisible element of the language,
|
||
like the keywords if or procedure, or the + and ; symbols are in Pascal.
|
||
A sequence of characters that makes up a token is called a lexeme, hence
|
||
the name. This phase is often responsible for removing whitespace (spaces,
|
||
tabs and new lines) and recognising identifiers and keywords. Usually the
|
||
lexical analyser recognises at least integers, instead of using grammar
|
||
rules like the example given earlier. Regular expressions are usually sufficient
|
||
to describe the actions needed, and they are often implemented as finite
|
||
state machines.
|
||
<br>2. <b>Syntax analysis</b>. This phase takes the stream of tokens and
|
||
checks that they obey the grammar rules. The output from this stage is
|
||
a hierarchical structure of tokens that I’m not actually going to describe,
|
||
because it isn’t used in recursive descent parsers. Instead, the same information
|
||
is implicit in the order of functions calls on the call stack. There is
|
||
some flexibility in the division of labour between lexical and syntactic
|
||
analysis. For example, sometimes whitespace is removed by the lexical analyser
|
||
and sometimes by the syntax analyser.
|
||
<br>3. <b>Semantic analysis</b>. This phase ensures that the program makes
|
||
sense. This usually takes the form of type checking; making sure that you
|
||
don’t multiply numbers and strings, for instance.
|
||
<br>4. <b>Code generation</b>. This may result in either machine code,
|
||
or an intermediate form like p-code. This enables the same back-end to
|
||
be used for compilers for different languages. For instance, a Pascal and
|
||
a C++ compiler might both produce p-code, and then use the same code optimisation
|
||
and generation phases.
|
||
<br>5. <b>Code optimisation</b>. Sometimes transformations can be made
|
||
to make the code run more efficiently. Substituting actual values for constants,so
|
||
they don’t need to be looked up at run-time, and taking advantage of instructions
|
||
relevant to a particular microprocessor are simple examples.
|
||
<p>Obviously, parsing an HTML file is going to have slightly different
|
||
phases. A browser would replace code generation with producing an image
|
||
of the page. Fortunately, we can stop at syntax analysis, at which point
|
||
we will have enough information to produce our diagram. Basically, a parser
|
||
consists of stages 1 to 3.
|
||
<p>There are two easy ways to write a parser. One is to use to tools Lex
|
||
and Yacc, which are freely available in several implementations (look for
|
||
LexAndYacc.zip on the Delphi Super Page at <a href="http://sunsite.icm.edu.pl/delphi/">http://sunsite.icm.edu.pl/delphi/</a>
|
||
). Their use is described in the Dragon Book. They are a little sophisticated
|
||
for our purposes here, so we will use the other easy technique, recursive
|
||
descent parsing, which is the easiest way to hand code a parser. Recursive
|
||
descent is an example of a type of parsing scheme known as syntax-directed
|
||
translation, which uses the grammatical rules to organise the code of the
|
||
parser. The first step is always to define the grammar of the language
|
||
we wish to parse. We then convert that grammar directly into code. Any
|
||
changes or corrections are made the same way; change the grammar first,
|
||
then change the code to reflect the new grammar. There is usually also
|
||
a set of semantic rules, which specify what else to do, which might include
|
||
type checking or updating the symbol table. For a larger project, these
|
||
semantic rules should be more formally specified, but I am going to introduce
|
||
them later on, when we get to write the parsing code.
|
||
<p>To begin with, we need to define a few terms. A grammar defines the
|
||
syntax of a language, that is, what that language looks like. It does not
|
||
define the meaning, or semantics, of a language. Grammars are often defined
|
||
using BNF, or Backus-Naur Form. This is itself an example of a context-free
|
||
grammar, which have four components:
|
||
<p>1. A set of tokens, known as terminal symbols.
|
||
<br>2. A set of nonterminals, which are sequences of tokens.
|
||
<br>3. A set of productions, which are the structuring rules of the grammar.
|
||
Each production has a nonterminal on the left side, an arrow, and then
|
||
a sequence of tokens and or nonterminals on the right side.
|
||
<br>4. One of the nonterminals is defined as the start symbol.
|
||
<p>In addition, BNF usually uses boldface strings to denote terminals,
|
||
italics for nonterminals, and normal for tokens (usually numbers and symbols).
|
||
We also usually use the symbol | to separate choices on the right
|
||
side of a production. In Extended BNF, or EBNF, we use the symbols * and
|
||
+ to denote zero or more, and one or more instances of a terminal or nonterminal,
|
||
respectively. (These operations are called Kleene closure and positive
|
||
closure, by the way.) If we want to use either closure on a set of choices,
|
||
the choices are enclosed in square brackets []. So, a set of productions
|
||
that define how to write a number might be:
|
||
<p><tt> number -> sign digitlist</tt>
|
||
<br><tt> sign -> + | - | ?</tt>
|
||
<br><tt> digitlist -> [0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9]+</tt>
|
||
<p>Here the symbol ? denotes the empty string. The rules can be read as:
|
||
<p>“A number has the form of a sign followed by a digitlist, where a sign
|
||
is the + symbol, or the – symbol, or a blank string (ie no symbol), followed
|
||
by a list of at least one character from the integers 0 to 9 inclusive.”
|
||
<p>We will use rules of this form to define HTML. As we aren’t interested
|
||
in all possible elements of an HTML page, we will define only those portions
|
||
of it that we need, but you will be able to extend the example easily enough
|
||
if you wish. Recursive descent parsers have a set of recursive procedures
|
||
to process the token stream, one procedure for each nonterminal in the
|
||
grammar. For example, our simple grammar would have procedures called Number,
|
||
Sign and DigitList, and the lexical analyser would just return single characters
|
||
as tokens. When several rules (or productions) have the same noterminal
|
||
on the left-hand side, they are combined into the one procedure.
|
||
<p>At this point, we could get very boring, with tedious mathematical proofs,
|
||
and talk about start and follow sets, canonical derivations and suchlike
|
||
wonders. Instead, we’ll look very briefly at a couple of key concepts,
|
||
and leave the details to Aho et al. One important notion is that of the
|
||
lookahead symbol. As the name suggests, this is the next token in the input
|
||
stream. It is used in predictive parsing to determine which grammar rule
|
||
to use. So the lookahead symbol will used to determine the next procedure
|
||
to call in our recursive descent parser. This is all a bit vague, so let’s
|
||
consider another simple example grammar, used to define the form of simple
|
||
expressions like 1 + 2, or 5 – 3. This example will also highlight an important
|
||
point about writing grammars. Here are our rules:
|
||
<p><tt>expr -> expr + term</tt>
|
||
<br><tt>expr -> expr – term</tt>
|
||
<br><tt>expr -> term</tt>
|
||
<br><tt>term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9</tt>
|
||
<p>There is a problem with this set of productions. Suppose we begin at
|
||
the start symbol, in this case expr (unless otherwise stated, the first
|
||
nonterminal is considered to be the start symbol). The first production
|
||
is fired, which looks for an expr, then a + token, then a term (ie a digit).
|
||
When looking for an expr, the first rule is fired, and so on indefinitely.
|
||
This is an example of a left recursive grammar. What we need to do is rewrite
|
||
the productions so that the same language is defined, but where there is
|
||
no left recursion. There is a formal method to use in these situations,
|
||
which is well described in the Dragon Book, but basically we need to ensure
|
||
that no rules have the form:
|
||
<p><tt>A -> Ax</tt>
|
||
<p>The example grammar can be rewritten as:
|
||
<p><tt>expr -> term rest</tt>
|
||
<br><tt>rest -> + expr | - expr | ?</tt>
|
||
<br><tt>term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9</tt>
|
||
<p>Now we can unambiguously determine which production to apply. To make
|
||
this a bit easier to follow, consider the following pseudocode:
|
||
<p><tt><b><font color="#000099">procedure</font></b> expr;</tt>
|
||
<br><b><tt><font color="#000099">begin</font></tt></b>
|
||
<br><tt> term;</tt>
|
||
<br><tt> rest;</tt>
|
||
<br><tt><b><font color="#000099">end</font></b>;</tt>
|
||
<p><tt><b><font color="#000099">procedure</font></b> rest;</tt>
|
||
<br><b><tt><font color="#000099">begin</font></tt></b>
|
||
<br><tt> <b><font color="#000099">if</font></b> lookahead = ‘+’ <b><font color="#000099">then</font></b></tt>
|
||
<br><tt> match(‘+’);</tt>
|
||
<br><tt> expr;</tt>
|
||
<br><tt> <b><font color="#000099">else if</font></b> lookahead =
|
||
‘-‘ <b><font color="#000099">then</font></b></tt>
|
||
<br><tt> match(‘-‘);</tt>
|
||
<br><tt> expr;</tt>
|
||
<br><tt> <b><font color="#000099">else</font></b></tt>
|
||
<br><tt> Do nothing</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b><font color="#000000">;</font></tt>
|
||
<br><tt><b><font color="#000099">end</font></b><font color="#000000">;</font></tt>
|
||
<p><tt><b><font color="#000099">procedure</font></b> term;</tt>
|
||
<br><b><tt><font color="#000099">begin</font></tt></b>
|
||
<br><tt> <b><font color="#000099">if</font></b> IsDigit(lookahead)
|
||
<b><font color="#000099">then</font></b></tt>
|
||
<br><tt> match(lookahead);</tt>
|
||
<br><tt> <b><font color="#000099">else</font></b></tt>
|
||
<br><tt> Raise error ‘digit expected’</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br><tt><b><font color="#000099">end</font></b>;</tt>
|
||
<p><tt><font color="#000099">procedure</font> match(t : token);</tt>
|
||
<br><b><tt><font color="#000099">begin</font></tt></b>
|
||
<br><tt><b><font color="#000099"> if</font></b> lookahead = t <b><font color="#000099">then</font></b></tt>
|
||
<br><tt> lookahead := GetNextToken;</tt>
|
||
<br><tt> <b><font color="#000099">else</font></b></tt>
|
||
<br><tt> Raise error ‘t expected’</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br><tt><b><font color="#000099">end</font></b>;</tt>
|
||
<br>
|
||
<p>Now it should be starting to become clear why this method of writing
|
||
a parser is so simple. We simply rewrite the productions as code. There
|
||
is an extra procedure match, which is used to simplify the other procedures.
|
||
It gets the next token only if its argument matches the lookahead symbol.
|
||
These procedures could be extended with code to provide the actions required
|
||
by the semantic rules; evaluating the expression, for instance. To see
|
||
how this works, try following this example through. Suppose we want to
|
||
parse the string:
|
||
<p><tt>7 – 2</tt>
|
||
<p>The steps are:
|
||
<p>1. The start symbol is expr, so call that procedure. This calls the
|
||
term procedure.
|
||
<br>2. The term procedure finds the token 7, and returns to expr, which
|
||
calls rest.
|
||
<br>3. The rest procedure finds a – token, and therefore calls expr again
|
||
(note the indirect recursion here).
|
||
<br>4. The second instance of expr calls term, which finds the 2 token.
|
||
<br>5. The second instance of expr calls rest again, which finds the empty
|
||
set.
|
||
<br>6. The second instance of expr returns to the first call of rest, which
|
||
returns to the first call of expr, which then terminates.
|
||
<p>The point of this little digression is that we need to make sure the
|
||
grammar we use for parsing HTML files is not left recursive. Right recursion
|
||
is fine, and will often lead to tail recursion in the code. That is, the
|
||
last statement in a procedure is a recursive call of that same procedure.
|
||
This can always be replaced by a while or repeat loop. Julian Bucknall
|
||
wrote an excellent article on recursion in issue ?? which discusses how,
|
||
and whether, to replace tail recursion.
|
||
<p><b>Tag, you’re it</b>
|
||
<p>That’s probably sufficient background to start writing the HTML parser.
|
||
Again, the first step is to write the grammar. We need to decide how much
|
||
HTML we need to be able to decipher. Because we want to create a diagram
|
||
with links between an HTML page and pages linked to it, along with any
|
||
images used on the original page, we can ignore most tags, and only look
|
||
for the following:
|
||
<p>1. The A tag. The HREF attribute of this tag specifies the URL of a
|
||
hypertext link.
|
||
<br>2. The AREA tag. The HREF attribute of this tag specifies the URL of
|
||
a hypertext link.
|
||
<br>3. The BASE tag. This HREF attribute of this tag will modify any other
|
||
URL on the page.
|
||
<br>4. The FRAME tag. The SRC attribute of this tag specifies the URL of
|
||
a hypertext link.
|
||
<br>5. The IMG tag. The SRC attribute of this tag specifies the URL of
|
||
a hypertext link, usually an image or video clip.
|
||
<br>6. The LINK tag. The HREF attribute of this tag specifies the URL of
|
||
another document that can be used in different ways in the current page.
|
||
For example, if the REL attribute is set to STYLESHEET, the linked document
|
||
is a style sheet. A style sheet is just a text file with instructions on
|
||
how to modify some default characteristics of an HTML page. It is usually
|
||
used to give a consistent look to all pages on a web site.
|
||
<br>7. The TITLE and /TITLE tags. If it exists, we will use the document
|
||
title as the name of the current page on the diagram. It is simple to parse,
|
||
as the text between the start and end tags cannot contain any other HTML.
|
||
<p>Note that we are ignoring the APPLET, BLOCKQUOTE, DEL, DIV, EMBED, HEAD,
|
||
IFRAME, ILAYER, INPUT, INS, LAYER, META, OBJECT, Q, SCRIPT, SPAN, STYLE
|
||
tags, all of which can have an URL as one of the attributes. You may like
|
||
to add these yourself. In addition, we will want to ignore the content
|
||
of comment and META tags, as they can contain all sorts of weird things.
|
||
<p>The general form of an HTML tag is:
|
||
<p><tt><TAGNAME ATTRIBUTE1=value1 ATTRIBUTE2=value2
|
||
…></tt>
|
||
<p>where TAGNAME can contain letters, numbers, hyphens and exclamation
|
||
marks (<!-- --> and <!DOCTYPE> are legal tags). ATTRIBUTE
|
||
tags generally contain only letters and hyphens, but because they can sometimes
|
||
be event names, we will allow numbers as well. We will allow exclamation
|
||
marks just to make the lexical analysis easier. If values contain anything
|
||
other than letters, numbers, hyphens or periods, then they must be enclosed
|
||
in double quotes. Also, in theory, every well-formed HTML document should
|
||
have at least these elements:
|
||
<p><tt><!DOCTYPE></tt>
|
||
<br><tt><HTML></tt>
|
||
<br><tt><HEAD></tt>
|
||
<br><tt><TITLE>The document title goes here</TITLE></tt>
|
||
<br><tt></HEAD></tt>
|
||
<p><tt><BODY></tt>
|
||
<br><tt>The actual document information goes here</tt>
|
||
<br><tt></BODY></tt>
|
||
<br><tt></HTML></tt>
|
||
<p>This is actually somewhat too complicated for our purposes, but it might
|
||
be useful information for more sophisticated applications. You should note
|
||
too, that in practice, many documents do not contain all these elements.
|
||
We shall consider an HTML document to be simply a list of tags and data.
|
||
All of this suggests the following grammar:
|
||
<p><tt>Document -> [Tag | Data] +</tt>
|
||
<br><tt>Tag
|
||
-> < TagName AttributeList > | < / TagName ></tt>
|
||
<br><tt>Data -> [any
|
||
character except <]*</tt>
|
||
<br><tt>TagName -> Identifier</tt>
|
||
<br><tt>AttributeList -> [AttributeName AttributeRest]*</tt>
|
||
<br><tt>AttributeName -> Identifier | QuotedValue</tt>
|
||
<br><tt>AttributeRest -> = Value | ?</tt>
|
||
<br><tt>Value -> QuotedValue
|
||
| PlainValue</tt>
|
||
<br><tt>Identifier -> [A..Z | a..z | 0..9 | - | !]+</tt>
|
||
<br><tt>QuotedValue -> “ [any character]* “</tt>
|
||
<br><tt>PlainValue -> [A..Z | a..z | 0..9 | - | .]+</tt>
|
||
<p>Notice that there is no left recursion in any production. Our semantic
|
||
rules will involve recording information for certain attributes of the
|
||
tags listed above. We will use a similar idea to the symbol table used
|
||
in most compilers. For error handling, we will raise Delphi exceptions,
|
||
as this will automatically release the call stack for us, and allow us
|
||
to catch the exceptions in one place. The lexical analysis will be extremely
|
||
simple, just returning tokens of one character. I have built in the capability
|
||
to easily extend this if you wish. The definition of the parsing classes
|
||
is shown below.
|
||
<p><b><tt><font color="#000099">interface</font></tt></b>
|
||
<p><b><tt><font color="#000099">uses</font></tt></b>
|
||
<br><tt> SysUtils, Classes;</tt>
|
||
<br>
|
||
<p><b><tt><font color="#000099">type</font></tt></b>
|
||
<br><tt> TjimToken = <b><font color="#000099">class</font></b>(TObject)</tt>
|
||
<br><tt> <b><font color="#000099">private</font></b></tt>
|
||
<br><tt> FTokenType : Integer;</tt>
|
||
<br><tt> FAsString : string;</tt>
|
||
<br><tt> <b><font color="#000099">public</font></b></tt>
|
||
<br><tt> <b><font color="#000099">property</font></b>
|
||
TokenType : Integer <b><font color="#000099">read</font></b> FTokenType
|
||
<b><font color="#000099">write</font></b>
|
||
FTokenType;</tt>
|
||
<br><tt> <b><font color="#000099">property</font></b>
|
||
AsString : string <b><font color="#000099">read</font></b> FAsString
|
||
<b><font color="#000099">write</font></b>
|
||
FAsString;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br>
|
||
<p><tt> TjimLexicalAnalyser = <b><font color="#000099">class</font></b>(TObject)</tt>
|
||
<br><tt> <b><font color="#000099">private</font></b></tt>
|
||
<br><tt> FText : string;</tt>
|
||
<br><tt> FPosition : Integer;</tt>
|
||
<p><tt> <b><font color="#000099">procedure</font></b>
|
||
SetText(<b><font color="#000099">const</font></b> Value : string);</tt>
|
||
<br><tt> <b><font color="#000099">public</font></b></tt>
|
||
<br><tt><b><font color="#000099"> constructor</font></b>
|
||
Create;</tt>
|
||
<p><tt> <b><font color="#000099">procedure</font></b>
|
||
GetNextToken(NextToken : TjimToken);</tt>
|
||
<br><tt> <b><font color="#000099">property</font></b>
|
||
Text : string <b><font color="#000099">read</font></b> FText <b><font color="#000099">write</font></b>
|
||
SetText;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br>
|
||
<p><tt> TjimSymbolType = (stTitle,stBase,stLink,stImage);</tt>
|
||
<br>
|
||
<p><tt> TjimSymbol = <b><font color="#000099">class</font></b>(TCollectionItem)</tt>
|
||
<br><tt> <b><font color="#000099">private</font></b></tt>
|
||
<br><tt> FSymbolType : TjimSymbolType;</tt>
|
||
<br><tt> FSymbolValue : string;</tt>
|
||
<br><tt> <b><font color="#000099">public</font></b></tt>
|
||
<br><tt><b><font color="#000099"> procedure</font></b>
|
||
Assign(Source : TPersistent); <b><font color="#000099">override</font></b>;</tt>
|
||
<p><tt> <b><font color="#000099">property</font></b>
|
||
SymbolType : TjimSymbolType <b><font color="#000099">read</font></b>
|
||
FSymbolType <b><font color="#000099">write</font></b> FSymbolType;</tt>
|
||
<br><tt> <b><font color="#000099">property</font></b>
|
||
SymbolValue : string <b><font color="#000099">read</font></b> FSymbolValue
|
||
<b><font color="#000099">write</font></b>
|
||
FSymbolValue;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br>
|
||
<p><tt> TjimSymbolTable = <b><font color="#000099">class</font></b>(TCollection)</tt>
|
||
<br><tt> <b><font color="#000099">private</font></b></tt>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
GetItem(Index : Integer) : TjimSymbol;</tt>
|
||
<br><tt> <b><font color="#000099">procedure</font></b>
|
||
SetItem(Index : Integer;Value : TjimSymbol);</tt>
|
||
<br><b><tt><font color="#000099"> public</font></tt></b>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
Add : TjimSymbol;</tt>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
AddSymbol(SymType : TjimSymbolType;SymValue : string) : TjimSymbol;</tt>
|
||
<p><tt> <b><font color="#000099">property</font></b>
|
||
Items[Index : Integer] : TjimSymbol <b><font color="#000099">read</font></b>
|
||
GetItem <b><font color="#000099">write</font></b> SetItem; <b><font color="#000099">default</font></b>;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br>
|
||
<p><tt> TjimHtmlParser = <b><font color="#000099">class</font></b>(TObject)</tt>
|
||
<br><tt> <b><font color="#000099">private</font></b></tt>
|
||
<br><tt> FLookahead : TjimToken;</tt>
|
||
<br><tt> FLexAnalyser : TjimLexicalAnalyser;</tt>
|
||
<br><tt> FSymbolTable : TjimSymbolTable;</tt>
|
||
<br><tt> FLastTag : string;</tt>
|
||
<p><tt> <b><font color="#000099">procedure</font></b>
|
||
Match(T : Integer);</tt>
|
||
<br><tt> <b><font color="#000099">procedure</font></b>
|
||
ConsumeWhiteSpace;</tt>
|
||
<br><tt> <b><font color="#000099">procedure</font></b>
|
||
Document;</tt>
|
||
<br><tt> <b><font color="#000099">procedure</font></b>
|
||
Tag;</tt>
|
||
<br><tt> <b><font color="#000099">procedure</font></b>
|
||
Data;</tt>
|
||
<br><tt> <b><font color="#000099">procedure</font></b>
|
||
TagName;</tt>
|
||
<br><tt> <b><font color="#000099">procedure</font></b>
|
||
AttributeList;</tt>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
AttributeName : string;</tt>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
Value : string;</tt>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
Identifier : string;</tt>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
QuotedValue : string;</tt>
|
||
<br><tt> <b><font color="#000099">function</font></b>
|
||
PlainValue : string;</tt>
|
||
<br><tt> <b><font color="#000099">public</font></b></tt>
|
||
<br><tt><b><font color="#000099"> constructor</font></b>
|
||
Create;</tt>
|
||
<br><tt> <b><font color="#000099">destructor</font></b>
|
||
Destroy; <b><font color="#000099">override</font></b>;</tt>
|
||
<p><tt> <b><font color="#000099">procedure</font></b>
|
||
Parse(<b><font color="#000099">const</font></b> DocString : string);</tt>
|
||
<p><tt> <b><font color="#000099">property</font></b>
|
||
SymbolTable : TjimSymbolTable <b><font color="#000099">read</font></b>
|
||
FSymbolTable;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br>
|
||
<p><tt> EjimHtmlParserError = <b><font color="#000099">class</font></b>(Exception);</tt>
|
||
<br>
|
||
<p>There are a couple of support classes to represent tokens and symbols.
|
||
Should you wish to write a parser for your own purposes, these should function
|
||
as templates you can build on. Usually, as here, the tokens have a type
|
||
that is the ASCII value of the character, if they are single characters,
|
||
otherwise it is some other predefined integer constant, outside the range
|
||
0 to 255. So, multiple character tokens would each need their own type.
|
||
A common extra token type is a marker for the end of the document. We will
|
||
use the constant ttEndOfDoc, defined to be –1. Obviously, the symbol table
|
||
would need to store different information as well. Often this would include
|
||
the type of the symbol (as in our application) which would then be used
|
||
for type checking (which we won’t be doing). We will not go through the
|
||
code for the lexical analyser as it is extremely simple, and you will easily
|
||
be able to follow the code on the disk. Similarly, the symbol table and
|
||
symbol objects are descended from TCollection and TCollectionItem, respectively,
|
||
and are straightforward implementations as described in the Delphi help
|
||
and Xavier Pacheco’s article in issue 10 (see how many opportunities to
|
||
plug that CD-ROM I’m giving you, Mr Editor, do I get a commission?). Hopefully,
|
||
the code for TjimHtmlParser will be easy to follow as well. A couple of
|
||
examples should be enough to give you the gist.
|
||
<p><tt><b><font color="#000099">procedure</font></b> TjimHtmlParser.Document;</tt>
|
||
<br><tt><b><font color="#000099">begin</font></b> <font color="#009900">{Document}</font></tt>
|
||
<br><tt> <b><font color="#000099">while</font></b> FLookahead.TokenType
|
||
<> ttEndOfDoc <b><font color="#000099">do begin</font></b></tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<p><tt> <b><font color="#000099">if</font></b> FLookahead.AsString
|
||
= '<' <b><font color="#000099">then begin</font></b></tt>
|
||
<br><tt> Tag;</tt>
|
||
<br><tt> <b><font color="#000099">end else begin</font></b></tt>
|
||
<br><tt> Data;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<p><tt> Match(ttEndOfDoc);</tt>
|
||
<br><tt><b><font color="#000099">end</font></b>; <font color="#009900">{Document}</font></tt>
|
||
<br>
|
||
<p><tt><b><font color="#000099">procedure</font></b> TjimHtmlParser.Tag;</tt>
|
||
<br><tt><b><font color="#000099">begin</font></b> <font color="#009900">{Tag}</font></tt>
|
||
<br><tt> Match(Ord('<'));</tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<p><tt> <b><font color="#000099">if</font></b> FLookahead.AsString
|
||
= '/' <b><font color="#000099">then begin</font></b></tt>
|
||
<br><tt> // Finding an end tag</tt>
|
||
<br><tt> Match(Ord('/'));</tt>
|
||
<br><tt> FLastTag := '/';</tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<br><tt> TagName;</tt>
|
||
<br><tt> <b><font color="#000099">end else begin</font></b></tt>
|
||
<br><tt> // Finding a start tag, or a tag that doesn't
|
||
enclose anything</tt>
|
||
<br><tt> FLastTag := '';</tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<br><tt> TagName;</tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<br><tt> AttributeList;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<p><tt> Match(Ord('>'));</tt>
|
||
<br><tt><b><font color="#000099">end</font></b>; <font color="#009900">{Tag}</font></tt>
|
||
<p>The procedure Document is the first procedure called when parsing a
|
||
file, as it corresponds to the start symbol of our grammar. As you can
|
||
see, it continues to process the file until the end of document marker
|
||
is reached. We will consume whitespace in the parser rather than the lexical
|
||
analyser, because in our simple example it is easier to manage. The lookahead
|
||
symbol is sufficient for us to be able to determine which procedure to
|
||
call next. If we follow the case where it is ‘<’, the Tag procedure
|
||
will be called. In this procedure, the ‘<’ symbol is first matched,
|
||
so that the next symbol is requested from the lexical analyser. After removing
|
||
any whitespace, the lookahead symbol should be either a tag name, or the
|
||
start of an end tag. To help us implement the semantic rules, we need to
|
||
keep track of the last tag processed, so we will store its value in FLastTag.
|
||
The TagName procedure collects the rest of the name, and if we are in a
|
||
start tag, the attribute list is processed. The tag must end with a ‘>’
|
||
symbol. The other procedure are built in the same fashion. Some of them
|
||
are complicated slightly because they are where our semantic rules are
|
||
implemented. For instance, the TagName procedure steps through comment
|
||
and META tags ignoring their content. The Data procedure collects the title
|
||
of the document if the last tag was the TITLE tag, and having found it,
|
||
adds it to the symbol table. The procedure that does most of the work,
|
||
though, is AttributeList. It is shown below.
|
||
<p><tt><b><font color="#000099">procedure</font></b> TjimHtmlParser.AttributeList;</tt>
|
||
<br><tt> <b><font color="#000099">var</font></b></tt>
|
||
<br><tt> FLastAttribute : string;</tt>
|
||
<br><tt> FLastValue : string;</tt>
|
||
<br><tt><b><font color="#000099">begin</font></b> <font color="#009900">{AttributeList}</font></tt>
|
||
<br><tt> <b><font color="#000099">while</font></b> FLookahead.AsString
|
||
<> '>' <b><font color="#000099">do begin</font></b></tt>
|
||
<br><tt> FLastAttribute := AttributeName;</tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<p><tt> <b><font color="#000099">if</font></b> FLookahead.AsString
|
||
= '=' <b><font color="#000099">then begin</font></b></tt>
|
||
<br><tt> Match(Ord('='));</tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<br><tt> FLastValue := Value;</tt>
|
||
<br><tt> ConsumeWhiteSpace;</tt>
|
||
<p><tt> <font color="#009900">// Should only
|
||
get here if FLastAttribute is not an empty string</font></tt>
|
||
<br><tt> <b><font color="#000099">if</font></b>
|
||
(CompareText(FLastTag,'BASE') = 0) <b><font color="#000099">and</font></b></tt>
|
||
<br><tt> (CompareText('HREF',FLastAttribute)
|
||
= 0) <b><font color="#000099">then begin</font></b></tt>
|
||
<br><tt> <font color="#009900">//
|
||
Special case when found the HREF attribute of a BASE tag</font></tt>
|
||
<br><tt> FSymbolTable.AddSymbol(stBase,FLastValue);</tt>
|
||
<br><tt> <b><font color="#000099">end else
|
||
if</font></b> (CompareText(FLastTag,'IMG') = 0) <b><font color="#000099">and</font></b></tt>
|
||
<br><tt> (CompareText('SRC',FLastAttribute)
|
||
= 0) <b><font color="#000099">then begin</font></b></tt>
|
||
<br><tt> <font color="#009900">//
|
||
Found an image</font></tt>
|
||
<br><tt> FSymbolTable.AddSymbol(stImage,FLastValue);</tt>
|
||
<br><tt> <b><font color="#000099">end else
|
||
if</font></b> ((CompareText(FLastTag,'A') = 0) <b><font color="#000099">or</font></b></tt>
|
||
<br><tt>
|
||
(CompareText(FLastTag,'AREA') = 0) <b><font color="#000099">or</font></b></tt>
|
||
<br><tt>
|
||
(CompareText(FLastTag,'LINK') = 0)) <b><font color="#000099">and</font></b></tt>
|
||
<br><tt>
|
||
(CompareText('HREF',FLastAttribute) = 0) <b><font color="#000099">then
|
||
begin</font></b></tt>
|
||
<br><tt> <font color="#009900">//
|
||
Found an ordinary link</font></tt>
|
||
<br><tt> FSymbolTable.AddSymbol(stLink,FLastValue);</tt>
|
||
<br><tt> <b><font color="#000099">end else
|
||
if </font></b>(CompareText(FLastTag,'FRAME') = 0) <b><font color="#000099">and</font></b></tt>
|
||
<br><tt>
|
||
(CompareText('SRC',FLastAttribute) = 0) <b><font color="#000099">then begin</font></b></tt>
|
||
<br><tt> <font color="#009900">//
|
||
Found an ordinary link</font></tt>
|
||
<br><tt> FSymbolTable.AddSymbol(stLink,FLastValue);</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
||
<br><tt><b><font color="#000099">end</font></b>; <font color="#009900">{AttributeList}</font></tt>
|
||
<p>You can see that the syntactic work is done early. Our grammar states
|
||
that an attribute list consists of a list of attributes after the tag name
|
||
and before the ‘>’ symbol. Each attribute has a name, and possibly a value.
|
||
In the case that the attribute does have a value, we need to check if it
|
||
is one of the attributes we need for our diagram. So for our selected tags,
|
||
we check the attribute name, stored in the local variable FLastAttribute,
|
||
and if it is an URL we store it in the symbol table. This is the time that
|
||
we determine the type of the URL, whether it is a BASE tag, an image, or
|
||
another HTML file.
|
||
<p>The remainder of the article is printed in the magazine.
|
||
</body>
|
||
</html>
|