544 lines
34 KiB
HTML
544 lines
34 KiB
HTML
|
|
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
|
|||
|
|
<html>
|
|||
|
|
<head>
|
|||
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|||
|
|
<meta name="Author" content="Jim Cooper">
|
|||
|
|
<meta name="GENERATOR" content="Mozilla/4.5 [en] (Win95; I) [Netscape]">
|
|||
|
|
<title>Parser</title>
|
|||
|
|
</head>
|
|||
|
|
<body>
|
|||
|
|
<b>Here be dragons</b>
|
|||
|
|
<p>The standard text on writing parsers and compilers is <20>Compilers. Principles,
|
|||
|
|
Techniques and Tools<6C> by Aho, Sethi and Ullman. The book is so well known
|
|||
|
|
it has a nickname: <20>The Dragon Book<6F>. This is because there is a picture
|
|||
|
|
of a dragon on the cover. Who says programmers aren<65>t creative? If you
|
|||
|
|
ever need to write a parser, never try to work out how to do it yourself.
|
|||
|
|
The area is well researched and excellent techniques are well known. All
|
|||
|
|
you need to do is apply them. The difference between a parser and a compiler
|
|||
|
|
is that a parser processes some text, providing information on its structure
|
|||
|
|
in a useful form. A compiler will contain a parser, but goes on to use
|
|||
|
|
its output to generate machine code. Usually compilers also make use of
|
|||
|
|
a symbol table, to store information about identifiers (like types, scope,
|
|||
|
|
number of parameters etc) and an error handler, to report errors, and possibly
|
|||
|
|
to take corrective action so that the remainder of the program can be parsed.
|
|||
|
|
<p>The phases can be simply described as follows:
|
|||
|
|
<p>1. <b>Lexical analysis</b>. This takes a stream of characters, and outputs
|
|||
|
|
a stream of tokens. A token is a non-divisible element of the language,
|
|||
|
|
like the keywords if or procedure, or the + and ; symbols are in Pascal.
|
|||
|
|
A sequence of characters that makes up a token is called a lexeme, hence
|
|||
|
|
the name. This phase is often responsible for removing whitespace (spaces,
|
|||
|
|
tabs and new lines) and recognising identifiers and keywords. Usually the
|
|||
|
|
lexical analyser recognises at least integers, instead of using grammar
|
|||
|
|
rules like the example given earlier. Regular expressions are usually sufficient
|
|||
|
|
to describe the actions needed, and they are often implemented as finite
|
|||
|
|
state machines.
|
|||
|
|
<br>2. <b>Syntax analysis</b>. This phase takes the stream of tokens and
|
|||
|
|
checks that they obey the grammar rules. The output from this stage is
|
|||
|
|
a hierarchical structure of tokens that I<>m not actually going to describe,
|
|||
|
|
because it isn<73>t used in recursive descent parsers. Instead, the same information
|
|||
|
|
is implicit in the order of functions calls on the call stack. There is
|
|||
|
|
some flexibility in the division of labour between lexical and syntactic
|
|||
|
|
analysis. For example, sometimes whitespace is removed by the lexical analyser
|
|||
|
|
and sometimes by the syntax analyser.
|
|||
|
|
<br>3. <b>Semantic analysis</b>. This phase ensures that the program makes
|
|||
|
|
sense. This usually takes the form of type checking; making sure that you
|
|||
|
|
don<EFBFBD>t multiply numbers and strings, for instance.
|
|||
|
|
<br>4. <b>Code generation</b>. This may result in either machine code,
|
|||
|
|
or an intermediate form like p-code. This enables the same back-end to
|
|||
|
|
be used for compilers for different languages. For instance, a Pascal and
|
|||
|
|
a C++ compiler might both produce p-code, and then use the same code optimisation
|
|||
|
|
and generation phases.
|
|||
|
|
<br>5. <b>Code optimisation</b>. Sometimes transformations can be made
|
|||
|
|
to make the code run more efficiently. Substituting actual values for constants,so
|
|||
|
|
they don<6F>t need to be looked up at run-time, and taking advantage of instructions
|
|||
|
|
relevant to a particular microprocessor are simple examples.
|
|||
|
|
<p>Obviously, parsing an HTML file is going to have slightly different
|
|||
|
|
phases. A browser would replace code generation with producing an image
|
|||
|
|
of the page. Fortunately, we can stop at syntax analysis, at which point
|
|||
|
|
we will have enough information to produce our diagram. Basically, a parser
|
|||
|
|
consists of stages 1 to 3.
|
|||
|
|
<p>There are two easy ways to write a parser. One is to use to tools Lex
|
|||
|
|
and Yacc, which are freely available in several implementations (look for
|
|||
|
|
LexAndYacc.zip on the Delphi Super Page at <a href="http://sunsite.icm.edu.pl/delphi/">http://sunsite.icm.edu.pl/delphi/</a>
|
|||
|
|
). Their use is described in the Dragon Book. They are a little sophisticated
|
|||
|
|
for our purposes here, so we will use the other easy technique, recursive
|
|||
|
|
descent parsing, which is the easiest way to hand code a parser. Recursive
|
|||
|
|
descent is an example of a type of parsing scheme known as syntax-directed
|
|||
|
|
translation, which uses the grammatical rules to organise the code of the
|
|||
|
|
parser. The first step is always to define the grammar of the language
|
|||
|
|
we wish to parse. We then convert that grammar directly into code. Any
|
|||
|
|
changes or corrections are made the same way; change the grammar first,
|
|||
|
|
then change the code to reflect the new grammar. There is usually also
|
|||
|
|
a set of semantic rules, which specify what else to do, which might include
|
|||
|
|
type checking or updating the symbol table. For a larger project, these
|
|||
|
|
semantic rules should be more formally specified, but I am going to introduce
|
|||
|
|
them later on, when we get to write the parsing code.
|
|||
|
|
<p>To begin with, we need to define a few terms. A grammar defines the
|
|||
|
|
syntax of a language, that is, what that language looks like. It does not
|
|||
|
|
define the meaning, or semantics, of a language. Grammars are often defined
|
|||
|
|
using BNF, or Backus-Naur Form. This is itself an example of a context-free
|
|||
|
|
grammar, which have four components:
|
|||
|
|
<p>1. A set of tokens, known as terminal symbols.
|
|||
|
|
<br>2. A set of nonterminals, which are sequences of tokens.
|
|||
|
|
<br>3. A set of productions, which are the structuring rules of the grammar.
|
|||
|
|
Each production has a nonterminal on the left side, an arrow, and then
|
|||
|
|
a sequence of tokens and or nonterminals on the right side.
|
|||
|
|
<br>4. One of the nonterminals is defined as the start symbol.
|
|||
|
|
<p>In addition, BNF usually uses boldface strings to denote terminals,
|
|||
|
|
italics for nonterminals, and normal for tokens (usually numbers and symbols).
|
|||
|
|
We also usually use the symbol | to separate choices on the right
|
|||
|
|
side of a production. In Extended BNF, or EBNF, we use the symbols * and
|
|||
|
|
+ to denote zero or more, and one or more instances of a terminal or nonterminal,
|
|||
|
|
respectively. (These operations are called Kleene closure and positive
|
|||
|
|
closure, by the way.) If we want to use either closure on a set of choices,
|
|||
|
|
the choices are enclosed in square brackets []. So, a set of productions
|
|||
|
|
that define how to write a number might be:
|
|||
|
|
<p><tt> number -> sign digitlist</tt>
|
|||
|
|
<br><tt> sign -> + | - | ?</tt>
|
|||
|
|
<br><tt> digitlist -> [0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9]+</tt>
|
|||
|
|
<p>Here the symbol ? denotes the empty string. The rules can be read as:
|
|||
|
|
<p><EFBFBD>A number has the form of a sign followed by a digitlist, where a sign
|
|||
|
|
is the + symbol, or the <20> symbol, or a blank string (ie no symbol), followed
|
|||
|
|
by a list of at least one character from the integers 0 to 9 inclusive.<2E>
|
|||
|
|
<p>We will use rules of this form to define HTML. As we aren<65>t interested
|
|||
|
|
in all possible elements of an HTML page, we will define only those portions
|
|||
|
|
of it that we need, but you will be able to extend the example easily enough
|
|||
|
|
if you wish. Recursive descent parsers have a set of recursive procedures
|
|||
|
|
to process the token stream, one procedure for each nonterminal in the
|
|||
|
|
grammar. For example, our simple grammar would have procedures called Number,
|
|||
|
|
Sign and DigitList, and the lexical analyser would just return single characters
|
|||
|
|
as tokens. When several rules (or productions) have the same noterminal
|
|||
|
|
on the left-hand side, they are combined into the one procedure.
|
|||
|
|
<p>At this point, we could get very boring, with tedious mathematical proofs,
|
|||
|
|
and talk about start and follow sets, canonical derivations and suchlike
|
|||
|
|
wonders. Instead, we<77>ll look very briefly at a couple of key concepts,
|
|||
|
|
and leave the details to Aho et al. One important notion is that of the
|
|||
|
|
lookahead symbol. As the name suggests, this is the next token in the input
|
|||
|
|
stream. It is used in predictive parsing to determine which grammar rule
|
|||
|
|
to use. So the lookahead symbol will used to determine the next procedure
|
|||
|
|
to call in our recursive descent parser. This is all a bit vague, so let<65>s
|
|||
|
|
consider another simple example grammar, used to define the form of simple
|
|||
|
|
expressions like 1 + 2, or 5 <20> 3. This example will also highlight an important
|
|||
|
|
point about writing grammars. Here are our rules:
|
|||
|
|
<p><tt>expr -> expr + term</tt>
|
|||
|
|
<br><tt>expr -> expr <20> term</tt>
|
|||
|
|
<br><tt>expr -> term</tt>
|
|||
|
|
<br><tt>term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9</tt>
|
|||
|
|
<p>There is a problem with this set of productions. Suppose we begin at
|
|||
|
|
the start symbol, in this case expr (unless otherwise stated, the first
|
|||
|
|
nonterminal is considered to be the start symbol). The first production
|
|||
|
|
is fired, which looks for an expr, then a + token, then a term (ie a digit).
|
|||
|
|
When looking for an expr, the first rule is fired, and so on indefinitely.
|
|||
|
|
This is an example of a left recursive grammar. What we need to do is rewrite
|
|||
|
|
the productions so that the same language is defined, but where there is
|
|||
|
|
no left recursion. There is a formal method to use in these situations,
|
|||
|
|
which is well described in the Dragon Book, but basically we need to ensure
|
|||
|
|
that no rules have the form:
|
|||
|
|
<p><tt>A -> Ax</tt>
|
|||
|
|
<p>The example grammar can be rewritten as:
|
|||
|
|
<p><tt>expr -> term rest</tt>
|
|||
|
|
<br><tt>rest -> + expr | - expr | ?</tt>
|
|||
|
|
<br><tt>term -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9</tt>
|
|||
|
|
<p>Now we can unambiguously determine which production to apply. To make
|
|||
|
|
this a bit easier to follow, consider the following pseudocode:
|
|||
|
|
<p><tt><b><font color="#000099">procedure</font></b> expr;</tt>
|
|||
|
|
<br><b><tt><font color="#000099">begin</font></tt></b>
|
|||
|
|
<br><tt> term;</tt>
|
|||
|
|
<br><tt> rest;</tt>
|
|||
|
|
<br><tt><b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<p><tt><b><font color="#000099">procedure</font></b> rest;</tt>
|
|||
|
|
<br><b><tt><font color="#000099">begin</font></tt></b>
|
|||
|
|
<br><tt> <b><font color="#000099">if</font></b> lookahead = <20>+<2B> <b><font color="#000099">then</font></b></tt>
|
|||
|
|
<br><tt> match(<28>+<2B>);</tt>
|
|||
|
|
<br><tt> expr;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">else if</font></b> lookahead =
|
|||
|
|
<EFBFBD>-<2D> <b><font color="#000099">then</font></b></tt>
|
|||
|
|
<br><tt> match(<28>-<2D>);</tt>
|
|||
|
|
<br><tt> expr;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">else</font></b></tt>
|
|||
|
|
<br><tt> Do nothing</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b><font color="#000000">;</font></tt>
|
|||
|
|
<br><tt><b><font color="#000099">end</font></b><font color="#000000">;</font></tt>
|
|||
|
|
<p><tt><b><font color="#000099">procedure</font></b> term;</tt>
|
|||
|
|
<br><b><tt><font color="#000099">begin</font></tt></b>
|
|||
|
|
<br><tt> <b><font color="#000099">if</font></b> IsDigit(lookahead)
|
|||
|
|
<b><font color="#000099">then</font></b></tt>
|
|||
|
|
<br><tt> match(lookahead);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">else</font></b></tt>
|
|||
|
|
<br><tt> Raise error <20>digit expected<65></tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br><tt><b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<p><tt><font color="#000099">procedure</font> match(t : token);</tt>
|
|||
|
|
<br><b><tt><font color="#000099">begin</font></tt></b>
|
|||
|
|
<br><tt><b><font color="#000099"> if</font></b> lookahead = t <b><font color="#000099">then</font></b></tt>
|
|||
|
|
<br><tt> lookahead := GetNextToken;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">else</font></b></tt>
|
|||
|
|
<br><tt> Raise error <20>t expected<65></tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br><tt><b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br>
|
|||
|
|
<p>Now it should be starting to become clear why this method of writing
|
|||
|
|
a parser is so simple. We simply rewrite the productions as code. There
|
|||
|
|
is an extra procedure match, which is used to simplify the other procedures.
|
|||
|
|
It gets the next token only if its argument matches the lookahead symbol.
|
|||
|
|
These procedures could be extended with code to provide the actions required
|
|||
|
|
by the semantic rules; evaluating the expression, for instance. To see
|
|||
|
|
how this works, try following this example through. Suppose we want to
|
|||
|
|
parse the string:
|
|||
|
|
<p><tt>7 <20> 2</tt>
|
|||
|
|
<p>The steps are:
|
|||
|
|
<p>1. The start symbol is expr, so call that procedure. This calls the
|
|||
|
|
term procedure.
|
|||
|
|
<br>2. The term procedure finds the token 7, and returns to expr, which
|
|||
|
|
calls rest.
|
|||
|
|
<br>3. The rest procedure finds a <20> token, and therefore calls expr again
|
|||
|
|
(note the indirect recursion here).
|
|||
|
|
<br>4. The second instance of expr calls term, which finds the 2 token.
|
|||
|
|
<br>5. The second instance of expr calls rest again, which finds the empty
|
|||
|
|
set.
|
|||
|
|
<br>6. The second instance of expr returns to the first call of rest, which
|
|||
|
|
returns to the first call of expr, which then terminates.
|
|||
|
|
<p>The point of this little digression is that we need to make sure the
|
|||
|
|
grammar we use for parsing HTML files is not left recursive. Right recursion
|
|||
|
|
is fine, and will often lead to tail recursion in the code. That is, the
|
|||
|
|
last statement in a procedure is a recursive call of that same procedure.
|
|||
|
|
This can always be replaced by a while or repeat loop. Julian Bucknall
|
|||
|
|
wrote an excellent article on recursion in issue ?? which discusses how,
|
|||
|
|
and whether, to replace tail recursion.
|
|||
|
|
<p><b>Tag, you<6F>re it</b>
|
|||
|
|
<p>That<EFBFBD>s probably sufficient background to start writing the HTML parser.
|
|||
|
|
Again, the first step is to write the grammar. We need to decide how much
|
|||
|
|
HTML we need to be able to decipher. Because we want to create a diagram
|
|||
|
|
with links between an HTML page and pages linked to it, along with any
|
|||
|
|
images used on the original page, we can ignore most tags, and only look
|
|||
|
|
for the following:
|
|||
|
|
<p>1. The A tag. The HREF attribute of this tag specifies the URL of a
|
|||
|
|
hypertext link.
|
|||
|
|
<br>2. The AREA tag. The HREF attribute of this tag specifies the URL of
|
|||
|
|
a hypertext link.
|
|||
|
|
<br>3. The BASE tag. This HREF attribute of this tag will modify any other
|
|||
|
|
URL on the page.
|
|||
|
|
<br>4. The FRAME tag. The SRC attribute of this tag specifies the URL of
|
|||
|
|
a hypertext link.
|
|||
|
|
<br>5. The IMG tag. The SRC attribute of this tag specifies the URL of
|
|||
|
|
a hypertext link, usually an image or video clip.
|
|||
|
|
<br>6. The LINK tag. The HREF attribute of this tag specifies the URL of
|
|||
|
|
another document that can be used in different ways in the current page.
|
|||
|
|
For example, if the REL attribute is set to STYLESHEET, the linked document
|
|||
|
|
is a style sheet. A style sheet is just a text file with instructions on
|
|||
|
|
how to modify some default characteristics of an HTML page. It is usually
|
|||
|
|
used to give a consistent look to all pages on a web site.
|
|||
|
|
<br>7. The TITLE and /TITLE tags. If it exists, we will use the document
|
|||
|
|
title as the name of the current page on the diagram. It is simple to parse,
|
|||
|
|
as the text between the start and end tags cannot contain any other HTML.
|
|||
|
|
<p>Note that we are ignoring the APPLET, BLOCKQUOTE, DEL, DIV, EMBED, HEAD,
|
|||
|
|
IFRAME, ILAYER, INPUT, INS, LAYER, META, OBJECT, Q, SCRIPT, SPAN, STYLE
|
|||
|
|
tags, all of which can have an URL as one of the attributes. You may like
|
|||
|
|
to add these yourself. In addition, we will want to ignore the content
|
|||
|
|
of comment and META tags, as they can contain all sorts of weird things.
|
|||
|
|
<p>The general form of an HTML tag is:
|
|||
|
|
<p><tt><TAGNAME ATTRIBUTE1=value1 ATTRIBUTE2=value2
|
|||
|
|
<EFBFBD>></tt>
|
|||
|
|
<p>where TAGNAME can contain letters, numbers, hyphens and exclamation
|
|||
|
|
marks (<!-- --> and <!DOCTYPE> are legal tags). ATTRIBUTE
|
|||
|
|
tags generally contain only letters and hyphens, but because they can sometimes
|
|||
|
|
be event names, we will allow numbers as well. We will allow exclamation
|
|||
|
|
marks just to make the lexical analysis easier. If values contain anything
|
|||
|
|
other than letters, numbers, hyphens or periods, then they must be enclosed
|
|||
|
|
in double quotes. Also, in theory, every well-formed HTML document should
|
|||
|
|
have at least these elements:
|
|||
|
|
<p><tt><!DOCTYPE></tt>
|
|||
|
|
<br><tt><HTML></tt>
|
|||
|
|
<br><tt><HEAD></tt>
|
|||
|
|
<br><tt><TITLE>The document title goes here</TITLE></tt>
|
|||
|
|
<br><tt></HEAD></tt>
|
|||
|
|
<p><tt><BODY></tt>
|
|||
|
|
<br><tt>The actual document information goes here</tt>
|
|||
|
|
<br><tt></BODY></tt>
|
|||
|
|
<br><tt></HTML></tt>
|
|||
|
|
<p>This is actually somewhat too complicated for our purposes, but it might
|
|||
|
|
be useful information for more sophisticated applications. You should note
|
|||
|
|
too, that in practice, many documents do not contain all these elements.
|
|||
|
|
We shall consider an HTML document to be simply a list of tags and data.
|
|||
|
|
All of this suggests the following grammar:
|
|||
|
|
<p><tt>Document -> [Tag | Data] +</tt>
|
|||
|
|
<br><tt>Tag
|
|||
|
|
-> < TagName AttributeList > | < / TagName ></tt>
|
|||
|
|
<br><tt>Data -> [any
|
|||
|
|
character except <]*</tt>
|
|||
|
|
<br><tt>TagName -> Identifier</tt>
|
|||
|
|
<br><tt>AttributeList -> [AttributeName AttributeRest]*</tt>
|
|||
|
|
<br><tt>AttributeName -> Identifier | QuotedValue</tt>
|
|||
|
|
<br><tt>AttributeRest -> = Value | ?</tt>
|
|||
|
|
<br><tt>Value -> QuotedValue
|
|||
|
|
| PlainValue</tt>
|
|||
|
|
<br><tt>Identifier -> [A..Z | a..z | 0..9 | - | !]+</tt>
|
|||
|
|
<br><tt>QuotedValue -> <20> [any character]* <20></tt>
|
|||
|
|
<br><tt>PlainValue -> [A..Z | a..z | 0..9 | - | .]+</tt>
|
|||
|
|
<p>Notice that there is no left recursion in any production. Our semantic
|
|||
|
|
rules will involve recording information for certain attributes of the
|
|||
|
|
tags listed above. We will use a similar idea to the symbol table used
|
|||
|
|
in most compilers. For error handling, we will raise Delphi exceptions,
|
|||
|
|
as this will automatically release the call stack for us, and allow us
|
|||
|
|
to catch the exceptions in one place. The lexical analysis will be extremely
|
|||
|
|
simple, just returning tokens of one character. I have built in the capability
|
|||
|
|
to easily extend this if you wish. The definition of the parsing classes
|
|||
|
|
is shown below.
|
|||
|
|
<p><b><tt><font color="#000099">interface</font></tt></b>
|
|||
|
|
<p><b><tt><font color="#000099">uses</font></tt></b>
|
|||
|
|
<br><tt> SysUtils, Classes;</tt>
|
|||
|
|
<br>
|
|||
|
|
<p><b><tt><font color="#000099">type</font></tt></b>
|
|||
|
|
<br><tt> TjimToken = <b><font color="#000099">class</font></b>(TObject)</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">private</font></b></tt>
|
|||
|
|
<br><tt> FTokenType : Integer;</tt>
|
|||
|
|
<br><tt> FAsString : string;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">public</font></b></tt>
|
|||
|
|
<br><tt> <b><font color="#000099">property</font></b>
|
|||
|
|
TokenType : Integer <b><font color="#000099">read</font></b> FTokenType
|
|||
|
|
<b><font color="#000099">write</font></b>
|
|||
|
|
FTokenType;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">property</font></b>
|
|||
|
|
AsString : string <b><font color="#000099">read</font></b> FAsString
|
|||
|
|
<b><font color="#000099">write</font></b>
|
|||
|
|
FAsString;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br>
|
|||
|
|
<p><tt> TjimLexicalAnalyser = <b><font color="#000099">class</font></b>(TObject)</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">private</font></b></tt>
|
|||
|
|
<br><tt> FText : string;</tt>
|
|||
|
|
<br><tt> FPosition : Integer;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
SetText(<b><font color="#000099">const</font></b> Value : string);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">public</font></b></tt>
|
|||
|
|
<br><tt><b><font color="#000099"> constructor</font></b>
|
|||
|
|
Create;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
GetNextToken(NextToken : TjimToken);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">property</font></b>
|
|||
|
|
Text : string <b><font color="#000099">read</font></b> FText <b><font color="#000099">write</font></b>
|
|||
|
|
SetText;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br>
|
|||
|
|
<p><tt> TjimSymbolType = (stTitle,stBase,stLink,stImage);</tt>
|
|||
|
|
<br>
|
|||
|
|
<p><tt> TjimSymbol = <b><font color="#000099">class</font></b>(TCollectionItem)</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">private</font></b></tt>
|
|||
|
|
<br><tt> FSymbolType : TjimSymbolType;</tt>
|
|||
|
|
<br><tt> FSymbolValue : string;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">public</font></b></tt>
|
|||
|
|
<br><tt><b><font color="#000099"> procedure</font></b>
|
|||
|
|
Assign(Source : TPersistent); <b><font color="#000099">override</font></b>;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">property</font></b>
|
|||
|
|
SymbolType : TjimSymbolType <b><font color="#000099">read</font></b>
|
|||
|
|
FSymbolType <b><font color="#000099">write</font></b> FSymbolType;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">property</font></b>
|
|||
|
|
SymbolValue : string <b><font color="#000099">read</font></b> FSymbolValue
|
|||
|
|
<b><font color="#000099">write</font></b>
|
|||
|
|
FSymbolValue;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br>
|
|||
|
|
<p><tt> TjimSymbolTable = <b><font color="#000099">class</font></b>(TCollection)</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">private</font></b></tt>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
GetItem(Index : Integer) : TjimSymbol;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
SetItem(Index : Integer;Value : TjimSymbol);</tt>
|
|||
|
|
<br><b><tt><font color="#000099"> public</font></tt></b>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
Add : TjimSymbol;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
AddSymbol(SymType : TjimSymbolType;SymValue : string) : TjimSymbol;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">property</font></b>
|
|||
|
|
Items[Index : Integer] : TjimSymbol <b><font color="#000099">read</font></b>
|
|||
|
|
GetItem <b><font color="#000099">write</font></b> SetItem; <b><font color="#000099">default</font></b>;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br>
|
|||
|
|
<p><tt> TjimHtmlParser = <b><font color="#000099">class</font></b>(TObject)</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">private</font></b></tt>
|
|||
|
|
<br><tt> FLookahead : TjimToken;</tt>
|
|||
|
|
<br><tt> FLexAnalyser : TjimLexicalAnalyser;</tt>
|
|||
|
|
<br><tt> FSymbolTable : TjimSymbolTable;</tt>
|
|||
|
|
<br><tt> FLastTag : string;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
Match(T : Integer);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
ConsumeWhiteSpace;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
Document;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
Tag;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
Data;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
TagName;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
AttributeList;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
AttributeName : string;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
Value : string;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
Identifier : string;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
QuotedValue : string;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">function</font></b>
|
|||
|
|
PlainValue : string;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">public</font></b></tt>
|
|||
|
|
<br><tt><b><font color="#000099"> constructor</font></b>
|
|||
|
|
Create;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">destructor</font></b>
|
|||
|
|
Destroy; <b><font color="#000099">override</font></b>;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">procedure</font></b>
|
|||
|
|
Parse(<b><font color="#000099">const</font></b> DocString : string);</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">property</font></b>
|
|||
|
|
SymbolTable : TjimSymbolTable <b><font color="#000099">read</font></b>
|
|||
|
|
FSymbolTable;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br>
|
|||
|
|
<p><tt> EjimHtmlParserError = <b><font color="#000099">class</font></b>(Exception);</tt>
|
|||
|
|
<br>
|
|||
|
|
<p>There are a couple of support classes to represent tokens and symbols.
|
|||
|
|
Should you wish to write a parser for your own purposes, these should function
|
|||
|
|
as templates you can build on. Usually, as here, the tokens have a type
|
|||
|
|
that is the ASCII value of the character, if they are single characters,
|
|||
|
|
otherwise it is some other predefined integer constant, outside the range
|
|||
|
|
0 to 255. So, multiple character tokens would each need their own type.
|
|||
|
|
A common extra token type is a marker for the end of the document. We will
|
|||
|
|
use the constant ttEndOfDoc, defined to be <20>1. Obviously, the symbol table
|
|||
|
|
would need to store different information as well. Often this would include
|
|||
|
|
the type of the symbol (as in our application) which would then be used
|
|||
|
|
for type checking (which we won<6F>t be doing). We will not go through the
|
|||
|
|
code for the lexical analyser as it is extremely simple, and you will easily
|
|||
|
|
be able to follow the code on the disk. Similarly, the symbol table and
|
|||
|
|
symbol objects are descended from TCollection and TCollectionItem, respectively,
|
|||
|
|
and are straightforward implementations as described in the Delphi help
|
|||
|
|
and Xavier Pacheco<63>s article in issue 10 (see how many opportunities to
|
|||
|
|
plug that CD-ROM I<>m giving you, Mr Editor, do I get a commission?). Hopefully,
|
|||
|
|
the code for TjimHtmlParser will be easy to follow as well. A couple of
|
|||
|
|
examples should be enough to give you the gist.
|
|||
|
|
<p><tt><b><font color="#000099">procedure</font></b> TjimHtmlParser.Document;</tt>
|
|||
|
|
<br><tt><b><font color="#000099">begin</font></b> <font color="#009900">{Document}</font></tt>
|
|||
|
|
<br><tt> <b><font color="#000099">while</font></b> FLookahead.TokenType
|
|||
|
|
<> ttEndOfDoc <b><font color="#000099">do begin</font></b></tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">if</font></b> FLookahead.AsString
|
|||
|
|
= '<' <b><font color="#000099">then begin</font></b></tt>
|
|||
|
|
<br><tt> Tag;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end else begin</font></b></tt>
|
|||
|
|
<br><tt> Data;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<p><tt> Match(ttEndOfDoc);</tt>
|
|||
|
|
<br><tt><b><font color="#000099">end</font></b>; <font color="#009900">{Document}</font></tt>
|
|||
|
|
<br>
|
|||
|
|
<p><tt><b><font color="#000099">procedure</font></b> TjimHtmlParser.Tag;</tt>
|
|||
|
|
<br><tt><b><font color="#000099">begin</font></b> <font color="#009900">{Tag}</font></tt>
|
|||
|
|
<br><tt> Match(Ord('<'));</tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">if</font></b> FLookahead.AsString
|
|||
|
|
= '/' <b><font color="#000099">then begin</font></b></tt>
|
|||
|
|
<br><tt> // Finding an end tag</tt>
|
|||
|
|
<br><tt> Match(Ord('/'));</tt>
|
|||
|
|
<br><tt> FLastTag := '/';</tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<br><tt> TagName;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end else begin</font></b></tt>
|
|||
|
|
<br><tt> // Finding a start tag, or a tag that doesn't
|
|||
|
|
enclose anything</tt>
|
|||
|
|
<br><tt> FLastTag := '';</tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<br><tt> TagName;</tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<br><tt> AttributeList;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<p><tt> Match(Ord('>'));</tt>
|
|||
|
|
<br><tt><b><font color="#000099">end</font></b>; <font color="#009900">{Tag}</font></tt>
|
|||
|
|
<p>The procedure Document is the first procedure called when parsing a
|
|||
|
|
file, as it corresponds to the start symbol of our grammar. As you can
|
|||
|
|
see, it continues to process the file until the end of document marker
|
|||
|
|
is reached. We will consume whitespace in the parser rather than the lexical
|
|||
|
|
analyser, because in our simple example it is easier to manage. The lookahead
|
|||
|
|
symbol is sufficient for us to be able to determine which procedure to
|
|||
|
|
call next. If we follow the case where it is <20><<EFBFBD>, the Tag procedure
|
|||
|
|
will be called. In this procedure, the <20><<EFBFBD> symbol is first matched,
|
|||
|
|
so that the next symbol is requested from the lexical analyser. After removing
|
|||
|
|
any whitespace, the lookahead symbol should be either a tag name, or the
|
|||
|
|
start of an end tag. To help us implement the semantic rules, we need to
|
|||
|
|
keep track of the last tag processed, so we will store its value in FLastTag.
|
|||
|
|
The TagName procedure collects the rest of the name, and if we are in a
|
|||
|
|
start tag, the attribute list is processed. The tag must end with a <20>><3E>
|
|||
|
|
symbol. The other procedure are built in the same fashion. Some of them
|
|||
|
|
are complicated slightly because they are where our semantic rules are
|
|||
|
|
implemented. For instance, the TagName procedure steps through comment
|
|||
|
|
and META tags ignoring their content. The Data procedure collects the title
|
|||
|
|
of the document if the last tag was the TITLE tag, and having found it,
|
|||
|
|
adds it to the symbol table. The procedure that does most of the work,
|
|||
|
|
though, is AttributeList. It is shown below.
|
|||
|
|
<p><tt><b><font color="#000099">procedure</font></b> TjimHtmlParser.AttributeList;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">var</font></b></tt>
|
|||
|
|
<br><tt> FLastAttribute : string;</tt>
|
|||
|
|
<br><tt> FLastValue : string;</tt>
|
|||
|
|
<br><tt><b><font color="#000099">begin</font></b> <font color="#009900">{AttributeList}</font></tt>
|
|||
|
|
<br><tt> <b><font color="#000099">while</font></b> FLookahead.AsString
|
|||
|
|
<> '>' <b><font color="#000099">do begin</font></b></tt>
|
|||
|
|
<br><tt> FLastAttribute := AttributeName;</tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<p><tt> <b><font color="#000099">if</font></b> FLookahead.AsString
|
|||
|
|
= '=' <b><font color="#000099">then begin</font></b></tt>
|
|||
|
|
<br><tt> Match(Ord('='));</tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<br><tt> FLastValue := Value;</tt>
|
|||
|
|
<br><tt> ConsumeWhiteSpace;</tt>
|
|||
|
|
<p><tt> <font color="#009900">// Should only
|
|||
|
|
get here if FLastAttribute is not an empty string</font></tt>
|
|||
|
|
<br><tt> <b><font color="#000099">if</font></b>
|
|||
|
|
(CompareText(FLastTag,'BASE') = 0) <b><font color="#000099">and</font></b></tt>
|
|||
|
|
<br><tt> (CompareText('HREF',FLastAttribute)
|
|||
|
|
= 0) <b><font color="#000099">then begin</font></b></tt>
|
|||
|
|
<br><tt> <font color="#009900">//
|
|||
|
|
Special case when found the HREF attribute of a BASE tag</font></tt>
|
|||
|
|
<br><tt> FSymbolTable.AddSymbol(stBase,FLastValue);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end else
|
|||
|
|
if</font></b> (CompareText(FLastTag,'IMG') = 0) <b><font color="#000099">and</font></b></tt>
|
|||
|
|
<br><tt> (CompareText('SRC',FLastAttribute)
|
|||
|
|
= 0) <b><font color="#000099">then begin</font></b></tt>
|
|||
|
|
<br><tt> <font color="#009900">//
|
|||
|
|
Found an image</font></tt>
|
|||
|
|
<br><tt> FSymbolTable.AddSymbol(stImage,FLastValue);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end else
|
|||
|
|
if</font></b> ((CompareText(FLastTag,'A') = 0) <b><font color="#000099">or</font></b></tt>
|
|||
|
|
<br><tt>
|
|||
|
|
(CompareText(FLastTag,'AREA') = 0) <b><font color="#000099">or</font></b></tt>
|
|||
|
|
<br><tt>
|
|||
|
|
(CompareText(FLastTag,'LINK') = 0)) <b><font color="#000099">and</font></b></tt>
|
|||
|
|
<br><tt>
|
|||
|
|
(CompareText('HREF',FLastAttribute) = 0) <b><font color="#000099">then
|
|||
|
|
begin</font></b></tt>
|
|||
|
|
<br><tt> <font color="#009900">//
|
|||
|
|
Found an ordinary link</font></tt>
|
|||
|
|
<br><tt> FSymbolTable.AddSymbol(stLink,FLastValue);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end else
|
|||
|
|
if </font></b>(CompareText(FLastTag,'FRAME') = 0) <b><font color="#000099">and</font></b></tt>
|
|||
|
|
<br><tt>
|
|||
|
|
(CompareText('SRC',FLastAttribute) = 0) <b><font color="#000099">then begin</font></b></tt>
|
|||
|
|
<br><tt> <font color="#009900">//
|
|||
|
|
Found an ordinary link</font></tt>
|
|||
|
|
<br><tt> FSymbolTable.AddSymbol(stLink,FLastValue);</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br><tt> <b><font color="#000099">end</font></b>;</tt>
|
|||
|
|
<br><tt><b><font color="#000099">end</font></b>; <font color="#009900">{AttributeList}</font></tt>
|
|||
|
|
<p>You can see that the syntactic work is done early. Our grammar states
|
|||
|
|
that an attribute list consists of a list of attributes after the tag name
|
|||
|
|
and before the <20>><3E> symbol. Each attribute has a name, and possibly a value.
|
|||
|
|
In the case that the attribute does have a value, we need to check if it
|
|||
|
|
is one of the attributes we need for our diagram. So for our selected tags,
|
|||
|
|
we check the attribute name, stored in the local variable FLastAttribute,
|
|||
|
|
and if it is an URL we store it in the symbol table. This is the time that
|
|||
|
|
we determine the type of the URL, whether it is a BASE tag, an image, or
|
|||
|
|
another HTML file.
|
|||
|
|
<p>The remainder of the article is printed in the magazine.
|
|||
|
|
</body>
|
|||
|
|
</html>
|