Delphi: How to Build an Easy-to-Use Parser/Parsing Framework (Part I)

Articles
Members Online:
-Article/Tip Search
-News Group
Member Area
-Top 10 NEW!!
-Forums Upgraded!!
-Indexes NEW!!
Employment
Contacts
Embarcadero
Embarcadero Community
JEDI
Links
How to Build an Easy-to-Use Parser/Parsing Framework (Part I)
07-Jul-03
Category
Algorithm
Language
Delphi 2.x
Views
192
User Rating
No Votes
# Votes
Replies
Publisher:
DSP, Administrator
Reference URL:
DKB
			Author: Marc Hoffmann

How to create a simple parsing framework to parse any kind of data?

Answer:

Note: 

The full sourcecodes for all components & examples descripted in the following 
article are available as an open-source project under SourceForge: 

  Parser Frameworkhttps://sourceforge.net/projects/parserfw/ 

A second article was released on 29.01.2002 with a more detailed example: 

Building an Easy-to-Use Parser/Parsing Framework (Part II) 

Today, we wonna speak about "how to create a simple parser framework" in Delphi. 
Our goal will be a class solutions which helps up to parse any kind of data and 
store all relevant informations in an easy-to- access object model. 

At the end of this article, we've developed a small utility, which generates a 
simple object model of a .dtd file and output it's xml pendant from a given 
starting node. In other words, we're using the parsing framework to create a 
parser, which is able to parse a .dtd file, extract all neccessary tags, store them 
in the object model and generates the xml output. Note: This utility uses a simply 
dtd- parser model, which don't include all routines to parse all kinds of dtd datas 
- it's up to you to include those features. 

Our claims to the framework and object model are: 

easy to use. 
save/loadable object trees. 
integrated error reporting. 
expandable. 

Okay, now let's start to develope the main parsing engine. Delphi comes with a unit 
called CopyPrsr which includes  the simple stream parser object TCopyParser. Try to 
take a look into that file to understand how it works - it's located under 
$(DELPHI)\Source\Internet\CopyPrsr.pas. Our framework parser is derived from that 
idea, but uses a simple string instead of the stream and includes some additional 
functions: 

The boiler plate: 

1   unit StringParser;
2   
3   interface
4   
5   uses
6     Classes;
7   
8   const
9     { Additional Parser special tokens }
10  
11    toEOL = char(6);
12    toBOF = char(7);
13  
14  type
15    { TSysCharSet }
16  
17    TSysCharSet = set of Char;
18  
19    { TStringParser }
20  
21    TStringParser = class
22    private
23      { Private declarations }
24      FParseString: string;
25      FLineTokens: Integer;
26      FSourceLine: Integer;
27      FSourcePos: Integer;
28      FTokenPos: Integer;
29      FToken: Char;
30      procedure SkipBlanks;
31      function GetParseString: string;
32      function GetSourcePos: Integer;
33      function GetTokenString: string;
34    protected
35      { Protected declarations }
36    public
37      { Public declarations }
38      constructor Create;
39      function LoadFromFile(const FileName: string): Boolean;
40      function LoadFromStream(const Stream: TStream): Boolean;
41      function SkipToEOF: string;
42      function SkipToEOL: string;
43      function SkipToken: Char;
44      function SkipTokenString: string;
45      function SkipToToken(const AToken: Char): string; overload;
46      function SkipToToken(const AToken: TSysCharSet): string; overload;
47      function SkipToTokenString(const ATokenString: string): string;
48      property ParseString: string read GetParseString;
49      property SourceLine: Integer read FSourceLine;
50      property SourcePos: Integer read GetSourcePos;
51      property Token: Char read FToken;
52      property TokenString: string read GetTokenString;
53    end;


As you can see, there are many public helper functions which you can use to parse 
the data. The main functions are LoadFromFile and LoadFromStream, which needs the 
name of the file to be parsed as the only parameter. Both functions loads the 
content of the file and store it to the string FParseString which can be accessed 
through the denominator property: 

LoadFromFile/LoadFromStream: 
54  
55  function TStringParser.LoadFromFile(const FileName: string): Boolean;
56  var
57    Stream: TMemoryStream;
58  begin
59    Result := False;
60    if not FileExists(FileName) then
61      Exit;
62    Stream := TMemoryStream.Create;
63    try
64      Stream.LoadFromFile(FileName);
65      Result := LoadFromStream(Stream);
66    finally
67      Stream.Free;
68    end;
69  end;
70  
71  function TStringParser.LoadFromStream(const Stream: TStream): Boolean;
72  var
73    MemStream: TMemoryStream;
74  begin
75    Result := False;
76    if not (assigned(Stream)) then
77      Exit;
78    MemStream := TMemoryStream.Create;
79    try
80      Stream.Seek(0, soFromBeginning);
81      MemStream.CopyFrom(Stream, Stream.Size);
82      FParseString := StrPas(MemStream.Memory);
83      SetLength(FParseString, MemStream.Size);
84      FParseString := Concat(FParseString, toEOF);
85      FToken := toBOF;
86      Result := True;
87    finally
88      MemStream.Free;
89    end;
90  end;


The main functionality of the parsing engine is the extraction of so- called 
tokens. A token can be a seperator (like CR, LF or EOF) or a symbol, which can be a 
keyword if you plan to parse a programing language. The following functions are 
used to skip blank characters (which are used to seperate symbols and aren't 
relevant) and to extract/skip the next token symbol: 

Token related functions (pullout only): 
91  
92  procedure TStringParser.SkipBlanks;
93  begin
94    while True do
95    begin
96      FToken := FParseString[FTokenPos];
97      case FToken of
98        #10:
99          begin
100           Inc(FSourceLine);
101           FLineTokens := FTokenPos;
102         end;
103       toEOF, #33..#255:
104         Exit;
105     end;
106     Inc(FTokenPos);
107   end;
108 end;
109 
110 function TStringParser.SkipToken: Char;
111 const
112   KeySet = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
113 begin
114   SkipBlanks;
115   FSourcePos := FTokenPos;
116   if FParseString[FTokenPos] = toEOF then
117     FToken := toEOF
118   else if FParseString[FTokenPos] in KeySet then
119   begin
120     while FParseString[FTokenPos] in KeySet do
121       Inc(FTokenPos);
122     FToken := toSymbol;
123   end
124   else
125   begin
126     FToken := FParseString[FTokenPos];
127     Inc(FTokenPos);
128   end;
129   Result := FToken;
130 end;
131 
132 function TStringParser.SkipToToken(const AToken: TSysCharSet): string;
133 begin
134   FSourcePos := FTokenPos;
135   while not ((FParseString[FTokenPos] = toEOF) or (FParseString[FTokenPos] in 
136 AToken))
137     do
138   begin
139     if FParseString[FTokenPos] = #10 then
140     begin
141       Inc(FSourceLine);
142       FLineTokens := FTokenPos;
143     end;
144     Inc(FTokenPos);
145   end;
146   if FParseString[FTokenPos] = toEOF then
147     FToken := toEOF
148   else
149     FToken := FParseString[FTokenPos];
150   Result := GetTokenString;
151   if FToken <> toEOF then
152     SkipToken;
153 end;


The absent functions includes alternativ possibilities to extract or skip the 
tokens, like SkipToTokenString or SkipToEof. Well, the next step is to create the 
object model, which holds all our parsed informations. As I mentioned earlier, the 
goal of this article it to create a simple dtd parser, so we'll create an object 
model to store dtd 
informations. 

A dtd file is used to descripe the syntax rules of a xml file, like: 

DTD example: 

key CDATA #REQUIRED 
value CDATA #REQUIRED 
> 

This example demonstrated the simplest way of a dtd structure. It's not the purpose 
of this article to develope a highly flexible dtd parser which spots all dtd 
grammas, so we only include the weightly ones. Root of each object model is the 
document, which holds all other objects as collections: 

The Root Document: 

154 { TDTDDocument }
155 
156 TDTDDocument = class(TPersistent)
157 private
158   { Private declarations }
159   FEntities: TDTDEntities;
160   FElements: TDTDElements;
161   procedure SetEntities(Value: TDTDEntities);
162   procedure SetElements(Value: TDTDElements);
163 public
164   { Public declarations }
165   constructor Create;
166   destructor Destroy; override;
167   procedure Assign(Source: TPersistent); override;
168 published
169   { Published declarations }
170   property Entities: TDTDEntities read FEntities write SetEntities;
171   property Elements: TDTDElements read FElements write SetElements;
172 end;


As you can see, our document gives us the access of some other types of data: 
Entities and Elements. Entities are very hard to parse, so it's a good lesson for 
you to include that feature. Parsing elements is quite easier, so this type of data 
is better to explain here. Look at the dtd example some rows above this. You can 
see, that a dtd element is descripted as followed: 

Our object model needs some extra fields to store such element informations. If you 
are not familiar with dtd or xml, look at W3CSchools http://www.w3schools.com/- 
it's a good starting point to learn more about that topic. So, take a look at the 
following  object structure: 

TDTDDocument 
  | 
  o--TDTDEntities 
  | 
  o--TDTElements 
    | 
    o--TDTDElementTyp 
    | 
    o--TDTDAttributes 
      | 
      o--TDTDAttributeTyp 
      o--TDTDAttributeStatus 
      o--Default: string 
      o--TDTDEnums 
    o--TDTDChild 
      | 
      o--TDTDTyp 
      o--TDTDChilds 

I've tried to "pack" the dtd grammars into an easy object model as you can see 
above: 

Each document contains a collection of elements. Each element is descripted by an 
elementtype and containes in turn a collection of attributes and childs. Each 
attribute again is descripted by an attributetype and contains a collection of 
enum(erations) and so forth. Followed a code cantle from the element 
implementation, it's not suggestive to show you the whole code here - it's quit 
long and a little bit more complex: 

173 TDTDElement:
174 
175 unit DTD_Document;
176 
177 interface
178 
179 uses
180   Classes;
181 
182 type
183 
184   ...
185 
186   { TDTDElementTyp }
187 
188   TDTDElementTyp =
189     (etAny, etEmpty, etData, etContainer);
190 
191   { TDTDElementStatus }
192 
193   TDTDElementStatus =
194     (esRequired, esRequiredSeq, esOptional, esOptionalSeq);
195 
196   { TDTDItem }
197 
198   TDTDItem = class(TCollectionItem)
199   private
200     { Private declarations }
201     FName: string;
202   public
203     { Public declarations }
204     procedure Assign(Source: TPersistent); override;
205   published
206     { Published declarations }
207     property Name: string read FName write FName;
208   end;
209 
210   { TDTDItems }
211 
212   TDTDItems = class(TCollection)
213   private
214     { Private declarations }
215     function GetItem(Index: Integer): TDTDItem;
216     procedure SetItem(Index: Integer; Value: TDTDItem);
217   public
218     { Public declarations }
219     function Add: TDTDItem;
220     function Find(const Name: string): TDTDItem;
221     property Items[Index: Integer]: TDTDItem read GetItem write SetItem; default;
222   end;
223 
224   ...
225 
226   { TDTDElement }
227 
228   TDTDElement = class(TDTDProperty)
229   private
230     { Private declarations }
231     FTyp: TDTDElementTyp;
232     FAttributes: TDTDAttributes;
233     FChilds: TDTDChilds;
234     procedure SetAttributes(Value: TDTDAttributes);
235     procedure SetChilds(Value: TDTDChilds);
236   public
237     { Public declarations }
238     constructor Create(Collection: TCollection); override;
239     destructor Destroy; override;
240     procedure Assign(Source: TPersistent); override;
241   published
242     { Published declarations }
243     property Typ: TDTDElementTyp read FTyp write FTyp;
244     property Attributes: TDTDAttributes read FAttributes write SetAttributes;
245     property Childs: TDTDChilds read FChilds write SetChilds;
246   end;
247 
248   { TDTDElements }
249 
250   TDTDElements = class(TDTDProperties)
251   private
252     { Private declarations }
253     function GetItem(Index: Integer): TDTDElement;
254     procedure SetItem(Index: Integer; Value: TDTDElement);
255   public
256     { Public declarations }
257     function Add: TDTDElement;
258     function Find(const Name: string): TDTDElement;
259     property Items[Index: Integer]: TDTDElement read GetItem write SetItem; default;
260   end;
261 
262   ...
263 
264 implementation
265 
266 ...
267 
268 { TDTDItem }
269 
270 procedure TDTDItem.Assign(Source: TPersistent);
271 begin
272   if Source is TDTDItem then
273   begin
274     Name := TDTDItem(Source).Name;
275     Exit;
276   end;
277   inherited Assign(Source);
278 end;
279 
280 { TDTDItems }
281 
282 function TDTDItems.Add: TDTDItem;
283 begin
284   Result := TDTDItem(inherited Add);
285 end;
286 
287 function TDTDItems.Find(const Name: string): TDTDItem;
288 var
289   i: Integer;
290 begin
291   Result := nil;
292   for i := 0 to Count - 1 do
293     if CompareStr(Items[i].Name, Name) = 0 then
294     begin
295       Result := Items[i];
296       Break;
297     end;
298 end;
299 
300 function TDTDItems.GetItem(Index: Integer): TDTDItem;
301 begin
302   Result := TDTDItem(inherited GetItem(Index));
303 end;
304 
305 procedure TDTDItems.SetItem(Index: Integer; Value: TDTDItem);
306 begin
307   inherited SetItem(Index, Value);
308 end;
309 
310 ...
311 
312 { TDTDElement }
313 
314 constructor TDTDElement.Create(Collection: TCollection);
315 begin
316   inherited Create(Collection);
317   FAttributes := TDTDAttributes.Create(TDTDAttribute);
318   FChilds := TDTDChilds.Create(TDTDChild);
319 end;
320 
321 destructor TDTDElement.Destroy;
322 begin
323   FAttributes.Free;
324   FChilds.Free;
325   inherited Destroy;
326 end;
327 
328 procedure TDTDElement.Assign(Source: TPersistent);
329 begin
330   if Source is TDTDElement then
331   begin
332     Typ := TDTDElement(Source).Typ;
333     Attributes.Assign(TDTDElement(Source).Attributes);
334     Childs.Assign(TDTDElement(Source).Childs);
335   end;
336   inherited Assign(Source);
337 end;
338 
339 procedure TDTDElement.SetAttributes(Value: TDTDAttributes);
340 begin
341   FAttributes.Assign(Value);
342 end;
343 
344 procedure TDTDElement.SetChilds(Value: TDTDChilds);
345 begin
346   FChilds.Assign(Value);
347 end;
348 
349 { TDTDElements }
350 
351 function TDTDElements.Add: TDTDElement;
352 begin
353   Result := TDTDElement(inherited Add);
354 end;
355 
356 function TDTDElements.Find(const Name: string): TDTDElement;
357 begin
358   Result := TDTDElement(inherited Find(Name));
359 end;
360 
361 function TDTDElements.GetItem(Index: Integer): TDTDElement;
362 begin
363   Result := TDTDElement(inherited GetItem(Index));
364 end;
365 
366 procedure TDTDElements.SetItem(Index: Integer; Value: TDTDElement);
367 begin
368   inherited SetItem(Index, Value);
369 end;


The advantage of this object model is, that you're able to easily add the document 
to a standard component and use Delphi's internal streaming technology to load and 
save the object contents of a parsed file. 

The next step will be the developing of the real dtd parser. Do you remember the 
TStringParser descripted at the top of this article? We'll using this class to 
build up our parser. But, we don't want to create a parser from scratch each time 
we're about to parse a new kind of data - it's not mind of a framework. So first, 
we'll develope a small parser class from which we will inherit our dtd parser. This 
parent class should include the error reporting and accessable fields to some other 
informations: 

The Private Parser class: 

370 unit PrivateParser;
371 
372 interface
373 
374 uses
375   Classes, SysUtils, StringParser;
376 
377 type
378   { TParserError }
379 
380   TParserError = class(TCollectionItem)
381   private
382     { Private declarations }
383     FFileName: string;
384     FLine: Integer;
385     FMessage: string;
386     FPosition: Integer;
387   public
388     { Public declarations }
389     procedure Assign(Source: TPersistent); override;
390   published
391     { Published declarations }
392     property FileName: string read FFileName write FFileName;
393     property Line: Integer read FLine write FLine;
394     property message: string read FMessage write FMessage;
395     property Position: Integer read FPosition write FPosition;
396   end;
397 
398   { TParserErrors }
399 
400   TParserErrors = class(TCollection)
401   private
402     { Private declarations }
403     function GetItem(Index: Integer): TParserError;
404     procedure SetItem(Index: Integer; Value: TParserError);
405   public
406     { Public declarations }
407     function Add: TParserError;
408     property Items[Index: Integer]: TParserError read GetItem write SetItem; 
409 default;
410   end;
411 
412   { TValidationParser }
413 
414   TValidationParser = class
415   private
416     { Private declarations }
417     FErrors: TParserErrors;
418     procedure SetErrors(const Value: TParserErrors);
419   public
420     { Public declarations }
421     constructor Create;
422     destructor Destroy; override;
423     procedure AddError(const AMessage: string; Parser: TStringParser; const 
424 AFileName:
425       string = '');
426     procedure AddErrorFmt(const AMessage: string; Params: array of const; Parser:
427       TStringParser; const AFileName: string = '');
428     property Errors: TParserErrors read FErrors write SetErrors;
429   end;
430 
431 implementation
432 
433 { TParserError }
434 
435 procedure TParserError.Assign(Source: TPersistent);
436 begin
437   if Source is TParserError then
438   begin
439     Line := TParserError(Source).Line;
440     message := TParserError(Source).message;
441     Position := TParserError(Source).Position;
442 
443     Exit;
444   end;
445 
446   inherited Assign(Source);
447 end;
448 
449 { TParserErrors }
450 
451 function TParserErrors.Add: TParserError;
452 begin
453   Result := TParserError(inherited Add);
454 end;
455 
456 function TParserErrors.GetItem(Index: Integer): TParserError;
457 begin
458   Result := TParserError(inherited GetItem(Index));
459 end;
460 
461 procedure TParserErrors.SetItem(Index: Integer; Value: TParserError);
462 begin
463   inherited SetItem(Index, Value);
464 end;
465 
466 { TValidationParser }
467 
468 constructor TValidationParser.Create;
469 begin
470   inherited Create;
471   FErrors := TParserErrors.Create(TParserError);
472 end;
473 
474 destructor TValidationParser.Destroy;
475 begin
476   FErrors.Free;
477   inherited Destroy;
478 end;
479 
480 procedure TValidationParser.SetErrors(const Value: TParserErrors);
481 begin
482   FErrors.Assign(Value);
483 end;
484 
485 procedure TValidationParser.AddErrorFmt(const AMessage: string; Params: array of
486   const; Parser: TStringParser; const AFileName: string = '');
487 begin
488   with FErrors.Add do
489   begin
490     FileName := AFileName;
491     Line := Parser.SourceLine;
492     message := Format(AMessage, Params);
493     Position := Parser.SourcePos;
494   end;
495 end;
496 
497 procedure TValidationParser.AddError(const AMessage: string; Parser: TStringParser;
498   const AFileName: string = '');
499 begin
500   AddErrorFmt(AMessage, [], Parser, AFileName);
501 end;
502 
503 end.


Now we can start developing the real parser by inheriting it from the 
TValidationParser. Again, I don't want to show you the whole sourcecode here, so I 
pick up only the sapid one. Our dtd parser is a so- called two-way parser, i.e. it 
uses the first pass to parse the elements and the second pass to parse the 
attributes. This is useful, because an attibute can refer to an element which is 
placed below it and otherwise we'll get some unneeded errors. The main method of 
our parse is Parse, which  needs the name of the file to be parsed as the first 
parameter, and a pre-initialized TDTDDocument as the second parameter. A sample 
call should looks like: 

Sample Call: 

504 // Create DTDDocument.
505 DTDDocument := TDTDDocument.Create;
506 try
507   // Create DTDParser.
508   DTDParser := TDTDParser.Create;
509   try
510     // Parse File.
511     DTDParser.Parse(FileName, DTDDocument);
512 
513     // Display possible Errors.
514     if DTDParser.Errors.Count > 0 then
515     begin
516       for i := 0 to DTDParser.Errors.Count - 1 do
517         with DTDParser.Errors[i] do
518           WriteLn(Format('Error in Line %d, Pos %d: %s...', [Line, Position,
519             message]));
520       Exit;
521     end;
522 
523     ...
524 
525     // Free DTDParser.
526   finally
527     DTDParser.Free;
528   end;
529 
530   // Free DTDDocument.
531 finally
532   DTDDocument.Free;
533 end;


But now, let's take a look at some sourcecode lines of the parser implementation. 
The first think we had to do is to inherited our parser from the parent class: 
534 
535 Parser implementation (Snippet): 
536 
537 type
538   { EDTDParser }
539 
540   EDTDParser = class(Exception);
541 
542   { TDTDParser }
543 
544   TDTDParser = class(TValidationParser)
545   private
546     { Private declarations }
547     procedure ParseElement(Parser: TStringParser; Document: TDTDDocument; const 
548 Pass:
549       Integer);
550     procedure ParseAttlist(Parser: TStringParser; Document: TDTDDocument);
551     procedure ParseFile(const FileName: string; Document: TDTDDocument; const Pass:
552       Integer = 0);
553   public
554     { Public declarations }
555     procedure Parse(const FileName: string; var Document: TDTDDocument);
556   end;


Afterwards we implement the Parse method which calls the internal method ParseFile 
on her part: 

Method "Parse": 
557 
558 procedure TDTDParser.Parse(const FileName: string; var Document: TDTDDocument);
559 var
560   TmpDocument: TDTDDocument;
561 begin
562   if not assigned(Document) then
563     raise EDTDParser.Create('Document not assigned!');
564   TmpDocument := TDTDDocument.Create;
565   try
566     ParseFile(FileName, TmpDocument);
567     if Errors.Count = 0 then
568       Document.Assign(TmpDocument);
569   finally
570     TmpDocument.Free;
571   end;
572 end;


As you can see, we create a special temporar document to store the parsed objects 
in. I've done this because I don't want to return the document if it is full of 
errors - I assign a exact copy of the objects only, if no errors occured. The 
method ParseFile implements the proper parsing calls to the StringParser and 
creates the real objects. Followed a code snippet of the method body: 

Method "ParseFile": 
573 
574 procedure TDTDParser.ParseFile(const FileName: string;
575   Document: TDTDDocument; const Pass: Integer = 0);
576 var
577   Parser: TStringParser;
578 begin
579   Parser := TStringParser.Create;
580   try
581     if not Parser.LoadFromFile(FileName) then
582     begin
583       AddErrorFmt('File "%s" not found', [FileName], Parser);
584       Exit;
585     end;
586     while True do
587     begin
588       while not (Parser.Token in [toEOF, '<']) do
589         Parser.SkipToken;
590       if Parser.Token = toEOF then
591         Break;
592       Parser.SkipToken;
593       if Parser.Token <> '!' then
594       begin
595         if not (Parser.Token in ['?']) and (Pass = 1) then
596           AddError('InvalidToken', Parser);
597         Continue;
598       end;
599       if Parser.SkipToken <> toSymbol then
600       begin
601         if (Parser.Token <> '-') and (Pass = 1) then
602           AddError('InvalidToken', Parser);
603         Continue;
604       end;
605       if UpperCase(Parser.TokenString) = 'ENTITY' then
606         Continue;
607       if UpperCase(Parser.TokenString) = 'ELEMENT' then
608         ParseElement(Parser, Document, Pass)
609       else if UpperCase(Parser.TokenString) = 'ATTLIST' then
610       begin
611         if Pass = 1 then
612           ParseAttlist(Parser, Document);
613       end
614       else if Pass = 1 then
615         AddErrorFmt('Invalid Symbol "%s"', [Parser.TokenString], Parser);
616     end;
617     if Pass = 0 then
618       ParseFile(FileName, Document, 1);
619   finally
620     Parser.Free;
621   end;
622 end;


This method calls some other functions (ParseElement and ParseAttlist) which parses 
the internal structures of an element or an attribute. Look at the whole sourceode 
to understand. 

What's next?? 

Well, this article has shown you how easy it is to write a customizeable parser 
which can parse any kind of data - it's up to you, how complex it should be. The 
main benefit in using this kind of parsing is, that you don't need to incorporate 
in complex systems like LexParser. 

Continue reading my second article: 

Building an Easy-to-Use Parser/Parsing Framework (Part II)
Vote: How useful do you find this Article/Tip?
Bad											Excellent
	1	2	3	4	5	6	7	8	9	10
Advertisement
Share this page
Advertisement