Author: Marc Hoffmann
How to create a simple parsing framework to parse any kind of data?
Answer:
Note:
The full sourcecodes for all components & examples descripted in the following
article are available as an open-source project under SourceForge:
Parser Frameworkhttps://sourceforge.net/projects/parserfw/
A second article was released on 29.01.2002 with a more detailed example:
Building an Easy-to-Use Parser/Parsing Framework (Part II)
Today, we wonna speak about "how to create a simple parser framework" in Delphi.
Our goal will be a class solutions which helps up to parse any kind of data and
store all relevant informations in an easy-to- access object model.
At the end of this article, we've developed a small utility, which generates a
simple object model of a .dtd file and output it's xml pendant from a given
starting node. In other words, we're using the parsing framework to create a
parser, which is able to parse a .dtd file, extract all neccessary tags, store them
in the object model and generates the xml output. Note: This utility uses a simply
dtd- parser model, which don't include all routines to parse all kinds of dtd datas
- it's up to you to include those features.
Our claims to the framework and object model are:
easy to use.
save/loadable object trees.
integrated error reporting.
expandable.
Okay, now let's start to develope the main parsing engine. Delphi comes with a unit
called CopyPrsr which includes the simple stream parser object TCopyParser. Try to
take a look into that file to understand how it works - it's located under
$(DELPHI)\Source\Internet\CopyPrsr.pas. Our framework parser is derived from that
idea, but uses a simple string instead of the stream and includes some additional
functions:
The boiler plate:
1 unit StringParser;
2
3 interface
4
5 uses
6 Classes;
7
8 const
9 { Additional Parser special tokens }
10
11 toEOL = char(6);
12 toBOF = char(7);
13
14 type
15 { TSysCharSet }
16
17 TSysCharSet = set of Char;
18
19 { TStringParser }
20
21 TStringParser = class
22 private
23 { Private declarations }
24 FParseString: string;
25 FLineTokens: Integer;
26 FSourceLine: Integer;
27 FSourcePos: Integer;
28 FTokenPos: Integer;
29 FToken: Char;
30 procedure SkipBlanks;
31 function GetParseString: string;
32 function GetSourcePos: Integer;
33 function GetTokenString: string;
34 protected
35 { Protected declarations }
36 public
37 { Public declarations }
38 constructor Create;
39 function LoadFromFile(const FileName: string): Boolean;
40 function LoadFromStream(const Stream: TStream): Boolean;
41 function SkipToEOF: string;
42 function SkipToEOL: string;
43 function SkipToken: Char;
44 function SkipTokenString: string;
45 function SkipToToken(const AToken: Char): string; overload;
46 function SkipToToken(const AToken: TSysCharSet): string; overload;
47 function SkipToTokenString(const ATokenString: string): string;
48 property ParseString: string read GetParseString;
49 property SourceLine: Integer read FSourceLine;
50 property SourcePos: Integer read GetSourcePos;
51 property Token: Char read FToken;
52 property TokenString: string read GetTokenString;
53 end;
As you can see, there are many public helper functions which you can use to parse
the data. The main functions are LoadFromFile and LoadFromStream, which needs the
name of the file to be parsed as the only parameter. Both functions loads the
content of the file and store it to the string FParseString which can be accessed
through the denominator property:
LoadFromFile/LoadFromStream:
54
55 function TStringParser.LoadFromFile(const FileName: string): Boolean;
56 var
57 Stream: TMemoryStream;
58 begin
59 Result := False;
60 if not FileExists(FileName) then
61 Exit;
62 Stream := TMemoryStream.Create;
63 try
64 Stream.LoadFromFile(FileName);
65 Result := LoadFromStream(Stream);
66 finally
67 Stream.Free;
68 end;
69 end;
70
71 function TStringParser.LoadFromStream(const Stream: TStream): Boolean;
72 var
73 MemStream: TMemoryStream;
74 begin
75 Result := False;
76 if not (assigned(Stream)) then
77 Exit;
78 MemStream := TMemoryStream.Create;
79 try
80 Stream.Seek(0, soFromBeginning);
81 MemStream.CopyFrom(Stream, Stream.Size);
82 FParseString := StrPas(MemStream.Memory);
83 SetLength(FParseString, MemStream.Size);
84 FParseString := Concat(FParseString, toEOF);
85 FToken := toBOF;
86 Result := True;
87 finally
88 MemStream.Free;
89 end;
90 end;
The main functionality of the parsing engine is the extraction of so- called
tokens. A token can be a seperator (like CR, LF or EOF) or a symbol, which can be a
keyword if you plan to parse a programing language. The following functions are
used to skip blank characters (which are used to seperate symbols and aren't
relevant) and to extract/skip the next token symbol:
Token related functions (pullout only):
91
92 procedure TStringParser.SkipBlanks;
93 begin
94 while True do
95 begin
96 FToken := FParseString[FTokenPos];
97 case FToken of
98 #10:
99 begin
100 Inc(FSourceLine);
101 FLineTokens := FTokenPos;
102 end;
103 toEOF, #33..#255:
104 Exit;
105 end;
106 Inc(FTokenPos);
107 end;
108 end;
109
110 function TStringParser.SkipToken: Char;
111 const
112 KeySet = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
113 begin
114 SkipBlanks;
115 FSourcePos := FTokenPos;
116 if FParseString[FTokenPos] = toEOF then
117 FToken := toEOF
118 else if FParseString[FTokenPos] in KeySet then
119 begin
120 while FParseString[FTokenPos] in KeySet do
121 Inc(FTokenPos);
122 FToken := toSymbol;
123 end
124 else
125 begin
126 FToken := FParseString[FTokenPos];
127 Inc(FTokenPos);
128 end;
129 Result := FToken;
130 end;
131
132 function TStringParser.SkipToToken(const AToken: TSysCharSet): string;
133 begin
134 FSourcePos := FTokenPos;
135 while not ((FParseString[FTokenPos] = toEOF) or (FParseString[FTokenPos] in
136 AToken))
137 do
138 begin
139 if FParseString[FTokenPos] = #10 then
140 begin
141 Inc(FSourceLine);
142 FLineTokens := FTokenPos;
143 end;
144 Inc(FTokenPos);
145 end;
146 if FParseString[FTokenPos] = toEOF then
147 FToken := toEOF
148 else
149 FToken := FParseString[FTokenPos];
150 Result := GetTokenString;
151 if FToken <> toEOF then
152 SkipToken;
153 end;
The absent functions includes alternativ possibilities to extract or skip the
tokens, like SkipToTokenString or SkipToEof. Well, the next step is to create the
object model, which holds all our parsed informations. As I mentioned earlier, the
goal of this article it to create a simple dtd parser, so we'll create an object
model to store dtd
informations.
A dtd file is used to descripe the syntax rules of a xml file, like:
DTD example:
key CDATA #REQUIRED
value CDATA #REQUIRED
>
This example demonstrated the simplest way of a dtd structure. It's not the purpose
of this article to develope a highly flexible dtd parser which spots all dtd
grammas, so we only include the weightly ones. Root of each object model is the
document, which holds all other objects as collections:
The Root Document:
154 { TDTDDocument }
155
156 TDTDDocument = class(TPersistent)
157 private
158 { Private declarations }
159 FEntities: TDTDEntities;
160 FElements: TDTDElements;
161 procedure SetEntities(Value: TDTDEntities);
162 procedure SetElements(Value: TDTDElements);
163 public
164 { Public declarations }
165 constructor Create;
166 destructor Destroy; override;
167 procedure Assign(Source: TPersistent); override;
168 published
169 { Published declarations }
170 property Entities: TDTDEntities read FEntities write SetEntities;
171 property Elements: TDTDElements read FElements write SetElements;
172 end;
As you can see, our document gives us the access of some other types of data:
Entities and Elements. Entities are very hard to parse, so it's a good lesson for
you to include that feature. Parsing elements is quite easier, so this type of data
is better to explain here. Look at the dtd example some rows above this. You can
see, that a dtd element is descripted as followed:
Our object model needs some extra fields to store such element informations. If you
are not familiar with dtd or xml, look at W3CSchools http://www.w3schools.com/-
it's a good starting point to learn more about that topic. So, take a look at the
following object structure:
TDTDDocument
|
o--TDTDEntities
|
o--TDTElements
|
o--TDTDElementTyp
|
o--TDTDAttributes
|
o--TDTDAttributeTyp
o--TDTDAttributeStatus
o--Default: string
o--TDTDEnums
o--TDTDChild
|
o--TDTDTyp
o--TDTDChilds
I've tried to "pack" the dtd grammars into an easy object model as you can see
above:
Each document contains a collection of elements. Each element is descripted by an
elementtype and containes in turn a collection of attributes and childs. Each
attribute again is descripted by an attributetype and contains a collection of
enum(erations) and so forth. Followed a code cantle from the element
implementation, it's not suggestive to show you the whole code here - it's quit
long and a little bit more complex:
173 TDTDElement:
174
175 unit DTD_Document;
176
177 interface
178
179 uses
180 Classes;
181
182 type
183
184 ...
185
186 { TDTDElementTyp }
187
188 TDTDElementTyp =
189 (etAny, etEmpty, etData, etContainer);
190
191 { TDTDElementStatus }
192
193 TDTDElementStatus =
194 (esRequired, esRequiredSeq, esOptional, esOptionalSeq);
195
196 { TDTDItem }
197
198 TDTDItem = class(TCollectionItem)
199 private
200 { Private declarations }
201 FName: string;
202 public
203 { Public declarations }
204 procedure Assign(Source: TPersistent); override;
205 published
206 { Published declarations }
207 property Name: string read FName write FName;
208 end;
209
210 { TDTDItems }
211
212 TDTDItems = class(TCollection)
213 private
214 { Private declarations }
215 function GetItem(Index: Integer): TDTDItem;
216 procedure SetItem(Index: Integer; Value: TDTDItem);
217 public
218 { Public declarations }
219 function Add: TDTDItem;
220 function Find(const Name: string): TDTDItem;
221 property Items[Index: Integer]: TDTDItem read GetItem write SetItem; default;
222 end;
223
224 ...
225
226 { TDTDElement }
227
228 TDTDElement = class(TDTDProperty)
229 private
230 { Private declarations }
231 FTyp: TDTDElementTyp;
232 FAttributes: TDTDAttributes;
233 FChilds: TDTDChilds;
234 procedure SetAttributes(Value: TDTDAttributes);
235 procedure SetChilds(Value: TDTDChilds);
236 public
237 { Public declarations }
238 constructor Create(Collection: TCollection); override;
239 destructor Destroy; override;
240 procedure Assign(Source: TPersistent); override;
241 published
242 { Published declarations }
243 property Typ: TDTDElementTyp read FTyp write FTyp;
244 property Attributes: TDTDAttributes read FAttributes write SetAttributes;
245 property Childs: TDTDChilds read FChilds write SetChilds;
246 end;
247
248 { TDTDElements }
249
250 TDTDElements = class(TDTDProperties)
251 private
252 { Private declarations }
253 function GetItem(Index: Integer): TDTDElement;
254 procedure SetItem(Index: Integer; Value: TDTDElement);
255 public
256 { Public declarations }
257 function Add: TDTDElement;
258 function Find(const Name: string): TDTDElement;
259 property Items[Index: Integer]: TDTDElement read GetItem write SetItem; default;
260 end;
261
262 ...
263
264 implementation
265
266 ...
267
268 { TDTDItem }
269
270 procedure TDTDItem.Assign(Source: TPersistent);
271 begin
272 if Source is TDTDItem then
273 begin
274 Name := TDTDItem(Source).Name;
275 Exit;
276 end;
277 inherited Assign(Source);
278 end;
279
280 { TDTDItems }
281
282 function TDTDItems.Add: TDTDItem;
283 begin
284 Result := TDTDItem(inherited Add);
285 end;
286
287 function TDTDItems.Find(const Name: string): TDTDItem;
288 var
289 i: Integer;
290 begin
291 Result := nil;
292 for i := 0 to Count - 1 do
293 if CompareStr(Items[i].Name, Name) = 0 then
294 begin
295 Result := Items[i];
296 Break;
297 end;
298 end;
299
300 function TDTDItems.GetItem(Index: Integer): TDTDItem;
301 begin
302 Result := TDTDItem(inherited GetItem(Index));
303 end;
304
305 procedure TDTDItems.SetItem(Index: Integer; Value: TDTDItem);
306 begin
307 inherited SetItem(Index, Value);
308 end;
309
310 ...
311
312 { TDTDElement }
313
314 constructor TDTDElement.Create(Collection: TCollection);
315 begin
316 inherited Create(Collection);
317 FAttributes := TDTDAttributes.Create(TDTDAttribute);
318 FChilds := TDTDChilds.Create(TDTDChild);
319 end;
320
321 destructor TDTDElement.Destroy;
322 begin
323 FAttributes.Free;
324 FChilds.Free;
325 inherited Destroy;
326 end;
327
328 procedure TDTDElement.Assign(Source: TPersistent);
329 begin
330 if Source is TDTDElement then
331 begin
332 Typ := TDTDElement(Source).Typ;
333 Attributes.Assign(TDTDElement(Source).Attributes);
334 Childs.Assign(TDTDElement(Source).Childs);
335 end;
336 inherited Assign(Source);
337 end;
338
339 procedure TDTDElement.SetAttributes(Value: TDTDAttributes);
340 begin
341 FAttributes.Assign(Value);
342 end;
343
344 procedure TDTDElement.SetChilds(Value: TDTDChilds);
345 begin
346 FChilds.Assign(Value);
347 end;
348
349 { TDTDElements }
350
351 function TDTDElements.Add: TDTDElement;
352 begin
353 Result := TDTDElement(inherited Add);
354 end;
355
356 function TDTDElements.Find(const Name: string): TDTDElement;
357 begin
358 Result := TDTDElement(inherited Find(Name));
359 end;
360
361 function TDTDElements.GetItem(Index: Integer): TDTDElement;
362 begin
363 Result := TDTDElement(inherited GetItem(Index));
364 end;
365
366 procedure TDTDElements.SetItem(Index: Integer; Value: TDTDElement);
367 begin
368 inherited SetItem(Index, Value);
369 end;
The advantage of this object model is, that you're able to easily add the document
to a standard component and use Delphi's internal streaming technology to load and
save the object contents of a parsed file.
The next step will be the developing of the real dtd parser. Do you remember the
TStringParser descripted at the top of this article? We'll using this class to
build up our parser. But, we don't want to create a parser from scratch each time
we're about to parse a new kind of data - it's not mind of a framework. So first,
we'll develope a small parser class from which we will inherit our dtd parser. This
parent class should include the error reporting and accessable fields to some other
informations:
The Private Parser class:
370 unit PrivateParser;
371
372 interface
373
374 uses
375 Classes, SysUtils, StringParser;
376
377 type
378 { TParserError }
379
380 TParserError = class(TCollectionItem)
381 private
382 { Private declarations }
383 FFileName: string;
384 FLine: Integer;
385 FMessage: string;
386 FPosition: Integer;
387 public
388 { Public declarations }
389 procedure Assign(Source: TPersistent); override;
390 published
391 { Published declarations }
392 property FileName: string read FFileName write FFileName;
393 property Line: Integer read FLine write FLine;
394 property message: string read FMessage write FMessage;
395 property Position: Integer read FPosition write FPosition;
396 end;
397
398 { TParserErrors }
399
400 TParserErrors = class(TCollection)
401 private
402 { Private declarations }
403 function GetItem(Index: Integer): TParserError;
404 procedure SetItem(Index: Integer; Value: TParserError);
405 public
406 { Public declarations }
407 function Add: TParserError;
408 property Items[Index: Integer]: TParserError read GetItem write SetItem;
409 default;
410 end;
411
412 { TValidationParser }
413
414 TValidationParser = class
415 private
416 { Private declarations }
417 FErrors: TParserErrors;
418 procedure SetErrors(const Value: TParserErrors);
419 public
420 { Public declarations }
421 constructor Create;
422 destructor Destroy; override;
423 procedure AddError(const AMessage: string; Parser: TStringParser; const
424 AFileName:
425 string = '');
426 procedure AddErrorFmt(const AMessage: string; Params: array of const; Parser:
427 TStringParser; const AFileName: string = '');
428 property Errors: TParserErrors read FErrors write SetErrors;
429 end;
430
431 implementation
432
433 { TParserError }
434
435 procedure TParserError.Assign(Source: TPersistent);
436 begin
437 if Source is TParserError then
438 begin
439 Line := TParserError(Source).Line;
440 message := TParserError(Source).message;
441 Position := TParserError(Source).Position;
442
443 Exit;
444 end;
445
446 inherited Assign(Source);
447 end;
448
449 { TParserErrors }
450
451 function TParserErrors.Add: TParserError;
452 begin
453 Result := TParserError(inherited Add);
454 end;
455
456 function TParserErrors.GetItem(Index: Integer): TParserError;
457 begin
458 Result := TParserError(inherited GetItem(Index));
459 end;
460
461 procedure TParserErrors.SetItem(Index: Integer; Value: TParserError);
462 begin
463 inherited SetItem(Index, Value);
464 end;
465
466 { TValidationParser }
467
468 constructor TValidationParser.Create;
469 begin
470 inherited Create;
471 FErrors := TParserErrors.Create(TParserError);
472 end;
473
474 destructor TValidationParser.Destroy;
475 begin
476 FErrors.Free;
477 inherited Destroy;
478 end;
479
480 procedure TValidationParser.SetErrors(const Value: TParserErrors);
481 begin
482 FErrors.Assign(Value);
483 end;
484
485 procedure TValidationParser.AddErrorFmt(const AMessage: string; Params: array of
486 const; Parser: TStringParser; const AFileName: string = '');
487 begin
488 with FErrors.Add do
489 begin
490 FileName := AFileName;
491 Line := Parser.SourceLine;
492 message := Format(AMessage, Params);
493 Position := Parser.SourcePos;
494 end;
495 end;
496
497 procedure TValidationParser.AddError(const AMessage: string; Parser: TStringParser;
498 const AFileName: string = '');
499 begin
500 AddErrorFmt(AMessage, [], Parser, AFileName);
501 end;
502
503 end.
Now we can start developing the real parser by inheriting it from the
TValidationParser. Again, I don't want to show you the whole sourcecode here, so I
pick up only the sapid one. Our dtd parser is a so- called two-way parser, i.e. it
uses the first pass to parse the elements and the second pass to parse the
attributes. This is useful, because an attibute can refer to an element which is
placed below it and otherwise we'll get some unneeded errors. The main method of
our parse is Parse, which needs the name of the file to be parsed as the first
parameter, and a pre-initialized TDTDDocument as the second parameter. A sample
call should looks like:
Sample Call:
504 // Create DTDDocument.
505 DTDDocument := TDTDDocument.Create;
506 try
507 // Create DTDParser.
508 DTDParser := TDTDParser.Create;
509 try
510 // Parse File.
511 DTDParser.Parse(FileName, DTDDocument);
512
513 // Display possible Errors.
514 if DTDParser.Errors.Count > 0 then
515 begin
516 for i := 0 to DTDParser.Errors.Count - 1 do
517 with DTDParser.Errors[i] do
518 WriteLn(Format('Error in Line %d, Pos %d: %s...', [Line, Position,
519 message]));
520 Exit;
521 end;
522
523 ...
524
525 // Free DTDParser.
526 finally
527 DTDParser.Free;
528 end;
529
530 // Free DTDDocument.
531 finally
532 DTDDocument.Free;
533 end;
But now, let's take a look at some sourcecode lines of the parser implementation.
The first think we had to do is to inherited our parser from the parent class:
534
535 Parser implementation (Snippet):
536
537 type
538 { EDTDParser }
539
540 EDTDParser = class(Exception);
541
542 { TDTDParser }
543
544 TDTDParser = class(TValidationParser)
545 private
546 { Private declarations }
547 procedure ParseElement(Parser: TStringParser; Document: TDTDDocument; const
548 Pass:
549 Integer);
550 procedure ParseAttlist(Parser: TStringParser; Document: TDTDDocument);
551 procedure ParseFile(const FileName: string; Document: TDTDDocument; const Pass:
552 Integer = 0);
553 public
554 { Public declarations }
555 procedure Parse(const FileName: string; var Document: TDTDDocument);
556 end;
Afterwards we implement the Parse method which calls the internal method ParseFile
on her part:
Method "Parse":
557
558 procedure TDTDParser.Parse(const FileName: string; var Document: TDTDDocument);
559 var
560 TmpDocument: TDTDDocument;
561 begin
562 if not assigned(Document) then
563 raise EDTDParser.Create('Document not assigned!');
564 TmpDocument := TDTDDocument.Create;
565 try
566 ParseFile(FileName, TmpDocument);
567 if Errors.Count = 0 then
568 Document.Assign(TmpDocument);
569 finally
570 TmpDocument.Free;
571 end;
572 end;
As you can see, we create a special temporar document to store the parsed objects
in. I've done this because I don't want to return the document if it is full of
errors - I assign a exact copy of the objects only, if no errors occured. The
method ParseFile implements the proper parsing calls to the StringParser and
creates the real objects. Followed a code snippet of the method body:
Method "ParseFile":
573
574 procedure TDTDParser.ParseFile(const FileName: string;
575 Document: TDTDDocument; const Pass: Integer = 0);
576 var
577 Parser: TStringParser;
578 begin
579 Parser := TStringParser.Create;
580 try
581 if not Parser.LoadFromFile(FileName) then
582 begin
583 AddErrorFmt('File "%s" not found', [FileName], Parser);
584 Exit;
585 end;
586 while True do
587 begin
588 while not (Parser.Token in [toEOF, '<']) do
589 Parser.SkipToken;
590 if Parser.Token = toEOF then
591 Break;
592 Parser.SkipToken;
593 if Parser.Token <> '!' then
594 begin
595 if not (Parser.Token in ['?']) and (Pass = 1) then
596 AddError('InvalidToken', Parser);
597 Continue;
598 end;
599 if Parser.SkipToken <> toSymbol then
600 begin
601 if (Parser.Token <> '-') and (Pass = 1) then
602 AddError('InvalidToken', Parser);
603 Continue;
604 end;
605 if UpperCase(Parser.TokenString) = 'ENTITY' then
606 Continue;
607 if UpperCase(Parser.TokenString) = 'ELEMENT' then
608 ParseElement(Parser, Document, Pass)
609 else if UpperCase(Parser.TokenString) = 'ATTLIST' then
610 begin
611 if Pass = 1 then
612 ParseAttlist(Parser, Document);
613 end
614 else if Pass = 1 then
615 AddErrorFmt('Invalid Symbol "%s"', [Parser.TokenString], Parser);
616 end;
617 if Pass = 0 then
618 ParseFile(FileName, Document, 1);
619 finally
620 Parser.Free;
621 end;
622 end;
This method calls some other functions (ParseElement and ParseAttlist) which parses
the internal structures of an element or an attribute. Look at the whole sourceode
to understand.
What's next??
Well, this article has shown you how easy it is to write a customizeable parser
which can parse any kind of data - it's up to you, how complex it should be. The
main benefit in using this kind of parsing is, that you don't need to incorporate
in complex systems like LexParser.
Continue reading my second article:
Building an Easy-to-Use Parser/Parsing Framework (Part II)
|