Author: Jonas Bilinkevicius
I work a lot with fixed-length ASCII files, and I need to know how many total lines
there are in a file. Sure, I can open up the file in a text editor, but really
large files take forever to load. Is there a better way?
Answer:
As Mr. Miyagi said to Daniel-san in Karate Kid, "Funny you should ask..." Yes,
there is a better way. What I'm going to show you may not be the best way, but it's
reasonably fast, and exceptionally easy to use. It starts out with this premise. If
you know the total number of bytes in the file and know the length of each record,
then if you divide the total bytes by the record length, you should get the number
of records in the file. Sounds reasonable, right? And it's exactly the way we do it.
For this example, I used a TFileStream object to open up my text file. I like using
this particular object because it has come convenient methods and properties that I
can use to get the information that I need; in particular, the Size property and
the Read and Seek methods. How do I use them? Let's go through some plain English
to give you an idea:
Open up a file stream on a text file
Get its total byte size
Now, serially move through the file, byte-by-byte reading each byte into a
single-character buffer until you reach a return character (#13).
As you pass each byte, increment a counter variable that will serve as both a file
reference point and later, the length of the record.
When you get to the return character, break out of the loop, add 2 to the reference
counter (to account for the #13#10 CR/LF pair).
Finally return the result as the file size divided by the record length.
Here's the code that accomplishes the English above:
1
2 {======================================================================
3 This function will give you the exact record count of a file. It uses
4 a TFileStream and goes through it byte by byte until it encounters
5 a #13. When it does, it adds 2 to the recLen to account for the #13#10
6 CR/LF pair, then divides the byte size of the file by the record true
7 record length.
8
9 Note that this will only work on text files.
10 ======================================================================}
11
12 function GetTextFileRecords(FileName: string): Integer;
13 var
14 ts: TFileStream;
15 fSize,
16 recLen: Integer;
17 buf: Char;
18 begin
19 buf := #0;
20 recLen := 0;
21 //Open up a File Stream
22 ts := TFileStream.Create(FileName, fmOpenRead);
23 with ts do
24 begin
25 //Get the File Size
26 fSize := Size;
27 try
28 //Move through the file a byte at a time
29 while (buf <> #13) do
30 begin
31 Seek(recLen, soFromBeginning);
32 read(buf, 1);
33 Inc(recLen);
34 end
35 finally
36 Free;
37 end;
38 end;
39 recLen := recLen + 2; //Need to account for CR/LF pair.
40 Result := Round(fSize / recLen);
41 end;
As I mentioned above, this may not be the "best" way to do this, but it is a way to
approach this problem. A faster way to do this would have been to open up the file
as a regular file, then read a bunch of bytes into a large buffer, let's say an
Array of Char 4K in size. Perusing through an array is much faster than moving
through a file, but the disadvantage there is that you run the risk of having the
buffer too small. I've seen some fixed-length ASCII files with line sizes up to 8K.
In any case, the method I presented above may not be the most efficient, but it's
safe, and it works. Besides, what's a few milliseconds worth to you? Have at it!
Wait a minute! 10:00PM
Okay, I couldn't resist. I realized that I could've done better than my example
above. Here's the method I described immediately above:
42
43 function GetTextFileRecords(FileName: string): Integer;
44 const
45 BlockSize = 8192;
46 var
47 F: file;
48 fSize,
49 amtXfer: Integer;
50 buf: array[0..BlockSize] of Char;
51 begin
52 AssignFile(F, FileName); //Open up the text file as an untyped file
53 Reset(F, 1);
54 fSize := FileSize(F); //Get the file size
55 BlockRead(F, buf, BlockSize, amtXfer); //read in up to an 8K block
56 CloseFile(F); //close the file, you're done
57 Result := Round(fSize / (Pos(#13, StrPas(buf)) + 1));
58 end;
There are several things different about this function as opposed to the function
above. First of all, it involves a lot less code. This is due to not have to
perform class constructor; I open up an untyped file, read a big block, get its
size, then immediately close it. Notice too that I don't use a loop to find a #13.
Instead, I use the StrPas function to convert the array of char into a string
that's passed to the Pos function that will give me the position of the return
character; thus the record length. Adding one to this value will account for the
#10 portion of the CR/LF pair.
Because I don't have to deal with constructing an object, this method is a lot
faster than method above, and amazingly it's not very complicated. Where this type
of operation can get tricky is with the BlockRead function. In order to use
BlockRead successfully, you need to specify a record size. That can be a bit
confusing, so just remember this: for byte- by-byte serial reads through a file,
always use a record size of 1. Also, notice that I also included a variable called
amtXfer. BlockRead fills this with the actual number of bytes read. If you don't
supply this, you'll raise an exception when BlockRead executes. That's not too much
of a problem because all you need to do is create an exception handling block - but
why bother? Just supply the variable, and you don't have to worry about the
exception.
Okay, now it's time to close this out... Is this the best way to get the record
length of a fixed length text file? Admittedly, it's one of the faster ways save
using Assembler. But I'm wondering what a purely WinAPI call set would look
like.... If you have any ideas, please make sure to let me know!
Here I Go Again! 11:05 PM
I guess my curiosity got the best of me tonight, because I just wasn't satisfied
doing just the BlockRead method. I knew there had to be another way to do it with
WinAPI calls. So I did just that. Look at the code below:
59 function GetTextFileRecordsWinAPI(FileName: string): Integer;
60 const
61 BlockSize = 8192;
62 var
63 F: THandle;
64 amtXFer,
65 fSize: DWORD;
66 buf: array[0..BlockSize] of Char;
67 begin
68 //Open up file
69 F := CreateFile(PChar(FileName), GENERIC_READ, FILE_SHARE_READ, nil,
70 OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL or FILE_FLAG_NO_BUFFERING, 0);
71 fSize := GetFileSize(F, nil); //Get the file's size
72 ReadFile(F, buf, BlockSize, amtXfer, nil); //Read a block from the file
73 CloseHandle(F);
74 Result := Round(fSize / (Pos(#13, StrPas(buf)) + 1));
75 end;
This method is almost exactly the same as the one immediately above, but instead
uses WinAPI calls to accomplish the same task.
Now which method is better? I DON'T KNOW! Actually, for simplicity's sake, I prefer
the elegance of the second method - there's just a lot less coding involved. With
the WinAPI method, while it may require one less line of code, the CreateFile
function is not the easiest thing to work with - I spent a bit of time Alt-Tabbing
between the code editor and Windows help to get the syntax and constants right.
Granted, it's easier now that I've done it, but it's not a method that I prefer.
So I'll leave it up to you to decide which method you like better.
|