Parser
v 8.9
By: Craig Williams
Index
1.0 - Overview
The Parser is designed to be/have:
Portable
No Conditional Compilation
Simple To Use
No Obsecure Data Types
No Obsecure Macros
Documented
No Tabs
Block Formatting/Alignment
Dynamic, Fast, and Safe
A Complete Package in one File
Truly Free - Public Domain
How is this Parser different?
2.0 - Features
2.01 - File Handling
2.02 - Tokens & Token Sets
Token Parameters/Definable Logic
Return
switchto
ignore
Sample Program: New Line Counter
fun
params
Token Sets
Sample Program: Print comments and Strings
2.03 - Text Mode Functions
2.04 - Binary Mode Functions
2.05 - Function Callbacks
2.06 - String Manipulation Functions
2.07 - Options/Configuration
2.08 - Search Methods
Linear
Globbing
RegEx
2.09 - State Management
Manual
Stack
2.10 - C++ Wrapper
3.0 - Using the Parser
3.1 - Compiling the Parser
3.2 - Initialization
3.3 - Setting up the Token Set
3.4 - Parsing the file
3.5 - Deinitialization
4.0 - Functions
4.1 - Public Parser Functions
4.1.01 - AddTokenSep()
4.1.02 - AddTokenSeparator()
4.1.03 - AddTokenSet()
4.1.04 - End()
4.1.05 - ErrorCode()
4.1.06 - GetFilePosition()
4.1.07 - GetFileSize()
4.1.08 - GetParams()
4.1.09 - GetParserState()
4.1.10 - GetTokenSet()
4.1.11 - GrabBinaryFloat()
4.1.12 - GrabBinaryInt()
4.1.13 - GrabByte()
4.1.14 - GrabBytes()
4.1.15 - GrabFloat()
4.1.16 - GrabInt()
4.1.17 - GrabToken()
4.1.18 - LoadFile()
4.1.19 - LoadMemory()
4.1.20 - LoadMemoryLen()
4.1.21 - ParserDeInit()
4.1.22 - ParserDisable()
4.1.23 - ParserEnable()
4.1.24 - ParserInit()
4.1.25 - ParserIsEnabled()
4.1.26 - ParserMemoryUsage()
4.1.27 - PeekByte()
4.1.28 - PeekToken()
4.1.29 - PopParserState()
4.1.30 - PrintErrorCode()
4.1.31 - PushParserState()
4.1.32 - Seek()
4.1.33 - SetFilePosition()
4.1.34 - SetParserState()
4.1.35 - SetTokenSet()
4.1.36 - GenericDiscard()
4.2 - String Manipulation Functions
4.2.01 - RemoveWhiteSpaces()
4.2.02 - ToUpper()
4.2.03 - ToLower()
4.2.04 - Dup()
4.2.05 - DupLen()
4.2.06 - DupRange()
4.2.07 - DupRangeFile()
4.2.08 - Cmp()
4.3 - Private Parser Functions
4.3.01 - BuildRange()
4.3.02 - Compile()
4.3.03 - DisableRegEx()
4.3.04 - DiscardWildcard()
4.3.05 - EnableRegEx()
4.3.06 - EnableWildcard()
4.3.07 - ForwardSearch()
4.3.08 - ForwardSearchReg()
4.3.09 - GrabLeftover()
4.3.10 - GrabNextChunk()
4.3.11 - GTChar*()
4.3.12 - HandleEscapes()
4.3.13 - InvalidRegEx()
4.3.14 - PDupRangeFile()
4.3.15 - PreserveBufferState()
4.3.16 - PreserveTSetHistory()
4.3.17 - PrintCompiled()
4.3.18 - ProcessToken()
4.3.19 - ProcessTokenWild()
4.3.20 - ReadBinary()
4.3.21 - RebuildHash()
4.3.22 - RestoreBufferState()
4.3.23 - RestoreTSetHistory()
4.3.24 - ShiftRight()
4.3.25 - SortTokenSet()
4.3.26 - UpdateThreads()
4.3.27 - WriteThreadBefore()
5.0 - Define List
6.0 - Known Bugs
7.0 - Planned Features
8.0 - Change Log
The Parser was originally designed to meet my own needs. Since it's
original creation, the parser has been expanding into something simply
beautiful. A lot of time and though has been put into the library, to
ensure that it behaves as it should.
The library is entirely written by me, and release to the Public Domain
One quick note. This entire document was hand written in Crimson Editor.
That, combined with my horrible spelling & grammar, will lead to
various errors in the document.
The parser was designed with several specific concepts/features in mind
Portable: - Top of Page
Portability is one of the main concepts that I spent a lot of time
on. If the platform you are compiling on supports ANSI C with
stdio.h, stdlib.h, string.h, and limits.h support, the library
should compile without any issues.
This library (v 8.9) has even been tested and found to work
correctly on the Nintendo Wii with minimum modifications
(File I/O is different, due to the DVD drive).
In order to ensure portability, the Parser was tested on a
Windows and a Linux machine. Unfortunately, I don't have access
to a Mac development box, so I'm forced to make do. Since the
parser does not require any GUI, I was able to rely on strict
ANSI C (C89). Multiple compilers are used and tested as follows
Windows:
MS Compiler cl *.c /W4
GNU gcc *.c -Wall -Wextra -ansi -pedantic
gcc *.c -Wall -Wextra -ansi -pedantic -mno-cygwin
g++ *.c -Wall -Wextra -ansi -pedantic
Borland Compiler bcc -w *.c
Digital Mars dmc file.c -A
Intel Complier icl *.c /W3
Linux:
GNU gcc *.c -Wall -Wextra -ansi -pedantic
g++ *.c -Wall -Wextra -ansi -pedantic
For each compiler, the maximum warning level is turned on.
No warnings or errors are acceptable for a release version.
The only exception to this policy lies with the Microsoft
Compiler. /Wall is the maximum; however, this generates warnings
from the Microsoft headers (none are from Parser.h or Parser.c).
No Conditional Compilation: - Top of Page
Conditional compilation (#ifdef & such) is avoided as much as
possible. Conditional compilation is only used for standard header
guards and nothing else. In other words, you do not have to add any
defines to build script or similar to get the Parser to compile.
Simple To Use: - Top of Page
To make the library as simple to use as possible, several
conventions are followed. Firstly, all the function names and
such follow my coding standards (included). Other then that,
all of the code is in two files, a .c and an .h. I find it far
simpler to just copy two files into your project, and then
included one file (#include "Parser.h", by default) to get the
library to work with other code. No defines or special
compilation flags are required.
In order to increase the simplicity, a file handler was built into
the project. Personally, I find manual file I/O to be rather ugly
and fairly inefficient. As such, the Parser provides a
comprehensive and fairly robust built in File I/O system, to
automate the process.
No Obscure Data Types: - Top of Page
No defined or typedef types beyond the remove of the "struct"
keyword is used. This should make it obvious as to what type
of data each variable takes.
Note: The function callback type is typedef; however, this is only
provided to simplify casting procedure. It is explicitly
defined everywhere else.
No Obscure Macros: - Top of Page
Yes, macros can be really helpful; however, they add another level
of obscurity to your code. There are no macros used at the
public level; however, a tiny number of internal macros are used.
Any macro that is used must be understood by its defined name.
Documented: - Top of Page
To improve you understanding of the project, documentation is
added to the project. Every function has a function header and
there is a file header in every file. Comments are added to the
code; however, worthless comments are avoided as much as
possible.
The external documentation provided with the project is not
automatically generated by some program like Doxygen or
similar. While these tools can be nice, they do no tell you
anything that reading the code won't tell you, and are thus
less helpful then a manually authored document. As such, a
compressive document written in one of the most commonly
accessed document types is provided.
No Tabs: - Top of Page
Tabs are evil. Rather, tabs mixed with spaces are evil. While tabs
may bet set to 4 spaces in one program, they may bet set to x
spaces in another program or on another computer. This destroys
the alignment and code flow, so they are all removed.
Unfortunately, this does increase the file size.
Block Formatting/Alignment: - Top of Page
I tend to be an alignment whore. I find it far easier to look at
and read code if it is separated into blocks.
Dynamic, Fast, and Safe: - Top of Page
The library was designed to be as dynamic as possible, without
sacrificing a huge number of cycles. All code is benchmarked, and
weight against the usefulness of the feature. If the feature
eats up a huge number of cycles while not being very useful, it
will not be implemented.
To ensure that the Parser is safe, a few cycles are spent on error
checking. All allocated memory is checked to ensure that it is
valid (not null), and buffer under/over read is checked. To
check for any memory related issues, Valgrind and my own memory
manager is used to check for any issues and memory leaks. No
memory leaks are tolerated. The memory checking mechanism is
removed for release.
That being said, it is still possible to crash the program through
the parser. If you pass in a pointer to a bad chunk of memory,
the parser will most probably crash.
A Complete Package in one File: - Top of Page
The parser does not depend on any other libraries, other then the
standard C library. Namely, stdio.h, stdlib.h, string.h, and
limits.h.
Truly Free - Public Domain - Top of Page
This library is released to the public domain. You are free to do
anything you want with it.
How is this Parser different? - Top of Page
The Parser was not designed to conform to traditional parser
designs. Generally, the differences can be summed up as follows:
Parser vs Traditonal Parser:
Modifiable at runtime
Built in dynamic lexical analyzer
Smart built In File I/O
No required external scripting
Easy to iterate with
Built in state system
Traditional parsers are generally made through a parser generator.
This alone can make integration and rapid iteration extremely
difficult to incorporate into a larger project. As such, the
Parser is built and modifiable at runtime. On top of this, a
C/C++ interface is provided in order to allow for easy integration
into an existing project, rather then writing a script to generate
a .c file, which then may require some additional modifications to
fully integrate said .c file into your project.
Along with this, all traditional parsers I've seen have only been
designed to run as one instance. As such, the Parser is wrapped up
in a fairly easy to manage state system, to allow multiple
Parser to exist at one time.
With that being said, there are plans to expand the Parser to
incorporate some more traditional parser technologies such as
BNF style scripts and LALR generation, which again, will be
done at runtime.
2.01 - File Handling - Top of Page
A built in file handling system is implemented in the parser. The file
handling system includes a file fragmentation/caching system. When
ParserInit(<file>, <bufsize>) is called, a buffer size is specified.
When the data is read in from the file, the specified buffer size determines
how many bytes to read in from the file.
Since a token separator can fall on the ends of the data read in, the parser
accounts for this fragmentation. For example, take
bufsize - 3
data - "0123456789ABCDEF"
With a bufsize of 3, the parser will fragment the file into
Fragment 1 - "012"
Fragment 2 - "345"
Fragment 3 - "678"
Fragment 4 - "90A"
Fragment 5 - "BCD"
Fragment 6 - "EF"
If a token separator was declared as "234", it would not be detected, since
the entire string would never be in the input buffer.
To handle this, the buffer size is expanded to the length of the longest
token separator - 1. IE, if "234" was the only separator, then the
buffer size would be expanded by 2 (Length("234") - 1).
Original Buffer:
-- -- --
| | | |
-- -- --
Expanded:
-- -- -- -- --
| | | | | |
-- -- -- -- --
Note: A null terminator is attached as well, but it is not represented.
When the data is actually read in, last longest sep - 1 is are attached
to the front of the buffer. This ensures that all the characters are
checked against all the possible token separators.
bufsize - 3
Longest Sep - 3
Actual buf - 5 (plus a null terminator, so it's actually 6 bytes)
data - "0123456789ABCDEF"
Fragment 1 - "012"
Fragment 2 - "12345"
Fragment 3 - "45678"
Fragment 4 - "7890A"
Fragment 5 - "0ABCD"
Fragment 6 - "CDEF"
This does force some redundant checking; however, it is far more important
that the parser correctly locates the tokens.
If performance is an issue, a low number of short token separators with a
larger buffer size will greatly increase performance. Larger buffer sizes
decrease the number of reads from the hard drive; however, the memory
footprint of the parser will increase. The smallest possible memory
footprint can be achieved by setting the buffer size to 1; however, it is
far slower. A buffer size of 1024 bytes (1 KB) is recommended for general
purposes.
2.02 - Tokens & Token Sets - Top of Page
Token Separators:
The actual parsing syntax of the parser is defined as "Token Separators".
There are several tokenizers on the market (including one in string.h) that
are fast; however, a single character is not always enough to make parsing
a file simple and easy. As such, full strings are used and scanned for. I
refer to the "delimiters" as "Token Separators", since addition logic
can be attached to them. This library is call a parser, instead of a
tokienizer, due to this addition logic. Not to mention, that function
callbacks are supported as well.
The order the tokens are added does matter. A token added before another
token will have a higher priority.
The Token Separators can have the following logic attached to them via
AddTokenSeparator(<token>, <Return>, <switchto>, <ignore>, <fun>, <params>)
Return - Should the token be returned when GrabToken() is called? If this
argument is set to true (1), the token will be returned. If
it's set to false (0), the token will not be returned. This is
extremely useful to filter out unneeded tokens.
For example,
data - "foo bar" /* Notice the two space between foo & bar */
if the Separator is a space, and return is set to 1, namely
AddTokenSeparator(" ", 1, ...);
GrabToken(); /* Returns "foo" */
GrabToken(); /* Returns " " */
GrabToken(); /* Returns " " */
GrabToken(); /* Returns "bar" */
GrabToken(); /* Returns 0 - End of the data was reached */
If return is set to 0, namely
AddTokenSeparator(" ", 0, ...);
GrabToken(); /* Returns "foo" */
/* Both of the spaces are not returned! */
GrabToken(); /* Returns "bar" */
GrabToken(); /* Returns 0 - End of the data was reached */
switchto - Should the active token set be changed, when the sep is found?
Setting this parameter to -1 disables this feature. If it's
set to anything else, the token set will automatically be
changed to the specified token set, if it exists.
Two defines can also be passed to this variable:
Define Value Description
PARSER_TSET_DONT_SWITCH - -1 - Don't change the current TSet
PARSER_TSET_LAST - -2 - Switch to the last active TSet
See Token Sets
ignore - Should the token separator be ignored? This logic was originally
designed to be used when parsing strings.
For Example:
string - "foo\"bar" /* Start and end quotation marks are part
* of the string */
Token Sep is a quotation mark, namely
AddTokenSeparator("\"", 1, -1, 0, ...);
GrabToken(); /* Returns quotation mark */
GrabToken(); /* Returns "foo\\" */
GrabToken(); /* Returns quotation mark */
GrabToken(); /* Returns "bar" */
GrabToken(); /* Returns quotation mark */
GrabToken(); /* Returns 0 - End of the data was reached */
If you had wanted to preserve the string, you probably didn't
want the parser to pick out the quotation mark from \". To
fix this, add another token sep (\") with ignore set to 1
AddTokenSeparator("\"", 1, -1, 0, 0, 0); /* " */
AddTokenSeparator("\\\"", 0, -1, 1, 0, 0); /* \" */
^- Ignore
GrabToken(); /* Returns quotation mark */
GrabToken(); /* Returns "foo\"bar" */
GrabToken(); /* Returns quotation mark */
GrabToken(); /* Returns 0 - End of the data was reached */
Ignore is also useful when combined with a function callback.
For example, say you wanted to count all the new lines (\n) in
a file. You could set ignore to 1, and then set the function
pointer to a function that would increment a global variable
that counts the number of new lines. When GrabToken() is
called, It'll call the callback function when ever it runs into
a new line. Since Ignore is set to one, it'll continue to do
this until it gets to the end of the file. From there, it'll
return the entire file, but you'll have the total number of
new lines in the file
Note: You cannot call GrabToken() or similar from a callback with
ignore set to 1 (true).
/*************************************************************
* Full Program - Retrieves the number of new lines from a file
* *
* Note: You must change <any file> in ParserInit() to the *
* name of the file you want to get the number of *
* new lines from. *
*************************************************************/
#include "Parser.h"
#include <stdlib.h> /* free() */
#include <stdio.h> /* printf() */
char *NewLineCounter(char *str, int *newlines);
int main(void)
{
int NewLines = 1;
ParserInit(<any file>, 1024);
AddTokenSeparator("\n", 0, -1, 1, (PCBACK)NewLineCounter,
&NewLines);
free(GrabToken()); /* Scan the whole file, and free what *
* ever is returned */
printf("New Lines: %d\n", NewLines);
ParserDeInit();
return 0;
}
char *NewLineCounter(char *str, int *newlines)
{
(*newlines)++;
return str;
}
/*************************************************************
* End of the program *
*************************************************************/
fun - Function to call when ever a token is found. The prototype for
the function is
char *<function name>(char *str, void *params);
The return of the function should be:
0 - Continue the search.
1+ - Any string. By default, the safest thing to do is to
return str; (first parameter of the function). Any
string returned by the callback is assumed to be
owned by the parser. As such, it must be allocated by
malloc(), calloc(), or realloc(). This enables you to
replace any token separator with another string.
It is assumed that you own the str parameter. IE, if you do not
plan to return the variable, and it is not 0/NULL, you should
call free(str);.
In the above program, if you changed the ignore parameter to
a 0 (false), str will be 0/NULL.
params - Parameters to pass to the callback. Check New Line Counter
for an example on how to use this. Once a token separator is
found and returned, you can retrieve this variable by calling
GetParams().
To add a new Token Separator, you can call two different functions
AddTokenSep(); - Basic version that attaches default behavior
AddTokenSepataor(); - Advanced version that allows you to define the logic.
To use AddTokenSep(), you only have to pass in a pointer to a string. A 1+
will be returned if a error occurred and a 0 will be returned if the token
was added.
Default behavior:
Return - 1 - Return the token when GrabToken() is called.
switchto - -1 - Don't change the token set.
ignore - 0 - Don't ignore the token separator.
fun - 0 - Don't call a function.
params - 0 - No params to pass to the callback
AddTokenSepataor() allows you specify the logic of the token separator.
Token Sets:
To allow the Parser to be more dynamic, multiple Token Sets can be
defined. A "Token Set" is just a set of tokens. Each token set is
completely separated from one another. This allows the parser to switch
the parsing syntax at runtime.
Once the parser is initialized (by calling ParserInit()), the initial token
set will automatically be created, and set as the active token set. The
initial token set has an index of 0.
To create a new token set, simply call
AddTokenSet();
The function will create a new token set, set the new token set as the
active token set, and then return the index of the token set. To properly
handle the return value, you should create a descriptive variable to
store the index.
For Example:
ParserInit(0, 1024); /*Set up parser and create token set 0*/
int tset_comments = AddTokenSet();/* Create tset that handles comments */
int tset_strings = AddTokenSet();/* Create tset that handles strings */
Likewise, you can also get the current token set by calling:
GetTokenSet();
While this is the proper way to handle token set indexes, feel free to just
hard code the value. The initial token set is 0. Each call to
AddTokenSet() will increase the index by 1. IE, tset_comments will be 1,
and tset_strings will be set to 2.
All calls to AddTokenSep(), AddTokenSeparator(), GrabToken(), PeekToken(),
etc will use the active token set.
To Change the active token set, simply call SetTokenSet();
SetTokenSet(tset_comments); /* Set the active token set to handle comments*/
SetTokenSet(tset_strings); /* Set the active token set to handle strings */
SetTokenSet(0); /* Set the active tset to the initial tset */
SetTokenSet() will return a -1 if the token set index you specified is
invalid, Otherwise, SetTokenSet() will return the index you passed in.
To add further automation to the parser, you can specify which token set
the parser will use when ever a token separator is located. To do this,
specify the token set the parser should switch to, as the switchto
parameter.
For example, let's write a program that will print only the C style
comments and strings from Parser.c. First, we Initialize the Parser.
That will create the initial token set (0) that will handle all the
switching between token sets, and calling the proper function. Once the
initial token set is created, create two addition token sets. The first
token set will handle all the strings. The 2nd, will handle all the
comments. Once we have all the token sets, we switch back to the
initial token set, otherwise we would add the token separators to the
last created token set.
Now that the initial token set is active, and all the addition token
sets have been created, we can start adding the parsing syntax.
Initial Token Set:
AddTokenSeparator("\"", 0, tset_strings, 0, PrintString, 0);
" - Return - 0 - Don't return it
switchto - tset_strings - switch to tset that handles strings
ignore - 0 - Don't ignore the token.
fun - PrintString - Print the string to the screen
params - 0 - Don't pass any parameters
AddTokenSeparator("/*", 0, tset_comments, 0, PrintComment, 0);
/* - Return - 0 - Don't return it
switchto - tset_comments- switch to tset that handles comments
ignore - 0 - Don't ignore the token
fun - PrintComment - Print comment to command prompt
params - 0 - Don't pass any parameters
With the initial token set up, we need to set up the two additional
token sets.
SetTokenSet(tset_strings);
tset_strings:
AddTokenSeparator("\\\\", 0, -1, 1, 0, 0);
\\ - Return - 0 - Don't return it
switchto - -1 - Don't change the current token set
ignore - 1 - Ignore the token, so we don't break up strings
fun - 0 - Don't call any function
params - 0 - Don't pass any parameters
AddTokenSeparator("\\\"", 0, -1, 1, 0, 0);
\" - Return - 0 - Don't return it
switchto - -1 - Don't change the current token set
ignore - 1 - Ignore the token, so we don't break up strings
fun - 0 - Don't call any function
params - 0 - Don't pass any parameters
AddTokenSeparator("\"", 0, 0, 0, 0, 0);
" - Return - 0 - Don't return it
switchto - 0 - Switch to the initial token set
ignore - 0 - Don't ignore it
fun - 0 - Don't call a function
params - 0 - Don't pass any parameters
SetTokenSet(tset_comments);
tset_comments:
AddTokenSeparator("*/", 0, 0, 0, 0, 0);
*/ - Return - 0 - Don't return it
switchto - 0 - Switch to the initial token set
ignore - 0 - Don't ignore it
fun - 0 - Don't call a function
params - 0 - Don't pass any parameters
With all the token sets set up, we need to switch back to the initial
token set before we can start parsing the file.
SetTokenSet(0);
With every thing set up, the actual parsing is fairly automatic.
When GrabToken() is called, It'll first scan the file for a separator.
If it finds a /*, it'll switch to tset_comments, and then call the
function PrintComment(). In PrintComment(), a printf() is called
along with another GrabToken(). In PrintComment(), GrabToken() will
return everything up to the */. Once */ is found, the token set
will automatically be switched back to the initial token set, and
then continue to search for another token. The same procedure happens
for strings as well.
/************************************************************************
* Start of Program. Prints out all comments and strings from Parser.c *
************************************************************************/
#include "Parser.h"
#include <stdio.h> /* printf() */
#include <stdlib.h> /* *alloc(), free() */
char *PrintString (char *str, void *);
char *PrintComment(char *str, void *);
int main(void)
{
int tset_strings;
int tset_comments;
char *buffer;
ParserInit("Parser.c", 1024);
tset_strings = AddTokenSet();
tset_comments = AddTokenSet();
SetTokenSet(0);
AddTokenSeparator("\"", 0, tset_strings, 0, PrintString , 0);
AddTokenSeparator("/*", 0, tset_comments, 0, PrintComment, 0);
/* Set up the token set that will handle the strings. *
* Note: We need to ignore \\ as well as \", since it is possible to*
* have something like this \\". If we don't ignore \\, then*
* the parser will continue to search for a " since \" will *
* be picked out of the stream. */
SetTokenSet(tset_strings);
AddTokenSeparator("\\\\", 0, -1, 1, 0, 0);
AddTokenSeparator("\\\"", 0, -1, 1, 0, 0);
AddTokenSeparator("\"", 0, 0, 0, 0, 0);
/* Set up the token set that will handle all comments */
SetTokenSet(tset_comments);
AddTokenSeparator("*/", 0, 0, 0, 0, 0);
SetTokenSet(0);
/* Loop through all the code. GrabToken() will return chunks of code*
* so, we need to free it until we get a null buffer/end of file */
for(buffer = GrabToken(); buffer; buffer = GrabToken());
free(buffer);
/* Clean up the parser */
ParserDeInit();
return 0;
}
char *PrintString (char *str, void *a)
{
char *buffer = GrabToken();
printf("String Found!\n\"%s\"\n\n", buffer);
free(buffer);
/* We own str, so free it if it's valid */
if(str)
free(str);
return 0; /* Continue Searching */
}
char *PrintComment(char *str, void *a)
{
char *buffer = GrabToken();
printf("Comment Found!\n/*%s*/\n\n", buffer);
free(buffer);
/* We own str, so free it if it's valid */
if(str)
free(str);
return 0; /* Continue Searching */
}
/************************************************************************
* End of Program *
************************************************************************/
2.03 - Text Mode Functions - Top of Page
GrabToken() and PeekToken() are the main text file functions. They search
through the file for a Token Separator and return a pointer to a null
terminated array. These functions will work with binary files as well;
however, due to NULL terminator, using these functions will cause all
tokens that start with a 0 to be ignored.
GrabInt() and GrabFloat() utilize GrabToken(), the conversion into the
respective variable type.
NOTE: Unicode is not supported.
Text Mode Compliant Functions:
GrabToken();
PeekToken();
GrabInt ();
GrabFloat();
Seek ();
2.04 - Binary Mode Functions - Top of Page
Seek() is the main driving force behind binary mode. It allows you to
scan the file, until you get to the position you want. GrabBinaryInt(),
GrabBinaryFloat(), GrabByte(), and GrabBytes() allow you to retrieve the
data. You can use Grab/PeekToken; however, binary data might be
interpreted to match a token separator.
Binary Mode Compliant Functions:
Seek ();
GrabBinaryInt ();
GrabBinaryFloat();
GrabBytes ();
GrabByte ();
PeekByte ();
2.05 - Function Callbacks - Top of Page
The parser supports callback functions. The prototype for the function is
char *<function name>(char *str, void *params);
The return of the function should be:
0 - Continue the search.
1+ - Any string. By default, the safest thing to do is to return str;
(first parameter of the function). Any string returned by the
callback is assumed to be owned by the parser. As such, it must be
allocated by malloc(), calloc(), or realloc(). This enables you to
replace any token separator with another string.
It is assumed that you own the str parameter. IE, if you do not plan to
return the variable, and it is not 0/NULL, you should call free(str);.
See The following for examples:
Sample Program: New Line Counter
Sample Program: Print comments and Strings
2.06 - String Manipulation Functions - Top of Page
RemoveWhiteSpaces() - Removes all white spaces from a string (Space, New Line,
Carriage Return, and Tab)
ToUpper () - Converts all letters in a string to upper case
ToLower () - Converts all letters in a string to lower case
Dup () - Creates a copy of the specified string
DupLen () - Same as above, but takes the length
DupRange () - Creates a copy of a specific part of a string
DupRangeFile() - Creates a copy of a specific part of a file
Cmp () - Compares two strings. If they are equal, 1 is returned. If
not, 0 is returned.
2.07 - Options/Configuration - Top of Page
The Parser contains several runtime definable options. Options can be
enabled, disable, or checked with the following functions:
ParserEnable () - Enables a option/feature
ParserDisable () - Disables a option/feature
ParserIsEnabled() - Checks to see if all the specified options are enabled
All the above functions can take multiple options at the same time.
For Example,
ParserEnable(PARSER_GLOBBING | PARSER_HASH | PARSER_CASE_INSENSITIVE);
Search Algorithm Options:
PARSER_CASE_INSENSITIVE - All search related functions will perform
case insensitive searches.
PARSER_WILDCARD - Globbing based searches will be enabled.
Namely, the * character will be use to
accept any number of additional characters.
This feature will be disabled if
PARSER_REGEX is enabled.
See Globbing for more info.
PARSER_GLOBBING - Same as PARSER_WILDCARD
PARSER_REGULAR_EXPRESSIONS - Enables RegEx searching syntax.
This feature will take precedence over and
disable the following features:
PARSER_HASH
PARSER_WILDCARD
PARSER_GLOBBING
PARSER_SORT_TOKEN_SEPS
See RegEx for more info.
PARSER_REGEX - Same as PARSER_REGULAR_EXPRESSIONS
Performance Options:
PARSER_HASH - Hashes the first character of each token
separator. This can drastically speed up
the parsing process, in exchange for
additional memory usage.
This feature will be disabled if
PARSER_REGEX is enabled.
This feature implies PARSER_SORT_TOKEN_SEPS
PARSER_CLOSE_FILE - Closes the file after it is read from. By
default, the file is left open due to
performance reasons. Namely, if the file
is closed, the OS will have to seek to
the correct location in the file which
tends to be very expensive.
Closing a file will reduce the total number
of active system file handles, in exchange
for a drastic performance loss.
PARSER_SORT_TOKEN_SEPS - Performs a week sorting algorithm to all
token separators. This guarantees that
all supersets of token separators will be
check before any subsets, in exchange for
a slightly longer initialization time.
This feature will be disabled if
PARSER_REGEX is enabled.
Memory Options:
PARSER_CONST_FILE_NAME - It is assumed that file name passed to
LoadFile() or ParserInit() is const. As
such, the string will not be duplicated,
nor will it be freed when ParserDeInit() is
called. As such, the string must be valid
for the entire life of the parser.
PARSER_CONST_LOAD_MEMORY - It is assumed that memory passed to
LoadMemory() or LoadMemoryLen() is const.
As such, the memory will not be duplicated,
nor will it be freed when ParserDeInit() is
called. As such, the memory must be valid
for the entire life of the parser.
PARSER_CONST_TOKEN_SEPS - It is assumed that token separator passed to
AddTokenSeparator() is const.
As such, the string will not be duplicated,
nor will it be freed when ParserDeInit() is
called. As such, the string must be valid
for the entire life of the parser.
PARSER_OWNS_FILE_NAME - It is assumed that the parser owns the
pointer passed to LoadFile() and
ParserInit(). As such, the pointer
will be freed when ParserDeInit()
is called. The pointer must be allocated
with malloc(), calloc(), or realloc().
PARSER_OWNS_LOAD_MEMORY - It is assumed that the parser owns the
pointer passed to LoadMemory() and
LoadMemoryLen(). As such, the pointer
will be freed when ParserDeInit()
is called. The pointer must be allocated
with malloc(), calloc(), or realloc().
PARSER_OWNS_TOKEN_SEPS - It is assumed that the parser owns the
pointer passed to AddTokenSeparator(). As
such, the pointer will be freed when
ParserDeInit() is called. The pointer must
be allocated with malloc(), calloc(), or
realloc().
2.08 - Search Methods - Top of Page
The following search algorithms directly effect how a token separator is
interpreted by the Parser. The more complex the search algorithm, the slower
it tends to be.
Linear:
This is the default and fastest search algorithm within the Parser. Simply
stated, a linear search is performed on the file/memory. IE, the Parser
will start from the first byte and compare it against all the token
separators. If none of the separators match, the next byte will be
tested, and so on.
In this mode, the token separators will be literally interpreted. IE, there
are no reserved characters, other then the NULL terminator/0.
To ensure that a linear search is performed, call:
ParserDisable(PARSER_GLOBBING | PARSER_REGEX);
Globbing:
To enable Globbing, call:
ParserEnable(PARSER_GLOBBING);
Globbing is similar to the linear search method; however, the asterisk (*)
has a special meaning. Namely, it can be any number of any types of
any characters. IE,
Glob Matching Examples
foo*bar - foobar, foo\nbar, foo___bar, foo o foo bar
*foo* - asdf foo bar, dddfoobar,
*f*r* - fr, f r, af r, af rd
Glob Non-Matching Examples
foo*bar - foba, fooba, foob ar, f oobar
*foo* - fo o, asf oo
*f*r* - afh
In order to prevent a glob from returning the content of an entire file,
all the other token separators will be taken into account. This will
effect the glob if it has a * on the front or end of the token ("*foo*").
Internal * are not separator delimited ("f*o").
So, if you have this
LoadMemory("ab foo bar ddd");
ParserEnable(PARSER_GLOBBING);
AddTokenSep(" ");
AddTokenSep("*foo*bar");
Then,
GrabToken(); /* Returns "ab" */
GrabToken(); /* Returns " " */
GrabToken(); /* Returns "foo bar" */
GrabToken(); /* Returns " " */
GrabToken(); /* Returns "ddd" */
GrabToken(); /* Returns 0 - End of the data was reached */
Likewise, with the same Token Set,
LoadMemory("ab ddfoo foo baree ddd");
GrabToken(); /* Returns "ab" */
GrabToken(); /* Returns " " */
GrabToken(); /* Returns "ddfoo foo baree" */
GrabToken(); /* Returns " " */
GrabToken(); /* Returns "ddd" */
GrabToken(); /* Returns 0 - End of the data was reached */
Warning: Once a glob has found the first part of a non wildcard segment of
the token, it will search to the end of the file/memory in order
to locate the end. It will break out as soon as an ending segment
is found. In other words, a O(N^2) search might be performed by
a glob on a file. IE, with a bad set of data, globbing may be
extremely slow.
Warning: There is no way to escape the *. Once enabled, all * will be
interpreted as a wildcard character.
RegEx:
Regular Expressions, or RegEx for short, is a much more powerful version of
globbing.
Warning: The RegEx engine is still in development. It is possible to get
stuck in an infinite loop.
To enable RegEx, call:
ParserEnable(PARSER_REGEX);
RegEx on Wikipedia
Supported Syntax:
Logic:
^ - Start of String $ - End of String
\b - Word Boundary \B - Not a Word Boundary
\< - Start of Word \> - End of Word
| - Or. Ex, "a|b" will match 'a' or 'b'
Ranges:
[aqf] - Will Match a, q, or f. Anything can be added here
[^a] - Anything other then 'a'
[a-z] - Anything from 'a' to 'z'. 'a' and 'z' can be
replaced with any ASCII characters.
Predefined Ranges:
\s - [ \t\r\n\v\f] - White Space
\S - [^\s] - Not White Space
\d - [0-9] - Digit
\D - [^\d] - Not a Digit
\w - [A-Za-z0-9_] - Word
\W - [^\w] - Not a Word
Quantifiers:
All Quantifiers are Greedy. Append a '?' after a
quantifier to make it non greedy.
? - 0 or 1 * - 0 or more
+ - 1 or more
Other:
(...) - Sub expressions. Ex, "(a|b)+" will match
"aaaa", "aabb", "abab", "bbba", ...
Internally, all RegEx's in the Parser are compiled into byte code.
This is done in order to simplify the implementation of the RegEx engine,
as well to improve the overall performance. Currently, the Parser uses
a DFA engine, due to their high performance.
Planned Enhancements:
Capture Groups
Backreferences
Warning: While still usable, RegEx's tend to be slower then Linear and
Globbing based searches.
2.09 - State Management - Top of Page
For simplicity, the Parsers state management is handled through a global
pointer. This reduces the total number of variables you have to manage.
If you need more then one Parser state, the following state management
methods are provided to simplify the overall process.
Manual:
When you are manually managing the Parser state, you are responsible for
keeping track of the state's pointer. Once the Parser is initialized
via ParserInit(), you can get the state pointer by
calling:
void *state = GetParserState();
This will give you a direct copy of the state pointer. The current state
will still be active.
To change the active Parser state, call:
SetParserState(state);
To deinitialize the Parser's state while the state is still active, call
ParserDeInit();
The main idea behind this mode is to wrap any code that needs to create a
new parser state with the above function calls.
For Example (assume foo() was called):
void foo(void)
{
/* Preserve any old parser state */
void *old_state = GetParserState();
/* Create a new state */
SetParserState(0);
ParserInit(...);
/* Do stuff with the Parser... */
bar();
/* Do stuff with the Parser... */
/* Cleanup the current state, and restore the old state */
ParserDeInit();
SetParserState(old_state);
}
void bar(void)
{
/* Preserve any old parser state */
void *old_state = GetParserState();
/* Create a new state */
SetParserState(0);
ParserInit(...);
/* Do stuff with the Parser... */
ParserDeInit();
SetParserState(old_state);
}
So, looking at the above code, when foo is called, we preserve any Parser
state that might be active. Then, we create a new state. We then call
bar() which does the same thing. This guarantees that we will not
do any harmful things to another parser state.
Alternatively, we can wrap the call to bar() with Get/SetParserState();
however, this tends to be a bit more dangerous and can lead to a lot
more code (you'll need to do this for all calls to functions that
expect an empty Parser state).
The main advantage to this method, is that you can prebuild a set of
Parsers, and switch to the correct Parser state when needed.
Stack:
The Stack (LIFO) based state management is very similar to the Manual
state management; however, it is not as flexible. In exchange for this,
stack based management tends to be slightly simpler.
The stack based management revolves around 2 functions:
PushParserState() pushes the current Parser state onto a
internal global stack, and then sets the current state to 0.
PopParserState() DeInitializes the current Parser state, and
restores any old state.
So, the above example would work out to
void foo(void)
{
/* Preserve any old parser state */
PushParserState();
/* Create a new state */
ParserInit(...);
/* Do stuff with the Parser... */
bar();
/* Do stuff with the Parser... */
/* Cleanup the current state, and restore the old state */
PopParserState();
}
void bar(void)
{
/* Preserve any old parser state */
PushParserState();
/* Create a new state */
ParserInit(...);
/* Do stuff with the Parser... */
PopParserState();
}
The main advantage to this method is that we do not have to keep track of
the old state pointer. Plus, this only takes 3 lines of code instead of
the original 5.
In exchange for this, we are limited on the total number of states that the
Parser will keep a track of. By default, the Parser will keep track of
10 states. The number of states can be increased by modifying
PARSER_STATE_STACK_SIZE in Parser.c. On top of this, the stack method
only works in a linear fashion. The manual management can switch to any
active state, rather then only the last one.
2.10- C++ Wrapper - Top of Page
Included with the library is a C++ wrapper. This wrapper is largely just
copy and paste; however, there are a few differences:
Differences:
ParserInit() -> Class Constructor
ParserDeInit() -> Class Destructor
AddTokenSep() -> Removed. Done through AddTokenSeparator() defaults.
All Parser*() functions have the "Parser" part removed. IE,
ParserEnable() -> Enable() and similar
Multiple States handled through C++ classes. IE, no more
Get/SetParserState();
Push/PopParserState();
Callbacks are now passed Parser &. IE,
char *<function name>(Parser &p, char *str, void *params);
Copy constructor is implemented.
Important things that did not change
malloc(), calloc(), realloc(), and free() are still used internally. As
such, all pointers returned by the parser should still be freed with
free().
new and delete should only be used on allocate and free the Parser class.
3.1 - Compiling the Parser - Top of Page
To compile the parser, simply add Parser.h and Parser.c to your project.
Add Parser.c to your make file, command line, or what ever. No compile
time defines or special switches are required.
Note: If you are working on a C++ project, you might want to rename Parser.c
to Parser.cpp or use the C++ wrapper.
The Parser has been compiled and tested with
Windows:
MS Compiler cl *.c /W4
GNU gcc *.c -Wall -Wextra -ansi -pedantic
gcc *.c -Wall -Wextra -ansi -pedantic -mno-cygwin
g++ *.c -Wall -Wextra -ansi -pedantic
Borland Compiler bcc -w *.c
Digital Mars dmc file.c -A
Intel Complier icl *.c /W3
Linux:
GNU gcc *.c -Wall -Wextra -ansi -pedantic
g++ *.c -Wall -Wextra -ansi -pedantic
3.2 - Initialization - Top of Page
Before you can use the Parser, you must first call
ParserInit(<file to parse>, <buffer size>);
After this is done, any options you want should be enabled.
See Options/Configuration.
3.3 - Setting up the Token Set - Top of Page
Once the parser has been initialized, you have to set up the token sets. If
you are parsing a pure binary file, you do not need to add any token
separators.
To add a token separator, call
AddTokenSep(<token>);
AddTokenSeparator(<token>, <return>, <switchto>, <ignore>, <fun>, <params>);
To create a new Token Set, call
AddTokenSet();
To change the current token set, call
SetTokenSet(<token set>);
More on Token Sets
3.4 - Parsing the file - Top of Page
Once the token set(s) have been set up, you begin to parse the file.
GrabToken() is the main parser function. When GrabToken() is called, it'll
retrieve the next token from the file. If a 0 (null) is returned, the end
of the file was reached or an error occurred.
Text Mode Functions
Binary Mode Functions
3.5 - Deinitialization - Top of Page
Once you are done parsing, you should call
ParserDeInit();
The following functions are the public interface for interacting
with the Parser.
4.1.01 - AddTokenSep() - Top of Page
Prototype:
int AddTokenSep(const char *sp);
Description:
Adds a new token separator to the current token set. This function is an
adapter for AddTokenSeparator(), that uses the default settings.
Return - 1 - The token separator will be returned
switchto - -1 - Don't change the token set
ignore - 0 - Don't ignore the token
fun - 0 - Don't call a callback function
params - 0 - Don't pass any parameters
Inputs:
*sp - Pointer to the string to use as a token separator. C style string -
must be null terminated.
Output:
0 - Token has been added to the parser.
1+ - The specified token is not valid, or another error occurred. This value
may one or more of the following flags:
PARSER_TOKEN_NULL - sp == NULL
PARSER_OUT_OF_MEMORY - Could not allocate enough memory to
add the token to parser. This error
will usually cause the rest of the
parser to error out.
PARSER_REGEX_COMPILE_ERROR - Failed to compile a RegEx to byte code
PARSER_UNBALANCED_PARENS - RegEx has unbalanced ()
PARSER_UNBALANCED_BRACKETS - RegEx has unbalanced []
Notes:
The order that tokens are added matters. A token passed in before another
tokens will be detected first. IE, if you passed in
AddTokenSep("23");
AddTokenSep("2");
"23" will be check for before "2" is checked for.
This is very useful if you have a token that is a superset of another.
To change this behavior, call
ParserEnable(PARSER_SORT_TOKEN_SEPS);
This will cause a weak sorting algorithm to be applied to the tokens
separator, so that supersets of tokens will always be checked before
subsets.
By default, the token that is passed in will be duplicated. To change this
behavior, call ParserEnable(); with one of the
following defines:
PARSER_CONST_TOKEN_SEPS - Token is assumed to be constant, and will not
be duplicated or freed. As such, the token
must always be available, for the entire life
of the parser state.
PARSER_OWNS_TOKEN_SEPS - Token is assumed to be allocated with malloc(),
calloc(), or realloc(). The token will be
freed when the ParserDeInit() is called.
4.1.02 - AddTokenSeparator() - Top of Page
Prototype:
int AddTokenSeparator(const char *sp,
int Return,
int switchto,
int ignore,
char *(*fun)(),
void *params);
Description:
Inputs:
*sp - Pointer to the string to consider as a separator
Return - Should the separator be returned by Grab/PeekToken()? This is
useful for filtering out specific strings.
switchto - Automatically switch to the specified token set, when the token
is found. This can also be one of the following defines:
PARSER_TSET_DONT_SWITCH - Don't change the current token set
PARSER_TSET_LAST - Switch to the last active token set
ignore - If this Token is found, just keep going. Originally designed to
be used with strings. For example, \" should be ignored;
however, " will be picked out if we don't ignore \"
fun - Callback function to call when ever the token is found.
params - Pointer to pass to fun. Once a token is found, and Grab*() has
returned, GetParams() call be called to return this value.
Output:
0 - Token has been added to the parser.
1+ - The specified token is not valid, or another error occurred. This value
may one or more of the following flags:
PARSER_TOKEN_NULL - sp == NULL
PARSER_OUT_OF_MEMORY - Could not allocate enough memory to
add the token to parser. This error
will usually cause the rest of the
parser to error out.
PARSER_REGEX_COMPILE_ERROR - Failed to compile a RegEx to byte code
PARSER_UNBALANCED_PARENS - RegEx has unbalanced ()
PARSER_UNBALANCED_BRACKETS - RegEx has unbalanced []
Notes:
The order that tokens are added matters. A token passed in before another
tokens will be detected first. IE, if you passed in
AddTokenSep("23");
AddTokenSep("2");
"23" will be check for before "2" is checked for.
This is very useful if you have a token that is a superset of another.
To change this behavior, call
ParserEnable(PARSER_SORT_TOKEN_SEPS);
This will cause a weak sorting algorithm to be applied to the tokens
separator, so that supersets of tokens will always be checked before
subsets.
By default, the token that is passed in will be duplicated. To change this
behavior, call ParserEnable(); with one of the
following defines:
PARSER_CONST_TOKEN_SEPS - Token is assumed to be constant, and will not
be duplicated or freed. As such, the token
must always be available, for the entire life
of the parser state.
PARSER_OWNS_TOKEN_SEPS - Token is assumed to be allocated with malloc(),
calloc(), or realloc(). The token will be
freed when the ParserDeInit() is called.
If ignore is set to 1 and there is a function callback for a token, you will
not be able to call Grab*() or similar from within the callback.
4.1.03 - AddTokenSet() - Top of Page
Prototype:
int AddTokenSet(void);
Description:
Creates a new Token Set, and then sets it as the active one.
Inputs:
N/A
Output:
-1 - Error occurred. Call ErrorCode() to find out what went wrong.
0+ - Index of the new token set. Generally, you should use a variable to
store the return result, and then use that variable when
SetTokenSet() is called.
Notes:
N/A
4.1.04 - End() - Top of Page
Prototype:
int End(void);
Description:
Returns 1 if the end of the file was reached, a error occurred, or if the
parser was not initialized.
Inputs:
N/A
Output:
1 - The end of the file was reached, a error occurred, or the parser was not
initialized.
0 - The parser can still retrieve data from the file.
Notes:
N/A
4.1.05 - ErrorCode() - Top of Page
Prototype:
int ErrorCode(void);
Description:
Returns a 0 if no error has occurred. Otherwise, an error has occurred.
Inputs:
N/A
Output:
Define Value Description
PARSER_NOT_INITIALIZED - -1 - Not Initialized. Call ParserInit() first.
PARSER_NO_ERROR - 0 - No Error
PARSER_COULD_NOT_OPEN_FILE - 1 - Could not open the specified file.
PARSER_OUT_OF_MEMORY - 2 - Could not allocate the required memory
PARSER_END_OF_FILE - 3 - Reached the end of the file
PARSER_MEMORY_NOT_VALID - 4 - Data passed in to Load Memory was null
PARSER_GRAB_TOKEN_IGNORE - 5 - GrabToken() or similar was called from a
function callback with ignore set to 1.
This is not supported.
Notes:
Check Parser.h for the defines of the above error codes. You can also call
PrintErrorCode() to print out a human readable error code to the
command prompt. This will be written to stdout via printf().
4.1.06 - GetFilePosition() - Top of Page
Prototype:
long GetFilePosition(void);
Description:
Returns the absolute position of the parser in the file.
Inputs:
N/A
Output:
Absolute position in the file.
Notes:
The position returned will not be accurate if GetFilePosition() is called
from within a callback that had a token with ignore set to 1.
4.1.07 - GetFileSize() - Top of Page
Prototype:
long GetFileSize(void);
Description:
Returns the size, in bytes, of the currently loaded file or block of memory.
Inputs:
N/A
Output:
Size of the file/block of memory.
Notes:
N/A
4.1.08 - GetParams() - Top of Page
Prototype:
void *GetParams(void);
Description:
Returns the params variable associated with the last found token separator.
This value is the last value specified when you call
AddTokenSeparator().
Inputs:
N/A
Output:
Last params variable associated with the last found token separator.
Notes:
N/A
4.1.09 - GetParserState() - Top of Page
Prototype:
void * GetParserState(void);
Description:
Returns a pointer to the current Parser state.
Inputs:
N/A
Output:
Pointer to the current Parser state.
Notes:
N/A
4.1.10 - GetTokenSet() - Top of Page
Prototype:
int GetTokenSet(void);
Description:
Returns the current token set.
Inputs:
N/A
Output:
-1 - Parser was not initialized.
0+ - Current Token Set
Notes:
N/A
4.1.11 - GrabBinaryFloat() - Top of Page
Prototype:
float GrabBinaryFloat(void);
Description:
Grabs the next four bytes in the file, and converts them to a float.
Inputs:
N/A
Output:
Next four bytes in the file as a float. If there are not four bytes left
in the file, a 0.0f will be returned instead.
Notes:
N/A
4.1.12 - GrabBinaryInt() - Top of Page
Prototype:
int GrabBinaryInt(void);
Description:
Returns the next sizeof(int) bytes in the file as an int.
Inputs:
N/A
Output:
Next sizeof(int) bytes in the file as an int. If there are less then 4 bytes
left in the file, a 0 will be returned.
Notes:
N/A
4.1.13 - GrabByte() - Top of Page
Prototype:
char GrabByte(void);
Description:
Grabs the next character (byte) in the file.
Inputs:
N/A
Output:
Next character (byte) from the file.
Notes:
N/A
4.1.14 - GrabBytes() - Top of Page
Prototype:
char *GrabBytes(int bytes);
Description:
Grabs the requested number of bytes from the file, and then returns them.
Inputs:
bytes - How many bytes to grab from the file.
Output:
0 - Requested number of bytes is invalid or the end of the file was reached
1+ - Pointer to the memory that contains the data from the file.
Notes:
You are responsible for cleaning up the data when you are done with it. IE,
you must call free(<pointer returned by GrabBytes()>).
4.1.15 - GrabFloat() - Top of Page
Prototype:
float GrabFloat(void);
Description:
Grabs the next token in the file, and then attempts to convert it to a float
via the atof() function declared in stdlib.h. All token separators are
taken into account. Function callbacks & such will still be called.
Inputs:
N/A
Output:
Next token converted to a float.
Notes:
N/A
4.1.16 - GrabInt() - Top of Page
Prototype:
int GrabInt(void);
Description:
Grabs the next token in the file, and then attempts to convert it to an int
via the atoi() function declared in stdlib.h. All token separators are
taken into account. Function callbacks & such will still be called.
Inputs:
N/A
Output:
Next token converted to an int.
Notes:
N/A
4.1.17 - GrabToken() - Top of Page
Prototype:
char *GrabToken(void);
Description:
The main function of the Parser. This function will scan the file for
any of the token separators you specified with AddTokenSeparator(), as well
as to apply any specified logic of the token separator.
Inputs:
N/A
Output:
0 - End of the file was reached, or an error occurred
1+ - Character pointer to the next token in the file.
Notes:
You are responsible for the cleanup. IE, calling free().
4.1.18 - LoadFile() - Top of Page
Prototype:
int LoadFile(const char *file);
Description:
Loads in a new file into the parser for processing.
Inputs:
*file - C style string that contains the name/path of the file to parse.
Output:
1 - File was loaded and the parser was set up
0 - Error occurred. Most likely do to an incorrect file name.
Notes:
The token sets will not be affected by this function.
All files are read in as binary.
By default, the file that is passed in will be duplicated. To change this
behavior, call ParserEnable(); with one of the
following defines:
PARSER_CONST_FILE_NAME - File name is assumed to be constant, and will not
be duplicated or freed. As such, the file name
must always be available, for the entire life
of the parser state.
PARSER_OWNS_FILE_NAME - File name is assumed to be allocated with
malloc(), calloc(), or realloc(). The file name
will be freed when ParserDeInit() is called.
4.1.19 - LoadMemory() - Top of Page
Prototype:
int LoadMemory(const char *memory);
Description:
Loads in the specified chunk of memory into the parser for parsing.
Currently, only C style strings are supported by this function.
This function calls LoadMemoryLen().
Inputs:
*memory - Pointer to the chuck of memory to load into the parser.
Output:
0 - The specified memory is not valid or the parser was not initialized.
1 - The memory was loaded into the parser.
Notes:
By default, the memory that is passed in will be duplicated. To change this
behavior, call ParserEnable(); with one of the
following defines:
PARSER_CONST_LOAD_MEMORY - Memory is assumed to be constant, and will not
be duplicated or freed. As such, the memory
must always be available, for the entire life
of the parser state.
PARSER_OWNS_LOAD_MEMORY - Memory is assumed to be allocated with
malloc(), calloc(), or realloc(). The memory
will be freed when ParserDeInit() is called.
4.1.20 - LoadMemoryLen() - Top of Page
Prototype:
int LoadMemoryLen(const char *memory, int len);
Description:
Loads in the specified chunk of memory into the parser for parsing.
Binary based memory can be passed in.
Inputs:
*memory - Pointer to the chuck of memory to load into the parser.
len - Size of the memory to load into the Parser
Output:
0 - The specified memory is not valid or the parser was not initialized.
1 - The memory was loaded into the parser.
Notes:
By default, the memory that is passed in will be duplicated. To change this
behavior, call ParserEnable(); with one of the
following defines:
PARSER_CONST_LOAD_MEMORY - Memory is assumed to be constant, and will not
be duplicated or freed. As such, the memory
must always be available, for the entire life
of the parser state.
PARSER_OWNS_LOAD_MEMORY - Memory is assumed to be allocated with
malloc(), calloc(), or realloc(). The memory
will be freed when ParserDeInit() is called.
4.1.21 - ParserDeInit() - Top of Page
Prototype:
void ParserDeInit(void);
Description:
Frees all the memory that the parser was using.
Inputs:
N/A
Output:
N/A
Notes:
N/A
4.1.22 - ParserDisable() - Top of Page
Prototype:
void ParserDisable(int flags);
Description:
Disables one or more features/options.
See Options/Configuration for a list of defines.
Inputs:
One or more features/options to disable. Multiple features can be disabled
at the same time by | the values together. For example,
ParserEnable(PARSER_REGEX | PARSER_CASE_INSENSITIVE);
Output:
N/A
Notes:
By default, none of the listed features/options are enabled.
4.1.23 - ParserEnable() - Top of Page
Prototype:
void ParserEnable(int flags);
Description:
Enables one or more of the specified options.
See Options/Configuration for a list of defines.
Inputs:
One or more features/options to enable. Multiple features can be enabled
at the same time by | the values together. For example,
ParserEnable(PARSER_REGEX | PARSER_CASE_INSENSITIVE);
Output:
N/A
Notes:
By default, none of the listed features/options are enabled.
4.1.24 - ParserInit() - Top of Page
Prototype:
void ParserInit(const char *file, int bufsize)'
Description:
Allocates and initializes all the memory that the parser needs to function.
Once everything has been allocated and initialized, the Parser will load
in the requested number of bytes from the file.
Inputs:
*file - Name/Path of the file to load into the parser. A NULL pointer can
be passed in if you do not wish to load in an initial file.
bufsize - How many bytes to read in from the file at one time. If a 0 is
passed in, bufsize will default to 1024 - 1 KB. This value
can not be changed once it is specified.
Output:
N/A
Notes:
If this function is called more then once, the parser will automatically
call ParserDeInit(), in order to prevent leaking memory.
Multiple Parser states can be managed with:
Push/PopParserState()
Get/SetParserState()
4.1.25 - ParserIsEnabled() - Top of Page
Prototype:
int ParserIsEnabled(int flags);
Description:
Checks to see if the specified options/features are enabled. This function
will return 1 if all the specified options are enabled. If one or more
options are not enabled, a 0 will be returned.
Inputs:
One or more features/options to check if they are enabled. Multiple features
can be checked at the same time by | the values together. For example,
ParserEnable(PARSER_REGEX | PARSER_CASE_INSENSITIVE);
Output:
0 - One or more of the specified options are not enabled.
1 - All the specified options are enabled.
Notes:
See Options/Configuration for a list of defines.
4.1.26 - ParserMemoryUsage() - Top of Page
Prototype:
int ParserMemoryUsage(void);
Description:
Returns a estimate of the total number of bytes the current parser state is
using. Generally, this number will be very accurate; however, certain
error conditions can skew the results.
Inputs:
N/A
Output:
Number of bytes of the heap the parser is using.
Notes:
Global variable memory is ignored; however, it tends to be very small. By
default, the Parser only uses 12 * sizeof(void *) bytes of global
variable memory.
4.1.27 - PeekByte() - Top of Page
Prototype:
unsigned char PeekByte(int offset);
Description:
Returns the next byte + the specified offset in the loaded file/memory. The
offset can be positive or negative. Requesting a byte before the start
or after the end of the file/memory will result in a 0.
This function does not modify the Parsers current location.
Inputs:
offset - Offset of the byte to get from the next byte in the file/memory.
Output:
Next byte + offset in the stream.
Notes:
N/A
4.1.28 - PeekToken() - Top of Page
Prototype:
char *PeekToken(void);
Description:
Same behavior as GrabToken(); although, the parser's position in the file
is not updated. Callback functions will still be called.
Inputs:
N/A
Output:
Pointer to the next token the in file.
Notes:
You are responsible for cleaning up the memory when you are done.
4.1.29 - PopParserState() - Top of Page
Prototype:
void PopParserState(void);
Description:
Deinitializes the current Parser state, and restores an old state.
If no old states exist, a new and uninitialized state will be created.
Inputs:
N/A
Output:
N/A
Notes:
It is safe to pop an empty state stack. This will just cause the current
state to the deinitialized.
4.1.30 - PrintErrorCode() - Top of Page
Prototype:
void PrintErrorCode(void);
Description:
Prints out the current status of the parser to stdout via printf().
Format:
<File Name>: <Error Message>
The file name will be the name of the parser file (Parser.c, by default).
The error message will be determined by the Error Code.
Inputs:
N/A
Output:
N/A
Notes:
N/A
4.1.31 - PushParserState() - Top of Page
Prototype:
int PushParserState(void);
Description:
Stores the current Parser state, and sets a new/uninitialized Parser state
as the active state.
Inputs:
N/A
Output:
0 - Could not push the parser state. The hard coded state stack size was
exceeded. See PARSER_STATE_STACK_SIZE in Parser.c for the
total number of states the Parser can keep track of.
1 - Parser State was pushed onto the state stack.
Notes:
See PARSER_STATE_STACK_SIZE in Parser.c to change the state stack size.
4.1.32 - Seek() - Top of Page
Prototype:
int Seek(const char *search);
Description:
Scans the file for the specified token. If the token is found, the position
of the parser will be updated to the character directly after the token.
If the token is not found, nothing in the parser will change.
PARSER_CASE_INSENSITIVE will be taken into account.
Inputs:
*search - C style string to search for in the file.
Output:
0 - The token was not found. The parser was not updated.
1 - The token was found. The parser was updated.
Notes:
Token sets are not factored in.
Globbing and RegEx are not supported by this function.
4.1.33 - SetFilePosition() - Top of Page
Prototype:
int SetFilePosition(long fpos);
Description:
Changes the position in the file that the parser scans for the tokens.
Inputs:
fpos - Where the parser should start parsing the file.
Output:
0 - Error occurred. Call ErrorCode() or PrintErrorCode() for more info.
1 - Parser's position was updated.
Notes:
N/A
4.1.34 - SetParserState() - Top of Page
Prototype:
void SetParserState(void *state);
Description:
Sets the Parser state to the specified Parser state.
Inputs:
*state - State the parser should use.
Output:
N/A
Notes:
No error checking is done here. The current Parser state will be lost. It is
highly recommended you call GetParserState() before hand
in order to preserve the last parser state.
Settings state to 0, followed by calling ParserInit() will create a new
Parser state.
4.1.35 - SetTokenSet() - Top of Page
Prototype:
int SetTokenSet(int tokenset);
Description:
Changes the current token set. The following defines can be passed to this
function:
Define Value Description
PARSER_TSET_DONT_SWITCH - -1 - Don't change the current token set
PARSER_TSET_LAST - -2 - Switch to the last active token set
Inputs:
tokenset - Index of the token set to change to.
Output:
-1 - The parser was not initialized or the requested token set was not valid
0+ - Index of the token set switched to.
Notes:
N/A
4.1.36 - GenericDiscard() - Top of Page
Prototype:
char *GenericDiscard()(char *str, void *unused);
Description:
Generic Parser callback designed to discard the next token. Namely,
this function simply calls free(GrabToken());
Inputs:
N/A
Output:
0 - Token separator that called the callback was discarded.
Notes:
N/A
The following functions are not implemented in string.h
or operate on different principals.
4.2.01 - RemoveWhiteSpaces() - Top of Page
Prototype:
int RemoveWhiteSpaces(char *sp);
Description:
Removes all spaces, new lines, carriage returns, and tabs from the specified
string.
Inputs:
*sp - string pointer - string to remove the white spaces from.
Output:
-1 - sp was not valid
0+ - New length of the string. The pointer will not be reallocated, so the
original string pointer should be valid.
Notes:
N/A
4.2.02 - ToUpper() - Top of Page
Prototype:
char *ToUpper(char *sp);
Description:
Converts a c style string to upper case in place. IE, the string you pass
in will be directly modified.
Inputs:
*sp - pointer to the string to convert to upper case.
Output:
Original pointer that was passed in.
Notes:
N/A
4.2.03 - ToLower() - Top of Page
Prototype:
char *ToLower(char *sp)
Description:
Converts a c style string to lower case in place. IE, the string you pass
in will be directly modified.
Inputs:
*sp - pointer to the sting to convert to lower case.
Output:
Original pointer that was passed in.
Notes:
N/A
4.2.04 - Dup() - Top of Page
Prototype:
char *Dup(const char *sp);
Description:
Creates a copy of the specified string.
Inputs:
*sp - String to make a copy of.
Output:
Pointer to the new chunk of memory.
Notes:
You are responsible for cleaning up the returned pointer by calling free().
4.2.05 - DupLen() - Top of Page
Prototype:
char *DupLen(const char *sp, int len);
Description:
Creates a copy of the specified string. The NULL terminator is automatically
attached. IE, you can just call strlen(<string>) for the param len.
Inputs:
*sp - String to make a copy of.
len - Length of the string/position of the null terminator.
Output:
Pointer to the new chunk of memory.
Notes:
You are responsible for cleaning up the returned pointer by calling free().
4.2.06 - DupRange() - Top of Page
Prototype:
char *DupRange(const char *sp, int start, int end);
Description:
Creates a copy of a specific part of a string
Inputs:
*sp - String to make a partial copy of
start - Index in the string to start copying data from
end - Where to stop/last character to copy
Output:
Pointer to the duplicated chunk of the string.
Notes:
You are responsible for cleaning up the returned pointer by calling free().
4.2.07 - DupRangeFile() - Top of Page
Prototype:
char *DupRangeFile(const char *file, int start, int end);
Description:
Opens up the specified file, and then reads in the data range to a buffer.
Inputs:
*file - Name/path of the file to read
start - Where in the file to start reading in the data
end - Where to stop reading in data
Output:
0 - The file name was not valid, or the memory couldn't be allocated.
1+ - Pointer to the new buffer containing the requested data.
Notes:
You are responsible for cleaning up the returned pointer by calling free().
4.2.08 - Cmp() - Top of Page
Prototype:
char Cmp(const char *osp, const char *osp2);
Description:
Compares two strings together. Cmp() differ from strcmp() (string.h) in two
ways. First, Cmp() returns a 1 if the strings match, and a 0 if not. 2nd,
Cmp() is non case sensitive and non white space sensitive.
Inputs:
*osp - First string to compare
*osp2 - Second string to compare
Output:
0 - Strings don't match
1 - Strings match
Notes:
N/A
The following functions are only meant to be called from within
the parser. Making these functions public, and calling them
externally will have undefined results.
4.3.01 - BuildRange() - Top of Page
Description:
Converts a RegEx range (Ex"[1-4abcd") into a bitfield for easy processing.
4.3.02 - Compile() - Top of Page
Description:
Compiles a RegEx to its respective byte code.
4.3.03 - DisableRegEx() - Top of Page
Description:
Disables and restores the original token separator of the specified
token separator.
4.3.04 - DisableWildcard() - Top of Page
Description:
Disables and restores the original token separator of the specified
token separator.
4.3.05 - EnableRegEx() - Top of Page
Description:
Compiles the specified token separator into a RegEx byte code.
4.3.06 - EnableWildcard() - Top of Page
Description:
Converts the specified token separator into a easy to consume version of the
glob/wildcard based separator.
4.3.07 - ForwardSearch() - Top of Page
Description:
Searches from the current location in the parser for the end of the
glob/wildcard separator.
4.3.08 - ForwardSearchReg() - Top of Page
Description:
RegEx version of ForwardSearch().
4.3.09 - GrabLeftover() - Top of Page
Description:
Returns any data that was left in the parser. This function is called once
the end of the file is reached, and no more tokens have been found.
4.3.10 - GrabNextChunk() - Top of Page
Description:
This function handles all file input. It will allocate the space for the
buffer, if required, and then read in the next chunk of the file.
4.3.11 - GTChar*() - Top of Page
Description:
These functions are called by GrabToken(). Each function is an optimized
version of the search algorithms. While this group of functions can
be reduced to 3 functions, the performance loss is not worth it.
4.3.12 - HandleEscapes() - Top of Page
Description:
Converts all escape sequences ("\n\r\b...") into an easy to process form.
4.3.13 - InvalidRegEx() - Top of Page
Description:
Performs some basic syntax check on the specified RegEx.
4.3.14 - PDupRangeFile() - Top of Page
Description:
Optimized version of DupRangeFile() that takes the current Parser state
into account.
4.3.15 - PreserveBufferState() - Top of Page
Description:
Preserves the current Parsers location in the file/memory, in order to
enable forward searching or similar.
4.3.16 - PreserveTSetHistory() - Top of Page
Description:
Preserves the current token set history to enable forward searching or
similar.
4.3.17 - PrintCompiled() - Top of Page
Description:
Prints a compiled RegEx's byte code to the command prompt for debugging
purposes.
4.3.18 - ProcessToken() - Top of Page
Description:
When ever a token is found, the function is called to handle all logic
attached to the token.
4.3.19 - ProcessTokenWild() - Top of Page
Description:
RegEx and Globbing version of ProcessToken();
4.3.20 - ReadBinary() - Top of Page
Description:
Grabs the specified number of bytes from the current Parser state. This
function is call by GrabBinary*() and GrabBytes().
4.3.21 - RebuildHash() - Top of Page
Description:
Builds or rebuilds a hash table out of the first character of each token
separator.
4.3.22 - RestoreBufferState() - Top of Page
Description:
Restores a state returned from PreserveBufferState().
4.3.23 - RestoreTSetHistory() - Top of Page
Description:
Restores a state returned from PreserveTSetHistory().
4.3.24 - ShiftRight() - Top of Page
Description:
Shifts a RegEx byte code over by the specified number of bytes.
4.3.25 - SortTokenSet() - Top of Page
Description:
Performs a weak sort on all the token separators. All tokens are sorted
based on their first character, followed by the length of each token
separator. This ensures that all supersets of separators will be
checked before any subsets.
4.3.26 - UpdateThreads() - Top of Page
Description:
Performs one iteration on a group of RegEx threads.
4.3.27 - WriteThreadBefore() - Top of Page
Description:
Writes a RegEx JUMP or THREAD instruction before the specified byte code
block.
Returned by ErrorCode()
Define Value Description
PARSER_NOT_INITIALIZED - -1 - Not Initialized. Call ParserInit() first.
PARSER_NO_ERROR - 0 - No Error
PARSER_COULD_NOT_OPEN_FILE - 1 - Could not open the specified file.
PARSER_OUT_OF_MEMORY - 2 - Could not allocate the required memory
PARSER_END_OF_FILE - 3 - Reached the end of the file
PARSER_MEMORY_NOT_VALID - 4 - Data passed in to Load Memory was null
PARSER_GRAB_TOKEN_IGNORE - 5 - GrabToken() or similar was called from a
function callback with ignore set to 1.
This is not supported.
AddTokenSeparator() SwitchTo defines:
Define Value Description
PARSER_TSET_DONT_SWITCH - -1 - Don't change the current TSet
PARSER_TSET_LAST - -2 - Switch to the last active TSet
AddTokenSeparator() return values:
Define Description
PARSER_NO_ERROR - Token has been added.
PARSER_TOKEN_NULL - Token is NULL/0, and thus invalid
PARSER_OUT_OF_MEMORY - Could not allocate enough memory for the token
PARSER_REGEX_COMPILE_ERROR - Could not compile the given RegEx
PARSER_UNBALANCED_PARENS - Unbalanced () in RegEx
PARSER_UNBALANCED_BRACKETS - Unbalanced [] in RegEx
Parser Options - See Options/Configuration for descriptions
PARSER_HASH
PARSER_CASE_INSENSITIVE
PARSER_WILDCARD
PARSER_GLOBBING
PARSER_REGULAR_EXPRESSIONS
PARSER_REGEX
PARSER_CLOSE_FILE
PARSER_SORT_TOKEN_SEPS
PARSER_CONST_FILE_NAME
PARSER_CONST_LOAD_MEMORY
PARSER_CONST_TOKEN_SEPS
PARSER_OWNS_FILE_NAME
PARSER_OWNS_LOAD_MEMORY
PARSER_OWNS_TOKEN_SEPS
Other:
PARSER_CALLBACK - Typedef of the Parser Callback prototype.
PCBACK - Same as PARSER_CALLBACK
RegEx Bugs:
Parser can enter infinite loop
Example: ".*" - Will always succeed without removing any characters from
the stream.
PeekToken() calls callbacks
If a token sep is set to ignore, any related callback may be called multiple
times for the same set of bytes in file.
RegEx:
Capture Groups
Backreferences
Parsing Methods:
LALR
Ability to modify existing Token Separators
C++ Wrapper:
Use new and delete internally
return std::auto_ptr<char *> (std::string would duplicate the string)
Save Parser State As:
External Text File and Text Stream:
Human Readable and Easy To Edit
External Binary File and Byte Stream:
Fast and Small.
All preprocessing already done.
Issues:
Parser Callbacks & Params.
Function and Var registry?
Pluses:
No recompilation required to change how the parser works
External Script File
Parser v 8.9
Massive Performance Boosts across the board
Runtime configurable options
Case Insensitivity
Hashing
Globbing
RegEx (Work In Progress)
Some Memory Management
Close File After Read
Callback:
Takes char * - matched token sep
returns char * - value GrabToken() should return
Added GetParams();
Added GetFileSize();
General Bug Fixes
Reduced Indirection in Internal Structs
C++ Wrapper:
Implemented Copy Constructor
Smarter Cashing
Fixed long standing GrabBinary*() buffer size limitation
GrabChar() -> GrabByte(), due to confusion
Removed DeSmet C support
Parser v 8.0
Function callback is now passed a void *
- AddTokenSeparator() now takes a additional parameter
Restricted GrabToken() from callback from Token with ignore == 1
This would cause an infinite recursion loop.
Fixed a possible buffer over read
Token order is now preserved correctly
Various Optimizations
Bulk of String Manipulation Functions now use const when possible
Performance Delta:
GNU: ~6% Faster
MS : No performance difference
Note: A few new warnings have been introduced, and need to be fixed
Parser v 7.1
Bug fix relating to recursion caused by Parser callback function calling
GrabToken().
Added Get/SetParserState()
Added GenericDiscard() Parser callback, since it is
a fairly common function.
Parser v 7.0
Moved most of the documentation to this html file
A few bug/broke logic fixes
Parser v 6.0
Dropped C++ build
Added DeSmet C support - strict ANSI C
Several bug/broken logic fixes
Reduced requested frees and allocs by ~66%
Massive Code Cleanup
Removed a lot of redundant code
Improved internal error handler
Added function callbacks
Cleaned up documentation
Parser v 5.0
Began testing on Linux
Several bug/broken logic fixes
Massive performance boost to internal file handler (~60% faster!)
Parser v 4.0
Implemented binary support
Expanded internal File Handler
Load Files Dynamically
Load Memory Dynamically
Improved Internal Error Handler
Parser v 3.0
Implemented internal File Handler
Parser v 2.0
Added a C++ build
Added the bulk of the String Manipulation Functions
Parser v 1.0 - Original build with
Multiple Token Set Support
Token Separators with logic:
Return
switchto
ignore