Parser
v 8.9
By: Craig Williams
Index 1.0 - Overview The Parser is designed to be/have: Portable No Conditional Compilation Simple To Use No Obsecure Data Types No Obsecure Macros Documented No Tabs Block Formatting/Alignment Dynamic, Fast, and Safe A Complete Package in one File Truly Free - Public Domain How is this Parser different? 2.0 - Features 2.01 - File Handling 2.02 - Tokens & Token Sets Token Parameters/Definable Logic Return switchto ignore Sample Program: New Line Counter fun params Token Sets Sample Program: Print comments and Strings 2.03 - Text Mode Functions 2.04 - Binary Mode Functions 2.05 - Function Callbacks 2.06 - String Manipulation Functions 2.07 - Options/Configuration 2.08 - Search Methods Linear Globbing RegEx 2.09 - State Management Manual Stack 2.10 - C++ Wrapper 3.0 - Using the Parser 3.1 - Compiling the Parser 3.2 - Initialization 3.3 - Setting up the Token Set 3.4 - Parsing the file 3.5 - Deinitialization 4.0 - Functions 4.1 - Public Parser Functions 4.1.01 - AddTokenSep() 4.1.02 - AddTokenSeparator() 4.1.03 - AddTokenSet() 4.1.04 - End() 4.1.05 - ErrorCode() 4.1.06 - GetFilePosition() 4.1.07 - GetFileSize() 4.1.08 - GetParams() 4.1.09 - GetParserState() 4.1.10 - GetTokenSet() 4.1.11 - GrabBinaryFloat() 4.1.12 - GrabBinaryInt() 4.1.13 - GrabByte() 4.1.14 - GrabBytes() 4.1.15 - GrabFloat() 4.1.16 - GrabInt() 4.1.17 - GrabToken() 4.1.18 - LoadFile() 4.1.19 - LoadMemory() 4.1.20 - LoadMemoryLen() 4.1.21 - ParserDeInit() 4.1.22 - ParserDisable() 4.1.23 - ParserEnable() 4.1.24 - ParserInit() 4.1.25 - ParserIsEnabled() 4.1.26 - ParserMemoryUsage() 4.1.27 - PeekByte() 4.1.28 - PeekToken() 4.1.29 - PopParserState() 4.1.30 - PrintErrorCode() 4.1.31 - PushParserState() 4.1.32 - Seek() 4.1.33 - SetFilePosition() 4.1.34 - SetParserState() 4.1.35 - SetTokenSet() 4.1.36 - GenericDiscard() 4.2 - String Manipulation Functions 4.2.01 - RemoveWhiteSpaces() 4.2.02 - ToUpper() 4.2.03 - ToLower() 4.2.04 - Dup() 4.2.05 - DupLen() 4.2.06 - DupRange() 4.2.07 - DupRangeFile() 4.2.08 - Cmp() 4.3 - Private Parser Functions 4.3.01 - BuildRange() 4.3.02 - Compile() 4.3.03 - DisableRegEx() 4.3.04 - DiscardWildcard() 4.3.05 - EnableRegEx() 4.3.06 - EnableWildcard() 4.3.07 - ForwardSearch() 4.3.08 - ForwardSearchReg() 4.3.09 - GrabLeftover() 4.3.10 - GrabNextChunk() 4.3.11 - GTChar*() 4.3.12 - HandleEscapes() 4.3.13 - InvalidRegEx() 4.3.14 - PDupRangeFile() 4.3.15 - PreserveBufferState() 4.3.16 - PreserveTSetHistory() 4.3.17 - PrintCompiled() 4.3.18 - ProcessToken() 4.3.19 - ProcessTokenWild() 4.3.20 - ReadBinary() 4.3.21 - RebuildHash() 4.3.22 - RestoreBufferState() 4.3.23 - RestoreTSetHistory() 4.3.24 - ShiftRight() 4.3.25 - SortTokenSet() 4.3.26 - UpdateThreads() 4.3.27 - WriteThreadBefore() 5.0 - Define List 6.0 - Known Bugs 7.0 - Planned Features 8.0 - Change Log
1.0 - Overview - Top of Page
The Parser was originally designed to meet my own needs. Since it's original creation, the parser has been expanding into something simply beautiful. A lot of time and though has been put into the library, to ensure that it behaves as it should. The library is entirely written by me, and release to the Public Domain One quick note. This entire document was hand written in Crimson Editor. That, combined with my horrible spelling & grammar, will lead to various errors in the document. The parser was designed with several specific concepts/features in mind Portable: - Top of Page Portability is one of the main concepts that I spent a lot of time on. If the platform you are compiling on supports ANSI C with stdio.h, stdlib.h, string.h, and limits.h support, the library should compile without any issues. This library (v 8.9) has even been tested and found to work correctly on the Nintendo Wii with minimum modifications (File I/O is different, due to the DVD drive). In order to ensure portability, the Parser was tested on a Windows and a Linux machine. Unfortunately, I don't have access to a Mac development box, so I'm forced to make do. Since the parser does not require any GUI, I was able to rely on strict ANSI C (C89). Multiple compilers are used and tested as follows Windows: MS Compiler cl *.c /W4 GNU gcc *.c -Wall -Wextra -ansi -pedantic gcc *.c -Wall -Wextra -ansi -pedantic -mno-cygwin g++ *.c -Wall -Wextra -ansi -pedantic Borland Compiler bcc -w *.c Digital Mars dmc file.c -A Intel Complier icl *.c /W3 Linux: GNU gcc *.c -Wall -Wextra -ansi -pedantic g++ *.c -Wall -Wextra -ansi -pedantic For each compiler, the maximum warning level is turned on. No warnings or errors are acceptable for a release version. The only exception to this policy lies with the Microsoft Compiler. /Wall is the maximum; however, this generates warnings from the Microsoft headers (none are from Parser.h or Parser.c). No Conditional Compilation: - Top of Page Conditional compilation (#ifdef & such) is avoided as much as possible. Conditional compilation is only used for standard header guards and nothing else. In other words, you do not have to add any defines to build script or similar to get the Parser to compile. Simple To Use: - Top of Page To make the library as simple to use as possible, several conventions are followed. Firstly, all the function names and such follow my coding standards (included). Other then that, all of the code is in two files, a .c and an .h. I find it far simpler to just copy two files into your project, and then included one file (#include "Parser.h", by default) to get the library to work with other code. No defines or special compilation flags are required. In order to increase the simplicity, a file handler was built into the project. Personally, I find manual file I/O to be rather ugly and fairly inefficient. As such, the Parser provides a comprehensive and fairly robust built in File I/O system, to automate the process. No Obscure Data Types: - Top of Page No defined or typedef types beyond the remove of the "struct" keyword is used. This should make it obvious as to what type of data each variable takes. Note: The function callback type is typedef; however, this is only provided to simplify casting procedure. It is explicitly defined everywhere else. No Obscure Macros: - Top of Page Yes, macros can be really helpful; however, they add another level of obscurity to your code. There are no macros used at the public level; however, a tiny number of internal macros are used. Any macro that is used must be understood by its defined name. Documented: - Top of Page To improve you understanding of the project, documentation is added to the project. Every function has a function header and there is a file header in every file. Comments are added to the code; however, worthless comments are avoided as much as possible. The external documentation provided with the project is not automatically generated by some program like Doxygen or similar. While these tools can be nice, they do no tell you anything that reading the code won't tell you, and are thus less helpful then a manually authored document. As such, a compressive document written in one of the most commonly accessed document types is provided. No Tabs: - Top of Page Tabs are evil. Rather, tabs mixed with spaces are evil. While tabs may bet set to 4 spaces in one program, they may bet set to x spaces in another program or on another computer. This destroys the alignment and code flow, so they are all removed. Unfortunately, this does increase the file size. Block Formatting/Alignment: - Top of Page I tend to be an alignment whore. I find it far easier to look at and read code if it is separated into blocks. Dynamic, Fast, and Safe: - Top of Page The library was designed to be as dynamic as possible, without sacrificing a huge number of cycles. All code is benchmarked, and weight against the usefulness of the feature. If the feature eats up a huge number of cycles while not being very useful, it will not be implemented. To ensure that the Parser is safe, a few cycles are spent on error checking. All allocated memory is checked to ensure that it is valid (not null), and buffer under/over read is checked. To check for any memory related issues, Valgrind and my own memory manager is used to check for any issues and memory leaks. No memory leaks are tolerated. The memory checking mechanism is removed for release. That being said, it is still possible to crash the program through the parser. If you pass in a pointer to a bad chunk of memory, the parser will most probably crash. A Complete Package in one File: - Top of Page The parser does not depend on any other libraries, other then the standard C library. Namely, stdio.h, stdlib.h, string.h, and limits.h. Truly Free - Public Domain - Top of Page This library is released to the public domain. You are free to do anything you want with it. How is this Parser different? - Top of Page The Parser was not designed to conform to traditional parser designs. Generally, the differences can be summed up as follows: Parser vs Traditonal Parser: Modifiable at runtime Built in dynamic lexical analyzer Smart built In File I/O No required external scripting Easy to iterate with Built in state system Traditional parsers are generally made through a parser generator. This alone can make integration and rapid iteration extremely difficult to incorporate into a larger project. As such, the Parser is built and modifiable at runtime. On top of this, a C/C++ interface is provided in order to allow for easy integration into an existing project, rather then writing a script to generate a .c file, which then may require some additional modifications to fully integrate said .c file into your project. Along with this, all traditional parsers I've seen have only been designed to run as one instance. As such, the Parser is wrapped up in a fairly easy to manage state system, to allow multiple Parser to exist at one time. With that being said, there are plans to expand the Parser to incorporate some more traditional parser technologies such as BNF style scripts and LALR generation, which again, will be done at runtime.
2.0 - Features - Top of Page
2.01 - File Handling - Top of Page A built in file handling system is implemented in the parser. The file handling system includes a file fragmentation/caching system. When ParserInit(<file>, <bufsize>) is called, a buffer size is specified. When the data is read in from the file, the specified buffer size determines how many bytes to read in from the file. Since a token separator can fall on the ends of the data read in, the parser accounts for this fragmentation. For example, take bufsize - 3 data - "0123456789ABCDEF" With a bufsize of 3, the parser will fragment the file into Fragment 1 - "012" Fragment 2 - "345" Fragment 3 - "678" Fragment 4 - "90A" Fragment 5 - "BCD" Fragment 6 - "EF" If a token separator was declared as "234", it would not be detected, since the entire string would never be in the input buffer. To handle this, the buffer size is expanded to the length of the longest token separator - 1. IE, if "234" was the only separator, then the buffer size would be expanded by 2 (Length("234") - 1). Original Buffer: -- -- -- | | | | -- -- -- Expanded: -- -- -- -- -- | | | | | | -- -- -- -- -- Note: A null terminator is attached as well, but it is not represented. When the data is actually read in, last longest sep - 1 is are attached to the front of the buffer. This ensures that all the characters are checked against all the possible token separators. bufsize - 3 Longest Sep - 3 Actual buf - 5 (plus a null terminator, so it's actually 6 bytes) data - "0123456789ABCDEF" Fragment 1 - "012" Fragment 2 - "12345" Fragment 3 - "45678" Fragment 4 - "7890A" Fragment 5 - "0ABCD" Fragment 6 - "CDEF" This does force some redundant checking; however, it is far more important that the parser correctly locates the tokens. If performance is an issue, a low number of short token separators with a larger buffer size will greatly increase performance. Larger buffer sizes decrease the number of reads from the hard drive; however, the memory footprint of the parser will increase. The smallest possible memory footprint can be achieved by setting the buffer size to 1; however, it is far slower. A buffer size of 1024 bytes (1 KB) is recommended for general purposes. 2.02 - Tokens & Token Sets - Top of Page Token Separators: The actual parsing syntax of the parser is defined as "Token Separators". There are several tokenizers on the market (including one in string.h) that are fast; however, a single character is not always enough to make parsing a file simple and easy. As such, full strings are used and scanned for. I refer to the "delimiters" as "Token Separators", since addition logic can be attached to them. This library is call a parser, instead of a tokienizer, due to this addition logic. Not to mention, that function callbacks are supported as well. The order the tokens are added does matter. A token added before another token will have a higher priority. The Token Separators can have the following logic attached to them via AddTokenSeparator(<token>, <Return>, <switchto>, <ignore>, <fun>, <params>) Return - Should the token be returned when GrabToken() is called? If this argument is set to true (1), the token will be returned. If it's set to false (0), the token will not be returned. This is extremely useful to filter out unneeded tokens. For example, data - "foo bar" /* Notice the two space between foo & bar */ if the Separator is a space, and return is set to 1, namely AddTokenSeparator(" ", 1, ...); GrabToken(); /* Returns "foo" */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns "bar" */ GrabToken(); /* Returns 0 - End of the data was reached */ If return is set to 0, namely AddTokenSeparator(" ", 0, ...); GrabToken(); /* Returns "foo" */ /* Both of the spaces are not returned! */ GrabToken(); /* Returns "bar" */ GrabToken(); /* Returns 0 - End of the data was reached */ switchto - Should the active token set be changed, when the sep is found? Setting this parameter to -1 disables this feature. If it's set to anything else, the token set will automatically be changed to the specified token set, if it exists. Two defines can also be passed to this variable: Define Value Description PARSER_TSET_DONT_SWITCH - -1 - Don't change the current TSet PARSER_TSET_LAST - -2 - Switch to the last active TSet See Token Sets ignore - Should the token separator be ignored? This logic was originally designed to be used when parsing strings. For Example: string - "foo\"bar" /* Start and end quotation marks are part * of the string */ Token Sep is a quotation mark, namely AddTokenSeparator("\"", 1, -1, 0, ...); GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns "foo\\" */ GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns "bar" */ GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns 0 - End of the data was reached */ If you had wanted to preserve the string, you probably didn't want the parser to pick out the quotation mark from \". To fix this, add another token sep (\") with ignore set to 1 AddTokenSeparator("\"", 1, -1, 0, 0, 0); /* " */ AddTokenSeparator("\\\"", 0, -1, 1, 0, 0); /* \" */ ^- Ignore GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns "foo\"bar" */ GrabToken(); /* Returns quotation mark */ GrabToken(); /* Returns 0 - End of the data was reached */ Ignore is also useful when combined with a function callback. For example, say you wanted to count all the new lines (\n) in a file. You could set ignore to 1, and then set the function pointer to a function that would increment a global variable that counts the number of new lines. When GrabToken() is called, It'll call the callback function when ever it runs into a new line. Since Ignore is set to one, it'll continue to do this until it gets to the end of the file. From there, it'll return the entire file, but you'll have the total number of new lines in the file Note: You cannot call GrabToken() or similar from a callback with ignore set to 1 (true). /************************************************************* * Full Program - Retrieves the number of new lines from a file * * * Note: You must change <any file> in ParserInit() to the * * name of the file you want to get the number of * * new lines from. * *************************************************************/ #include "Parser.h" #include <stdlib.h> /* free() */ #include <stdio.h> /* printf() */ char *NewLineCounter(char *str, int *newlines); int main(void) { int NewLines = 1; ParserInit(<any file>, 1024); AddTokenSeparator("\n", 0, -1, 1, (PCBACK)NewLineCounter, &NewLines); free(GrabToken()); /* Scan the whole file, and free what * * ever is returned */ printf("New Lines: %d\n", NewLines); ParserDeInit(); return 0; } char *NewLineCounter(char *str, int *newlines) { (*newlines)++; return str; } /************************************************************* * End of the program * *************************************************************/ fun - Function to call when ever a token is found. The prototype for the function is char *<function name>(char *str, void *params); The return of the function should be: 0 - Continue the search. 1+ - Any string. By default, the safest thing to do is to return str; (first parameter of the function). Any string returned by the callback is assumed to be owned by the parser. As such, it must be allocated by malloc(), calloc(), or realloc(). This enables you to replace any token separator with another string. It is assumed that you own the str parameter. IE, if you do not plan to return the variable, and it is not 0/NULL, you should call free(str);. In the above program, if you changed the ignore parameter to a 0 (false), str will be 0/NULL. params - Parameters to pass to the callback. Check New Line Counter for an example on how to use this. Once a token separator is found and returned, you can retrieve this variable by calling GetParams(). To add a new Token Separator, you can call two different functions AddTokenSep(); - Basic version that attaches default behavior AddTokenSepataor(); - Advanced version that allows you to define the logic. To use AddTokenSep(), you only have to pass in a pointer to a string. A 1+ will be returned if a error occurred and a 0 will be returned if the token was added. Default behavior: Return - 1 - Return the token when GrabToken() is called. switchto - -1 - Don't change the token set. ignore - 0 - Don't ignore the token separator. fun - 0 - Don't call a function. params - 0 - No params to pass to the callback AddTokenSepataor() allows you specify the logic of the token separator. Token Sets: To allow the Parser to be more dynamic, multiple Token Sets can be defined. A "Token Set" is just a set of tokens. Each token set is completely separated from one another. This allows the parser to switch the parsing syntax at runtime. Once the parser is initialized (by calling ParserInit()), the initial token set will automatically be created, and set as the active token set. The initial token set has an index of 0. To create a new token set, simply call AddTokenSet(); The function will create a new token set, set the new token set as the active token set, and then return the index of the token set. To properly handle the return value, you should create a descriptive variable to store the index. For Example: ParserInit(0, 1024); /*Set up parser and create token set 0*/ int tset_comments = AddTokenSet();/* Create tset that handles comments */ int tset_strings = AddTokenSet();/* Create tset that handles strings */ Likewise, you can also get the current token set by calling: GetTokenSet(); While this is the proper way to handle token set indexes, feel free to just hard code the value. The initial token set is 0. Each call to AddTokenSet() will increase the index by 1. IE, tset_comments will be 1, and tset_strings will be set to 2. All calls to AddTokenSep(), AddTokenSeparator(), GrabToken(), PeekToken(), etc will use the active token set. To Change the active token set, simply call SetTokenSet(); SetTokenSet(tset_comments); /* Set the active token set to handle comments*/ SetTokenSet(tset_strings); /* Set the active token set to handle strings */ SetTokenSet(0); /* Set the active tset to the initial tset */ SetTokenSet() will return a -1 if the token set index you specified is invalid, Otherwise, SetTokenSet() will return the index you passed in. To add further automation to the parser, you can specify which token set the parser will use when ever a token separator is located. To do this, specify the token set the parser should switch to, as the switchto parameter. For example, let's write a program that will print only the C style comments and strings from Parser.c. First, we Initialize the Parser. That will create the initial token set (0) that will handle all the switching between token sets, and calling the proper function. Once the initial token set is created, create two addition token sets. The first token set will handle all the strings. The 2nd, will handle all the comments. Once we have all the token sets, we switch back to the initial token set, otherwise we would add the token separators to the last created token set. Now that the initial token set is active, and all the addition token sets have been created, we can start adding the parsing syntax. Initial Token Set: AddTokenSeparator("\"", 0, tset_strings, 0, PrintString, 0); " - Return - 0 - Don't return it switchto - tset_strings - switch to tset that handles strings ignore - 0 - Don't ignore the token. fun - PrintString - Print the string to the screen params - 0 - Don't pass any parameters AddTokenSeparator("/*", 0, tset_comments, 0, PrintComment, 0); /* - Return - 0 - Don't return it switchto - tset_comments- switch to tset that handles comments ignore - 0 - Don't ignore the token fun - PrintComment - Print comment to command prompt params - 0 - Don't pass any parameters With the initial token set up, we need to set up the two additional token sets. SetTokenSet(tset_strings); tset_strings: AddTokenSeparator("\\\\", 0, -1, 1, 0, 0); \\ - Return - 0 - Don't return it switchto - -1 - Don't change the current token set ignore - 1 - Ignore the token, so we don't break up strings fun - 0 - Don't call any function params - 0 - Don't pass any parameters AddTokenSeparator("\\\"", 0, -1, 1, 0, 0); \" - Return - 0 - Don't return it switchto - -1 - Don't change the current token set ignore - 1 - Ignore the token, so we don't break up strings fun - 0 - Don't call any function params - 0 - Don't pass any parameters AddTokenSeparator("\"", 0, 0, 0, 0, 0); " - Return - 0 - Don't return it switchto - 0 - Switch to the initial token set ignore - 0 - Don't ignore it fun - 0 - Don't call a function params - 0 - Don't pass any parameters SetTokenSet(tset_comments); tset_comments: AddTokenSeparator("*/", 0, 0, 0, 0, 0); */ - Return - 0 - Don't return it switchto - 0 - Switch to the initial token set ignore - 0 - Don't ignore it fun - 0 - Don't call a function params - 0 - Don't pass any parameters With all the token sets set up, we need to switch back to the initial token set before we can start parsing the file. SetTokenSet(0); With every thing set up, the actual parsing is fairly automatic. When GrabToken() is called, It'll first scan the file for a separator. If it finds a /*, it'll switch to tset_comments, and then call the function PrintComment(). In PrintComment(), a printf() is called along with another GrabToken(). In PrintComment(), GrabToken() will return everything up to the */. Once */ is found, the token set will automatically be switched back to the initial token set, and then continue to search for another token. The same procedure happens for strings as well. /************************************************************************ * Start of Program. Prints out all comments and strings from Parser.c * ************************************************************************/ #include "Parser.h" #include <stdio.h> /* printf() */ #include <stdlib.h> /* *alloc(), free() */ char *PrintString (char *str, void *); char *PrintComment(char *str, void *); int main(void) { int tset_strings; int tset_comments; char *buffer; ParserInit("Parser.c", 1024); tset_strings = AddTokenSet(); tset_comments = AddTokenSet(); SetTokenSet(0); AddTokenSeparator("\"", 0, tset_strings, 0, PrintString , 0); AddTokenSeparator("/*", 0, tset_comments, 0, PrintComment, 0); /* Set up the token set that will handle the strings. * * Note: We need to ignore \\ as well as \", since it is possible to* * have something like this \\". If we don't ignore \\, then* * the parser will continue to search for a " since \" will * * be picked out of the stream. */ SetTokenSet(tset_strings); AddTokenSeparator("\\\\", 0, -1, 1, 0, 0); AddTokenSeparator("\\\"", 0, -1, 1, 0, 0); AddTokenSeparator("\"", 0, 0, 0, 0, 0); /* Set up the token set that will handle all comments */ SetTokenSet(tset_comments); AddTokenSeparator("*/", 0, 0, 0, 0, 0); SetTokenSet(0); /* Loop through all the code. GrabToken() will return chunks of code* * so, we need to free it until we get a null buffer/end of file */ for(buffer = GrabToken(); buffer; buffer = GrabToken()); free(buffer); /* Clean up the parser */ ParserDeInit(); return 0; } char *PrintString (char *str, void *a) { char *buffer = GrabToken(); printf("String Found!\n\"%s\"\n\n", buffer); free(buffer); /* We own str, so free it if it's valid */ if(str) free(str); return 0; /* Continue Searching */ } char *PrintComment(char *str, void *a) { char *buffer = GrabToken(); printf("Comment Found!\n/*%s*/\n\n", buffer); free(buffer); /* We own str, so free it if it's valid */ if(str) free(str); return 0; /* Continue Searching */ } /************************************************************************ * End of Program * ************************************************************************/ 2.03 - Text Mode Functions - Top of Page GrabToken() and PeekToken() are the main text file functions. They search through the file for a Token Separator and return a pointer to a null terminated array. These functions will work with binary files as well; however, due to NULL terminator, using these functions will cause all tokens that start with a 0 to be ignored. GrabInt() and GrabFloat() utilize GrabToken(), the conversion into the respective variable type. NOTE: Unicode is not supported. Text Mode Compliant Functions: GrabToken(); PeekToken(); GrabInt (); GrabFloat(); Seek (); 2.04 - Binary Mode Functions - Top of Page Seek() is the main driving force behind binary mode. It allows you to scan the file, until you get to the position you want. GrabBinaryInt(), GrabBinaryFloat(), GrabByte(), and GrabBytes() allow you to retrieve the data. You can use Grab/PeekToken; however, binary data might be interpreted to match a token separator. Binary Mode Compliant Functions: Seek (); GrabBinaryInt (); GrabBinaryFloat(); GrabBytes (); GrabByte (); PeekByte (); 2.05 - Function Callbacks - Top of Page The parser supports callback functions. The prototype for the function is char *<function name>(char *str, void *params); The return of the function should be: 0 - Continue the search. 1+ - Any string. By default, the safest thing to do is to return str; (first parameter of the function). Any string returned by the callback is assumed to be owned by the parser. As such, it must be allocated by malloc(), calloc(), or realloc(). This enables you to replace any token separator with another string. It is assumed that you own the str parameter. IE, if you do not plan to return the variable, and it is not 0/NULL, you should call free(str);. See The following for examples: Sample Program: New Line Counter Sample Program: Print comments and Strings 2.06 - String Manipulation Functions - Top of Page RemoveWhiteSpaces() - Removes all white spaces from a string (Space, New Line, Carriage Return, and Tab) ToUpper () - Converts all letters in a string to upper case ToLower () - Converts all letters in a string to lower case Dup () - Creates a copy of the specified string DupLen () - Same as above, but takes the length DupRange () - Creates a copy of a specific part of a string DupRangeFile() - Creates a copy of a specific part of a file Cmp () - Compares two strings. If they are equal, 1 is returned. If not, 0 is returned. 2.07 - Options/Configuration - Top of Page The Parser contains several runtime definable options. Options can be enabled, disable, or checked with the following functions: ParserEnable () - Enables a option/feature ParserDisable () - Disables a option/feature ParserIsEnabled() - Checks to see if all the specified options are enabled All the above functions can take multiple options at the same time. For Example, ParserEnable(PARSER_GLOBBING | PARSER_HASH | PARSER_CASE_INSENSITIVE); Search Algorithm Options: PARSER_CASE_INSENSITIVE - All search related functions will perform case insensitive searches. PARSER_WILDCARD - Globbing based searches will be enabled. Namely, the * character will be use to accept any number of additional characters. This feature will be disabled if PARSER_REGEX is enabled. See Globbing for more info. PARSER_GLOBBING - Same as PARSER_WILDCARD PARSER_REGULAR_EXPRESSIONS - Enables RegEx searching syntax. This feature will take precedence over and disable the following features: PARSER_HASH PARSER_WILDCARD PARSER_GLOBBING PARSER_SORT_TOKEN_SEPS See RegEx for more info. PARSER_REGEX - Same as PARSER_REGULAR_EXPRESSIONS Performance Options: PARSER_HASH - Hashes the first character of each token separator. This can drastically speed up the parsing process, in exchange for additional memory usage. This feature will be disabled if PARSER_REGEX is enabled. This feature implies PARSER_SORT_TOKEN_SEPS PARSER_CLOSE_FILE - Closes the file after it is read from. By default, the file is left open due to performance reasons. Namely, if the file is closed, the OS will have to seek to the correct location in the file which tends to be very expensive. Closing a file will reduce the total number of active system file handles, in exchange for a drastic performance loss. PARSER_SORT_TOKEN_SEPS - Performs a week sorting algorithm to all token separators. This guarantees that all supersets of token separators will be check before any subsets, in exchange for a slightly longer initialization time. This feature will be disabled if PARSER_REGEX is enabled. Memory Options: PARSER_CONST_FILE_NAME - It is assumed that file name passed to LoadFile() or ParserInit() is const. As such, the string will not be duplicated, nor will it be freed when ParserDeInit() is called. As such, the string must be valid for the entire life of the parser. PARSER_CONST_LOAD_MEMORY - It is assumed that memory passed to LoadMemory() or LoadMemoryLen() is const. As such, the memory will not be duplicated, nor will it be freed when ParserDeInit() is called. As such, the memory must be valid for the entire life of the parser. PARSER_CONST_TOKEN_SEPS - It is assumed that token separator passed to AddTokenSeparator() is const. As such, the string will not be duplicated, nor will it be freed when ParserDeInit() is called. As such, the string must be valid for the entire life of the parser. PARSER_OWNS_FILE_NAME - It is assumed that the parser owns the pointer passed to LoadFile() and ParserInit(). As such, the pointer will be freed when ParserDeInit() is called. The pointer must be allocated with malloc(), calloc(), or realloc(). PARSER_OWNS_LOAD_MEMORY - It is assumed that the parser owns the pointer passed to LoadMemory() and LoadMemoryLen(). As such, the pointer will be freed when ParserDeInit() is called. The pointer must be allocated with malloc(), calloc(), or realloc(). PARSER_OWNS_TOKEN_SEPS - It is assumed that the parser owns the pointer passed to AddTokenSeparator(). As such, the pointer will be freed when ParserDeInit() is called. The pointer must be allocated with malloc(), calloc(), or realloc(). 2.08 - Search Methods - Top of Page The following search algorithms directly effect how a token separator is interpreted by the Parser. The more complex the search algorithm, the slower it tends to be. Linear: This is the default and fastest search algorithm within the Parser. Simply stated, a linear search is performed on the file/memory. IE, the Parser will start from the first byte and compare it against all the token separators. If none of the separators match, the next byte will be tested, and so on. In this mode, the token separators will be literally interpreted. IE, there are no reserved characters, other then the NULL terminator/0. To ensure that a linear search is performed, call: ParserDisable(PARSER_GLOBBING | PARSER_REGEX); Globbing: To enable Globbing, call: ParserEnable(PARSER_GLOBBING); Globbing is similar to the linear search method; however, the asterisk (*) has a special meaning. Namely, it can be any number of any types of any characters. IE, Glob Matching Examples foo*bar - foobar, foo\nbar, foo___bar, foo o foo bar *foo* - asdf foo bar, dddfoobar, *f*r* - fr, f r, af r, af rd Glob Non-Matching Examples foo*bar - foba, fooba, foob ar, f oobar *foo* - fo o, asf oo *f*r* - afh In order to prevent a glob from returning the content of an entire file, all the other token separators will be taken into account. This will effect the glob if it has a * on the front or end of the token ("*foo*"). Internal * are not separator delimited ("f*o"). So, if you have this LoadMemory("ab foo bar ddd"); ParserEnable(PARSER_GLOBBING); AddTokenSep(" "); AddTokenSep("*foo*bar"); Then, GrabToken(); /* Returns "ab" */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns "foo bar" */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns "ddd" */ GrabToken(); /* Returns 0 - End of the data was reached */ Likewise, with the same Token Set, LoadMemory("ab ddfoo foo baree ddd"); GrabToken(); /* Returns "ab" */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns "ddfoo foo baree" */ GrabToken(); /* Returns " " */ GrabToken(); /* Returns "ddd" */ GrabToken(); /* Returns 0 - End of the data was reached */ Warning: Once a glob has found the first part of a non wildcard segment of the token, it will search to the end of the file/memory in order to locate the end. It will break out as soon as an ending segment is found. In other words, a O(N^2) search might be performed by a glob on a file. IE, with a bad set of data, globbing may be extremely slow. Warning: There is no way to escape the *. Once enabled, all * will be interpreted as a wildcard character. RegEx: Regular Expressions, or RegEx for short, is a much more powerful version of globbing. Warning: The RegEx engine is still in development. It is possible to get stuck in an infinite loop. To enable RegEx, call: ParserEnable(PARSER_REGEX); RegEx on Wikipedia Supported Syntax: Logic: ^ - Start of String $ - End of String \b - Word Boundary \B - Not a Word Boundary \< - Start of Word \> - End of Word | - Or. Ex, "a|b" will match 'a' or 'b' Ranges: [aqf] - Will Match a, q, or f. Anything can be added here [^a] - Anything other then 'a' [a-z] - Anything from 'a' to 'z'. 'a' and 'z' can be replaced with any ASCII characters. Predefined Ranges: \s - [ \t\r\n\v\f] - White Space \S - [^\s] - Not White Space \d - [0-9] - Digit \D - [^\d] - Not a Digit \w - [A-Za-z0-9_] - Word \W - [^\w] - Not a Word Quantifiers: All Quantifiers are Greedy. Append a '?' after a quantifier to make it non greedy. ? - 0 or 1 * - 0 or more + - 1 or more Other: (...) - Sub expressions. Ex, "(a|b)+" will match "aaaa", "aabb", "abab", "bbba", ... Internally, all RegEx's in the Parser are compiled into byte code. This is done in order to simplify the implementation of the RegEx engine, as well to improve the overall performance. Currently, the Parser uses a DFA engine, due to their high performance. Planned Enhancements: Capture Groups Backreferences Warning: While still usable, RegEx's tend to be slower then Linear and Globbing based searches. 2.09 - State Management - Top of Page For simplicity, the Parsers state management is handled through a global pointer. This reduces the total number of variables you have to manage. If you need more then one Parser state, the following state management methods are provided to simplify the overall process. Manual: When you are manually managing the Parser state, you are responsible for keeping track of the state's pointer. Once the Parser is initialized via ParserInit(), you can get the state pointer by calling: void *state = GetParserState(); This will give you a direct copy of the state pointer. The current state will still be active. To change the active Parser state, call: SetParserState(state); To deinitialize the Parser's state while the state is still active, call ParserDeInit(); The main idea behind this mode is to wrap any code that needs to create a new parser state with the above function calls. For Example (assume foo() was called): void foo(void) { /* Preserve any old parser state */ void *old_state = GetParserState(); /* Create a new state */ SetParserState(0); ParserInit(...); /* Do stuff with the Parser... */ bar(); /* Do stuff with the Parser... */ /* Cleanup the current state, and restore the old state */ ParserDeInit(); SetParserState(old_state); } void bar(void) { /* Preserve any old parser state */ void *old_state = GetParserState(); /* Create a new state */ SetParserState(0); ParserInit(...); /* Do stuff with the Parser... */ ParserDeInit(); SetParserState(old_state); } So, looking at the above code, when foo is called, we preserve any Parser state that might be active. Then, we create a new state. We then call bar() which does the same thing. This guarantees that we will not do any harmful things to another parser state. Alternatively, we can wrap the call to bar() with Get/SetParserState(); however, this tends to be a bit more dangerous and can lead to a lot more code (you'll need to do this for all calls to functions that expect an empty Parser state). The main advantage to this method, is that you can prebuild a set of Parsers, and switch to the correct Parser state when needed. Stack: The Stack (LIFO) based state management is very similar to the Manual state management; however, it is not as flexible. In exchange for this, stack based management tends to be slightly simpler. The stack based management revolves around 2 functions: PushParserState() pushes the current Parser state onto a internal global stack, and then sets the current state to 0. PopParserState() DeInitializes the current Parser state, and restores any old state. So, the above example would work out to void foo(void) { /* Preserve any old parser state */ PushParserState(); /* Create a new state */ ParserInit(...); /* Do stuff with the Parser... */ bar(); /* Do stuff with the Parser... */ /* Cleanup the current state, and restore the old state */ PopParserState(); } void bar(void) { /* Preserve any old parser state */ PushParserState(); /* Create a new state */ ParserInit(...); /* Do stuff with the Parser... */ PopParserState(); } The main advantage to this method is that we do not have to keep track of the old state pointer. Plus, this only takes 3 lines of code instead of the original 5. In exchange for this, we are limited on the total number of states that the Parser will keep a track of. By default, the Parser will keep track of 10 states. The number of states can be increased by modifying PARSER_STATE_STACK_SIZE in Parser.c. On top of this, the stack method only works in a linear fashion. The manual management can switch to any active state, rather then only the last one. 2.10- C++ Wrapper - Top of Page Included with the library is a C++ wrapper. This wrapper is largely just copy and paste; however, there are a few differences: Differences: ParserInit() -> Class Constructor ParserDeInit() -> Class Destructor AddTokenSep() -> Removed. Done through AddTokenSeparator() defaults. All Parser*() functions have the "Parser" part removed. IE, ParserEnable() -> Enable() and similar Multiple States handled through C++ classes. IE, no more Get/SetParserState(); Push/PopParserState(); Callbacks are now passed Parser &. IE, char *<function name>(Parser &p, char *str, void *params); Copy constructor is implemented. Important things that did not change malloc(), calloc(), realloc(), and free() are still used internally. As such, all pointers returned by the parser should still be freed with free(). new and delete should only be used on allocate and free the Parser class.
3.0 - Using the Parser - Top of Page
3.1 - Compiling the Parser - Top of Page To compile the parser, simply add Parser.h and Parser.c to your project. Add Parser.c to your make file, command line, or what ever. No compile time defines or special switches are required. Note: If you are working on a C++ project, you might want to rename Parser.c to Parser.cpp or use the C++ wrapper. The Parser has been compiled and tested with Windows: MS Compiler cl *.c /W4 GNU gcc *.c -Wall -Wextra -ansi -pedantic gcc *.c -Wall -Wextra -ansi -pedantic -mno-cygwin g++ *.c -Wall -Wextra -ansi -pedantic Borland Compiler bcc -w *.c Digital Mars dmc file.c -A Intel Complier icl *.c /W3 Linux: GNU gcc *.c -Wall -Wextra -ansi -pedantic g++ *.c -Wall -Wextra -ansi -pedantic 3.2 - Initialization - Top of Page Before you can use the Parser, you must first call ParserInit(<file to parse>, <buffer size>); After this is done, any options you want should be enabled. See Options/Configuration. 3.3 - Setting up the Token Set - Top of Page Once the parser has been initialized, you have to set up the token sets. If you are parsing a pure binary file, you do not need to add any token separators. To add a token separator, call AddTokenSep(<token>); AddTokenSeparator(<token>, <return>, <switchto>, <ignore>, <fun>, <params>); To create a new Token Set, call AddTokenSet(); To change the current token set, call SetTokenSet(<token set>); More on Token Sets 3.4 - Parsing the file - Top of Page Once the token set(s) have been set up, you begin to parse the file. GrabToken() is the main parser function. When GrabToken() is called, it'll retrieve the next token from the file. If a 0 (null) is returned, the end of the file was reached or an error occurred. Text Mode Functions Binary Mode Functions 3.5 - Deinitialization - Top of Page Once you are done parsing, you should call ParserDeInit();
4.0 - Functions - Top of Page
4.1 - Public Parser Functions - Top of Page
The following functions are the public interface for interacting with the Parser.
4.1.01 - AddTokenSep() - Top of Page Prototype: int AddTokenSep(const char *sp); Description: Adds a new token separator to the current token set. This function is an adapter for AddTokenSeparator(), that uses the default settings. Return - 1 - The token separator will be returned switchto - -1 - Don't change the token set ignore - 0 - Don't ignore the token fun - 0 - Don't call a callback function params - 0 - Don't pass any parameters Inputs: *sp - Pointer to the string to use as a token separator. C style string - must be null terminated. Output: 0 - Token has been added to the parser. 1+ - The specified token is not valid, or another error occurred. This value may one or more of the following flags: PARSER_TOKEN_NULL - sp == NULL PARSER_OUT_OF_MEMORY - Could not allocate enough memory to add the token to parser. This error will usually cause the rest of the parser to error out. PARSER_REGEX_COMPILE_ERROR - Failed to compile a RegEx to byte code PARSER_UNBALANCED_PARENS - RegEx has unbalanced () PARSER_UNBALANCED_BRACKETS - RegEx has unbalanced [] Notes: The order that tokens are added matters. A token passed in before another tokens will be detected first. IE, if you passed in AddTokenSep("23"); AddTokenSep("2"); "23" will be check for before "2" is checked for. This is very useful if you have a token that is a superset of another. To change this behavior, call ParserEnable(PARSER_SORT_TOKEN_SEPS); This will cause a weak sorting algorithm to be applied to the tokens separator, so that supersets of tokens will always be checked before subsets. By default, the token that is passed in will be duplicated. To change this behavior, call ParserEnable(); with one of the following defines: PARSER_CONST_TOKEN_SEPS - Token is assumed to be constant, and will not be duplicated or freed. As such, the token must always be available, for the entire life of the parser state. PARSER_OWNS_TOKEN_SEPS - Token is assumed to be allocated with malloc(), calloc(), or realloc(). The token will be freed when the ParserDeInit() is called. 4.1.02 - AddTokenSeparator() - Top of Page Prototype: int AddTokenSeparator(const char *sp, int Return, int switchto, int ignore, char *(*fun)(), void *params); Description: Inputs: *sp - Pointer to the string to consider as a separator Return - Should the separator be returned by Grab/PeekToken()? This is useful for filtering out specific strings. switchto - Automatically switch to the specified token set, when the token is found. This can also be one of the following defines: PARSER_TSET_DONT_SWITCH - Don't change the current token set PARSER_TSET_LAST - Switch to the last active token set ignore - If this Token is found, just keep going. Originally designed to be used with strings. For example, \" should be ignored; however, " will be picked out if we don't ignore \" fun - Callback function to call when ever the token is found. params - Pointer to pass to fun. Once a token is found, and Grab*() has returned, GetParams() call be called to return this value. Output: 0 - Token has been added to the parser. 1+ - The specified token is not valid, or another error occurred. This value may one or more of the following flags: PARSER_TOKEN_NULL - sp == NULL PARSER_OUT_OF_MEMORY - Could not allocate enough memory to add the token to parser. This error will usually cause the rest of the parser to error out. PARSER_REGEX_COMPILE_ERROR - Failed to compile a RegEx to byte code PARSER_UNBALANCED_PARENS - RegEx has unbalanced () PARSER_UNBALANCED_BRACKETS - RegEx has unbalanced [] Notes: The order that tokens are added matters. A token passed in before another tokens will be detected first. IE, if you passed in AddTokenSep("23"); AddTokenSep("2"); "23" will be check for before "2" is checked for. This is very useful if you have a token that is a superset of another. To change this behavior, call ParserEnable(PARSER_SORT_TOKEN_SEPS); This will cause a weak sorting algorithm to be applied to the tokens separator, so that supersets of tokens will always be checked before subsets. By default, the token that is passed in will be duplicated. To change this behavior, call ParserEnable(); with one of the following defines: PARSER_CONST_TOKEN_SEPS - Token is assumed to be constant, and will not be duplicated or freed. As such, the token must always be available, for the entire life of the parser state. PARSER_OWNS_TOKEN_SEPS - Token is assumed to be allocated with malloc(), calloc(), or realloc(). The token will be freed when the ParserDeInit() is called. If ignore is set to 1 and there is a function callback for a token, you will not be able to call Grab*() or similar from within the callback. 4.1.03 - AddTokenSet() - Top of Page Prototype: int AddTokenSet(void); Description: Creates a new Token Set, and then sets it as the active one. Inputs: N/A Output: -1 - Error occurred. Call ErrorCode() to find out what went wrong. 0+ - Index of the new token set. Generally, you should use a variable to store the return result, and then use that variable when SetTokenSet() is called. Notes: N/A 4.1.04 - End() - Top of Page Prototype: int End(void); Description: Returns 1 if the end of the file was reached, a error occurred, or if the parser was not initialized. Inputs: N/A Output: 1 - The end of the file was reached, a error occurred, or the parser was not initialized. 0 - The parser can still retrieve data from the file. Notes: N/A 4.1.05 - ErrorCode() - Top of Page Prototype: int ErrorCode(void); Description: Returns a 0 if no error has occurred. Otherwise, an error has occurred. Inputs: N/A Output: Define Value Description PARSER_NOT_INITIALIZED - -1 - Not Initialized. Call ParserInit() first. PARSER_NO_ERROR - 0 - No Error PARSER_COULD_NOT_OPEN_FILE - 1 - Could not open the specified file. PARSER_OUT_OF_MEMORY - 2 - Could not allocate the required memory PARSER_END_OF_FILE - 3 - Reached the end of the file PARSER_MEMORY_NOT_VALID - 4 - Data passed in to Load Memory was null PARSER_GRAB_TOKEN_IGNORE - 5 - GrabToken() or similar was called from a function callback with ignore set to 1. This is not supported. Notes: Check Parser.h for the defines of the above error codes. You can also call PrintErrorCode() to print out a human readable error code to the command prompt. This will be written to stdout via printf(). 4.1.06 - GetFilePosition() - Top of Page Prototype: long GetFilePosition(void); Description: Returns the absolute position of the parser in the file. Inputs: N/A Output: Absolute position in the file. Notes: The position returned will not be accurate if GetFilePosition() is called from within a callback that had a token with ignore set to 1. 4.1.07 - GetFileSize() - Top of Page Prototype: long GetFileSize(void); Description: Returns the size, in bytes, of the currently loaded file or block of memory. Inputs: N/A Output: Size of the file/block of memory. Notes: N/A 4.1.08 - GetParams() - Top of Page Prototype: void *GetParams(void); Description: Returns the params variable associated with the last found token separator. This value is the last value specified when you call AddTokenSeparator(). Inputs: N/A Output: Last params variable associated with the last found token separator. Notes: N/A 4.1.09 - GetParserState() - Top of Page Prototype: void * GetParserState(void); Description: Returns a pointer to the current Parser state. Inputs: N/A Output: Pointer to the current Parser state. Notes: N/A 4.1.10 - GetTokenSet() - Top of Page Prototype: int GetTokenSet(void); Description: Returns the current token set. Inputs: N/A Output: -1 - Parser was not initialized. 0+ - Current Token Set Notes: N/A 4.1.11 - GrabBinaryFloat() - Top of Page Prototype: float GrabBinaryFloat(void); Description: Grabs the next four bytes in the file, and converts them to a float. Inputs: N/A Output: Next four bytes in the file as a float. If there are not four bytes left in the file, a 0.0f will be returned instead. Notes: N/A 4.1.12 - GrabBinaryInt() - Top of Page Prototype: int GrabBinaryInt(void); Description: Returns the next sizeof(int) bytes in the file as an int. Inputs: N/A Output: Next sizeof(int) bytes in the file as an int. If there are less then 4 bytes left in the file, a 0 will be returned. Notes: N/A 4.1.13 - GrabByte() - Top of Page Prototype: char GrabByte(void); Description: Grabs the next character (byte) in the file. Inputs: N/A Output: Next character (byte) from the file. Notes: N/A 4.1.14 - GrabBytes() - Top of Page Prototype: char *GrabBytes(int bytes); Description: Grabs the requested number of bytes from the file, and then returns them. Inputs: bytes - How many bytes to grab from the file. Output: 0 - Requested number of bytes is invalid or the end of the file was reached 1+ - Pointer to the memory that contains the data from the file. Notes: You are responsible for cleaning up the data when you are done with it. IE, you must call free(<pointer returned by GrabBytes()>). 4.1.15 - GrabFloat() - Top of Page Prototype: float GrabFloat(void); Description: Grabs the next token in the file, and then attempts to convert it to a float via the atof() function declared in stdlib.h. All token separators are taken into account. Function callbacks & such will still be called. Inputs: N/A Output: Next token converted to a float. Notes: N/A 4.1.16 - GrabInt() - Top of Page Prototype: int GrabInt(void); Description: Grabs the next token in the file, and then attempts to convert it to an int via the atoi() function declared in stdlib.h. All token separators are taken into account. Function callbacks & such will still be called. Inputs: N/A Output: Next token converted to an int. Notes: N/A 4.1.17 - GrabToken() - Top of Page Prototype: char *GrabToken(void); Description: The main function of the Parser. This function will scan the file for any of the token separators you specified with AddTokenSeparator(), as well as to apply any specified logic of the token separator. Inputs: N/A Output: 0 - End of the file was reached, or an error occurred 1+ - Character pointer to the next token in the file. Notes: You are responsible for the cleanup. IE, calling free(). 4.1.18 - LoadFile() - Top of Page Prototype: int LoadFile(const char *file); Description: Loads in a new file into the parser for processing. Inputs: *file - C style string that contains the name/path of the file to parse. Output: 1 - File was loaded and the parser was set up 0 - Error occurred. Most likely do to an incorrect file name. Notes: The token sets will not be affected by this function. All files are read in as binary. By default, the file that is passed in will be duplicated. To change this behavior, call ParserEnable(); with one of the following defines: PARSER_CONST_FILE_NAME - File name is assumed to be constant, and will not be duplicated or freed. As such, the file name must always be available, for the entire life of the parser state. PARSER_OWNS_FILE_NAME - File name is assumed to be allocated with malloc(), calloc(), or realloc(). The file name will be freed when ParserDeInit() is called. 4.1.19 - LoadMemory() - Top of Page Prototype: int LoadMemory(const char *memory); Description: Loads in the specified chunk of memory into the parser for parsing. Currently, only C style strings are supported by this function. This function calls LoadMemoryLen(). Inputs: *memory - Pointer to the chuck of memory to load into the parser. Output: 0 - The specified memory is not valid or the parser was not initialized. 1 - The memory was loaded into the parser. Notes: By default, the memory that is passed in will be duplicated. To change this behavior, call ParserEnable(); with one of the following defines: PARSER_CONST_LOAD_MEMORY - Memory is assumed to be constant, and will not be duplicated or freed. As such, the memory must always be available, for the entire life of the parser state. PARSER_OWNS_LOAD_MEMORY - Memory is assumed to be allocated with malloc(), calloc(), or realloc(). The memory will be freed when ParserDeInit() is called. 4.1.20 - LoadMemoryLen() - Top of Page Prototype: int LoadMemoryLen(const char *memory, int len); Description: Loads in the specified chunk of memory into the parser for parsing. Binary based memory can be passed in. Inputs: *memory - Pointer to the chuck of memory to load into the parser. len - Size of the memory to load into the Parser Output: 0 - The specified memory is not valid or the parser was not initialized. 1 - The memory was loaded into the parser. Notes: By default, the memory that is passed in will be duplicated. To change this behavior, call ParserEnable(); with one of the following defines: PARSER_CONST_LOAD_MEMORY - Memory is assumed to be constant, and will not be duplicated or freed. As such, the memory must always be available, for the entire life of the parser state. PARSER_OWNS_LOAD_MEMORY - Memory is assumed to be allocated with malloc(), calloc(), or realloc(). The memory will be freed when ParserDeInit() is called. 4.1.21 - ParserDeInit() - Top of Page Prototype: void ParserDeInit(void); Description: Frees all the memory that the parser was using. Inputs: N/A Output: N/A Notes: N/A 4.1.22 - ParserDisable() - Top of Page Prototype: void ParserDisable(int flags); Description: Disables one or more features/options. See Options/Configuration for a list of defines. Inputs: One or more features/options to disable. Multiple features can be disabled at the same time by | the values together. For example, ParserEnable(PARSER_REGEX | PARSER_CASE_INSENSITIVE); Output: N/A Notes: By default, none of the listed features/options are enabled. 4.1.23 - ParserEnable() - Top of Page Prototype: void ParserEnable(int flags); Description: Enables one or more of the specified options. See Options/Configuration for a list of defines. Inputs: One or more features/options to enable. Multiple features can be enabled at the same time by | the values together. For example, ParserEnable(PARSER_REGEX | PARSER_CASE_INSENSITIVE); Output: N/A Notes: By default, none of the listed features/options are enabled. 4.1.24 - ParserInit() - Top of Page Prototype: void ParserInit(const char *file, int bufsize)' Description: Allocates and initializes all the memory that the parser needs to function. Once everything has been allocated and initialized, the Parser will load in the requested number of bytes from the file. Inputs: *file - Name/Path of the file to load into the parser. A NULL pointer can be passed in if you do not wish to load in an initial file. bufsize - How many bytes to read in from the file at one time. If a 0 is passed in, bufsize will default to 1024 - 1 KB. This value can not be changed once it is specified. Output: N/A Notes: If this function is called more then once, the parser will automatically call ParserDeInit(), in order to prevent leaking memory. Multiple Parser states can be managed with: Push/PopParserState() Get/SetParserState() 4.1.25 - ParserIsEnabled() - Top of Page Prototype: int ParserIsEnabled(int flags); Description: Checks to see if the specified options/features are enabled. This function will return 1 if all the specified options are enabled. If one or more options are not enabled, a 0 will be returned. Inputs: One or more features/options to check if they are enabled. Multiple features can be checked at the same time by | the values together. For example, ParserEnable(PARSER_REGEX | PARSER_CASE_INSENSITIVE); Output: 0 - One or more of the specified options are not enabled. 1 - All the specified options are enabled. Notes: See Options/Configuration for a list of defines. 4.1.26 - ParserMemoryUsage() - Top of Page Prototype: int ParserMemoryUsage(void); Description: Returns a estimate of the total number of bytes the current parser state is using. Generally, this number will be very accurate; however, certain error conditions can skew the results. Inputs: N/A Output: Number of bytes of the heap the parser is using. Notes: Global variable memory is ignored; however, it tends to be very small. By default, the Parser only uses 12 * sizeof(void *) bytes of global variable memory. 4.1.27 - PeekByte() - Top of Page Prototype: unsigned char PeekByte(int offset); Description: Returns the next byte + the specified offset in the loaded file/memory. The offset can be positive or negative. Requesting a byte before the start or after the end of the file/memory will result in a 0. This function does not modify the Parsers current location. Inputs: offset - Offset of the byte to get from the next byte in the file/memory. Output: Next byte + offset in the stream. Notes: N/A 4.1.28 - PeekToken() - Top of Page Prototype: char *PeekToken(void); Description: Same behavior as GrabToken(); although, the parser's position in the file is not updated. Callback functions will still be called. Inputs: N/A Output: Pointer to the next token the in file. Notes: You are responsible for cleaning up the memory when you are done. 4.1.29 - PopParserState() - Top of Page Prototype: void PopParserState(void); Description: Deinitializes the current Parser state, and restores an old state. If no old states exist, a new and uninitialized state will be created. Inputs: N/A Output: N/A Notes: It is safe to pop an empty state stack. This will just cause the current state to the deinitialized. 4.1.30 - PrintErrorCode() - Top of Page Prototype: void PrintErrorCode(void); Description: Prints out the current status of the parser to stdout via printf(). Format: <File Name>: <Error Message> The file name will be the name of the parser file (Parser.c, by default). The error message will be determined by the Error Code. Inputs: N/A Output: N/A Notes: N/A 4.1.31 - PushParserState() - Top of Page Prototype: int PushParserState(void); Description: Stores the current Parser state, and sets a new/uninitialized Parser state as the active state. Inputs: N/A Output: 0 - Could not push the parser state. The hard coded state stack size was exceeded. See PARSER_STATE_STACK_SIZE in Parser.c for the total number of states the Parser can keep track of. 1 - Parser State was pushed onto the state stack. Notes: See PARSER_STATE_STACK_SIZE in Parser.c to change the state stack size. 4.1.32 - Seek() - Top of Page Prototype: int Seek(const char *search); Description: Scans the file for the specified token. If the token is found, the position of the parser will be updated to the character directly after the token. If the token is not found, nothing in the parser will change. PARSER_CASE_INSENSITIVE will be taken into account. Inputs: *search - C style string to search for in the file. Output: 0 - The token was not found. The parser was not updated. 1 - The token was found. The parser was updated. Notes: Token sets are not factored in. Globbing and RegEx are not supported by this function. 4.1.33 - SetFilePosition() - Top of Page Prototype: int SetFilePosition(long fpos); Description: Changes the position in the file that the parser scans for the tokens. Inputs: fpos - Where the parser should start parsing the file. Output: 0 - Error occurred. Call ErrorCode() or PrintErrorCode() for more info. 1 - Parser's position was updated. Notes: N/A 4.1.34 - SetParserState() - Top of Page Prototype: void SetParserState(void *state); Description: Sets the Parser state to the specified Parser state. Inputs: *state - State the parser should use. Output: N/A Notes: No error checking is done here. The current Parser state will be lost. It is highly recommended you call GetParserState() before hand in order to preserve the last parser state. Settings state to 0, followed by calling ParserInit() will create a new Parser state. 4.1.35 - SetTokenSet() - Top of Page Prototype: int SetTokenSet(int tokenset); Description: Changes the current token set. The following defines can be passed to this function: Define Value Description PARSER_TSET_DONT_SWITCH - -1 - Don't change the current token set PARSER_TSET_LAST - -2 - Switch to the last active token set Inputs: tokenset - Index of the token set to change to. Output: -1 - The parser was not initialized or the requested token set was not valid 0+ - Index of the token set switched to. Notes: N/A 4.1.36 - GenericDiscard() - Top of Page Prototype: char *GenericDiscard()(char *str, void *unused); Description: Generic Parser callback designed to discard the next token. Namely, this function simply calls free(GrabToken()); Inputs: N/A Output: 0 - Token separator that called the callback was discarded. Notes: N/A
4.2 - String Manipulation Functions - Top of Page
The following functions are not implemented in string.h or operate on different principals.
4.2.01 - RemoveWhiteSpaces() - Top of Page Prototype: int RemoveWhiteSpaces(char *sp); Description: Removes all spaces, new lines, carriage returns, and tabs from the specified string. Inputs: *sp - string pointer - string to remove the white spaces from. Output: -1 - sp was not valid 0+ - New length of the string. The pointer will not be reallocated, so the original string pointer should be valid. Notes: N/A 4.2.02 - ToUpper() - Top of Page Prototype: char *ToUpper(char *sp); Description: Converts a c style string to upper case in place. IE, the string you pass in will be directly modified. Inputs: *sp - pointer to the string to convert to upper case. Output: Original pointer that was passed in. Notes: N/A 4.2.03 - ToLower() - Top of Page Prototype: char *ToLower(char *sp) Description: Converts a c style string to lower case in place. IE, the string you pass in will be directly modified. Inputs: *sp - pointer to the sting to convert to lower case. Output: Original pointer that was passed in. Notes: N/A 4.2.04 - Dup() - Top of Page Prototype: char *Dup(const char *sp); Description: Creates a copy of the specified string. Inputs: *sp - String to make a copy of. Output: Pointer to the new chunk of memory. Notes: You are responsible for cleaning up the returned pointer by calling free(). 4.2.05 - DupLen() - Top of Page Prototype: char *DupLen(const char *sp, int len); Description: Creates a copy of the specified string. The NULL terminator is automatically attached. IE, you can just call strlen(<string>) for the param len. Inputs: *sp - String to make a copy of. len - Length of the string/position of the null terminator. Output: Pointer to the new chunk of memory. Notes: You are responsible for cleaning up the returned pointer by calling free(). 4.2.06 - DupRange() - Top of Page Prototype: char *DupRange(const char *sp, int start, int end); Description: Creates a copy of a specific part of a string Inputs: *sp - String to make a partial copy of start - Index in the string to start copying data from end - Where to stop/last character to copy Output: Pointer to the duplicated chunk of the string. Notes: You are responsible for cleaning up the returned pointer by calling free(). 4.2.07 - DupRangeFile() - Top of Page Prototype: char *DupRangeFile(const char *file, int start, int end); Description: Opens up the specified file, and then reads in the data range to a buffer. Inputs: *file - Name/path of the file to read start - Where in the file to start reading in the data end - Where to stop reading in data Output: 0 - The file name was not valid, or the memory couldn't be allocated. 1+ - Pointer to the new buffer containing the requested data. Notes: You are responsible for cleaning up the returned pointer by calling free(). 4.2.08 - Cmp() - Top of Page Prototype: char Cmp(const char *osp, const char *osp2); Description: Compares two strings together. Cmp() differ from strcmp() (string.h) in two ways. First, Cmp() returns a 1 if the strings match, and a 0 if not. 2nd, Cmp() is non case sensitive and non white space sensitive. Inputs: *osp - First string to compare *osp2 - Second string to compare Output: 0 - Strings don't match 1 - Strings match Notes: N/A
4.3 - Private Parser Functions - Top of Page
The following functions are only meant to be called from within the parser. Making these functions public, and calling them externally will have undefined results.
4.3.01 - BuildRange() - Top of Page Description: Converts a RegEx range (Ex"[1-4abcd") into a bitfield for easy processing. 4.3.02 - Compile() - Top of Page Description: Compiles a RegEx to its respective byte code. 4.3.03 - DisableRegEx() - Top of Page Description: Disables and restores the original token separator of the specified token separator. 4.3.04 - DisableWildcard() - Top of Page Description: Disables and restores the original token separator of the specified token separator. 4.3.05 - EnableRegEx() - Top of Page Description: Compiles the specified token separator into a RegEx byte code. 4.3.06 - EnableWildcard() - Top of Page Description: Converts the specified token separator into a easy to consume version of the glob/wildcard based separator. 4.3.07 - ForwardSearch() - Top of Page Description: Searches from the current location in the parser for the end of the glob/wildcard separator. 4.3.08 - ForwardSearchReg() - Top of Page Description: RegEx version of ForwardSearch(). 4.3.09 - GrabLeftover() - Top of Page Description: Returns any data that was left in the parser. This function is called once the end of the file is reached, and no more tokens have been found. 4.3.10 - GrabNextChunk() - Top of Page Description: This function handles all file input. It will allocate the space for the buffer, if required, and then read in the next chunk of the file. 4.3.11 - GTChar*() - Top of Page Description: These functions are called by GrabToken(). Each function is an optimized version of the search algorithms. While this group of functions can be reduced to 3 functions, the performance loss is not worth it. 4.3.12 - HandleEscapes() - Top of Page Description: Converts all escape sequences ("\n\r\b...") into an easy to process form. 4.3.13 - InvalidRegEx() - Top of Page Description: Performs some basic syntax check on the specified RegEx. 4.3.14 - PDupRangeFile() - Top of Page Description: Optimized version of DupRangeFile() that takes the current Parser state into account. 4.3.15 - PreserveBufferState() - Top of Page Description: Preserves the current Parsers location in the file/memory, in order to enable forward searching or similar. 4.3.16 - PreserveTSetHistory() - Top of Page Description: Preserves the current token set history to enable forward searching or similar. 4.3.17 - PrintCompiled() - Top of Page Description: Prints a compiled RegEx's byte code to the command prompt for debugging purposes. 4.3.18 - ProcessToken() - Top of Page Description: When ever a token is found, the function is called to handle all logic attached to the token. 4.3.19 - ProcessTokenWild() - Top of Page Description: RegEx and Globbing version of ProcessToken(); 4.3.20 - ReadBinary() - Top of Page Description: Grabs the specified number of bytes from the current Parser state. This function is call by GrabBinary*() and GrabBytes(). 4.3.21 - RebuildHash() - Top of Page Description: Builds or rebuilds a hash table out of the first character of each token separator. 4.3.22 - RestoreBufferState() - Top of Page Description: Restores a state returned from PreserveBufferState(). 4.3.23 - RestoreTSetHistory() - Top of Page Description: Restores a state returned from PreserveTSetHistory(). 4.3.24 - ShiftRight() - Top of Page Description: Shifts a RegEx byte code over by the specified number of bytes. 4.3.25 - SortTokenSet() - Top of Page Description: Performs a weak sort on all the token separators. All tokens are sorted based on their first character, followed by the length of each token separator. This ensures that all supersets of separators will be checked before any subsets. 4.3.26 - UpdateThreads() - Top of Page Description: Performs one iteration on a group of RegEx threads. 4.3.27 - WriteThreadBefore() - Top of Page Description: Writes a RegEx JUMP or THREAD instruction before the specified byte code block.
5.0 - Define List - Top of Page
Returned by ErrorCode() Define Value Description PARSER_NOT_INITIALIZED - -1 - Not Initialized. Call ParserInit() first. PARSER_NO_ERROR - 0 - No Error PARSER_COULD_NOT_OPEN_FILE - 1 - Could not open the specified file. PARSER_OUT_OF_MEMORY - 2 - Could not allocate the required memory PARSER_END_OF_FILE - 3 - Reached the end of the file PARSER_MEMORY_NOT_VALID - 4 - Data passed in to Load Memory was null PARSER_GRAB_TOKEN_IGNORE - 5 - GrabToken() or similar was called from a function callback with ignore set to 1. This is not supported. AddTokenSeparator() SwitchTo defines: Define Value Description PARSER_TSET_DONT_SWITCH - -1 - Don't change the current TSet PARSER_TSET_LAST - -2 - Switch to the last active TSet AddTokenSeparator() return values: Define Description PARSER_NO_ERROR - Token has been added. PARSER_TOKEN_NULL - Token is NULL/0, and thus invalid PARSER_OUT_OF_MEMORY - Could not allocate enough memory for the token PARSER_REGEX_COMPILE_ERROR - Could not compile the given RegEx PARSER_UNBALANCED_PARENS - Unbalanced () in RegEx PARSER_UNBALANCED_BRACKETS - Unbalanced [] in RegEx Parser Options - See Options/Configuration for descriptions PARSER_HASH PARSER_CASE_INSENSITIVE PARSER_WILDCARD PARSER_GLOBBING PARSER_REGULAR_EXPRESSIONS PARSER_REGEX PARSER_CLOSE_FILE PARSER_SORT_TOKEN_SEPS PARSER_CONST_FILE_NAME PARSER_CONST_LOAD_MEMORY PARSER_CONST_TOKEN_SEPS PARSER_OWNS_FILE_NAME PARSER_OWNS_LOAD_MEMORY PARSER_OWNS_TOKEN_SEPS Other: PARSER_CALLBACK - Typedef of the Parser Callback prototype. PCBACK - Same as PARSER_CALLBACK
6.0 - Known Bugs - Top of Page
RegEx Bugs: Parser can enter infinite loop Example: ".*" - Will always succeed without removing any characters from the stream. PeekToken() calls callbacks If a token sep is set to ignore, any related callback may be called multiple times for the same set of bytes in file.
7.0 - Planned Features - Top of Page
RegEx: Capture Groups Backreferences Parsing Methods: LALR Ability to modify existing Token Separators C++ Wrapper: Use new and delete internally return std::auto_ptr<char *> (std::string would duplicate the string) Save Parser State As: External Text File and Text Stream: Human Readable and Easy To Edit External Binary File and Byte Stream: Fast and Small. All preprocessing already done. Issues: Parser Callbacks & Params. Function and Var registry? Pluses: No recompilation required to change how the parser works External Script File
8.0 - Change Log - Top of Page
Parser v 8.9 Massive Performance Boosts across the board Runtime configurable options Case Insensitivity Hashing Globbing RegEx (Work In Progress) Some Memory Management Close File After Read Callback: Takes char * - matched token sep returns char * - value GrabToken() should return Added GetParams(); Added GetFileSize(); General Bug Fixes Reduced Indirection in Internal Structs C++ Wrapper: Implemented Copy Constructor Smarter Cashing Fixed long standing GrabBinary*() buffer size limitation GrabChar() -> GrabByte(), due to confusion Removed DeSmet C support Parser v 8.0 Function callback is now passed a void * - AddTokenSeparator() now takes a additional parameter Restricted GrabToken() from callback from Token with ignore == 1 This would cause an infinite recursion loop. Fixed a possible buffer over read Token order is now preserved correctly Various Optimizations Bulk of String Manipulation Functions now use const when possible Performance Delta: GNU: ~6% Faster MS : No performance difference Note: A few new warnings have been introduced, and need to be fixed Parser v 7.1 Bug fix relating to recursion caused by Parser callback function calling GrabToken(). Added Get/SetParserState() Added GenericDiscard() Parser callback, since it is a fairly common function. Parser v 7.0 Moved most of the documentation to this html file A few bug/broke logic fixes Parser v 6.0 Dropped C++ build Added DeSmet C support - strict ANSI C Several bug/broken logic fixes Reduced requested frees and allocs by ~66% Massive Code Cleanup Removed a lot of redundant code Improved internal error handler Added function callbacks Cleaned up documentation Parser v 5.0 Began testing on Linux Several bug/broken logic fixes Massive performance boost to internal file handler (~60% faster!) Parser v 4.0 Implemented binary support Expanded internal File Handler Load Files Dynamically Load Memory Dynamically Improved Internal Error Handler Parser v 3.0 Implemented internal File Handler Parser v 2.0 Added a C++ build Added the bulk of the String Manipulation Functions Parser v 1.0 - Original build with Multiple Token Set Support Token Separators with logic: Return switchto ignore