Parsing using QLexer and QRegex

If you ever need to parse user input, or configuration files, QLexer is a QCubed component that may save you some time. QLexer allows you to create a simple parser using regular expressions.

In this example, you will see a multiline QTextBox that accepts user input in the BBCode format - the format that's frequently used on bulletin boards to prevent script injection and arbitrary HTML input. When you press the button underneath the textbox, the BBCode input will be parsed and converted into corresponding HTML that's ready to be outputted on the site. What's really happening here is validation of user input - attempts to inject straight-up HTML tags like <script> do nothing - they don't pass through. Only a subset of HTML formatting options is thus accessible through this BBCode notation.

Let's inspect how this is done. This example contains a custom class, BBCodeParser, that abstracts out the parsing logic. In the on-click handler for our button, we'll simply instantiate an object of that class, and pass the user input to it. We'll then get the HTML result of BBCode transormation from that object, and display it the QLabels of our form.

Now, inspecting the BBCodeParser class: we first instantiate a QLexer, and then use the addEntryPattern() and addExitPattern() methods to define the regular expressions that will outline the tokens in the BBCode. Anything outside of the defined pattern list will be passed through htmlentities() and thus will be safe: this is how we're accomplishing the "no arbitrary HTML allowed" requirement.

We then call the Tokenize() method on the QLexer object to perform the actual parsing based on the rules (patterns) we defined above. The result is an array of tokens; we'll then inspect these tokens in the Render() method of BBCodeParser. Each object of the array contains two elements: $objToken['token'] contains the name of the matched token, based on the defined patterns; $objToken['raw'] includes an array of elements that were matched in the input.

As we loop through the tokens, we are inspecting the name of the matched token, and based on that, determine how to display the raw matched items. For example, if we are looking at a start_image token - which would match the [img]http://foo.com/a.jpg[/img] input - we would want to take the matched contents (http://foo.com/a.jpg), and place them into an image tag (<img src="http://foo.com/a.jpg" />) . That's exactly what the renderImage() method of the BBCodeParser class does.

Note that nested (recursive) parsing is currently not supported by QLexer: in the example below, inputting [b][i]Hello[/i] world[/b] will not generate the desired result. Adding support for recursive parsing to QLexer is something that the QCubed project is considering for the next release - if you happen to implement it, please do share your code!

Input your BBCode here and click the button. Supported tags: [b], [i], [code], [url], [img].

Hello world. [b]We[/b] all love [img]http://static.php.net/www.php.net/images/logos/php-med-trans-light.gif[/img] This is a [url=http://www.google.com]link to Google[/url].

Raw HTML (htmlentities):
Click the button to process the input.

Formatted output:
Click the button to process the input.