Parsing using QLexer and QRegex
If you ever need to parse user input, or configuration files, QLexer
is a QCubed component that may save you some time. QLexer allows you to
create a simple parser using regular expressions.
In this example, you will see a multiline QTextBox
that accepts user input in the BBCode format - the
format that's frequently used on bulletin boards to prevent script
injection and arbitrary HTML input. When you press the button underneath
the textbox, the BBCode input will be parsed and converted into
corresponding HTML that's ready to be outputted on the site. What's
really happening here is validation of user input - attempts to inject
straight-up HTML tags like <script> do nothing -
they don't pass through. Only a subset of HTML formatting options is
thus accessible through this BBCode notation.
Let's inspect how this is done. This example contains a custom class,
BBCodeParser, that abstracts out the parsing logic. In
the on-click handler for our button, we'll simply instantiate an object
of that class, and pass the user input to it. We'll then get the HTML
result of BBCode transormation from that object, and display it the QLabels
of our form.
Now, inspecting the BBCodeParser class: we first
instantiate a QLexer, and then use the addEntryPattern()
and addExitPattern() methods to define the regular
expressions that will outline the tokens in the BBCode. Anything outside
of the defined pattern list will be passed through htmlentities() and thus
will be safe: this is how we're accomplishing the "no arbitrary HTML
allowed" requirement.
We then call the Tokenize() method on the QLexer
object to perform the actual parsing based on the rules (patterns) we
defined above. The result is an array of tokens; we'll then inspect
these tokens in the Render() method of BBCodeParser.
Each object of the array contains two elements: $objToken['token']
contains the name of the matched token, based on the defined patterns; $objToken['raw']
includes an array of elements that were matched in the input.
As we loop through the tokens, we are inspecting the name of the
matched token, and based on that, determine how to display the raw
matched items. For example, if we are looking at a start_image
token - which would match the [img]http://foo.com/a.jpg[/img]
input - we would want to take the matched contents (http://foo.com/a.jpg),
and place them into an image tag (<img
src="http://foo.com/a.jpg" />) . That's exactly what the renderImage()
method of the BBCodeParser class does.
Note that nested (recursive) parsing is currently not supported by
QLexer: in the example below, inputting [b][i]Hello[/i]
world[/b] will not generate the desired result. Adding support
for recursive parsing to QLexer is something that the QCubed project is
considering for the next release - if you happen to implement it, please
do share your code!
Input your BBCode here and click the button. Supported tags: [b],
[i], [code], [url], [img].