Regular Expressions (RegEx) for Parsing Text (AutoHotkey Quick Reference Script Part Three)

The RegExReplace() Function Makes It Easy to Extract and Cleanup Text, Plus a Quick-and-Dirty RegEx to Strip All HTML Tags

commandsyntaxLast time, we accessed commands at using its hidden built-in index. Whenever the script downloaded a command page, we identified it by the embedded HTML code <pre class=”Syntax”>. Not only do the <pre class=”Syntax”>…</pre> tags identify the command pages but they surround the proper syntax for that command. Since this easily located HTML format appears in every command page, it can be used to launch a quick reference pop-up window. We only need to parse the command syntax with the RegExReplace() function, then clean up any extraneous HTML tags.

The AutoHotkeyQuickRef.ahk script includes two RegExReplace() functions:

CmdRef := RegExReplace(RefSource,".+?<pre class=""Syntax"">(.+?)</pre.+","$1")
CmdRef := RegExReplace(CmdRef,"<.+?>") ; quick and dirty HTML removal

The first line of code extracts the command format text, then the second line strips any remaining HTML code—usually link formating (e.g. <a href=”…”>…</a>).

Note: The double double-quotes (“”) surrounding the word Syntax in the first code line escapes the required single double-quote (“) marks. From the online documentation, “Within an expression, two consecutive quotes enclosed inside a literal string resolve to a single literal quote.”

*          *          *

If you find Jack’s AutoHotkey Blogs useful, then please consider contributing by purchasing one or more of Jack’s AutoHotkey books. The e-books make handy AutoHotkey references.

*          *          *

The Universal “Blah, Blah, Blah” RegEx (.+?)

If you open the source code for most Web pages (in Google Chrome, right-click and select “View page source” from the menu), the HTML page opens. For those unfamiliar with HTML, it looks like gobblygook. Many people avoid working with source code because to the non-HTML code writer the text is a foreign language. blahbangblahThey often fear that messing with any type of computerese might break something. Let me reassure you that you won’t accidentally damage someone else’s Web page.

Web pages exist on the remote server and, unless you’re a high-level hacker, the Web limits you to reading the page either through your Web browser or a by downloading the source code text. However, once you download the page code, you can manipulate it on your own computer any way you like—without any additional access to the remote Web page itself. Plus, since HTML is pure text, it’s perfectly safe to tear apart on your own computer. That’s what we do here.

Rather than understanding what all the code does, you only need to find the bit of text you want to extract and whatever unique code surrounds it. Then, you can use the “blah, blah, blah” RegEx to remove the extraneous HTML (and any other junk found in the page). The image below shows Ryan’s RegEx Tester parsing the command syntax from an command page using the “blah, blah, blah” RegEx three times:


The top pane in this RegEx Tester shows the source code from an Web page. Buried in the code, the <pre class=”Syntax”> tag marks the location of the command text. Using this tag as a marker the “blah, blah, blah” RegEx (.+?) eliminates the extra code in the page—leaving only the desired text (Loop, [, Count]) as the result.

I call it the “blah, blah, blah” Regex because it matches anything in a text string without exception until it encounters the next character in the expression—in the first case, the < from <pre class=”Syntax”>. Next, the “blah, blah, blah” Regex matches the second target text between the two tags and stops at the < at the beginning of </pre. (Enclosing that RegEx inside parentheses creates a backreference for saving the text—$1 in the Replacement Text field.) Finally, the third greedy “blah, blah, blah” Regex (.+) matches all of the remaining text. The final result in the bottom pane displays only the text saved by the backreference $1.

How the “Blah, Blah, Blah” Regex Works

autohotkeybooks160x600As shown by the AutoHotkey “Regular Expressions (RegEx) — Quick Reference“, the RegEx operator combination creates a wildcard expression. The dot (.) matches any single character while the plus sign tells the engine to continue matching indefinitely. However, this is the greedy form of the “blah, blah, blah” Regex and continues matching all of the text until it reaches either the last next character or the end of the text.

For example, the RegEx .+< would travel through the text matching everything until it reaches the very last < in the text. But in our situation, we want to stop at the next <pre class=”Syntax”> (non-greedy). We add the question mark (.+?) at the end of the expression to eliminate the greedy condition and stop the match at the first occurrence of <. (In fact, the question mark may not be required for the first “blah, blah, blah” Regex since <pre class=”Syntax”> usually only occurs once in the text, but just in case…) Note that at the end of the RegEx, I only use .+ since I’m no longer looking for more text to save.

Tip: You’ll find the .*? expression slightly more universal than .+? operators. The former (*) matches none or more, while the latter (+) matches one or more.

Stripping HTML Tags with the “Blah, Blah, Blah” Regex

The second RegExReplace() function shown above removes any remaining HTML. (I noted that some of the pages embedded links <a href=””>…</a> within the command format text.) The “blah, blah, blah” Regex offers a quick-and-dirty method for excising all of them—<.+?>.

CmdRef := RegExReplace(CmdRef,"<.+?>") ; quick and dirty HTML removal

Every HTML tag starts with a left arrow < and ends a right arrow >. Therefore, each match starts with an occurrence of <—consuming everything following up to and including the next >. In this situation the non-greedy ? is absolutely essential for only removing HTML tags. Otherwise, it’s likely that the entire page will disappear—unless it finds no HTML code at all. (Note that the matched RegEx gets replaced with nothing by the RegExReplace() function.)

A Little Cleanup with StringReplace

I found that some of the saved text included the HTML character symbol &quot; in the command format. I could have removed all HTML special character with:

CmdRef := RegExReplace(CmdRef,"&.+?;")

but I opted to use the StringReplace command to convert the codes to double-quote marks:

StringReplace, CmdRef, CmdRef, &quot;, ", all

More Quick Reference Features

I have more features planned for the AutoHotkeyQuickRef.ahk script, but I’m not sure which I’ll do next. I guess we’ll find out next time.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s