When It Comes to Extracting Data from Text Files, Nothing Works Like Regular Expressions (RegEx)
Last time, “Peeking at Notes Inside Auto-Startup AHK Script Files,” I added a feature for reading notes inside the .ahk files targeted by shortcuts launched from the Windows Startup folder to the AutoStartupControl.ahk script. This gave me a method for reminding myself how the various auto-startup scripts work. In that blog, I discussed how to find the .ahk file containing the notes. This time, I take a look at how to use Regular Expressions (RegEx) to extract the script notes.
Anyone who follows my blog or reads my books knows that I have a fondness for Regular Expressions (RegEx). The averages person may not find the RegEx system easy to follow or implement, but once the concept clicks, it makes certain aspects of programming much easier—regardless of the programming language. (That’s why I wrote the book A Beginner’s Guide to Using Regular Expressions in AutoHotkey.) When confronted with extracting text or implementing complex replacements, I immediately gravitate toward RegEx. When implemented in scripts such as IPFind.ahk and SynonymLookup.ahk, these enigmatic expressions have made my AutoHotkey life much easier.
Locating AutoHotkey Script Notes Inside a File
Many scriptwriters place a section of text at the beginning of the .ahk file—often a description of what the app does and how to use it—bounded by the comments delimiters
*/. When encountered on a separate line before (
/*) and after (
*/) a comments section in an AutoHotkey script, AutoHotkey ignores the entire chunk as one a long remark. By using these two boundaries as keys for the RegEx to locate script notes, I can extract and display that data in a message box (MsgBox) or GUI pop-up window. Of course, the script file must contain this type of bounded remarks section for the system to work.
If you’re new to Regular Expressions (RegEx), I suggest you review some of the free resources available:
- The main AutoHotkey site’s RegEx Quick Reference page
- Jack’s AutoHotkey Blog Regular Expressions (RegEx) Page
- A Perfect Place to Use an AutoHotkey Regular Expression (RegEx in Text Replacement
Occasionally, you run into a search-and-replace problem that cries out for an AutoHotkey RegEx (Regular Expression). But is learning how to use Regular Expressions worth your time? You decide! Here’s a real problem and a beginner’s mini-tutorial for solving it with RegEx.
- Deleting Double Words with AutoHotkey Regular Expressions (RegEx)
When too many identical chapter numbers appear in the e-book index, it’s time for another AutoHotkey RegEx. Includes how to use a backreference!
- Quick and Dirty Complex Text Replacement with Ryan’s RegEx Tester
While not for beginning AutoHotkey scriptwriters, this Regular Expression (RegEx) trick executes multiple complex text replacements without even writing an AutoHotkey script.
- Regular Expressions (RegEx) for Parsing Text
The RegExReplace() function makes it easy to extract and cleanup text, plus a quick-and-dirty RegEx to strip all HTML tags.
- Web Data Extraction Script (An Easy AutoHotkey RegEx Trick)
A simple Regular Expression (RegEx) retrieves your daily horoscope by harvesting data from a web page—this 10-line AutoHotkey script demonstrates how to build your own web-based pop-ups.
- Using Regular Expressions to Convert Most Formatted Dates into DateTime Stamps
AutoHotkey offers many techniques for converting the DateTime stamp (yyyymmdd) into formatted dates, but what about going in the other direction? Use RegEx to identify date formats—including British and American!
- Powerful RegEx Text Search Shorthand (~=)
AutoHotkey provides an abbreviated Regular Expression RegExMatch() operator ( ~= ) for quick wildcard text matches
I used the following code in the RegExMatch() function to extract the first block comment in the AutoHotkey script:
RegExMatch(FileVar, "s)/\*(.*?)\*/" , NotesVar)
Highlighted in red, the RegEx contains a set of symbols telling AutoHotkey what to include in a text match.
- The RegEx
s)/\*(.*?)\*/looks for a set of the comments section boundaries as denoted by
\*/characters. Since the asterisk hold special properties in RegEx, you must escape the character by preceding each with with a backslash (
(.*?)in the center of the RegEx matches all characters between the two boundaries. The dot ( . ) means any character; the asterisk’s * special property repeats the prior match (in this case any character) until encountering the next symbol (the final escaped asterisk
\*); the question mark ( ? ) tells RegEx to stop on the first occurrence of the next match—rather than the last.
- We place RegEx options at the beginning of an expression followed by a close parenthesis—in this case, the symbols
s). Normally, RegEx stops matching at the end of a section terminated by a return
`rand/or a new line
s)option forces the RegEx to ignore the line breaks.
- The set of parentheses in the center of the RegEx creates and saves a sub-pattern from the results. The RegExMatch() function above saves the entire match in the variable NotesVar. The function loads the first sub-pattern as indicate by the first set of parentheses in the pseudo-array variable NotesVar1. This allows the exclusion of the comment boundary delimiters in the final display.
Note: I don’t expect everyone to immediately grasp this list of explanations. Regular Expressions take a little time to comprehend. Even now, I must go back to examples and documentation when I write new expressions. But, if you add RegEx to your toolbox, you’ll find it worth the time.
When I put notes into AutoHotkey scripts, I place hard returns at measured intervals to force a word wrap at matching line lengths. This allows the presentation of the text in a consistent manner in text editors. However, when using a MsgBox to display text, the built-in word-wrap forces those hard returns to produce awkward results—line breaks in strange places.
When issuing the command:
MsgBox, 0, %Name_no_ext%, % Location "`r" NotesVar1
the auto-adjust of the MsgBox command forces odd word wraps leaving short broken lines after each hard return.
Using a GUI Window
I can have save the hassle the MsgBox causes by using a GUI window:
Gui, Add, Text, , %Location%`r%NotesVar1% Gui, Show, , %Name_no_ext%
Each window expands to accommodate the fixed line lengths created by the hard returns—displaying a format similar to that found in the .ahk file.
However, the GUI window comes with its own problems—especially if you don’t include hard returns for setting line lengths or if you remove the single hard returns (as I demonstrate below). Without special GUI options, the paragraph can run off the GUI window as a single long line.
Plus, when using a GUI, you must destroy the GUI window between each subroutine call:
GuiClose: GUI, Destroy Return
Otherwise, closing the window by clicking the x box in the upper right-hand corner merely hides the window. If you call the routine again, it doesn’t update.
Having said that, I lean toward using the GUI window. It offers the most faithful representation of the notes in the .ahk file. Plus, I don’t need a special RegEx removing single hard returns.
Since I had previously written a RegExReplace() function to remove those unneeded single returns for display in a MsgBox, I felt it worth a look. You may have other uses for an AutoHotkey routine which removes single returns without touching the double returns between paragraphs—perhaps formatting text for a word processor.
Removing Single Returns with RegEx
The following RegExReplace() function removes single returns without affecting the double returns between paragraphs:
NotesVar1 := RegExReplace(NotesVar1, "(\S ?)`r`n( ?\S)" , "$1 $2")
Generally (not always), pressing the Enter key inserts a carriage return
`r and a new line
`n (linefeed) into a computer document. Together these two symbols act as a hard return. To remove a single return, the script must match all returns not sitting next to another one.
- The expression
(\S ?)`r`n( ?\S)matches the carriage return/new line set
- But it must not sit next to another return or new line character (
\S). The backslash with a capital
Stells RegEx not to match a space, tab, return, or new line, while the backslash followed by a lower-case
smatches any one of those characters.
- I opted to allow a single space before or after the hard return in the form of a blank space character followed by a question mark (
?). This makes the space optional for those times when someone added the hard return at the end of the line after hitting the space bar.
- The replacement text “
$1 $2” consists of the two sub-patterns created by the first and second set of parentheses with a single spaces between them. This merely swaps the hard return for a space.
As you can see, the RegEx works pretty well.
I ran into some other formatting issues which I’ll address next time.
Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)
This post was proofread by Grammarly
(Any other mistakes are all mine.)
(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)
Find my AutoHotkey books at ComputorEdge E-Books!
Find quick-start AutoHotkey classes at “Robotic Desktop Automation with AutoHotkey“!