Regular Expressions (RegEx) for Mining Text in Files (AutoHotkey Startup Control)

When It Comes to Extracting Data from Text Files, Nothing Works Like Regular Expressions (RegEx)

Last time, “Peeking at Notes Inside Auto-Startup AHK Script Files,” I added a feature for reading notes inside the .ahk files targeted by shortcuts launched from the Windows Startup folder to the AutoStartupControl.ahk script. This gave me a method for reminding myself how the various auto-startup scripts work. In that blog, I discussed how to find the .ahk file containing the notes. This time, I take a look at how to use Regular Expressions (RegEx) to extract the script notes.

Anyone who follows my blog or reads my books knows that I have a fondness for Regular Expressions (RegEx). The averages person may not find the RegEx system easy to follow or implement, but once the concept clicks, it makes certain aspects of programming much easier—regardless of the programming language. (That’s why I wrote the book A Beginner’s Guide to Using Regular Expressions in AutoHotkey.) When confronted with extracting text or implementing complex replacements, I immediately gravitate toward RegEx. When implemented in scripts such as IPFind.ahk and SynonymLookup.ahk, these enigmatic expressions have made my AutoHotkey life much easier.

Locating AutoHotkey Script Notes Inside a File

Many scriptwriters place a section of text at the beginning of the .ahk file—often a description of what the app does and how to use it—bounded by the comments delimiters /* and */. When encountered on a separate line before (/*) and after (*/) a comments section in an AutoHotkey script, AutoHotkey ignores the entire chunk as one a long remark. By using these two boundaries as keys for the RegEx to locate script notes, I can extract and display that data in a message box (MsgBox) or GUI pop-up window. Of course, the script file must contain this type of bounded remarks section for the system to work.


If you’re new to Regular Expressions (RegEx), I suggest you review some of the free resources available:

RegEx Resources


I used the following code in the RegExMatch() function to extract the first block comment in the AutoHotkey script:

RegExMatch(FileVar, "s)/\*(.*?)\*/" , NotesVar)

Highlighted in red, the RegEx contains a set of symbols telling AutoHotkey what to include in a text match.

Regular Expressions in AutoHotkey
Regular Expressions (RegEx) can be mysterious in any language.
  1. The RegEx s)/\*(.*?)\*/ looks for a set of the comments section boundaries as denoted by /\* and \*/ characters. Since the asterisk hold special properties in RegEx, you must escape the character by preceding each with with a backslash ( \* ).
  2. The (.*?) in the center of the RegEx matches all characters between the two boundaries. The dot ( . ) means any character; the asterisk’s * special property repeats the prior match (in this case any character) until encountering the next symbol (the final escaped asterisk \* ); the question mark ( ? ) tells RegEx to stop on the first occurrence of the next match—rather than the last.
  3. We place RegEx options at the beginning of an expression followed by a close parenthesis—in this case, the symbols s). Normally, RegEx stops matching at the end of a section terminated by a return `r and/or a new line `n. The s) option forces the RegEx to ignore the line breaks.
  4. The set of parentheses in the center of the RegEx creates and saves a sub-pattern from the results. The RegExMatch() function above saves the entire match in the variable NotesVar. The function loads the first sub-pattern as indicate by the first set of parentheses in the pseudo-array variable NotesVar1. This allows the exclusion of the comment boundary delimiters in the final display.

Note: I don’t expect everyone to immediately grasp this list of explanations. Regular Expressions take a little time to comprehend. Even now, I must go back to examples and documentation when I write new expressions. But, if you add RegEx to your toolbox, you’ll find it worth the time.

Formatting Issues

When I put notes into AutoHotkey scripts, I place hard returns at measured intervals to force a word wrap at matching line lengths. This allows the presentation of the text in a consistent manner in text editors. However, when using a MsgBox to display text, the built-in word-wrap forces those hard returns to produce awkward results—line breaks in strange places.

When issuing the command:

MsgBox, 0, %Name_no_ext%, % Location "`r" NotesVar1

the auto-adjust of the MsgBox command forces odd word wraps leaving short broken lines after each hard return.

Using a GUI Window

I can have save the hassle the MsgBox causes by using a GUI window:

Gui, Add, Text, , %Location%`r%NotesVar1%
Gui, Show,  , %Name_no_ext%

Each window expands to accommodate the fixed line lengths created by the hard returns—displaying a format similar to that found in the .ahk file.

However, the GUI window comes with its own problems—especially if you don’t include hard returns for setting line lengths or if you remove the single hard returns (as I demonstrate below). Without special GUI options, the paragraph can run off the GUI window as a single long line.

Plus, when using a GUI, you must destroy the GUI window between each subroutine call:

GuiClose:
	GUI, Destroy
Return

Otherwise, closing the window by clicking the x box in the upper right-hand corner merely hides the window. If you call the routine again, it doesn’t update.

Having said that, I lean toward using the GUI window. It offers the most faithful representation of the notes in the .ahk file. Plus, I don’t need a special RegEx removing single hard returns.

Since I had previously written a RegExReplace() function to remove those unneeded single returns for display in a MsgBox, I felt it worth a look. You may have other uses for an AutoHotkey routine which removes single returns without touching the double returns between paragraphs—perhaps formatting text for a word processor.

Removing Single Returns with RegEx

The following RegExReplace() function removes single returns without affecting the double returns between paragraphs:

NotesVar1 := RegExReplace(NotesVar1, "(\S ?)`r`n( ?\S)" , "$1 $2")

Generally (not always), pressing the Enter key inserts a carriage return `r and a new line `n (linefeed) into a computer document. Together these two symbols act as a hard return. To remove a single return, the script must match all returns not sitting next to another one.

  1. The expression (\S ?)`r`n( ?\S) matches the carriage return/new line set `r`n.
  2. But it must not sit next to another return or new line character ( \S ). The backslash with a capital S tells RegEx not to match a space, tab, return, or new line, while the backslash followed by a lower-case s matches any one of those characters.
  3. I opted to allow a single space before or after the hard return in the form of a blank space character followed by a question mark ( ? ). This makes the space optional for those times when someone added the hard return at the end of the line after hitting the space bar.
  4. The replacement text “$1 $2” consists of the two sub-patterns created by the first and second set of parentheses with a single spaces between them. This merely swaps the hard return for a space.

As you can see, the RegEx works pretty well.

I ran into some other formatting issues which I’ll address next time.

Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)

jack

This post was proofread by Grammarly
(Any other mistakes are all mine.)

(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)

Find my AutoHotkey books at ComputorEdge E-Books!

Find quick-start AutoHotkey classes at “Robotic Desktop Automation with AutoHotkey“!

One thought on “Regular Expressions (RegEx) for Mining Text in Files (AutoHotkey Startup Control)

  1. I LOVE using regular expressions. I came from a U*nix background before discovering AutoHotkey on Windows. I find using a site like https://regex101.com/ very useful when developing complex expressions since it supports more than one type of RE and gives real time feedback of your expressions.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s