Removing Excess Tabs and Spaces with RegEx Greed (AutoHotkey RegEx Tips Part 3)

After Parsing Selected Table Data or One-Line Street Addresses for Unique Paste Operations, We Prevent Blank Paste Items from Appearing in the MsgBox Window by Using RegEx Greed to Remove Any Extra Tab Characters

In the first two parts of this series, I introduced a couple of common Regular Expressions (RegEx) wild cards for finding unknown characters (the needle) in a larger string of text (the haystack). In “Finding US Zip Codes (AutoHotkey RegEx Tips Part 1),” I discussed the \d expression (representing any single numeric digit) which you can also identify with the range [0-9] or the expression set (0|1|2|3|4|5|6|7|8|9).

In “Finding UK Postal Codes (AutoHotkey RegEx Tips Part 2),” I introduced the wild card \w as the RegEx symbol for matching any alphanumeric character (upper or lower case) and the ten digits (plus underline mark)—the equivalent of the range [a-zA-Z0-9_]. In both blogs, I used the /s wild card to locate a space in front of either the US zip code or UK postal code.

This time I use \s in combination with `t to remove extra tab delimiters (e.g.`t`t) which insert blank lines in the MultiPaste.ahk script’s MsgBox command window.

Removing Excess Tabs from Text

MultiPasteSpaceIn the MultiPaste.ahk script, first discussed in the blog, “Brute Force Data-Set Copy-and-Paste (AutoHotkey Clipboard Technique),” each tab `t character creates a parsed line in the MsgBox command window. Two tabs in a row squander paste options by inserting a blank entry in the MsgBox command window—either through excess tabs previously embedded in the text or unneeded tabs added by the script. Since the current form of the script limits the MsgBox command window to ten entries, allowing any extra empty items (as shown in the image at right) only wastes options. Removing the extra tabs solves the problem.

In the above reference blog, we attempted to use the StrReplace() function to remove double tabs by placing the function in a loop:

Loop 
{
  Clipboard := StrReplace(Clipboard, "`t`t" , "`t")
  If ! InStr(Clipboard, "`t`t")
    Break
 }

This loop replaces each found double tab with a single tab. When the StrReplace() function no longer finds a double tab to replace, it kicks out of the loop (Break command). This technique works most of the time.

However, this approach doesn’t work when a space or two appears between the tabs. We could introduce—prior to running the tab eliminating loop—an additional loop for removing all spaces following the tab character:

Loop 
{
  Clipboard := StrReplace(Clipboard, "`t " , "`t")
  If ! InStr(Clipboard, "`t ")
    Break
 }

However, this adds a lot of code when a simple RegEx solves the problem. By using a RegEx, we can eliminate both the double tabs and any spaces with a single RegExReplace() function:

Clipboard := RegExReplace(Clipboard,"`t\s*`t","`t")

MultiPasteNoSpaceBy eradicating any loops and resolving all double tabs (`t`t) into one tab (including the removal of any intervening spaces), this RegExReplace() function demonstrates the power of RegEx.

Note: While backslash+t (\t) represents the standard tab in Regular Expressions, you can also use the AutoHotkey escape sequence backtick+t (`t) as discussed in the remarks section of the AutoHotkey RegEx functions online documentation.

The Zero-or-More Wild Card Modifier (*)

In the RegEx `t\s*`t, we introduce the wild card modifier * which matches the preceding character zero or more times. The \s symbol represents the space character plus other non-visible entities such as the tab `t, the return character `r, or the newline character`n. When adding the * modifier to \s, by default, this RegEx consumes all spaces and/or tabs until it reaches the last tab followed by a non-space character (i.e. greedy). This function replaces all multiple tabs and spaces matched with a single tab character. Much more powerful (and accurate) than the original loop.

Greedy Versus Non-Greedy

By default, the * (match none or more) and the + (match one or more) modifiers are greedy. That means the expression consumes all matching characters until it reaches the last possible matching character. For example, suppose a string of text contains a number of tabs and spaces:

[tab][space][space][tab][tab][space][tab][space][space][tab]

Note: Since you usually can’t see tabs and spaces, I simulate them with [tab] and [space] respectively.

The greedy form of the expression `t\s*`t matches every \s character until reaching the last matching tab found in the search text:

[tab][space][space][tab][tab][space][tab][space][space][tab]

Everything between must match \s*—which includes both tabs and spaces. Any non-space or non-tab character (i.e. any visible character including punctuation) terminates the matching. Therefore, when the replacement occurs, it reduces the entire set of characters to a single tab.

Nevertheless, you will encounter times when the greedy form of a wildcard does not work for your application. For example, when stripping HTML tags out of a section of Web page source code, you might try the following:

Clipboard := RegExReplace(Clipboard, "<.*>","")

but this greedy RegEx actually removes all of the text between the first left-arrow < and the last right-arrow > in the haystack—often leaving you with nothing. The expression .* matches everything between two designated points in the haystack. If you only want to remove tags bound by a single set of arrows (e.g. <p>), then adding the question mark modifier ? to the * or + modifier forces the RegEx to stop matching on the first occurrence of the next character rather than the last—forging a non-greedy expression:

Clipboard := RegExReplace(Clipboard, "`t\s*?`t" , "`t")

A non-greedy expression works for HTML tags while it causes problems in our tab removal problem. The expression `t\s*?`t would match the first tab encountered—thus creating matched pairs for the replacements:

[tab][space][space][tab][tab][space][tab][space][space][tab]

Cover 200Rather than reducing the line to a single tab, the non-greedy expression would leave multiple tabs in place.

In tab replacement, RegEx greed helps us, while when removing HTML tags, greed hurts us.

Next time, I look at using Regular Expressions (RegEx) to deal with the special issues cause by date formats.

Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)

jack

This post was proofread by Grammarly
(Any other mistakes are all mine.)

(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s