Extracting Multiple Dates from Text Using AutoHotkey RegEx

While Not Simple (and a Little Bit “Greedy”), the RegEx for Two-Date Parsing Only Requires One Selection

I received the following query from a reader:

Regular Expressions in AutoHotkey
Regular Expressions (RegEx) can be mysterious in any language.

Hi! Is it possible to highlight the entire date range (e.g. 16 March 2021 to 21 May 2021) when the Hotkey is triggered, feed it into the timespan ahk, and share the timespan as result?

Working with AutoHotkey Date Formats and Timespan Calculations

Yes, it is! You’ll find using Regular Expressions (RegEx) to simultaneously parse the two dates from the text the key to success. Plus, you’ll want to streamline the process by eliminating the GUI and feeding the dates directly into the HowLong() function found in HowLongYearsMonthsDays.ahk script. Implementing the instant calculation requires three steps:

  1. Writing a RegEx for identifying and capturing the target dates. (Discussed in this blog.)
  2. Using DateStampConvert.ahk code to format the parsed dates in the standard TimeDate stamp (YYYYMMDD).
  3. Calculate the timespan by running the HowLong() function using the two dates as parameters.

This approach should provide you with an instant timespan calculation between any two dates matched in a text selection.

I have not done all the work, but I have developed a RegEx which locates the first and last date in a text selection;

sx)(\b[[:alpha:]]+.?\s\d\d?,?\s\d?\d?\d\d|\b\d\d?[-\s]?[[:alpha:]]+[-\s]?\d\d\d?\d?|\b\d\d?[-/]\d\d?[-/]\d\d\d?\d?)
.*(\b\[[:alpha:]]+.?\s\d\d?,?\s\d?\d?\d\d|\b\d\d?[-\s]?[[:alpha:]]+[-\s]?\d\d\d?\d?|\b\d\d?[-/]\d\d?[-/]\d\d\d?\d?)

Update March 26, 2021: \w in original RegEx changed to [[:alpha:]] to include only alphabetic characters.

While I don’t discuss every aspect of this RegEx here, I cover the important aspects of its construction. (I’ve written numerous blogs and an entire book discussing the basics of AutoHotkey Regular Expressions.)

Date Format Ambiguity

The inconsistency in worldwide date formats forced me to write the slightly complicated RegEx shown above. As usual, Ryan’s RegEx Tester helped me work out the solution by giving me an environment where I could immediately see results when making even the slightest change to the key expression.

I had three objectives:

  1. Locate dates in any of the three (or more) formats identified in the DateStampConvert.ahk script (American, European, and all numeric).
  2. Extract only the first and last date in the selected text by ignoring any spurious dates appearing in-between. Only this approach would ensure the accurate selection of two target dates.
  3. Save those two dates as subpatterns for conversion into the DateTime stamp format.

Create RegEx Options with the Pipe Character ( | )

When you insert the pipe character ( | vertical line) you separate the expression into either/or options. This allows you to simultaneously search for and match different expressions without resorting to multiple searches or loops. In the RegEx found at the top of this blog, the expression shows three different variations:

  1. American date format in red: \b[[:alpha:]]+.?\s\d\d?,?\s\d?\d?\d\d
  2. European date format in green: \b\d\d?[-\s]?[[:alpha:]]+[-\s]?\d\d\d?\d?
  3. Numeric format in violet: \b\d\d?[-/]\d\d?[-/]\d\d\d?\d?

Once RegEx matches one of the options separated by the pipe, it ignores the others and moves on.

While these expressions work for most date formats, I’ve already found another which does not match: military date format 08-Jul-2021 or as seen on some documents 08JUL21. If you find you need to identify these types of date formats, you might find it easier to add another pipe option rather than reworking one of the originals to accommodate the new entry. You may also need to make a change to the DateStampConvert.ahk script to accept the format changes. (March 26, 2021: Problem corrected with recent changes.)

Matching the First and Last Date Format

By default, the RegEx ( .* ) is greedy consuming every character until it reaches the end. In most cases, I use the question mark ( .*? ) to make it stop at the first match rather than the last. In this case, I wanted the last match.

To extract the first two dates in the selection, use .*? between the two sets of optional matches. For the first and last matches, drop the question mark and just use .* (in blue in the original expression) between the sets of options. This simple change (removing the question mark ?) solved one of my early conundrums.

Using the Boundary ( \b ) Symbol

In some situations (especially when extracting the second date), I needed to add the boundary symbol ( \b ). This forced the month names to remain in tack and the two-digit days to stick together. It fixed a missing character problem.

Adding Options to Eliminated the Effects of Returns

In previous Regular Expressions, I’ve used the s) option (“causes a period (.) to match all characters including newlines”) at the beginning of the expression to remove end-of-line characters from consideration. However, I still had a freakish problem with losing matches when I added an extra character on the same line as the formatted date. I’m not sure what caused it, but adding the x) option (“ignores whitespace characters in the pattern”) fixed it. Go figure!

Extracting the Two Dates with Parentheses

Surrounding portions of the expression with parentheses creates subpatterns while assigning variable names—in this case, the first and last date format. I can access those extractions by using the pseudo-array names OutputVar1 and OutputVar2. I wrote the short demonstration script below to show how the RegEx works (Updated March 26, 2021):

Haystack =
(
I've changed my plans and will arrive on April 9, 2021.
I hope that works for you.

Originally, I would have arrived on 4/23/2021, but that
was too close to 30 April 2021.

I still plan to stay until Wednesday, 15 September 2021 
at 1:35PM.

Looking forward to seeing you.

	—Freddy the Freeloader
)
StartingPos := 1

; The following RegExMatch() function uses line continuation techniques
; to wrap lines of code for display purposes. 

RegExMatch(Haystack
, "sx)(\b[[:alpha:]]+.?\s\d\d?,?\s\d?\d?\d\d
|\b\d\d?[-\s]?[[:alpha:]]+[-\s]?\d\d\d?\d?
|\d\d?[-/]\d\d?[-/]\d\d\d?\d?)
.*(\b[[:alpha:]]+.?\s\d\d?,?\s\d?\d?\d\d
|\b\d\d?[-\s]?[[:alpha:]]+[-\s]?\d\d\d?\d?
|\b\d\d?[-/]\d\d?[-/]\d\d\d?\d?)" 
, OutputVar, StartingPos)

   MsgBox,,Date Match, % OutputVar1 "`r" OutputVar2

I have not done the integration work required to build an instant timespan calculation tool but—once you add the above Regular Expression technique—all the pieces needed exist in the referenced scripts.

Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)

jack

This post was proofread by Grammarly
(Any other mistakes are all mine.)

(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)

Find my AutoHotkey books at ComputorEdge E-Books!

Find quick-start AutoHotkey classes at “Robotic Desktop Automation with AutoHotkey“!

One thought on “Extracting Multiple Dates from Text Using AutoHotkey RegEx

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s