Adapting Web Scraping Routines to Changing Web Pages (AutoHotkey Tip)

When the Horoscope Web Page I Use for E-mails Altered Its Format, I Quickly Adjusted the Script

Last year, I wrote a script that e-mails a daily horoscope to my wife, “E-mail the Daily Horoscope to Yourself (AutoHotkey Trick).” Every morning she receives on her tablet an e-mail containing her daily horoscope. (I don’t send it to myself because I don’t want to know that much about my future—and I don’t listen to advice.) Recently, she pointed out that the e-mail started coming up blank. I immediately realized that the target Web site had changed its source code. (I’ve experienced the same problem with the SynonymLookup.ahk script.) I knew I could repair the Regular Expression (RegEx) in the broken script fairly quickly by following some basic steps:

  1. Access the source code for the target Web page and locate the key text.
  2. Copy the critical portion of the source code, including any unique HTML tags surrounding the target text, then paste the selection into Ryan’s RegEx Tester.
  3. Adjust the RegEx to include key unique tags surrounding the text—then extracting the paragraph.
  4. In the script, replace the old RegEx found in the RegExMatch() function with the new one from Ryan’s RegEx Tester.
  5. Make any necessary adjustments to the RegEx—primarily escaping double quotation marks.

The new horoscope e-mail script now includes more details and a link to the site.

While making these changes, I implemented a couple of other improvements making it easier to format the subject text in the e-mail and adding a link to the main site for those times when Web page source code changes.

Finding the Web Page Source Code

 I opened the target Web page in my Web browser and accessed the source HTML code (right-click and select “View page source” in Google).

Tip: The source code page included a humongous amount of code making it difficult to find the right piece of code. Scrolling through masses of HTML code quickly gets tedious. As a shortcut, quickly locate the correct section by copying text from the browser display page, then pasting it into the search field (CTRL+F) in the browser’s source code page. The browser jumps directly to the right spot.

After locating the correct piece of text, look for the unique identifiers surrounding the text.

The unique tags include <div id="content" class="grid-md-c-s2"> tag at the beginning of the section and the </div> tag at the end. The RegEx (.*?) captures everything in-between.

While you can make the changes directly to the script, I generally load the segment of source code and new RegEx tags into Ryan’s RegEx Tester to make sure the code works. That saves me from continually saving and running the script during testing.

Regular Expressions in AutoHotkey
Regular Expressions (RegEx) can be mysterious in any language.

The only cautions I provide about when using the RegEx Tester involve AutoHotkey peculiarities:

  1. You must escape any double quotation marks ( " ) you used in the tester RegEx when entering them into the RegExMatch() function by preceding each with an additional double quotation mark ( "" ).
  2. Even though you can directly enter a RETURN (newline character) from the keyboard when using the tester, you must use either \n or `n within the RegEx in the RegExMatch() function.

The following unique characters replace the old, no longer functioning, code in the RegExMatch() function:

RegExMatch(Horoscope, "<div id=""content"" class=""grid-md-c-s2"">(.*?)</div>", Today)

The following shows the original RegEx:

RegExMatch(Horoscope, "<p><span class=""date"">(.*?)</p>", Today)

This gets the horoscope e-mail script working again.

Adding More Results to the E-Mail

As I reviewed the Web page, I saw that it offered two more daily horoscope categories: Daily Love and Daily Work. I quickly added two more RegExMatch() functions to extract the associated text:

RegExMatch(Horoscope, "<div id=""content-love"">(.*?)</div>", Love) 
Love1 := RegExReplace(Love1,"<.+?>")
RegExMatch(Horoscope, "<div id=""content-work"">(.*?)</div>", Work) 
Work1 := RegExReplace(Work1,"<.+?>")

Note: The second line of code in each addition uses the RegExReplace() function to remove all remaining HTML tags found in the target text. I could have included the <p> … </p> tags in the RegEx but since this extra code removes them I didn’t take the time. Plus, I would need to include \n for each newline found in the section.

E-Mail Formatting Problems

When adding the new text variables to the e-mail, formatting the body of the text takes on more complications. By far the easiest method for adding the body text uses a multi-line continuation section (Method #2)). See “Quick and Dirty Multi-Line Text Formatting (AutoHotkey Tip).”

sBody = 
(
Horoscope for People Born under Pisces
%Today1%
Daily Love
%Love1%
Daily Work
%Work1%
View More of Today's Horoscope at Astrology.com
https://www.astrology.com/us/horoscope/daily-extended.aspx?sign=pisces
)

Using the traditional equal sign ( = ) assignment method rather than the semicolon equal sign ( := ) expression evaluation offers a simpler solution when dealing with multiple variables. The double percent sign (%var%) variable replacement macro directly inserts the key text into the appropriate spot in the e-mail.

Updated Horoscope Script

See the new script below. To make this script work you must include valid e-mail addresses and username in line16, line 17, and line 47.

For more information about sending e-mail with AutoHotkey, see “How to Send E-mail Directly from an AutoHotkey Script.” If you run into problems, check the comments at the end of the page and reference resources.

Note: This script pulls the e-mail password from the Windows Registry. See “Username and Password Protection in AutoHotkey.”

GetHoroscope := "https://www.astrology.com/us/horoscope/daily-extended.aspx?sign=pisces"
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
whr.Open("GET", GetHoroscope)
whr.Send()
  Sleep 100
Horoscope := whr.ResponseText
RegExMatch(Horoscope, "<div id=""content"" class=""grid-md-c-s2"">(.*?)</div>", Today) 
Today1 := RegExReplace(Today1,"<.+?>")
RegExMatch(Horoscope, "<div id=""content-love"">(.*?)</div>", Love) 
Love1 := RegExReplace(Love1,"<.+?>")
RegExMatch(Horoscope, "<div id=""content-work"">(.*?)</div>", Work) 
Work1 := RegExReplace(Work1,"<.+?>")

RegRead, Password, HKEY_CURRENT_USER\Software\DateStampConvert, ValueRead  ; save password

sFrom := "[my e-mail address]@gmail.com"
sTo := "[my wife's email address]@gmail.com"
FormatTime, TimeString,, LongDate
sSubject := "Horoscope — " . TimeString
sBody = 
(
Horoscope for People Born under Pisces
%Today1%
Daily Love
%Love1%
Daily Work
%Work1%
View More of Today's Horoscope at Astrology.com
https://www.astrology.com/us/horoscope/daily-extended.aspx?sign=pisces
)
sAttach := 
mAttrib = From|To|Subject|TextBody
vars2 = sFrom|sTo|sSubject|sBody
StringSplit, Attrib, mAttrib, |

sServer := "smtp.gmail.com" ; specify your SMTP server
nPort := 465 ; 25
bTLS := True ; False
nSend := 2   ; cdoSendUsingPort
nAuth := 1   ; cdoBasic
tOut := 60
url := "http://schemas.microsoft.com/cdo/configuration/"
uSub = sendusing|smtpconnectiontimeout|smtpserver|smtpserverport|smtpusessl|smtpauthenticate|sendusername|sendpassword
vars1 = nSend|tOut|sServer|nPort|bTLS|nAuth|sUsername|sPassword
StringSplit, sub, uSub, |

sUsername := "[my username]"
sPassword := Password    ; hidden in Windows Registry

pmsg :=   ComObjCreate("CDO.Message")
pfld :=   pmsg.Configuration.Fields
Loop, Parse, vars1, |
	pfld.Item[url sub%A_Index%]:= %A_LoopField%
pfld.Update()
Loop, Parse, vars2, |
	pmsg[Attrib%A_Index%]:= %A_LoopField%
if sAttach
	Loop, Parse, sAttach, |, %A_Space%%A_Tab%
		pmsg.AddAttachment[A_LoopField]
pmsg.Send()
pfld:=pmsg:=""
return

Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)

jack

This post was proofread by Grammarly
(Any other mistakes are all mine.)

(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)

Find my AutoHotkey books at ComputorEdge E-Books!

Find quick-start AutoHotkey classes at “Robotic Desktop Automation with AutoHotkey“!

One thought on “Adapting Web Scraping Routines to Changing Web Pages (AutoHotkey Tip)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s