Strip HTML Tags from Text (AutoHotkey Quick Tip)

Use This AutoHotkey Trick to Remove HTML Code from Any Text

Last time in “Alternative Web Page HTML Download Techniques (AutoHotkey Tip),” I mentioned how I updated the GooglePhraseFix.ahk script by aaston86 to get it working again and make it a little more robust. The script uses a Google search page to autocorrect common expressions and people’s names. (It only works if Google thinks you may have made an error.)

For example, if you type “Ralph Nadal” the Spanish tennis player, selecting the name and using the CTRL+ALT+G Hotkey combination changes “Ralph Nadal” to “Rafael Nadal.” It only works for obvious possibilities, but may come in handy for correcting hard to remember spellings (i.e. “Jocavic” turns into “Djokovic”).

I added the phrase “Showing results for ” to the script as a search key in the Google results page. Google includes the phrase when it senses that you may have made a mistake. The original script used the StringReplace command to remove some HTML code and correct any apostrophes ('):

   StringReplace, clipboard, match2, <b><i>,, All
   StringReplace, clipboard, clipboard, </i></b>,, All
   StringReplace, clipboard, clipboard, ',', All

The StringReplace command can work for unchanging HTML tags but you need to add the command for each tag (or set of tags). By using the RegExReplace() function, you can remove all HTML code with one command.

HTML Tag Stripping Regular Expressions (RegEx) Using the RegExReplace() Function

The selected section of the Google page now includes a lot more HTML code than merely italics <i> and bold <b>. Using the following expression removes it all:

   var := RegExReplace(var,"<.+?>")

You don’t need to know anything about AutoHotkey Regular Expressions (RegEx) to use the above RegExReplace() function. The command removes all text found in var bounded by the arrow brackets (< … >).

Suppose you want to copy all the text from a Web page to a file. You could use the URLDownloadToFile command to copy the page source code, then execute the above RegExReplace() function to remove all of the HTML code. Only the plain text remains.

Click the Follow button at the top of the sidebar on the right of this page for e-mail notification of new blogs. (If you’re reading this on a tablet or your phone, then you must scroll all the way to the end of the blog—pass any comments—to find the Follow button.)


This post was proofread by Grammarly
(Any other mistakes are all mine.)

(Full disclosure: If you sign up for a free Grammarly account, I get 20¢. I use the spelling/grammar checking service all the time, but, then again, I write a lot more than most people. I recommend Grammarly because it works and it’s free.)

Find my AutoHotkey books at ComputorEdge E-Books!

Find quick-start AutoHotkey classes at “Robotic Desktop Automation with AutoHotkey“!

Alternative Web Page HTML Download Techniques (AutoHotkey Tip)

When One Method for Downloading HTML Code Breaks, Try the Alternative AutoHotkey Command

After noticing that, although I could quickly get the latitude and longitude for any location with a Google search in a browser, when I attempted to download that page using the GetWebPage() function code taken from the AutoHotkey documentation (shown below in the first script), Google stopped me. The Google server denied the download attempt of the coordinates for San Diego with the following statement:

403. That’s an error.

Your client does not have permission to get URL /search?q=latitude+longitude+san+diego+decimal&rlz=1C1GEWG_enUS953US953 from this server.

Thwarted by Google again (see my “Switched IPFind.ahk to for Reliable AutoHotkey GUI Map Embedding” blog), I wanted to find an alternative source for the same information.

I searched for an unblocked Web page providing the latitude and longitude. I didn’t have to look very far. (The site name appears in the AutoHotkey snippet below.) I wrote the following test code for proof of function:

Continue reading