Friday, September 01, 2006

How to Strip out HTML with an ASP Reg Exp Function

Ever wanted an easy way to strip out all the HTML from a string? There are a variety of reasons you might want to do this. Maybe you offer a mail page feature and want to strip out the HTML to send a text email. Or maybe you're a Black Hat SEO Sith Master and are scraping content from millions of pages to build an AdSense empire!

In any case, here's an easy way, using regular expressions, to strip out all the HTML from a string.

This function accepts a string input (the string whose HTML tags are to be stripped). The regular expression pattern <(.|\n)+?> is used to get all matches of < and > characters with at least one character in-between. The Replace method of the regular expression object is then used to replace all instances with an empty string (""). Finally, all remaining < and > signs are replaced with their respective HTML encoded forms.

Something to consider: if you strip out the <BR> tags and you're re-displaying the string, it will all run together. The fix there would be to do a straight replace BEFORE sending the string to the stripHTML Function:
TheString = Replace(TheString,"<BR>","{}",1)
This replaces all the Break tags (those in both upper and lowercase) with {} right next to each other. Then send TheString to the StripHTML Function:
Function stripHTML(strHTML)
Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "<(.|\n)+?>"

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strHTML, "")

strOutput = Replace(strOutput, "<", "&lt;")
strOutput = Replace(strOutput, ">", "&gt;")

stripHTML = strOutput 'Return the value of strOutput

Set objRegExp = Nothing
End Function
Now you have a string with all the HTML stripped out and all the Break tags replace with {}. Now do one last Replace to put the Break tags back in:
TheString = Replace(TheString,"{}","<BR>",1)

That's it!

No comments: