Wednesday, August 30, 2006

Scraping Page Content from a Remote Site in ASP

Scraping page content (called consuming when you do it to an RSS feed) is a way of getting content from a site other than your own and then displaying it on your own site (typically as your own content). Since the scraping is performed by the server, not the client, it has the appearance - to both site visitors and search engine spiders alike - of having originated from YOUR site, not the one you scraped from.

Microsoft's suite of XML DOM (Document Object Model) components includes the XMLHTTP object. This object was originally designed to provide client-side access to XML documents on remote servers through the HTTP protocol. It exposes a simple API which allows you to send requests (even POSTS) and get the resultant XML, HTML or binary data.

The code below shows how to return the HTML of a remote URL. Note there are two options for instaniating the object: one uses Microsoft.XMLHTTP and the other MSXML2.ServerXMLHTM. The second is the newer version of the object. Try using the older one first and if that one doesn't work, try the newer.

Since this functon returns served content, you can use it for any type of page: asp, aspx, php, html, htm... etc. Built as a function, you could go with something like:

strURL = ""
strRemoteContent = GetRemoteContent(URL)

Function GetRemoteContent(TheURL)
'create an instance of the MS XMLhttp component.
'this is the old version, use new if you can
'Set xmlObj = Server.CreateObject("Microsoft.XMLHTTP")

'new version, better
Set xmlObj = Server.CreateObject("MSXML2.ServerXMLHTTP")

'Open the connection and send the request
'Set the optional Async parameter to True
'Otherwise, the waitForResponse method used will have no effect
xmlObj.Open "GET", url, true
Call xmlObj.Send()

'Turn off error handling
On Error Resume Next

'Wait for up to 3 seconds if we've not gotten the data yet
If xmlObj.readyState <> 4 Then xml.waitForResponse 3
'Did an error occur? If so, use a default value for our data
If Err.Number <> 0 Then
GetRemoteContent = "There was an error retreiving the remote page"
'If we reach here, we know the server responded
'To accommodate for unexpected behaviors ensure the
'readyState property equals 4
'and the Status property, which returns the HTTP Response status,
'equals 200
If (xmlObj.readyState <> 4) Or (xml.Status <> 200) Then
'Abort the request
GetRemoteContent = "Problem communicating with remote server..."
GetRemoteContent = xmlObj.ResponseText
End If
End If
End Function
Now the thing you have to remember is that this brings back the entire page, so if there's content on the page you want to use, you should parse it out. At a minimum, you probably only want the content between the opening and closing BODY tags. YOu could create a funciton to pull that out and display it:

Function ParseContent(TheContent)
intStart = INSTR(LCASE(strRemoteContent,"<body>")) + 6
intEnd = INSTR(LCASE(strRemoteContent,"</body>"))
intLength = intEnd - intStart
ParseContent = MID(strRemoteContent, intStart, intLength))
End Function
You can do more parsing inside the funciton or extract other content by using the various string functions of VBScript. You could have different functions that pull out different sections of teh remote page for display.

Although page scraping might seem to be more for content stuffing or other less than savory ends, there are many legitmate uses as well: grab the newest news headlines or stock prices, do a price check, see if a page has been updated.... Combined with an XSLT style sheet, it can be used to pull in RSS feeds too.

Be aware that this only works on Windows servers and that that some hosting companies disable the XMLHTTP object. If you're going to build a site that makes heavy use, check with your hosting company first to make sure they have it enabled.

1 comment:

Anonymous said...

Thanks for this as it gave me a few pointers, however there are errors in your code.

You set STRURL but pass URL which doesnt exist and your LCASE is incorrect below you need to close the brackets correctly and your MID has a bracket too many.