Thursday, September 07, 2006

Scraping Search Engines for Page Content

It's no secret that optimized page content can help you rank better in organic search engine results. While page content is not the deciding factor, it's stil important, especially for long tail search terms. No matter what your page content, you're probably not going to be able to rank well for the term 'dvd player' (or any other exceptionally competetive term). But you might be able to rank on a long tail term such as 'panasonic portable dvd player model 123ABCxyz'. For such a long tail phrase, optimized page content can help.

But how to get the content? Well, you could write it. Or buy it. Or... scrape it from somewhere else.

If you chose the scraping solution, I highly recommend that you have some original content on the page. Write a capsule review, or your own description, or anything else that likely doesn't exist on some other site (until, of course, someone scrapes it).

The following process will step you through scraping content from a search engine and adding it to your page.

WARNING: Implementing these techniques could very well get you banned from the search engines!

1. Decide on the keyword you want to generate content for. For this example, I'll use the fictitious Panasonic DVD model from above.

2. Decide how many search engines you're going to scrape content from. For this example, I'll use 3: MSN, Yahoo, and Gigablast.

3. Since I'm using 3 search engines, I generate a random number between 1 and 3.

4. Now pass the number into a page scraping function to go out and get the search results for that phrase from the search engine. The way you might do it is pass the number and keyword into the function:
strContent = ScrapeContent(intRandNum,"panasonic portable dvd player model 123ABCxyz')
Then, in the function, have a Case statement to assign the url based on the random number:
Function ScrapeContent(TheEngine, TheTerm)
Select Case TheEngine
Case 1
URL = "http://search.msn.com/results.aspx?q="&TheTerm&""
Case 2
URL = "http://search.yahoo.com/search?p="&TheTerm"&"
Case 3
URL = "http://www.gigablast.com/search?q="&TheTerm&""
End Select

Set xmlObj = Server.CreateObject("MSXML2.ServerXMLHTTP")

xmlObj.Open "GET", url, true
Call xmlObj.Send()

On Error Resume Next

If xmlObj.readyState <> 4 Then xml.waitForResponse 3

If Err.Number <> 0 Then
ScrapeContent = "There was an error retreiving the remote page"
Else
If (xmlObj.readyState <> 4) Or (xml.Status <> 200) Then
xmlObj.Abort
ScrapeContent = "Problem communicating with remote server..."
Else
ScrapeContent = xmlObj.ResponseText
End If
End If
End Function
5. In the above example. we know have the remote content assigned to a variable named strContent. Parse that string to strip out eveything between the body tags:
strContent = Mid(strContent,Instr(strContent,"<body>")+6,Instr(strContent,"</body>"))
6. Now that you have the body content, replace all the breaks with a space; this is because words might run to gether otherwise:
strContent = Replace(strContent,"<br>"," ")
7. There are some extraneous words that have no relevancy to your search term that you're probably going to want to strip out as well. For example, Yahoo has additional links that follow the listing: 'Cached', 'More from this site', 'Save' etc. I recommend customizing a function to strip out all the non keyword related carp that clutters the search results you just scraped.

8. Now you have fairly clean copy relevant to your keyword, the last thing to do strip out the HTML.

The end result: a big chunk of text directly relevant to your keyword.

There's a few things you can do with it: plop it right on the page (not recommended), hide it in a div, or - my recommendation - hide it in a div and reverse cloak it, so that only search engines can see it. Hopefully your pages will start climbing higher in the organic results.

Leverage those favorable results quickly! Sooner or later you'll find your site banned, either because the spiders got smarter or a competitor reported you.

Good luck!

No comments: