Scraping the Cache version from Google

#1

Hi,

I have been trying to scrap the cache version of a list of URLs (only 20) using the xpath but no luck so far.

The list indicates this pattern: http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/ whereas the xpath i use is this: /html/body/div[1]/div[1]/span[1] to locate the cache version of the page.

The full function is: =XPathOnUrl("http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/";"/html/body/div[1]/div[1]/span[1]")

Is there anything i am missing or doing wrong? It just returns me blank cells.

Thanks,

#2

Can you tell me where the cache version is available on that page?

#3

If you open this URL in the browser: http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/

I'd like to scrap the following:

I found out the xpath for that which is: /html/body/div[1]/div[1]/span[1] but run xpath throws me blank cells.

#4

Aha, didn't see it because the text was hidden behind the banner. This XPath works for me:

=XPathOnUrl("http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/";"//div[1]/span[2]")

However, this service is the same as the one used in the GoogleCachDate connector so I'm not sure why you prefer to scrape this manually?

#5

Thanks for your help.

I thought, this should =Connector("GoogleCacheDate.GoogleCacheDate";C1) only work for date. Please suggest if i can pull the cached URL too through the above function?

Thanks again!

#6

Hmm, isn't the URL always set to the latest URL? Is the purpose to identify redirects?

#7

The purpose is to identify the pages google has indexed that don't match with original URL pattern. Basically, a mismatch of indexed vs cached URL but xpath helps me to do that so no worries. :slight_smile: