Scraping the Cache version from Google

natanspecialist · March 11, 2020, 8:55am

Hi,

I have been trying to scrap the cache version of a list of URLs (only 20) using the xpath but no luck so far.

The list indicates this pattern: http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/ whereas the xpath i use is this: /html/body/div[1]/div[1]/span[1] to locate the cache version of the page.

The full function is: =XPathOnUrl("http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/";"/html/body/div[1]/div[1]/span[1]")

Is there anything i am missing or doing wrong? It just returns me blank cells.

Thanks,

diskborste · March 11, 2020, 9:36am

Can you tell me where the cache version is available on that page?

natanspecialist · March 11, 2020, 10:12am

If you open this URL in the browser: http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/

I'd like to scrap the following:

I found out the xpath for that which is: /html/body/div[1]/div[1]/span[1] but run xpath throws me blank cells.

diskborste · March 11, 2020, 11:08am

Aha, didn't see it because the text was hidden behind the banner. This XPath works for me:

=XPathOnUrl("http://webcache.googleusercontent.com/search?q=cache:https://www.dkc.ae/";"//div[1]/span[2]")

However, this service is the same as the one used in the GoogleCachDate connector so I'm not sure why you prefer to scrape this manually?

natanspecialist · March 11, 2020, 12:00pm

Thanks for your help.

I thought, this should =Connector("GoogleCacheDate.GoogleCacheDate";C1) only work for date. Please suggest if i can pull the cached URL too through the above function?

Thanks again!

diskborste · March 11, 2020, 1:17pm

Hmm, isn't the URL always set to the latest URL? Is the purpose to identify redirects?

natanspecialist · March 11, 2020, 1:18pm

The purpose is to identify the pages google has indexed that don't match with original URL pattern. Basically, a mismatch of indexed vs cached URL but xpath helps me to do that so no worries.