Can I add a delay before scrape with XpathOnURL?

Ryan · November 7, 2016, 7:25pm

Hey all, first post / question. (but I did search for my question before posting!)

Let me start out by saying that it seems like the spider tool is likely what I need to use, but I haven't quite figured out how to use it, yet.

I use a number of websites that generate their data based on a variable passed through the URL. Essentially www.websitereportingdomain.com/test?DOMAINBEINGTESTED

I know I can auto-generate these links using a clever concatenate function in excel and pass them to an XpathOnURL formula. The problem is that it takes 1-3 seconds for the site to generate the report I need.

Since the xPath scrape appears to happen nearly simultaneuously (out of the box, at least), it's scraping info from a container that hasn't yet been populated, and returns a blank result. In another tool, I can set a delay before downloading the data and passing the xPath argument to scrape the data. This is fine, except this other tool isn't feasible for efficient and scalable data mining.

My question is whether or not a delay can be set with xpathonurl that executes the page load, and then waits 1-3 seconds before parsing the site into XML and initiating the xPath scrape.

The added 3 seconds of delay would end up saving me a total of 15 minutes, per report, so I have a lot of energy to get this to work.

Thanks for any help you can offer!

diskborste · November 8, 2016, 8:45am

You can use HTTP Settings to force a delay between requests:

=XPathOnUrl(url,xpath;"",HttpSettings(,,,"5000|5000|Host"),"text")

5000 | 5000 means 5 seconds between request. Changed to 5000 | 10000 means a random delay between 5 and 10 seconds.

You can edit these settings under HTTP Settings:

Ryan · November 8, 2016, 3:04pm

Thanks for the reply!

I've set the delay using the HTTP settings, as you suggested, but it's returning a blank cell. After the delay time has passed, the cell remains blank. I've tried Get, Post and Head, to no avail.

This suggests that the delay only affects the time between requests. I need to send the request, but wait a specified amount of time before caching the data and running my xpath commands.

To give a more accurate example:

I need to request this page: (https://developers.google.com/speed/pagespeed/insights/?url=mcdonalds.com)
I need to delay 5-7 seconds while the page loads
Finally, I need to pull the data in the following container:
//*[@id="page-speed-insights"]/div[2]/div/div[2]/div[1]/div[3]/div[3]/h2/span[1]

This returns the mobile User Experience Score. of "XX / 100"

I'm open to other suggestions about how to grab this. I've never written a connector before, but I understand the basics of programming and don't mind tweaking the file, if someone has the code I need.

Basically, At the end of the day, if there's a way to automate this, it would be SUPER helpful!

Ryan · November 8, 2016, 3:52pm

Actually, I think the Google connector is using the pagespeedinsight V1 API, rather than the v2 API. The V1 API doesn't have a specific API call to that data.

If I change the Fetch from

<Fetch Url="https://www.googleapis.com/pagespeedonline/v1/runPagespeed?...

to

<Fetch Url="https://www.googleapis.com/pagespeedonline/v2/runPagespeed?...

I'll also need to change

<JsonPath Id="SpeedScore" Expr="score" Converter="Int"/>

to

<JsonPath Id="SpeedScore" Expr="result.score" Converter="Int"/>

Right?

Is there a list of v2 commands, or am I oversimplifying it?

Ryan · November 8, 2016, 5:27pm

Updating the GooglePageSpeed connector to use V2 of the API:

Step 1:
change the Fetch from:

<Fetch Url="https://www.googleapis.com/pagespeedonline/v1/runPagespeed?...

to

<Fetch Url="https://www.googleapis.com/pagespeedonline/v2/runPagespeed?...

Step 2:
Change the score call from:

<JsonPath Id="SpeedScore" Expr="score" Converter="Int"/>

to

<JsonPath Id="Score" Expr="ruleGroups.SPEED.score" Converter="Int"/>

Step 3:
Add functionality for mobile UE score by Adding:

<JsonPath Id="Mobile Experience Score" Expr="ruleGroups.USABILITY.score" Converter="Int"/>

Which should look something like this, when you're done:

The new function will ONLY work if you select the "mobile" option. I'm not a programmer, so I'm not sure how to have it only appear when "mobile" is selected.

@nielsbosma, Unless I'm missing something that's also connected, this might be a quick update to the included connector.

Ryan · November 8, 2016, 5:28pm

Now, to figure out how to do this for the other sites I need connectors for...

diskborste · November 8, 2016, 6:41pm

Ryan, thank you very much for noticing the updated API version and making suggestions to the connector code!

I don't know of an easy solution to disable the Usability Score request when Mobile is selected. It would need to remove and insert the variable from the fields list when a user selects and de-selects it before "Insert" is clicked. Only excluding it from the results would be confusing if the user selected it from the list. Will talk to Niels and hear what Master Yoda thinks

I've updated the connector code with the correct expressions and added the extra Mobile score. Also added support for Locale selection and Third Party Resources exclusion.

github.com

nielsbosma/SeoTools-for-Excel-Connectors/blob/master/GooglePageSpeed.xml

<?xml version="1.0" encoding="utf-8" ?>
<Suite Title="Google PageSpeed" Id="GooglePageSpeed" Category="SEO" SourceUrl="https://github.com/nielsbosma/SeoTools-for-Excel-Connectors/blob/master/GooglePageSpeed.xml" HelpUrl="http://seotoolsforexcel.com/google-pagespeed/" HelpText="Documentation">

  <Author Name="Victor Sandberg" Url="http://community.seotoolsforexcel.com/users/diskborste/activity" />

  <Resources>
    <Resource Id="Parameters">
			<Parameters>
				<Text Id="Url" Title="URL" Required="true" Debug.DefaultValue="http://www.seotoolsforexcel.com"/>
				<Select Id="Locale" Title="Locale" Required="false" DefaultValue="en">
					<DataSource>
						<Item Id="ar" Title="Arabic"/>
						<Item Id="bg" Title="Bulgarian"/>
						<Item Id="ca" Title="Catalan"/>
						<Item Id="zh-TW" Title="Traditional Chinese (Taiwan)"/>
						<Item Id="zh-CN" Title="Simplified Chinese"/>
						<Item Id="hr" Title="Croatian"/>
						<Item Id="cs" Title="Czech"/>
						<Item Id="da" Title="Danish"/>
						<Item Id="nl" Title="Dutch"/>

This file has been truncated. show original

diskborste · November 8, 2016, 6:47pm

It would be awesome if you want to give Connectors a shot! We are in the process of creating a cloud-based cookbook and a guide for writing the connectors.

What Connectors did you have in mind? If you need assistance, send me a PM or add me on Skype (+46709545111) and I'll try my best to help you.