Bug in XPathOnUrl()?

WolfeDen · April 19, 2018, 2:54pm

There seems to be a bug when XPathOnUrl() to extract values which are not enclosed in quotes. I believe the same bug may exist with other functions such as CsQueryOnUrl(). I'm using SeoTools v8.0.6.

For example, I need to extract the Hreflang codes from https://www.walmart.ca/en, where the actual tags look like this:

<link rel=alternate hreflang=x-default href="/" />
<link rel=alternate hreflang=en-ca href="/en" />
<link rel=alternate hreflang=fr-ca href="/fr"/>

Note that the values for the rel & hreflang properties are not enclosed in quotes such as "alternate" and "en-ca".

The following syntax should work, but it returns a null value:

=Dump( XPathOnUrl( "https://www.walmart.ca/en", "//*[@hreflang]", "hreflang" ) )

I've also tried using the "html" mode as well as blank attributes, but those had the same result.

To verify, I used PhantomJsCloud.XPath as follows, and all 3 Hreflang codes are correctly returned:

=Dump( Connector("PhantomJsCloud.XPath", "https://www.walmart.ca/en", "//*[@hreflang]", "hreflang", TRUE, "us") )

diskborste · April 19, 2018, 7:26pm

How are you inspecting the xml tree? The elements have quotes in chrome:

Also, I can't find these tags in the original HTML code so they are probably added after SeoTools makes the xpath request, which is why it returns correctly with PhantomJS.

WolfeDen · April 25, 2018, 3:43am

Chrome's Inspect feature is based on the generated source code, which Chrome nicely formats and automatically adds the missing quotes. You'll need to look at the server's source code by right-clicking the page and selecting View page source:

diskborste · April 25, 2018, 2:59pm

Thanks, that's good to know Weird, I looked at the original code and couldn't find the rel=alternative tag. Will report to Niels.

diskborste · April 29, 2018, 7:53pm

I tried the function DownloadString on this site and it looks like Walmart is using some type of scraping detection: