Bug in XPathOnUrl()?


#1

There seems to be a bug when XPathOnUrl() to extract values which are not enclosed in quotes. I believe the same bug may exist with other functions such as CsQueryOnUrl(). I'm using SeoTools v8.0.6.

For example, I need to extract the Hreflang codes from https://www.walmart.ca/en, where the actual tags look like this:

<link rel=alternate hreflang=x-default href="/" />
<link rel=alternate hreflang=en-ca href="/en" />
<link rel=alternate hreflang=fr-ca href="/fr"/>

Note that the values for the rel & hreflang properties are not enclosed in quotes such as "alternate" and "en-ca".

The following syntax should work, but it returns a null value:

=Dump( XPathOnUrl( "https://www.walmart.ca/en", "//*[@hreflang]", "hreflang" ) )

I've also tried using the "html" mode as well as blank attributes, but those had the same result.

To verify, I used PhantomJsCloud.XPath as follows, and all 3 Hreflang codes are correctly returned:

=Dump( Connector("PhantomJsCloud.XPath", "https://www.walmart.ca/en", "//*[@hreflang]", "hreflang", TRUE, "us") )

#2

How are you inspecting the xml tree? The elements have quotes in chrome:
image

Also, I can't find these tags in the original HTML code so they are probably added after SeoTools makes the xpath request, which is why it returns correctly with PhantomJS.


#3

Chrome's Inspect feature is based on the generated source code, which Chrome nicely formats and automatically adds the missing quotes. You'll need to look at the server's source code by right-clicking the page and selecting View page source:

image


#4

Thanks, that's good to know :slight_smile: Weird, I looked at the original code and couldn't find the rel=alternative tag. Will report to Niels.


#5

I tried the function DownloadString on this site and it looks like Walmart is using some type of scraping detection:
image