How to write an XPath query for text within <script> using PhantomJS

Danny · February 13, 2019, 2:59am

I am trying to scrape some specific content that sits within the <script> section of a page (at the bottom of the page before the end of the tag. It is my understanding that this can't be done with regular XPath, so I will be using PhantomJs cloud via SEOTools for Excel plugin.

Please see sample of code below:

window.__INITIAL_STATE__ = {"questions":{"list":{},"status":{}},"sites":{"list":{"SEOTest":{"joined":"2016-04-17T22:00:31.000Z","threshold":[],"abn":"8724483318952", I want to be able to scrape the text after "ABN" field, so the xpath would return "8724483318952". Can anybody help me with this?

diskborste · February 13, 2019, 9:38am

I think regex is better for this task. Can you provide the URL so I can test?

Danny · February 13, 2019, 10:14pm

Thanks for replying diskborste.

Here is a URL i'm trying to scrape:

https://hipages.com.au/connect/mscleansydney

CTRL F 'threshold' and you'll see the section in which i want to scrape. Note that it is in the section at the end of the page, and is in JS/JSON type format. Basically if you can scrape the ABN within this section that would be amazing.

diskborste · February 14, 2019, 6:46am

Try this formula:
=RegexpFindOnUrl("https://hipages.com.au/connect/mscleansydney";"""abn"":""(\d+)""";1)

No PhantomJS is required. You may need to change the semi-colons for commas depending on your regional settings.

Danny · February 15, 2019, 6:07am

Thank you so much Victor that worked great. I really appreciate it.

I was also able to use this to change the fields to extract more information, such as 'mobile'. However when I change i to fields such as "website", "email", "suburb" I just get a blank result. Am I doing something wrong?

diskborste · February 15, 2019, 7:07am

The \d+ is regex for digits and will stop when encountering letters and other non-numerical characters. Try and change to (.*?).