Combined XPathOnURL and Twitter On Page Scrapper

JF13 · March 7, 2017, 11:27pm

Hi everyone, I'm new to the community and SEOTools.

I have a question I'm hoping you can help with.

Full disclosure, I'm not a programmer know enough to be dangerous ... at least to myself

Here's what I would like to do:

Go to a specific RSS Feed
Scrape the title and URL for each article in the RSS feed
Scrape any Twitter name (e.g. @somename)
Have each result appear in columns A, B and C. So as an example, the first article in the RSS feed would have the URL in A2, the title in B2, and any Twitter username in C2

When I'm In SEOTools I have the following:

XPathOnURL
URL: The RSS feed is in here
XPath: //link|//title
Mode: Text
Values Selected

When I try the combination above all the results are in column A and looks like

Link1
Title1
Link2
Title2
Link3
Title3

The next combination I try is:

XPathOnURL
URL: The RSS feed is in here
XPath: //link|//title
Mode: Text
Formula selected and Dump is checked off

When I try the combination above it still doesn't work. The same results as before where column A will look like:

Link1
Title1
Link2
Title2

The next combination I try is:

XPathOnURL
URL: The RSS feed is in here
XPath: //link|//title
Mode: Text
Formula selected and Dump is checked off
Transpose selected

When I try the combination above it still doesn't work. Row 1 is filled in across the top and looks like:

Title1 Link1 Title2 Link2

Ideally, the output would like the following:

And this brings me to my second question.

How do I configure the On-Page Scrapers to display the Twitter handle in column C?

A big thank you in advance for my newbie question!

Jeffrey

diskborste · March 9, 2017, 7:13pm

Hi,

The easiest solution is to download the latest beta and use our new RSS Parser (located under Others->Scraping->RSS:

Next, go to the Twitter Accounts OnPage Scraper (located under Others->Scraping->On-Page Scrapers->Twitter Accounts

Select Formula Mode and make sure Dump and Transpose are chosen (The purpose of Dump is to extract more than one item per url, and Tranpose makes the items go horizontally since it's one url per row). Then change the arbitrary input to the first cell with the url and expand the formula:

JF13 · March 10, 2017, 2:13pm

Thanks for the suggestion and how-to.

I've downloaded and installed the latest beta version of SEOTOOLS

When I go to Others -> Scraping -> On Page Scrapers I'm not seeing RSS

Here's a screen shot:

When I check the 'About' in SEOTools in Excel I'm seeing 6.1.10

Let me know what I can do to access the RSS Parser

diskborste · March 10, 2017, 6:35pm

Oh, you are right. The RSS Connector will be included in the next release which will be available in (hopefully a few days). In the meantime, you can copy the following code and save it as RSS.xml in the connectors folder:

github.com

nielsbosma/SeoTools-for-Excel-Connectors/blob/master/RSS.xml

<?xml version="1.0" encoding="utf-8" ?>
<Suite Category="Scraping" Title="RSS" Identifier="RSS" RequireVersion="6.1.0" SourceUrl="https://github.com/nielsbosma/SeoTools-for-Excel-Connectors/blob/master/RSS.xml" HelpUrl="http://seotoolsforexcel.com/rssparser" HelpText="Documentation">

  <Author Name="Niels Bosma" Url="https://se.linkedin.com/in/bosmaniels"/>

  <RestConnector Id="RSS" Title="RSS">
    <Parameters>
       <Text Id="Url" Title="Url" Required="true" Debug.DefaultValue="https://news.ycombinator.com/rss"/>
    </Parameters>
    <Fetch Url="@Model.Url"/>
    <Parse>
      <XPath Expr="/rss/channel/item">
        <XPath Id="link" Title="Link" Expr="link"/>
        <XPath Id="title" Title="Title" Expr="title"/>
        <XPath Id="pubdate" Title="Published" Expr="pubdate" Converter="DateTime"/>
      </XPath>
    </Parse>
  </RestConnector>

</Suite>

JF13 · March 10, 2017, 9:47pm

I downloaded the RSS file. Thank you for the tip.

The issue I'm now having is that Excel will not load:

Any suggestions?

diskborste · March 10, 2017, 10:13pm

The RSS code should only be 21 lines of code. Can you post a print screen of the code?

JF13 · March 10, 2017, 11:56pm

I did a copy and paste this time it now works

I downloaded the rss.xml from https://github.com/nielsbosma/SeoTools-for-Excel-Connectors/find/master

I now have the title and url!

The last piece of the puzzle is how I can scrape the Twitter name (e.g. @myname) that appears in the article.

Your help with this would be appreciated!

diskborste · March 11, 2017, 6:34am

Copy the file OnPageScrapers.xml from the same library (has some improvements and only returns unique twitter names). Then do as I described in my original post

JF13 · March 11, 2017, 6:51am

Thanks again for the reply.

I went to the master and copied and pasted the OnPageScrapers.xml

It works with the exception of not showing any Twitter names.

Let me know what else I can do.

I appreciate your help and patience.

diskborste · March 11, 2017, 11:21am

Can you post a print screen of the code for OnPageScrapers.xml? Also of your request in Excel.

JF13 · March 11, 2017, 6:51pm

I have it wokring, I re-read one of the original emails that got lost in the bug issues.

It's great!

Not a few fine tuning questions.

Twitter Names Not Being Scraped
If you look at the URL below, for some reason it missed the Twitter account. It seems all of Inc's URLs are not scraping the Twitter name

http://www.inc.com/jessica-stillman/bill-gates-says-this-is-the-best-non-fiction-book-hes-read-for-ages.html

Excess Information
Some of the results are returning what look like categories. As an example:

'share' is being scraped from: https://www.marketingprofs.com/charts/2017/31686/emails-sent-by-marketers-on-mondays-have-more-mistakes

'inc' is being scraped from www [dot] inc [dot] com/travis-bradberry/13-things-that-will-make-you-much-happier.html

Is there anyway to have these kind of results automatically omitted?

Questions aside, it's great to have this working. It's a huge time saver.

diskborste · March 12, 2017, 9:24am

Thanks, I will take a look at this!

diskborste · March 12, 2017, 4:22pm

I've made the regex case insensitive so "Inc" and "inc" will not appear twice.

Isn't Inc a valid Twitter account?
https://twitter.com/inc

There are some usernames which are tricky to scrape. In your example, the html code is:
http://www.twitter.com/EntryLevelRebel
which messes up the regex. Will take a closer look at this later.

I've updated the xml file:
https://raw.githubusercontent.com/nielsbosma/SeoTools-for-Excel-Connectors/master/OnPageScrapers.xml

JF13 · March 12, 2017, 5:27pm

Wow, nicely done! It is working great!

You are correct, www.twitter.com/inc is a valid account.

Many thanks!!

Jeffrey