Combined XPathOnURL and Twitter On Page Scrapper

Hi everyone, I'm new to the community and SEOTools.

I have a question I'm hoping you can help with.

Full disclosure, I'm not a programmer know enough to be dangerous ... at least to myself :smile:

Here's what I would like to do:

  1. Go to a specific RSS Feed
  2. Scrape the title and URL for each article in the RSS feed
  3. Scrape any Twitter name (e.g. @somename)
  4. Have each result appear in columns A, B and C. So as an example, the first article in the RSS feed would have the URL in A2, the title in B2, and any Twitter username in C2

When I'm In SEOTools I have the following:

XPathOnURL
URL: The RSS feed is in here
XPath: //link|//title
Mode: Text
Values Selected

When I try the combination above all the results are in column A and looks like

Link1
Title1
Link2
Title2
Link3
Title3

The next combination I try is:

XPathOnURL
URL: The RSS feed is in here
XPath: //link|//title
Mode: Text
Formula selected and Dump is checked off

When I try the combination above it still doesn't work. The same results as before where column A will look like:

Link1
Title1
Link2
Title2

The next combination I try is:

XPathOnURL
URL: The RSS feed is in here
XPath: //link|//title
Mode: Text
Formula selected and Dump is checked off
Transpose selected

When I try the combination above it still doesn't work. Row 1 is filled in across the top and looks like:

Title1 Link1 Title2 Link2

Ideally, the output would like the following:

And this brings me to my second question.

How do I configure the On-Page Scrapers to display the Twitter handle in column C?

A big thank you in advance for my newbie question!

Jeffrey

Hi,

The easiest solution is to download the latest beta and use our new RSS Parser (located under Others->Scraping->RSS:

Next, go to the Twitter Accounts OnPage Scraper (located under Others->Scraping->On-Page Scrapers->Twitter Accounts

Select Formula Mode and make sure Dump and Transpose are chosen (The purpose of Dump is to extract more than one item per url, and Tranpose makes the items go horizontally since it's one url per row). Then change the arbitrary input to the first cell with the url and expand the formula:

Thanks for the suggestion and how-to.

I've downloaded and installed the latest beta version of SEOTOOLS

When I go to Others -> Scraping -> On Page Scrapers I'm not seeing RSS

Here's a screen shot:

When I check the 'About' in SEOTools in Excel I'm seeing 6.1.10

Let me know what I can do to access the RSS Parser

Oh, you are right. The RSS Connector will be included in the next release which will be available in (hopefully a few days). In the meantime, you can copy the following code and save it as RSS.xml in the connectors folder:

I downloaded the RSS file. Thank you for the tip.

The issue I'm now having is that Excel will not load:

Any suggestions?

The RSS code should only be 21 lines of code. Can you post a print screen of the code?

I did a copy and paste this time it now works :smile:

I downloaded the rss.xml from https://github.com/nielsbosma/SeoTools-for-Excel-Connectors/find/master

I now have the title and url!

The last piece of the puzzle is how I can scrape the Twitter name (e.g. @myname) that appears in the article.

Your help with this would be appreciated!

Copy the file OnPageScrapers.xml from the same library (has some improvements and only returns unique twitter names). Then do as I described in my original post :slight_smile:

Thanks again for the reply.

I went to the master and copied and pasted the OnPageScrapers.xml

It works with the exception of not showing any Twitter names.

Let me know what else I can do.

I appreciate your help and patience.

Can you post a print screen of the code for OnPageScrapers.xml? Also of your request in Excel.

I have it wokring, I re-read one of the original emails that got lost in the bug issues.

It's great!

Not a few fine tuning questions.

  1. Twitter Names Not Being Scraped
    If you look at the URL below, for some reason it missed the Twitter account. It seems all of Inc's URLs are not scraping the Twitter name

http://www.inc.com/jessica-stillman/bill-gates-says-this-is-the-best-non-fiction-book-hes-read-for-ages.html

  1. Excess Information
    Some of the results are returning what look like categories. As an example:

'share' is being scraped from: https://www.marketingprofs.com/charts/2017/31686/emails-sent-by-marketers-on-mondays-have-more-mistakes

'inc' is being scraped from www [dot] inc [dot] com/travis-bradberry/13-things-that-will-make-you-much-happier.html

Is there anyway to have these kind of results automatically omitted?

Questions aside, it's great to have this working. It's a huge time saver.

Thanks, I will take a look at this!

I've made the regex case insensitive so "Inc" and "inc" will not appear twice.

I've also excluded generic twitter links such as:
sharei|ntent|tweet|share|signup|sessions

Isn't Inc a valid Twitter account?
https://twitter.com/inc

There are some usernames which are tricky to scrape. In your example, the html code is:
http://www.twitter.com/EntryLevelRebel
which messes up the regex. Will take a closer look at this later.

I've updated the xml file:
https://raw.githubusercontent.com/nielsbosma/SeoTools-for-Excel-Connectors/master/OnPageScrapers.xml

Wow, nicely done! It is working great!

You are correct, www.twitter.com/inc is a valid account.

Many thanks!!

Jeffrey