Extracting Outbound links from a site's recently published articles

Tareq · October 7, 2021, 8:53am

Hi Guys,

I am new at SEOtoolsforexcel and am fascinated by this tool.

Currently, I am trying to extract the recently published articles of a site. My goal is to extract the outbound links from these articles. I don't know how to do that with this tool. I tried to break this task into two subtasks;

Subtask 1: Getting the recently published articles URL list -- I tried the spider, google search, and bing search functions to get the URL list and then sort it by recent google cache date to get my list. It's not working properly. First of all, this gives me random category and tag page URLs, and secondly, the google cache says limit reached after I get some of the cache dates. Is there any way to derive the publish dates or recently published article lists?

Subtask 2: Extracting external links from a URL: Suppose I am successful at getting the list. to extract the external links I used XpathonURL functions. But this gives me a long list. Can I filter out the only external links?

I tried to find tutorials about this tool, but all I can find is some 7/8 years old video. Can anyone refer me to any tutorial (articles or video) of this tool? If there are none, I think it would be great if you guys compile a tutorial of different functions in detail.

Best,
Tareq

WolfeDen · October 7, 2021, 3:15pm

Hey, @Tareq.

For your subtask 1, I wouldn't recommend using Google's cache date in your case, because that date will likely change whenever Google recrawls the page. Instead, try to extract the publish date from each page, if it exists. If it's not visible on-page, try looking in the metadata - there's usually some clue within the HTML. Then you can sort using that date to proceed to subtask 2.

For your subtask 2, give this XPath a try. Be sure to replace the _____ with your site's homepage URL:

/html/body//a[@href and not(starts-with(@href,'_____')) and not(starts-with(@href,'/')) and not(starts-with(@href,'#')) and not(starts-with(@href,'javascript'))]

This should give you a list of all externally linked URLs. It will also exclude potential false-positives such as relative URLs and JavaScript links. Good luck - I hope this helps!

-Tim W.

Tareq · October 8, 2021, 6:58am

Hey Tim,

Thank you very much for your help. I am really grateful to you.

Subtask 2 is working perfectly. I am now trying to call the publishing date from the metadata.

Can you please share some insights how can I get the list of published articles? The spider gives me lots of category and tag pages which I don't need. I am thinking of using screaming frog and then exporting the url list and then use this plugin. if you give me any better idea that would be great!

Also, Can you please refer me to any tutorials for learning xpath, regx or for SEOtoolsfor excel?

Anyway, Thank you again for the help.