Spider question


#1

I want to scrape date from a site with the spider.

The site is:

The problem is that I only get a few URLs.

Another problem is, that the data often is on external links.

Example:
Category "Auto Motocicletas"

First result links to:


Is there a way to scrape the infos ?


#2

Can you explain what you want to scrape? Date as in what dates? Are you referring to some kind of date item in the website content?

You can chose "Ignore external links" if you only want from the same domain.


#3

I want to scrape the online shop details.
The problem is that most of them are on external urls.
Example:
I want to scrape:


First category is "auto y motocicletas":

First entry is 123piecesderechange.ch (on an external swiss URL):

I want:
Shop Name
Contact (adress)
CEO (in this case "Thierry Delesall")


#4

Normally you can add regex or xpath requests to the Spider to get the values. If they needs to be adjusted depending on the external url html structure, then it would be better to only extract the URLs then run regex/xpath formulas afterwards.

In the spider, you can uncheck "ignore external urls" and then use the Include button and add trustedshops. This will produce urls from all domains belonging to trustedshops.

I noticed a bug in the Spider which needs to be fixed. Unfortunately it looks like it can't process the external urls (same site, different domain).


#5

I get now this message..


#6

Yes, that is the bug I was referring to.


#7

OK. Did not see your comment below the screenshot.

Any chance to scrape this site ?


#8

You can do this in smaller steps, for example:

The source URL in cell A1 is the first results page for category auto_y_motocicletas. The external urls are extracted from this page. Then I extracted some example metadata from each of the external urls.

This can be expanded to include all categories and all pages because the page number is the "seite=1" string which can be adjusted. Depending on your Excel skills, this can be automated in several ways.

Hope this helps!