Spider question

Nosh · December 13, 2018, 11:14am

I want to scrape date from a site with the spider.

The site is:

https://www.trustedshops.es/encontrar/?cat=cualquier_categor_a&q=

The problem is that I only get a few URLs.

Another problem is, that the data often is on external links.

Example:
Category "Auto Motocicletas"
https://www.trustedshops.es/encontrar/?cat=auto_y_motocicletas

First result links to:
https://www.trustedshops.ch/evaluation/info_XB6EB3D6DB95EBD4C7ABBA5B24E4C437D.html
Is there a way to scrape the infos ?

diskborste · December 13, 2018, 2:15pm

Can you explain what you want to scrape? Date as in what dates? Are you referring to some kind of date item in the website content?

You can chose "Ignore external links" if you only want from the same domain.

Nosh · December 13, 2018, 5:07pm

I want to scrape the online shop details.
The problem is that most of them are on external urls.
Example:
I want to scrape:
https://www.trustedshops.es
First category is "auto y motocicletas":
https://www.trustedshops.es/encontrar/?cat=auto_y_motocicletas
First entry is 123piecesderechange.ch (on an external swiss URL):
https://www.trustedshops.ch/evaluation/info_XB6EB3D6DB95EBD4C7ABBA5B24E4C437D.html

I want:
Shop Name
Contact (adress)
CEO (in this case "Thierry Delesall")

diskborste · December 17, 2018, 1:45pm

Normally you can add regex or xpath requests to the Spider to get the values. If they needs to be adjusted depending on the external url html structure, then it would be better to only extract the URLs then run regex/xpath formulas afterwards.

In the spider, you can uncheck "ignore external urls" and then use the Include button and add trustedshops. This will produce urls from all domains belonging to trustedshops.

I noticed a bug in the Spider which needs to be fixed. Unfortunately it looks like it can't process the external urls (same site, different domain).

Nosh · December 17, 2018, 3:43pm

I get now this message..

diskborste · December 17, 2018, 6:41pm

Yes, that is the bug I was referring to.

Nosh · December 17, 2018, 9:42pm

OK. Did not see your comment below the screenshot.

Any chance to scrape this site ?

diskborste · December 19, 2018, 9:13am

You can do this in smaller steps, for example:

The source URL in cell A1 is the first results page for category auto_y_motocicletas. The external urls are extracted from this page. Then I extracted some example metadata from each of the external urls.

This can be expanded to include all categories and all pages because the page number is the "seite=1" string which can be adjusted. Depending on your Excel skills, this can be automated in several ways.

Hope this helps!