Scrape data from different sites?

Nosh · May 4, 2020, 4:52pm

I know that data can be scraped from pages with the same format?

For example, if I want to scrape data from companies (I would need, Company Name, CEO, VAT ID)
https://www.teamviewer.com/es/aviso-legal/
https://www.metz-mecatech.de/es/aviso-legal.html
https://www.tuv-sud.es/es-es/aviso-legal

Is this somehow possible?

diskborste · May 4, 2020, 5:15pm

Not sure what your question is. If there is a common pattern then you can scrape based on that pattern. If different across sites, then you need to use custom solutions for each site.

Nosh · May 4, 2020, 5:30pm

Sure. Do words count as pattern? What I mean by that is: a common company form is Gmbh. Is it possible to scrape the words before that?

Or is it possible to scrape words after a ":" and before the new line/
?

diskborste · May 4, 2020, 7:28pm

Are you asking about how to scrape that specific site?

When I said pattern, I meant HTML content structured in a way that is the same on all different sites and this pattern can be used when making the requests.

Nosh · May 4, 2020, 7:43pm

No. Various sites with "more or less" the same date.
Thought it would be possible to scrape data after and before specific words

diskborste · May 4, 2020, 8:22pm

I recommend scraping each site separately with a custom XPath or Regex. Comparing the first two, they are different in several ways, both order of information and also info type included on the right:

Nosh · May 4, 2020, 8:24pm

Ok, thanks. What I was expecting but still had hope that it would be possible somehow possible to "bulk" scrape and to create a directory