Web Scraping using TOR. How To? Ideas?


#1

I’ve seen articles online for users scraping sites using TOR. “Slower speed but better protection”

Any ideas on how to scrape for example using XpathOnURL using TOR proxies?

Articles I’ve seen use Python, example article) Anonymous Web Crawling with Tor in Python

Is it possible to adjust the HTTPSETTINGSparameters using default SEOTOOLS functionality to include TOR as a proxy?

If not, this would be an awesome connector. I really wan’t to start creating my own connectors but I don’t know where to start. I’ve examined current connector configurations but they don’t show the way seotools functions operate. I’ve even tried to examine the raw DLL files and it’s just alot of information. I understand bits and pieces but maybe someone can’t point me in the right direction.

My goal is to scrape a list of about 100K unique URL’s. I currently use seotools dealy functionality but even still with a safe delay it will take me way too long.

Last question, Is their a way to view SeoTools log files? Like extremely detailed log files that show every event and result?

Thanks in advance!
-Damian


#2

Hi Damian,

Great question! Tor is a bit out of my depth but I’ll ask Niels about the possibilities with HTTP Settings.

It’s great to hear you’re looking at the Connector files. The long-term goal is for users to submit and develop as open source. I have created most of the Connectors so if you have any questions feel free to ask me here via PM or victor@seotoolsforexcel.com

About the logs, I would recommend the following to methods:

  1. Enable diagnostics mode (might be unstable)
    http://seotoolsforexcel.com/enable-diagnostics-logging/

  2. Enable debug mode. Great resource when building connectors:
    http://seotoolsforexcel.com/enable-debug-mode/


#3

My goal is to scrape a list of about 100K unique URL’s

It very depends on the nature of these urls. seotools is not a tool for such tasks. i would do it with screaminig frog running in the cloud (or locally with min 32gb RAM, because of running inside of java machine) or, again depending on the url’s nature, with phantomJs (directly, not with seotools connector).
The bottleneck of seotools is excel and its slow nature to handle http requests. Even with a great proxy list you’ll have an ugly CPU issues.
note: seotools is made primarily for analyzing and manipulating, maybe for some Custom scraping. But not for such kind of brute force scrapings :wink: Its a kind of a surgical knife, not a jackhammer.
To be honest, i don’t know better tool for such massive scrapings, based on regex , css path or xpath, as screaming frog is - i use it for such tasks intensively for some years and only RAM is you limit.


#4

I’ve seen screamingfrog. But is SeoTools has so much potential for so many other things. I get your point about excel being a bottleneck for SeoTools. But that brings me to question, Can SeoTools operate outside of Excel? And Save results in a dataframe like python? SeoTools should definitely create it’s own spreadsheet application that doesn’t rely on excel’s processing speeds. Once you have the data you need, then you can open it with excel for a more user-friendly experience.

I do have an update however. I started using Avira VPN Pro with all my http requests in SeoTools for excel. And I must say, that I scraped large html tables from over 30K unique URL’s and it only took about 15 minutes. So Excel’s http requests work perfectly fine. I do also use an intel core i7 skylake unlocked and it helps tremendously compared to my office computer. I only have 8gb of ram so I think once I upgrade that Everything should be running smoothly.

Excel in itself is such a great tool. Excel can literally do anything if configured properly. The name doesn’t do it justice, but SeoTools is much more than just and seo tool. Not taking anything away from screaming frog though, I’ve definitely used it maybe once or twice. But I find that the price might have been the biggest factor in continued use.


#5

Hey diskborste,
any way to add a line in that code to change the path the log saves to? Would a be a nice minor tweak