Rotating Proxy Threads In Python API Scraper

Rotating Proxy Threads In Python API Scraper - python

I have a scraper that uses multithreaded rotating proxies. The program works great however I was wonder what I could do to make it run faster. Currently I run it on my home router and a VPS, I notice the program runs much faster on my VPS.
I was wondering why this was the case? Is this based on GPU, Bandwith, Allowed Concurrent Connections By The ISP, amount of open threads allowed by proxyrack?
My end goal is to be able to run 10,000+ threads a second.
I have tried purchasing more threads on proxyrack but it just maxes out around 2000/5000. The support told me my threads close quickly.

Related

What impacts the latency of requests with proxies in a threaded environment (Python)?

I'm doing some web scraping, and once I start running more than 50 threads (One dedicated proxy per One thread, with some of them seemingly being on the same subnet, all though I'm not sure if that's indicative of anything) the measured response time increases (I simply measure it by asserting the time before the request and printing the difference of the time after the thread and the starting time).
What might cause that?
To make sure that it's not a Python performance issue, or software design flaw I split the amount of threads per program and ran multiple instances of the scraper to utilize multiprocessing.
This did not increase the response times.

How to serve a continuously running python script to multiple users (Social Media Bot)

I hope you are all having an amazing day. So I am working on a project using Python. The script's job is to automate actions and tasks on a social media platform via http requests. As of now, one instance of this script access one user account. Now, I want to create a website where I can let users register, enter their credentials to the social media platform and run an instance of this script to perform the automation tasks. I've thought about creating a new process of this script every time a new user has register, but this doesn't seem efficient. Also though about using threads, but also does not seem reasonable. Especially if there are 10,000 users registering. What is the best way to do this? How can I scale? Thank you guys so much in advance.

What is the nature of the tasks that you're running?
Are the tasks simply jobs that run at a scheduled time of day, or every X minutes? For this, you could have your Web application register cronjobs or similar, and each cronjob can spawn an instance of your script, which I assume is short-running, to carry out a the automated task one user at a time. If the exact timing of the script doesn't matter then you could scatter the running of these scripts throughout the day, on seperate machines if need be.
The above approach probably won't scale well to 10,000 users, and you will need something more robust, especially if the script is something that needs to run continuously (e.g. you are polling some data from Facebook and need to react to its changes). If it's a lot of communication per user, then you could consider using a producer-consumer model, where a bunch of producer scripts (which run continously) issue work requests into a global queue that a bunch of consumer scripts poll and carry out. You could also load balance such consumers and producers across multiple machines.
Of course, you would definitely want to squeeze out some parallelism from the extra cores of your machines by carrying out this work on multiple threads or processes. You could do this quite easily in Python using the multiprocessing module.

Monitor python scraper programs on multiple Amazon EC2 servers with a single web interface written in Django

I have a web-scraper (command-line scripts) written in Python that run on 4-5 Amazon-EC2 instances.
What i do is place the copy of these python scripts in these EC2 servers and run them.
So the next time when i change the program i have to do it for all the copies.
So, you can see the problem of redundancy, management and monitoring.
So, to reduce the redundancy and for easy management , I want to place the code in a separate server from which it can be executed on other EC2 servers and also monitor theses python programs, and logs created them through a Django/Web interface situated in this server.

There are at least two issues you're dealing with:
monitoring of execution of the scraping tasks
deployment of code to multiple servers
and each of them requires a different solution.
In general I would recommend using task queue for this kind of assignment (I have tried and was very pleased with Celery running on Amazon EC2).
One advantage of the task queue is that it abstracts the definition of the task from the worker which actually performs it. So you send the tasks to the queue, and then a variable number of workers (servers with multiple workers) process those tasks by asking for them one at a time. Each worker if it's idle will connect to the queue and ask for some work. If it receives it (a task) it will start processing it. Then it might send the results back and it will ask for another task and so on.
This means that a number of workers can change over time and they will process the tasks from the queue automatically until there are no more tasks to process. The use case for this is using Amazon's Spot instances which will greatly reduce the cost. Just send your tasks to the queue, create X spot requests and see the servers processing your tasks. You don't really need to care about the servers going up and down at any moment because the price went above your bid. That's nice, isn't it ?
Now, this implicitly takes care of monitoring - because celery has tools for monitoring the queue and processing, it can even be integrated with django using django-celery.
When it comes to deployment of code to multiple servers, Celery doesn't support that. The reasons behind this are of different nature, see e.g. this discussion. One of them might be that it's just difficult to implement.
I think it's possible to live without it, but if you really care, I think there's a relatively simple DIY solution. Put your code under VCS (I recommend Git) and check for updates on a regular basis. If there's an update, run a bash script which will kill your workers, make all the updates and start the workers again so that they can process more tasks. Given Celerys ability to handle failure this should work just fine.

Does a multithreaded crawler in Python really speed things up?

Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?
Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?
thanks!

The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.
On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.

Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.
If you think about it, you have only a single network card, so parallel processing can't really help by definition.
What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.

When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.
Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.
Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.

Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.

Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.
You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.

Python threads stack_size and segfaults

A web crawler script that spawns at most 500 threads and each thread basically requests for certain data served from the remote server, which each server's reply is different in content and size from others.
i'm setting stack_size as 756K's for threads
threading.stack_size(756*1024)
which enables me to have the sufficient number of threads required and complete most of the jobs and requests. But as some servers' responses are bigger than others, and when a thread gets that kind of response, script dies with SIGSEGV.
stack_sizes more than 756K makes it impossible to have the required number of threads at the same time.
any suggestions on how can i continue with given stack_size without crashes?
and how can i get the current used stack_size of any given thread?

Why on earth are you spawning 500 threads? That seems like a terrible idea!
Remove threading completely, use an event loop to do the crawling. Your program will be faster, simpler, and easier to maintain.
Lots of threads waiting for network won't make your program wait faster. Instead, collect all open sockets in a list and run a loop where you check if any of them has data available.
I recommend using Twisted - It is an event-driven networking engine. It is very flexile, secure, scalable and very stable (no segfaults).
You could also take a look at Scrapy - It is a web crawling and screen scraping framework written in Python/Twisted. It is still under heavy development, but maybe you can take some ideas.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.