I want to constantly scrape a website - once every 3-5 seconds with
requests.get('http://www.example.com', headers=headers2, timeout=35).json()
But the example website has a rate limit and I want to bypass that. How can I do so?? I thought about doing it with proxies but was hoping there were some other ways?
You would have to do some very low level stuff. Utilizing likely socket and urllib2.
First do your research. How are they limiting your query rate? Is it by IP, or session based (server side cookie) or local cookies? I suggest going to the site manually as your first step of research, and using a web-developer tool to view all headers communicated.
One you figure this out, create a plan to manipulate it.
Lets say it is session based, you could utilize multiple threads to control several individual instances of a scraper, each with unique sessions.
Now, if it is IP based, then you must spoof your IP which is much more complex.
just buy quite a lot of proxy.
and config the script to change the proxy to next after the rate limit time of the server.
Related
We have a lot of small services (python based) which all access the same external resource (REST API). The external resource is a payed service with a given set of tickets per day. Every request, no matter how small or big decreases the available tickets by one. Even if the request fails due to some error in the request parameters or timeout the tickets get reduced. This is usually not a big problem and can be dealt with on a per service level. The problem starts with the limitation about maximum parallel request. We have a certain amount of parallel request, and if we reach that limit the next requests fail and again decrease our available tickets.
So for me a solution where each service handles this error and retries after a certain amount of time is no option. This would be to costly in terms of tickets and also way too inefficient.
My solution now would be to have special internal service which all other service call which acts as a kind of proxy or middleman which receives the requests and puts them in a queue and processes them in a way that we never exceed the parallel requests limit.
Before i now start to code i would like to know if there is a proper name for such a service and if there are already some solutions out there, because i can imagine someone else could also have these problems. I also think that someone (probably not me) could create such a service completely independent from the actual external API.
Thank you very much and please stackoverflow gods be kind with me.
I scrape a lot but so far I'm using a VPN for my scrapes. I would like to start using proxies but the problem I'm running into, especially with free proxies, is that free proxies are highly unreliable.
How do I tell whether there is an issue with the webpage compared to an issue with the proxy? There are timeouts, connectionerrors, etc exceptions but those happen both when a proxy is bad as well as when the webpage has a problem.
So in other words, how do I know whether I need to rotate a dead proxy compared to when there is a problem with the URL I want to scrape and I should stop trying and skip it?
It's hard to make a difference between a website that's down and a proxy that's not functional because you might get the same HTTP error.
My recommendation is to create a proxy checker: a simple tool that will iterate over your proxies list, connect to one and access a website that you control (think of a simple Express web server with a single endpoint). The proxy checker will run every 30 seconds.
By doing it this way, you will have the guarantee the website is never down (you will not block yourself) and if you're getting an error, it's definitely a proxy error.
Once you get an error, you remove the proxy from the list (and add it later when it will come back online).
Basically I'm running a Flask web server that crunches a bunch of data and sends it back to the user. We aren't expecting many users ~60, but I've noticed what could be an issue with concurrency. Right now, if I open a tab and send a request to have some data crunched, it takes about 30s, for our application that's ok.
If I open another tab and send the same request at the same time, unicorn will do it concurrently, this is great if we have two seperate users making two seperate requests. But what happens if I have one user open 4 or 8 tabs and send the same request? It backs up the server for everyone else, is there a way I can tell Gunicorn to only accept 1 request at a time from the same IP?
A better solution to the answer by #jon would be limiting the access by your web server instead of the application server. A good way would always be to have separation between the responsibilities to be carried out by the different layers of your application. Ideally, the application server, flask should not have any configuration for the limiting or anything to do with from where the requests are coming. The responsibility of the web server, in this case nginx is to route the request based on certain parameters to the right client. The limiting should be done at this layer.
Now, coming to the limiting, you could do it by using the limit_req_zone directive in the http block config of nginx
http {
limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
...
server {
...
location / {
limit_req zone=one burst=5;
proxy_pass ...
}
where, binary_remote_addris the IP of the client and not more than 1 request per second at an average is allowed, with bursts not exceeding 5 requests.
Pro-tip: Since the subsequent requests from the same IP would be held in a queue, there is a good chance of nginx timing out. Hence, it would be advisable to have a better proxy_read_timeout and if the reports take longer then also adjusting the timeout of gunicorn
Documentation of limit_req_zone
A blog post by nginx on rate limiting can be found here
This is probably NOT best handled at the flask level. But if you had to do it there, then it turns out someone else already designed a flask plugin to do just this:
https://flask-limiter.readthedocs.io/en/stable/
If a request takes at least 30s then make your limit by address for one request every 30s. This will solve the issue of impatient users obsessively clicking instead of waiting for a very long process to finish.
This isn't exactly what you requested, since it means that longer/shorter requests may overlap and allow multiple requests at the same time, which doesn't fully exclude the behavior you describe of multiple tabs, etc. That said, if you are able to tell your users to wait 30 seconds for anything, it sounds like you are in the drivers seat for setting UX expectations. Probably a good wait/progress message will help too if you can build an asynchronous server interaction.
I was trying to using python requests and mechanize to gather information from a website. This process needs me to post some information then get the results from that website. I automate this process using for loop in Python. However, after ~500 queries, I was told that I am blocked due to high query rate. It takes about 1 sec to do each query. I was using some software online where they query multiple data without problems. Could anyone help me how to avoid this issue? Thanks!
No idea how to solve this.
--- I am looping this process (by auto changing case number) and export data to csv....
After some queries, I was told that my IP was blocked.
Optimum randomized delay time between requests.
Randomized real user-agents for
each request.
Enabling cookies.
Using a working proxy pool and
selecting a random proxy for each request.
I have rather unusual task, so I would like to ask for a piece of advice from experts :)
I need to build small Flask-based web which will have build-in video player. Users will have to log-in to access videos. The problem is that I need to limit user by the amount of time they can spend using the service.
Could someone please recommend a possible way to make it work or help me to find a place to get started?
What I am thinking of... what if i create user's profile variable like "credits_minutes", and i could find a way to decrease credits_minutes every minute by one?
The sessions are based on requests from my understanding what you are trying to do is to actually get the amount of time spent on the site? You'll need to do some kind of keep alive from the client.
Such as web sockets, repetitive JavaScript calls or something else to know that they are on the actual site and base you logic on that.
A simple solution would be to write something with jquery that polls an endpoint of you choice where you could do something time based for each poll. Such as saving the oldest call and comparing it to each new that is arriving. and when X minutes has elapsed, redirect the user.
From the Flask-Session documentation: https://pythonhosted.org/Flask-Session/
PERMANENT_SESSION_LIFETIME: the lifetime of a permanent session as datetime.timedelta object. Starting with Flask 0.8 this can also be an integer representing seconds.