The website i scapped blocked me out by showing 406 Not Acceptable on the browser. It might i mistakenly sent too many requests at once on phython code.
So i put time.sleep(10) for each loop to not make it look like a DDoS attack, and it seems worked out.
My questions are:
How long would it be reasonable to send between each request? Sleep 10 seconds for each loop makes my code running too slow.
How to fix the 406 Not Acceptable error on my browsers? They still block me out, only if i chance my ip address but it's not permanent solution.
Thank you all for your answers and comments. Good day!
Any rate-limit errors are all subject to which website you choose to scrape / interact with. I could set up a website that only allows you to view it once per day, before throwing HTTP errors at your screen. So to answer your first question, there is no definitive answer. You must test for yourself and see what's the fastest speed you can go, without getting blocked.
However, there is a workaround. If you use proxies, then it's almost impossible to detect and stop the requests from executing, and therefore you will not be hit by any HTTP errors. HOWEVER, JUST BECAUSE YOU CAN, DOESN'T MEAN THAT YOU SHOULD- I am a programmer, not a lawyer. I'm sure there's a rule somewhere that says that spamming a page, even after it tells you to stop, is illegal.
Your second question isn't exactly related to programming, but I will answer it anyways- try clearing your cookies or refreshing your IP (try using a VPN or such). Other than changing your IP or cookies, there's not many more ways that a page can fingerprint you (in order to block you).
Related
I scrape a lot but so far I'm using a VPN for my scrapes. I would like to start using proxies but the problem I'm running into, especially with free proxies, is that free proxies are highly unreliable.
How do I tell whether there is an issue with the webpage compared to an issue with the proxy? There are timeouts, connectionerrors, etc exceptions but those happen both when a proxy is bad as well as when the webpage has a problem.
So in other words, how do I know whether I need to rotate a dead proxy compared to when there is a problem with the URL I want to scrape and I should stop trying and skip it?
It's hard to make a difference between a website that's down and a proxy that's not functional because you might get the same HTTP error.
My recommendation is to create a proxy checker: a simple tool that will iterate over your proxies list, connect to one and access a website that you control (think of a simple Express web server with a single endpoint). The proxy checker will run every 30 seconds.
By doing it this way, you will have the guarantee the website is never down (you will not block yourself) and if you're getting an error, it's definitely a proxy error.
Once you get an error, you remove the proxy from the list (and add it later when it will come back online).
i'm solliciting you today because i've a problem with selenium.
my goal is to make a full automated bot that create an account with parsed details (mail, pass, birth date...) So far, i've managed to almost create the bot (i just need to access to gmail and get the confirmation code).
My problem is here, because i've tried a lot of things, i have a Failed to load resource: the server responded with a status of 429 ()
So, i guess, instagram is blocking me.
how could i bypass this ?
The answer is in the description of the HTTP error code. You are being blocked because you made too many requests in a short time.
Reduce the rate at which your bot makes requests and see if that helps. As far as I know there's no way to "bypass" this check by the server.
Check if the response header has a Retry-After value to tell you when you can try again.
Status code of 429 means that you've bombarded Instagram's server too many times ,and that is why Instagram has blocked your ip.
This is done mainly to prevent from DDOS attacks.
Best thing would be to try after some time ( there might be a Retry-After header in the response).
Also, increase the time interval between each request and set the specific count of number of requests made within a specified time (let's say 1 hr).
Retry-After header is the best practice. However, there's no such response header in this scenario.
My understanding of web programming isn't the best, so I might have some misconceptions here about how it works in general, but hopefully what I'm trying to do is possible.
Recently my friend and I have been challenging each other to break web systems we've set up, and in order to break his next one I need to use the requests module, while doing part of it by myself. I'm perfectly happy with the requests module, but after a while, I want to manually take over that session with the server in my browser. I've tried webbrowser.open, but this effectively loads the page again as if I've never connected before, due to not having any of the cookies from the other session. Is this possible, or do I have a misunderstanding of the situation? Thanks in advance for any help.
Because of the huge hassles with finding a good scraping solution for Py, I'm using Dryscrape. I can't seem to get it to consistently work through a proxy, however. Some sites causes it to throw the following:
InvalidResponseError: Error while loading URL
https://apis.google.com/js/plusone.js: Operation on socket is not
supported (error code 99)
I guess it's some kind of proxy protection thingy, but I'm not breaking any TOS or anything. Only some sites do this, but the whole project is kind of relying on looking something up on the site daily. Does anyone have a solution?
It's really hard to tell without any code and knowing what you are trying to accomplish. But if you are trying to scrape a lot of pages at once, try throttling back the # of current connections to your proxy. Does it occur on the same page(s) each attempt?
I'm using python GData to work with Google Calendar. I'm making really simple requests (e.g. create event), using OAuth authorization.
Usually this works OK, but sometimes I'm receiving lots of 302 redirects, that leads to "Maximum redirects count reached" exception.
If I re-try same request, it's usually works correct.
I can't figure out, why is this happening, looks like it's a random event.
As a walkthrough I wrote a code which retries requests few times, if there is such error, but may be there is an explanation of this behavior or even solutions to evade it?
Answer from Google support forum:
This might happen due to some issues in the Calendar servers and is not an error on your part. The best way to "resolve" this issue is to retry again.