Because of the huge hassles with finding a good scraping solution for Py, I'm using Dryscrape. I can't seem to get it to consistently work through a proxy, however. Some sites causes it to throw the following:
InvalidResponseError: Error while loading URL
https://apis.google.com/js/plusone.js: Operation on socket is not
supported (error code 99)
I guess it's some kind of proxy protection thingy, but I'm not breaking any TOS or anything. Only some sites do this, but the whole project is kind of relying on looking something up on the site daily. Does anyone have a solution?
It's really hard to tell without any code and knowing what you are trying to accomplish. But if you are trying to scrape a lot of pages at once, try throttling back the # of current connections to your proxy. Does it occur on the same page(s) each attempt?
Related
Background information: This question is in regards to use with "Python Selenium". I am working with selenium and opening websites with the help of selenium module and python. Below is the stuff I want the script to do. Is it possible? Please read.
I am looking for a way with python that can help me read requests made to server and response received from server (If you have ever used BurpSuite Interceptor, it does the same thing).
Suppose, I open browser and type in "mywebsite.com". I should be able to read the request that browser made to server and then the response received from server to browser.
I am thinking of doing something like creating localhost proxy server with python, and all requests should first pass through local proxy and then go out to the server (same goes for the responses received).
There could be an other better way of doing this that I do not know about. Please recommend it if you know about it.
I hope I am clear.
While researching online for the same, I came across this: https://github.com/manmolecular/py-request-interceptor, but I am not sure how this can be helpful. Haven't found anything else till now.
I scrape a lot but so far I'm using a VPN for my scrapes. I would like to start using proxies but the problem I'm running into, especially with free proxies, is that free proxies are highly unreliable.
How do I tell whether there is an issue with the webpage compared to an issue with the proxy? There are timeouts, connectionerrors, etc exceptions but those happen both when a proxy is bad as well as when the webpage has a problem.
So in other words, how do I know whether I need to rotate a dead proxy compared to when there is a problem with the URL I want to scrape and I should stop trying and skip it?
It's hard to make a difference between a website that's down and a proxy that's not functional because you might get the same HTTP error.
My recommendation is to create a proxy checker: a simple tool that will iterate over your proxies list, connect to one and access a website that you control (think of a simple Express web server with a single endpoint). The proxy checker will run every 30 seconds.
By doing it this way, you will have the guarantee the website is never down (you will not block yourself) and if you're getting an error, it's definitely a proxy error.
Once you get an error, you remove the proxy from the list (and add it later when it will come back online).
The website i scapped blocked me out by showing 406 Not Acceptable on the browser. It might i mistakenly sent too many requests at once on phython code.
So i put time.sleep(10) for each loop to not make it look like a DDoS attack, and it seems worked out.
My questions are:
How long would it be reasonable to send between each request? Sleep 10 seconds for each loop makes my code running too slow.
How to fix the 406 Not Acceptable error on my browsers? They still block me out, only if i chance my ip address but it's not permanent solution.
Thank you all for your answers and comments. Good day!
Any rate-limit errors are all subject to which website you choose to scrape / interact with. I could set up a website that only allows you to view it once per day, before throwing HTTP errors at your screen. So to answer your first question, there is no definitive answer. You must test for yourself and see what's the fastest speed you can go, without getting blocked.
However, there is a workaround. If you use proxies, then it's almost impossible to detect and stop the requests from executing, and therefore you will not be hit by any HTTP errors. HOWEVER, JUST BECAUSE YOU CAN, DOESN'T MEAN THAT YOU SHOULD- I am a programmer, not a lawyer. I'm sure there's a rule somewhere that says that spamming a page, even after it tells you to stop, is illegal.
Your second question isn't exactly related to programming, but I will answer it anyways- try clearing your cookies or refreshing your IP (try using a VPN or such). Other than changing your IP or cookies, there's not many more ways that a page can fingerprint you (in order to block you).
My understanding of web programming isn't the best, so I might have some misconceptions here about how it works in general, but hopefully what I'm trying to do is possible.
Recently my friend and I have been challenging each other to break web systems we've set up, and in order to break his next one I need to use the requests module, while doing part of it by myself. I'm perfectly happy with the requests module, but after a while, I want to manually take over that session with the server in my browser. I've tried webbrowser.open, but this effectively loads the page again as if I've never connected before, due to not having any of the cookies from the other session. Is this possible, or do I have a misunderstanding of the situation? Thanks in advance for any help.
I'm using python GData to work with Google Calendar. I'm making really simple requests (e.g. create event), using OAuth authorization.
Usually this works OK, but sometimes I'm receiving lots of 302 redirects, that leads to "Maximum redirects count reached" exception.
If I re-try same request, it's usually works correct.
I can't figure out, why is this happening, looks like it's a random event.
As a walkthrough I wrote a code which retries requests few times, if there is such error, but may be there is an explanation of this behavior or even solutions to evade it?
Answer from Google support forum:
This might happen due to some issues in the Calendar servers and is not an error on your part. The best way to "resolve" this issue is to retry again.