I use to get Information from google, I know that I will block after a few requests, that's why I tried to get through Proxies. For the Proxies I use
the ProxyBroker from this link:
The Link
However, if I use proxies, google returns 503. If I click on the error, google shows me my IP and not the Proxy IP.
Here is what I've tried with:
usedProxy = self.getProxy()
if usedProxy is not None:
proxies = {"http": "http://%s" % usedProxy[0]}
headers = {'User-agent': 'Mozilla/5.0'}
proxy_support = urlrequest.ProxyHandler(proxies)
opener = urlrequest.build_opener(proxy_support, urlrequest.HTTPHandler(debuglevel=1))
urlrequest.install_opener(opener)
req = urlrequest.Request(search_url, None, headers)
with contextlib.closing(urlrequest.urlopen(req)) as url:
htmltext = url.read()
I tried with http and https.
Even if the requests is going well, I get a 503 with this the following Message:
send: b'GET http://www.google.co.in/search?q=Test/ HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.google.co.in\r\nUser-Agent: Mozilla/5.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Date header: Server header: Location header: Pragma header: Expires header: Cache-Control header: Content-Type header: Content-Length header: X-XSS-Protection header: X-Frame-Options header:
>Connection send: b'GET http://ipv4.google.com/sorry/index?continue=http://www.google.co.in/search%3Fq%3DTest/&q=EgTCDs9XGMbOgNAFIhkA8aeDS0dE8uXKu31DEbfj5mCVdhpUO598MgFy HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: ipv4.google.com\r\nUser-Agent: Mozilla/5.0\r\n
>Connection: close\r\n\r\n'
reply: 'HTTP/1.1 503 Service Unavailable\r\n'
If the above error doesn't happen, I finally get the following Error:
>[Errno 54] Connection reset by peer
My Questions are:
Is the Ip from the error Link every time my IP and not the Proxy IP?
Google Error Link
And if it´s every time the Host IP what is shown in the error Message from google and the Problem is from the Proxies, how to bypass the error?
It seems that Google is knowing that I go to an proxy, because it uses HTTPS and the HTTPS Proxies don´t seem to work. So the HTTP proxies are detected, that´s why I get blocked after 50-60 queries directly.
My Solution:
I tried all Solutions found on Stackoverflow but they doesen´t work fine like Sleep for 10 seconds. But I found a Article with the same Problem, the Solution was "quite" easy then. First I download the fake-useragent Library from Python, which provides a ton of usefull User-agents.
I select randomly a User-agent from this list at each request. I also add to take only common user-agents because otherwise the page has a different HTML which does not fit in my read method.
After installing the Useragent and selecting one randomly, I add a sleep between 15 and 90 seconds, because the article-writer tried different timespans, and with 30 seconds he got block. So with these two simple changes my programm is successfully running since 10 Hours without truble.
I hope this helps you also, because it cost me a bunch of time to figure out when google does block you. So it simple detects every time but let you go with this Configuration.
Have fun and I wish you all successfully crawling!
EDIT:
The Programm gets ~1000 Requests until it get banned.
Related
While scraping without any problem, access is suddenly denied.
The error code is as below.
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.investing.com/equities/alibaba
Most of solution on google is to set User Agent in headers to avoid robot detection. This part was already applied to the code and the scrapping has done well, but it is being rejected from one day.
So, I tried fakeuseragent and random user agent to send user-agent randomly, but it was always rejected.
Secondly, to check the case through IP information, I tried to use VPN to switch IP but still denied.
The last thing I found was regarding Referer Control. So I installed the Referer Control extension, but I can't find any information on how to use it.
My code is below.
url = "https://www.investing.com/equities/alibaba"
ua = generate_user_agent()
print(ua)
headers = {"User-agent":ua}
res = requests.get(url,headers=headers)
print("Response Code :", res.status_code)
res.raise_for_status()
print("Start")
Any help will be appreciated.
I am trying to read an image URL from the internet and be able to get the image onto my machine via python, I used example used in this blog post https://www.geeksforgeeks.org/how-to-open-an-image-from-the-url-in-pil/ which was https://media.geeksforgeeks.org/wp-content/uploads/20210318103632/gfg-300x300.png, however, when I try my own example it just doesn't seem to work I've tried the HTTP version and it still gives me the 403 error. Does anyone know what the cause could be?
import urllib.request
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")
Output:
urllib.error.HTTPError: HTTP Error 403: Forbidden
The server at prntscr.com is actively rejecting your request. There are many reasons why that could be. Some sites will check for the user agent of the caller to make see if that's the case. In my case, I used httpie to test if it would allow me to download through a non-browser app. It worked. So then I simply reused made up a user header to see if it's just the lack of user-agent.
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent', 'MyApp/1.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(
"http://image.prntscr.com/image/ynfpUXgaRmGPwj5YdZJmaw.png",
"gfg.png")
It worked! Now I don't know what logic the server uses. For instance, I tried a standard Mozilla/5.0 and that did not work. You won't always encounter this issue (most sites are pretty lax in what they allow as long as you are reasonable), but when you do, try playing with the user-agent. If nothing works, try using the same user-agent as your browser for instance.
I had the same problem and it was due to an expired URL. I checked the response text and I was getting "URL signature expired" which is a message you wouldn't normally see unless you checked the response text.
This means some URLs just expire, usually for security purposes. Try to get the URL again and update the URL in your script. If there isn't a new URL for the content you're trying to scrape, then unfortunately you can't scrape for it.
while working with python Requests, I met the problem with ConnectionRefusedError [WinError 10061], because of network settings and limitations in my network, or company's network software won't allow it (I think).
But I was interested in what happens when I call requests.get(). Maybe I'm not good at reading the documentation, but I could not find any processes which happen after the call.
For example, why if I access URL by the browser it is ok, but while I try to access by requests - it fails.
What I'm asking about is what processes happen after the call get() method: starts the server at localhost? configure it? form headers? how it send the request?
Generally, most companies use proxy server for each outgoing request. Once set in connection settings, the browsers will read them and set with each request. You can check if proxy is enabled by checking the settings in your browser.
However, when you're making a python request, you will need to set the proxy in the request, like this:
proxyDict = {
"http" : "192.168.100.3:8080",
"https" : "Some/Same proxy for https",
"ftp" : "Some proxy for ftp (Optional)"
}
r = requests.get(url, headers=headers, proxies=proxyDict)
Also, browsers set the content-types, request headers and other such parameters. You can check browser's developer console, like one of Google Chrome, and goto Network tab and see what all params are being set with the request, and imply the same paramters in your request.get(). In case of headers, it should be :
r = requests.get(url, headers=headers, proxies=proxyDict, headers = {'Content-type':'application/json')
I have a application deployed in private server.
ip = "100.10.1.1"
I want to read the source code/Web content of that page.
On browser when I am visiting to the page. It shows me "Your connection is not private".
So then after proceed to unsafe connection it takes me to that page.
Below is the code I am trying to read the HTML content. But it is not giving me the correct HTML content thought it is showing as 200 OK response.
I tried with ignoring ssl certificate with below code but not happening
import requests
url = "http://100.10.1.1"
requests.packages.urllib3.disable_warnings()
response = requests.get(url, verify=False)
print response.text
Can Someone share some idea on how to proceed or am i doing anything wrong on top of it?
I want to know how to get the direct link to an embedded video (the link to the .flv/.mp4 or whatever file) from just the embed link.
For example, http://www.kumby.com/ano-hana-episode-1/ has
<embed src="http://www.4shared.com/embed/571660264/396a46be"></embed>
, though the link to the video seems to be
"http://dc436.4shared.com/img/571660264/396a46be/dlink__2Fdownload_2FM2b0O5Rr_3Ftsid_3D20120514-093834-29c48ef9/preview.flv"
How does the browser know where to load the video from? How can I write code that converts the embed link to a direct link?
UPDATE:
Thanks for the quick answer, Quentin.
However, I don't seem to receive a 'Location' header when connecting to "http://www.4shared.com/embed/571660264/396a46be".
import urllib2
r=urllib2.urlopen('http://www.4shared.com/embed/571660264/396a46be')
gives me the following headers:
'content-length', 'via', 'x-cache', 'accept-ranges', 'server', 'x-cache-lookup', 'last-modified', 'connection', 'etag', 'date', 'content-type', 'x-jsl'
from urllib2 import Request
r=Request('http://www.4shared.com/embed/571660264/396a46be')
gives me no headers at all.
The server issues a 302 HTTP status code and a Location header.
$ curl -I http://www.4shared.com/embed/571660264/396a46be
HTTP/1.1 302 Moved Temporarily
Server: Apache-Coyote/1.1
(snip cookies)
Location: http://static.4shared.com/flash/player/5.6/player.swf?file=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&provider=image&image=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&displayclick=link&link=http://www.4shared.com/video/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.html&controlbar=none
Content-Length: 0
Date: Mon, 14 May 2012 10:01:59 GMT
See How do I prevent Python's urllib(2) from following a redirect if you want to get information about the redirect response instead of following the redirect automatically.