This code was working fine and reading html.
Then the site just stopped giving any data on read(). No error code.
It's because the web server has detected something unusual right?
(Before I figured setting the User Agent I had Error 403: Bad Behavior)
Does urllib2 have some noticeable signature that raises a flag?
Would switching to another library help?
I'm not doing anything suspicious. I can't see any behavior difference between me using this library to read a page and using lynx browser.
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
search_url='http://www.google.com/?search=stackoverflow"'
raw = opener.open(search_url)
print raw.headers
print raw.read()
Given your print statement, I assume you are doing this over Python2.x
I ran the same thing on my system and it works, irrespective of setting the user-agent. What David Robinson suggested might have something to do over here.
On another note, I have personally used, the following example snippet
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
Multiple times, over multiple use cases and have never encountered your error.
Try using mechanize instead of plain urllib2 for crawling search engines, it mimics better a browser's behaviour.
Related
I'm having some trouble using urllib to fetch some web content on my Debian server. I use the following code to get the contents of most websites without problems:
import urllib.request as request
url = 'https://www.metal-archives.com/'
req = request.Request(url, headers={'User-Agent': "foobar"})
response = request.urlopen(req)
response.read()
However, if the website is using an older encryption protocol, the urlopen function will throw the following error:
ssl.SSLError: [SSL: VERSION_TOO_LOW] version too low (_ssl.c:748)
I have found a way to work around this problem, consisting in using an SSL context and passing it as an argument to the urlopen function, so the previous code would have to be modified:
...
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
...
Which will work, provided the protocol specified matches the website I'm trying to access. However, this does not seem like the best solution since:
If the site owners ever update their cryptography methods, the code will stop working
The code above will only work for this site, and I would have to create special cases for every website I visit in the entire program, since everyone could be using a different version of the protocol. That would lead to pretty messy code
The first solution I posted (the one without the ssl context) oddly seems to work on my ArchLinux machine, even though they both have the same versions of everything
Does anyone know about a generic solution that would work for every TLS version? Am I missing something here?
PS: For completeness, I will add that I'm using Debian 9, python v3.6.2, openssl v1.1.0f and urllib3 v1.22
In the end, I've opted to wrap the method call inside a try-except, so I can use the older SSL version as fallback. The final code is this:
url = 'https://www.metal-archives.com'
req = request.Request(url, headers={"User-Agent": "foobar"})
try:
response = request.urlopen(req)
except (ssl.SSLError, URLError):
# Try to use the older TLSv1 to see if we can fix the problem
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
I have only tested this code on a dozen websites and it seems to work so far, but I'm not sure it will work every time. Also, this solution seems inefficient, since it needs two http requests, which can be very slow.
Improvements are still welcome :)
I am trying to use requests to pull information from the NPI API but it is taking on average over 20 seconds to pull the information. If I try and access it via my web browser it takes less than a second. I'm rather new to this and any help would be greatly appreciated. Here is my code.
import json
import sys
import requests
url = "https://npiregistry.cms.hhs.gov/api/?number=&enumeration_type=&taxonomy_description=&first_name=&last_name=&organization_name=&address_purpose=&city=&state=&postal_code=10017&country_code=&limit=&skip="
htmlfile=requests.get(url)
data = htmlfile.json()
for i in data["results"]:
print(i)
This might be due to the response being incorrectly formatted, or due to requests taking longer than necessary to set up the request. To solve these issues, read on:
Server response formatted incorrectly
A possible issue might be that the response parsing is actually the offending line. You can check this by not reading the response you receive from the server. If the code is still slow, this is not your problem, but if this fixed it, the problem might lie with parsing the response.
In case some headers are set incorrectly, this can lead to parsing errors which prevents chunked transfer (source).
In other cases, setting the encoding manually might resolve parsing problems (source).
To fix those, try:
r = requests.get(url)
r.raw.chunked = True # Fix issue 1
r.encoding = 'utf-8' # Fix issue 2
print(response.text)
Setting up the request takes long
This is mainly applicable if you're sending multiple requests in a row. To prevent requests having to set up the connection each time, you can utilize a requests.Session. This makes sure the connection to the server stays open and configured and also persists cookies as a nice benefit. Try this (source):
import requests
session = requests.Session()
for _ in range(10):
session.get(url)
Didn't solve your issue?
If that did not solve your issue, I have collected some other possible solutions here.
I'm using urlopen() to open a website and pull (financial) data from it. Here is my line:
sourceCode = urlopen('xxxxxxxx').read()
After this, I then pull the data I need out. I loop through different pages on the same domain to pull data (stock info). I end the body of the loop with:
time.sleep(1)
as I'm told that keeps the site from blocking me. My program will run for a few minutes, but at some point, it stalls and quits pulling data. I can rerun it and it'll run another arbitrary amount of time and then stall.
Is there something I can do to prevent this?
This worked (for most websites) for me:
If you're using the urllib.request library, you can create a Request and spoof the user agent. This might mean that they stop blocking you.
from urllib.request import Request, urlopen
req = Request(path, headers={'User-Agent': 'Mozilla/5.0})
data = urlopen(req).read()
Hope this helps
I am parsing through various links using requests, however some links are "bad" aka they basically just don't load which causes my program to become hung up and eventually crash.
Is there a way to set a time limit for getting a request, and if that time passes (fails to get a request from the url) it will just return some kind of error? Or is there some other way I can prevent bad links from breaking my program?
urllib2 is one option
import urllib2
test_url = "http://www.test.com"
try:
urllib2.urlopen(test_url)
except:
pass
Edit:I have found that i made a mistake, because the cause of the error wasn't urllib but nltk, which wasn't able to process a long string which came from this exact page. Sorry for this one.
I do not know why, but this no matter if I use Urllib2.urlopen or request when I come across a specific url.
import requests
r = requests.get('SomeURL')
print html = r.text
Here is its behavior.
1) When I go thought a loop of 200 urls it freezes each time at that exactly the same URL. It stays here for hours if i do not terminate program.
2) When u try with just example of the code outside of the loop it works.
3) If i blacklist just this url it goes through the loop without problems.
It actually doesn't return any kind of error code and it work good outside of loop and also timeout is set but it doesn't do anything. It still hangs for an indefinite time.
So is there any other way to forcefully stop the http get request after a certain time, because the timeout doesn't work. Is there any other library other than urllib2 and request that could do the job, and that follows timeout limits?
for i in range(0,mincount):
code(call the request for urlist[i])
It always works but freezes only when I request this site. If i had 200 request to yahoo for example it would work. But when i try go to this particular url i cannot.
#end
edit: It is a standard for cycle and there is not much room for error.
I think it's simply a very slow page; on my system, it takes about 9.7s to load.
If you are trying to run it in a short loop, it would indeed seem to freeze.
You could try something like
links = [
'SomeURL',
'http://www.google.com/'
]
for link in links:
try:
html = requests.get(link, timeout=2.).content
print("Successfully loaded {}".format(link))
except requests.Timeout:
print("Timed out loading {}".format(link))
which gave me
Timed out loading SomeURL
Successfully loaded http://www.google.com/