I am trying to use requests to pull information from the NPI API but it is taking on average over 20 seconds to pull the information. If I try and access it via my web browser it takes less than a second. I'm rather new to this and any help would be greatly appreciated. Here is my code.
import json
import sys
import requests
url = "https://npiregistry.cms.hhs.gov/api/?number=&enumeration_type=&taxonomy_description=&first_name=&last_name=&organization_name=&address_purpose=&city=&state=&postal_code=10017&country_code=&limit=&skip="
htmlfile=requests.get(url)
data = htmlfile.json()
for i in data["results"]:
print(i)
This might be due to the response being incorrectly formatted, or due to requests taking longer than necessary to set up the request. To solve these issues, read on:
Server response formatted incorrectly
A possible issue might be that the response parsing is actually the offending line. You can check this by not reading the response you receive from the server. If the code is still slow, this is not your problem, but if this fixed it, the problem might lie with parsing the response.
In case some headers are set incorrectly, this can lead to parsing errors which prevents chunked transfer (source).
In other cases, setting the encoding manually might resolve parsing problems (source).
To fix those, try:
r = requests.get(url)
r.raw.chunked = True # Fix issue 1
r.encoding = 'utf-8' # Fix issue 2
print(response.text)
Setting up the request takes long
This is mainly applicable if you're sending multiple requests in a row. To prevent requests having to set up the connection each time, you can utilize a requests.Session. This makes sure the connection to the server stays open and configured and also persists cookies as a nice benefit. Try this (source):
import requests
session = requests.Session()
for _ in range(10):
session.get(url)
Didn't solve your issue?
If that did not solve your issue, I have collected some other possible solutions here.
Related
I'm doing a post request using python and confluence REST API in order to update confluence pages via a script.
I ran into a problem which caused me to receive a 400 error in response to a
requests.put(url, data = jsonData, auth = (username, passwd), headers = {'Content-Type' : 'application/json'})
I spent some time on this to discover that the reason for it was me not supplying an incremented version when updating the content. I have managed to make my script work, but that is not the point of this question.
During my attempts to make this work, I swapped from requests to an http.client connection. Using this module, I get a lot more information regarding my error:
b'{"statusCode":400,"data":{"authorized":false,"valid":true,"allowedInReadOnlyMode":true,"errors":[],"successful":false},"message":"Must supply an incremented version when updating Content. No version supplied.","reason":"Bad Request"}'
Is there a way for me to get the same feedback information while using requests? I've turned on logging, but this kind of info is never shown.
You're looking for
requests.json()
It outputs everything the requests item returns, as a dictionary.
I'm having some trouble using urllib to fetch some web content on my Debian server. I use the following code to get the contents of most websites without problems:
import urllib.request as request
url = 'https://www.metal-archives.com/'
req = request.Request(url, headers={'User-Agent': "foobar"})
response = request.urlopen(req)
response.read()
However, if the website is using an older encryption protocol, the urlopen function will throw the following error:
ssl.SSLError: [SSL: VERSION_TOO_LOW] version too low (_ssl.c:748)
I have found a way to work around this problem, consisting in using an SSL context and passing it as an argument to the urlopen function, so the previous code would have to be modified:
...
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
...
Which will work, provided the protocol specified matches the website I'm trying to access. However, this does not seem like the best solution since:
If the site owners ever update their cryptography methods, the code will stop working
The code above will only work for this site, and I would have to create special cases for every website I visit in the entire program, since everyone could be using a different version of the protocol. That would lead to pretty messy code
The first solution I posted (the one without the ssl context) oddly seems to work on my ArchLinux machine, even though they both have the same versions of everything
Does anyone know about a generic solution that would work for every TLS version? Am I missing something here?
PS: For completeness, I will add that I'm using Debian 9, python v3.6.2, openssl v1.1.0f and urllib3 v1.22
In the end, I've opted to wrap the method call inside a try-except, so I can use the older SSL version as fallback. The final code is this:
url = 'https://www.metal-archives.com'
req = request.Request(url, headers={"User-Agent": "foobar"})
try:
response = request.urlopen(req)
except (ssl.SSLError, URLError):
# Try to use the older TLSv1 to see if we can fix the problem
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
I have only tested this code on a dozen websites and it seems to work so far, but I'm not sure it will work every time. Also, this solution seems inefficient, since it needs two http requests, which can be very slow.
Improvements are still welcome :)
I'm using urlopen() to open a website and pull (financial) data from it. Here is my line:
sourceCode = urlopen('xxxxxxxx').read()
After this, I then pull the data I need out. I loop through different pages on the same domain to pull data (stock info). I end the body of the loop with:
time.sleep(1)
as I'm told that keeps the site from blocking me. My program will run for a few minutes, but at some point, it stalls and quits pulling data. I can rerun it and it'll run another arbitrary amount of time and then stall.
Is there something I can do to prevent this?
This worked (for most websites) for me:
If you're using the urllib.request library, you can create a Request and spoof the user agent. This might mean that they stop blocking you.
from urllib.request import Request, urlopen
req = Request(path, headers={'User-Agent': 'Mozilla/5.0})
data = urlopen(req).read()
Hope this helps
Edit:I have found that i made a mistake, because the cause of the error wasn't urllib but nltk, which wasn't able to process a long string which came from this exact page. Sorry for this one.
I do not know why, but this no matter if I use Urllib2.urlopen or request when I come across a specific url.
import requests
r = requests.get('SomeURL')
print html = r.text
Here is its behavior.
1) When I go thought a loop of 200 urls it freezes each time at that exactly the same URL. It stays here for hours if i do not terminate program.
2) When u try with just example of the code outside of the loop it works.
3) If i blacklist just this url it goes through the loop without problems.
It actually doesn't return any kind of error code and it work good outside of loop and also timeout is set but it doesn't do anything. It still hangs for an indefinite time.
So is there any other way to forcefully stop the http get request after a certain time, because the timeout doesn't work. Is there any other library other than urllib2 and request that could do the job, and that follows timeout limits?
for i in range(0,mincount):
code(call the request for urlist[i])
It always works but freezes only when I request this site. If i had 200 request to yahoo for example it would work. But when i try go to this particular url i cannot.
#end
edit: It is a standard for cycle and there is not much room for error.
I think it's simply a very slow page; on my system, it takes about 9.7s to load.
If you are trying to run it in a short loop, it would indeed seem to freeze.
You could try something like
links = [
'SomeURL',
'http://www.google.com/'
]
for link in links:
try:
html = requests.get(link, timeout=2.).content
print("Successfully loaded {}".format(link))
except requests.Timeout:
print("Timed out loading {}".format(link))
which gave me
Timed out loading SomeURL
Successfully loaded http://www.google.com/
Trying to make a POST request between a Python (WSGI) and a NodeJS + Express application. They are on different servers.
The problem is that when using different IP addresses (i.e. private network vs. public network), a urllib2 request on the public network succeeds, but the same request for the private network fails with a 502 Bad Gateway or URLError [32] Broken pipe.
The urllib2 code I'm using is this:
req = urllib2.Request(url, "{'some':'data'}", {'Content-Type' : 'application/json; charset=utf-8'})
res = urllib2.urlopen(req)
print f.read()
Now, I have also coded the request like this, using requests:
r = requests.post(url, headers = {'Content-Type' : 'application/json; charset=utf-8'}, data = "{'some':'data'}")
print r.text
And get a 200 OK response. This alternate method works for both networks.
I am interested in finding out if there is some additional configuration needed for a urllib2 request that I don't know of, or if I need to look into some network configuration which might be missing (I don't believe this is the case, since the alternate request method works, but I could definitely be wrong).
Any suggestions or pointers with this will be greatly appreciated. Thanks!
The problem here is that, as Austin Phillips pointed out, urllib2.Request's constructor's data parameter:
may be a string specifying additional data to send to the server… data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format.
By passing it JSON-encoded data instead of urlencoded data, you're confusing it somewhere.
However, Request has a method add_data:
Set the Request data to data. This is ignored by all handlers except HTTP handlers — and there it should be a byte string, and will change the request to be POST rather than GET.
If you use this, you should probably also use add_header rather than passing it in the constructor, although that doesn't seem to be mentioned specifically anywhere in the documentation.
So, this should work:
req = urllib2.Request(url)
req.add_data("{'some':'data'}")
req.add_header('Content-Type', 'application/json; charset=utf-8')
res = urllib2.urlopen(req)
In a comment, you said:
The reason I don't want to just switch over to requests without finding out why I'm seeing this problem is that there may be some deeper underlying issue that this points to that could come back and cause harder-to-detect problems later on.
If you want to find deep underlying issues, you're not going to do that by just looking at your client-side source. The first step to figuring out "Why does X work but Y fails?" with network code is to figure out exactly what bytes X and Y each send. Then you can try to narrow down what the relevant difference is, and then figure out what part of your code is causing Y to send the wrong data in the relevant place.
You can do this by logging things at the service (if you control it), running Wireshark, etc., but the easiest way, for simple cases, is netcat. You'll need to read man nc for your system (and, on Windows, you'll need to get and install netcat before you can run it), because the syntax is different for each version, but it's always something simple like nc -kl 12345.
Then, in your client, change the URL to use localhost:12345 in place of the hostname, and it'll connect up to netcat and send its HTTP request, which will be dumped to the terminal. You can then copy that and use nc HOST 80 and paste it to see how the real server responds, and use that to narrow down where the problem is. Or, if you get stuck, at least you can copy and paste the data to your SO question.
One last thing: This is almost certainly not relevant to your problem (because you're sending the exact same data with requests and it's working), but your data is not actually valid JSON, because it uses single quotes instead of double quotes. According to the docs, string is defined as:
string
""
" chars "
(The docs have a nice graphical representation as well.)
In general, except for really simple test cases, you don't want to write JSON by hand. In many cases (including yours), all you have to do is replace the "…" with json.dumps(…), so this isn't a serious hardship. So:
req = urllib2.Request(url)
req.add_data(json.dumps({'some':'data'}))
req.add_header('Content-Type', 'application/json; charset=utf-8')
res = urllib2.urlopen(req)
So, why is it working? Well, in JavaScript, single-quoted strings are legal, as well as other things like backslash escapes that aren't valid in JSON, and any JS code that uses restricted-eval (or, worse, raw eval) for parsing will accept it. And, because so many people got used to writing bad JSON because of this, many browsers' native JSON parsers and many JSON libraries in other languages have workarounds to allow common errors. But you shouldn't rely on that.