I have the following, very basic code that throws; TypeError: the JSON object must be str, not 'bytes'
import requests
import json
url = 'my url'
user = 'my user'
pwd = 'my password'
response = requests.get(url, auth=(user, pwd))
if(myResponse.ok):
Data = json.loads(myResponse.content)
I try to set decode to the Data variable, as follows but it throws the same error; jData = json.loads(myResponse.content).decode('utf-8')
Any suggestions?
json.loads(myResponse.content.decode('utf-8'))
You just put it in the wrong order, innocent mistake.
(In-depth answer). As courteously pointed out by wim, in some rare cases, they could opt for UTF-16 or UTF-32. These cases will be less common as the developers, in that scenario would be consciously deciding to throw away valuable bandwidth. So, if you run into encoding issues, you can change utf-8 to 16, 32, etc.
There are a couple of solutions for this. You could use request's built-in .json() function:
myResponse.json()
Or, you could opt for character detection via chardet. Chardet is a library developed based on a study. The library has one function: detect. Detect can detect most common encodings and then use them to encode your string with.
import chardet
json.loads(myResponse.content.decode(chardet.detect(myResponse.content)["encoding"]))
Let requests decode it for you:
data = response.json()
This will check headers (Content-Type) and response encoding, auto-detecting how to decode the content correctly.
python3.6+ does this automatically.so your code shouldn't return error in python3.6+
what's new in python3.6
Related
I would like to use asyncio to get the webpage.
However, when I executed the code below, no page is obtained.
The code is
import aiofiles
import aiohttp
from aiohttp import ClientSession
import asyncio
async def get_webpage(url, session):
try:
res = await session.request(method="GET", url=url)
html = await res.text(encoding='GB18030')
return 0, html
except:
return 1, []
async def main_get_webpage(urls):
webpage = []
connector = aiohttp.TCPConnector(limit=60)
async with ClientSession(connector=connector) as session:
tasks = [get_webpage(url, session) for url in urls]
result = await asyncio.gather(*tasks)
for status, data in result:
print(status)
if status == 0:
webpage.append(data)
return webpage
if __name__ == '__main__':
urls = ['https://lcdsj.fang.com/house/3120178164/fangjia.htm', 'https://mingliugaoerfuzhuangyuan0551.fang.com/house/2128242324/fangjia.htm']
loop = asyncio.ProactorEventLoop()
asyncio.set_event_loop(loop)
loop = asyncio.get_event_loop()
webpage = loop.run_until_complete(main_get_webpage(urls))
I expect two zeros will be printed in the function main_get_webpage(urls).
However, two ones are printed.
What's wrong with my code?
How to fix the problem?
Thank you very much.
What's wrong with my code?
What's wrong is that you have a try: ... except: that masks the source of the problem. If you remove the except clause, you will find an error message that communicates the underlying issue:
UnicodeDecodeError: 'gb18030' codec can't decode byte 0xb7 in position 47676: illegal multibyte sequence
The web page is not encoded as GB18030. The page declares itself as GB2312 (a pre-cursor to GB18030), but using that as the coding also fails.
How to fix the problem?
Depending on what you want to do with the web page text, you have several options:
Find an encoding supported by Python that works with the page as given. This is the ideal option, but I wasn't able to find it with a short search. (Using this answer to find out what chrome thinks the page uses didn't help either, because the response was GBK, which against produces an error on character 47676.)
Decode the page with a more relaxed error handler, such as res.text(encoding='GB18030', errors='replace'). That will give you a good approximation of the text, with the undecipherable bytes rendered as the unicode replacement character. This is a good option if you need to search the page for a substring or analyze it as text, and don't care about a weird character somewhere in it.
Give up the idea of decoding the page as text, and just use res.data() to get the bytes. This option is best if you need to archive or cache the page, or index it.
I think a better way maybe just use await res.text() instead of await res.text(encoding='GB18030') because as https://docs.aiohttp.org/en/stable/client_reference.html?highlight=encoding#aiohttp.ClientResponse.text said
If encoding is None content encoding is autocalculated using
Content-Type HTTP header and chardet tool if the header is not
provided by server.
I will argue that if aiohttp didn't use the charset in Content-Type to decode the response text, its implementation is rather problematic. You really don't need to provide encoding parameter.
I check the 2 urls in your example, Content-Type are both text/html; charset=utf-8 so you can't use GB18030 to decode them.
I am using this statement in Python
jsonreq = json.dumps({'jsonrpc': '2.0', 'id': 'qwer', 'method': 'aria2.pauseAll'})
jsonreq = jsonreq.encode('ascii')
c = urllib.request.urlopen('http://localhost:6800/jsonrpc', jsonreq)
In this I am getting this warning/error when I perform code quality test
Audit url open for permitted schemes. Allowing use of "file:" or custom schemes is often unexpected.
Because I stumbled upon this question and the accepted answer did not work for me, I researched this myself:
Why urlib is a security risk
urlib not only opens http:// or https:// URLs, but also ftp:// and file://.
With this it might be possible to open local files on the executing machine which might be a security risk if the URL to open can be manipulated by an external user.
How to fix this
You are yourself responsible to validate the URL before opening it with urllib.
E.g.
if url.lower().startswith('http'):
req = urllib.request.Request(url)
else:
raise ValueError from None
with urllib.request.urlopen(req) as resp:
[...]
How to fix this so the linter (e.g. bandit) does no longer complain
At least bandit has a simple blacklist for the function call. As long as you use urllib, the linter will raise a warning. Even if you DO validate your input like shown above. (Or even use hardcoded URLs).
Add a #nosec comment to the line to suppress the warning from bandit or look up the suppression keyword for your linter/code-checker. It's best practice to also add additional comments stating WHY you think this is not worth a warning in your case.
I think this is what you need
import urllib.request
req = urllib.request.Request('http://www.example.com')
with urllib.request.urlopen(req) as response:
the_page = response.read()
For the people who couldn't solve it by above answers.
You could use requests library instead, which is not black listed in bandit.
https://bandit.readthedocs.io/en/latest/blacklists/blacklist_calls.html#b310-urllib-urlopen
import requests
url = 'http://www.example.com'
the_page = requests.get(url)
print(the_page.json()) # if the response is json
print(the_page.text) # if the response is some text
I am checking for url status with this code:
h = httplib2.Http()
hdr = {'User-Agent': 'Mozilla/5.0'}
resp = h.request("http://" + url, headers=hdr)
if int(resp[0]['status']) < 400:
return 'ok'
else:
return 'bad'
and getting
Error -3 while decompressing data: incorrect header check
the url i am checking is:
http://www.sueddeutsche.de/wirtschaft/deutschlands-innovationsangst-wir-neobiedermeier-1.2117528
the Exception Location is:
Exception Location: C:\Python27\lib\site-packages\httplib2-0.9-py2.7.egg\httplib2\__init__.py in _decompressContent, line 403
try:
encoding = response.get('content-encoding', None)
if encoding in ['gzip', 'deflate']:
if encoding == 'gzip':
content = gzip.GzipFile(fileobj=StringIO.StringIO(new_content)).read()
if encoding == 'deflate':
content = zlib.decompress(content) ##<---- error line
response['content-length'] = str(len(content))
# Record the historical presence of the encoding in a way the won't interfere.
response['-content-encoding'] = response['content-encoding']
del response['content-encoding']
except IOError:
content = ""
http status is 200 which is ok for my case, but i am getting this error
I actually need only http status, why is it reading the whole content?
You may have any number of reasons why you choose httplib2, but it's far too easy to get the status code of an HTTP request using the python module requests. Install with the following command:
$ pip install requests
See an extremely simple example below.
In [1]: import requests as rq
In [2]: url = "http://www.sueddeutsche.de/wirtschaft/deutschlands-innovationsangst-wir-neobiedermeier-1.2117528"
In [3]: r = rq.get(url)
In [4]: r
Out[4]: <Response [200]>
Link
Unless you have a considerable constraint that needs httplib2 explicitly, this solves your problem.
This may be a bug (or just uncommon design decision) in httplib2. I don't get this problem with urllib2 or httplib in the 2.x stdlib, or urllib.request or http.client in the 3.x stdlib, or the third-party libraries requests, urllib3, or pycurl.
So, is there a reason you need to use this particular library?
If so:
I actually need only http status, why is it reading the whole content?
Well, most HTTP libraries are going to read and parse the whole content, or at least the headers, before returning control. That way they can respond to simple requests about the headers or chunked encoding or MIME envelope or whatever without any delay.
Also, many of them automate things like 100 continue, 302 redirect, various kinds of auth, etc., and there's no way they could do that if they didn't read ahead. In particular, according to the description for httplib2, handling these things automatically is one of the main reasons you should use it in the first place.
Also, the first TCP read is nearly always going to include the headers anyway, so why not read them?
This means that if the headers are invalid, you may get an exception immediately. They may still provide a way to get the status code (or the raw headers, or other information) anyway.
As a side note, if you only want the HTTP status, you should probably send a HEAD request rather than a GET. Unless you're writing and testing a server, you can almost always rely on the fact that, as the RFC says, the status and headers should be identical to what you'd get with GET. In fact, that would almost certainly solve things in this caseāif there is no body to decompress, the fact that httplib2 has gotten confused into thinking the body is gzipped when it isn't won't matter anyway.
I am trying to generate the digest authorization header for use in a python test case. Because of the way the code base works, it is important that I am able to get the header as a string.
This header looks something like this
Authorization: Digest username="the_user", realm="my_realm", nonce="1389832695:d3c620a9e645420228c5c7da7d228f8c", uri="/some/uri", response="b9770bd8f1cf594dade72fe9abbb2f31"
I think my best bets are to use either urllib2 or the requests library.
With urllib2, I have gotten this far:
au=urllib2.HTTPDigestAuthHandler()
au.add_password("my_realm", "http://example.com/", "the_user", "the_password")
but I can't get the header out of that.
With requests, I have gotten this far:
requests.HTTPDigestAuth("the_user", "the_password")
But I when I try to use that, in a request, I am getting errors about setting the realm which I can't figure out how to do
If you're prepared to contort yourself around it, you can get the requests.auth.HTTPDigestAuth class to give you the right answer by doing something like this:
from requests.auth import HTTPDigestAuth
chal = {'realm': 'my_realm', 'nonce': '1389832695:d3c620a9e645420228c5c7da7d228f8c'}
a = HTTPDigestAuth('the_user', password)
a.chal = chal
print a.build_digest_header('GET', '/some/uri')
If I use 'the_password' as the user's password, that gives me this result:
Digest username="the_user", realm="my_realm", nonce="1389832695:d3c620a9e645420228c5c7da7d228f8c", uri="/some/uri", response="0b34daf411f3d9739538c7e7ee845e92"
When I tried #Lukasa answer I got an error:
'_thread._local' object has no attribute 'chal'
So I solved it in a slightly dirty way but it works:
from requests.auth import HTTPDigestAuth
chal = {'realm': 'my_realm',
'nonce': '1389832695:d3c620a9e645420228c5c7da7d228f8c'}
a = HTTPDigestAuth("username", "password")
a.init_per_thread_state()
a._thread_local.chal = chal
print(a.build_digest_header('GET', '/some/uri'))
I'm having some trouble with twisted.web.client.Agent...
I think the string data in my post request isn't being formatted properly. I'm trying to do something analogous to this synchronous code:
from urllib import urlencode
import urllib2
page = 'http://example.com/'
id_string = 'this:is,my:id:string'
req = urllib2.Request(page, data=urlencode({'id': id_string})) # urlencode call returns 'id=this%3Ais%2Cmy%3Aid%3Astring'
resp = urllib2.urlopen(req)
Here's how I'm building my Agent request as of right now:
from urllib import urlencode
from StringIO import StringIO
page = 'http://example.com/'
id_string = 'my:id_string'
head = {'User-Agent': ['user agent goes here']}
data = urlencode({'id': id_string})
request = agent.request('POST', page, Headers(head), FileBodyProducer(StringIO(data)))
request.addCallback(foo)
Because of the HTTP response I'm getting (null JSON string), I'm beginning to suspect that the id is not being properly encoded in the POST request, but I'm not sure what I can be doing about it. Is using urlencode with the Agent.request call valid? Is there another way I should be encoding these things?
EDIT: Some kind IRC guys have suggested that the problem may stem from the fact that I didn't send the header information that indicates the data is encoded in a url string. I know remarkably little about this kind of stuff... Can anybody point me in the right direction?
As requested, here's my comment in the form of an answer:
HTTP requests with bodies should have the Content-Type header set (to tell the server how to interpret the bytes in the body); in this case, it seems the server is expecting URL-encoded data, like a web browser would send when a form is filled out.
urllib2.Request apparently defaults the content type for you, but the twisted library seems to need it to be set manually. In this case, you want a content type of application/x-www-form-urlencoded.