How to remove Byte Order Mark in python - python

This question is related to a recent change to the Stack Overflow API that I reported here. In that question, I received a response that seems like it'd work, but in practice I'm unable to make it work.
This is my code
import requests
import json
url="https://api.stackexchange.com/2.2/sites/?filter=%21%2AL1%2AAY-85YllAr2%29&pagesize=1&page=1"
response = requests.get(url)
response.text
This outputs
u'\ufeff{"items":[{"site_state":"normal","api_site_parameter":"stackoverflow","name":"Stack Overflow"}],"has_more":true,"quota_max":300,"quota_remaining":294}'
The leading u'\ufeff means that if I do response.json() I get a ValueError: No JSON object could be decoded
The suggestion I was provided was to use decode('utf-8-sig'). However, I can't seem to get this work work either:
Try 1:
response.text.decode('utf-8-sig')
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
Try 2:
json.loads(response.text).decode('utf-8-sig')
ValueError: No JSON object could be decoded
What is the appropriate way to remove the leading u'\ufeff?

response.text is a Unicode object, i. e. it already has been decoded, so you can't decode it again.
What you need to do is tell the response object which encoding it should use:
response = requests.get(url)
response.encoding = "utf-8-sig"
respose.text
See the docs for more background info.

Related

Python - decode ('utf-8') issue

I am very new to Python.Please help me fix this issue.
I am trying to get the revenue from the link below :
https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898
I am using below commands :
import re
import urllib.request
data=urllib.request.urlopen(url).read()
data1=data.decode("utf-8")
Issue :
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position
10798: invalid start byte
Maybe better with requests:
import requests
url = "https://www.google.co.in/?gfe_r...."
req = requests.get(url)
req.encoding = "utf-8"
data = req.text
The result of downloading the specific URL given in the question, is HTML code. I was able to use BeautifulSoup to scrape the page after using the following Python code to get the data:
import requests
url = "https://www.google.co.in/?gfe_rd=cr&ei=kFFsWYyPEqvM8AeF7Y2IDQ&gws_rd=ssl#q=adp+revenue&stick=H4sIAAAAAAAAAOPgE-LUz9U3MMkozijTUskot9JPzs_JSU0uyczP088vSk_My6xKBHGKrYpSy1LzSlMBIRiSrDMAAAA&spf=1500270991898"
response = requests.get(url)
data = response.content.decode('utf-8', errors="replace")
print (data)
Please note that I used Python3 in my code example. The syntax for print() may vary a little.
0xa0 or in unicode notation U+00A0 is the character NO-BREAK SPACE. In UTF8 it is represented as b'\xc2\xa0'. If you find it as a raw byte it probably means that your input is not UTF8 encoded but Latin1 encoded.
A quick look on the linked page shows that it is indeed latin1 encoded - but I got a french version...
The rule when you are not sure of the exact convertion is to use the replace errors processing:
data1=data.decode("utf-8", errors="replace")
then, all offending characters are replaced with the REPLACEMENT CHARACTER (U+FFFD) (displayed as �). If only few are found, that means the page contains erroneous characters, but if almost all non-ascii characters are replaced, then it means that the encoding is not UTF8. If is commonly Latin1 for west european languages, but your mileage may vary for other languages.

Python 2.7, Requests library, can't get unicode

Documentation for Request library says that requests.get() method returns unicode always. But when I try to know what an encoding was returned, I see "windows-1251". That's a problem. When I try to get requests.get(url).text, there's an error, because current url's content has a Cyrillic symbols.
import requests
url = 'https://www.weblancer.net/jobs/'
r = requests.get(url)
print r.encoding
print r.text
I got something like that:
windows-1251
UnicodeEncodeError: 'ascii' codec can't encode characters in position 256-263: ordinal not in range(128)
Is it a problem of Python 2.7 or there is not a problem at all ?
Help me
From the docs:
Requests will automatically decode content from the server. Most
unicode charsets are seamlessly decoded.
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers.
requests.get().encoding is telling you the encoding that was used to convert the bitstream from the server into the Unicode text that is in the response.
In your case it is correct: the headers in the response say that the character set is windows-1251
The error you are having is after that. The python you are using is trying to encode the Unicode into ascii to print it, and failing.
You can say print r.text.encode(r.encoding) ... which is the same result as Padraic's suggestion in comments - that is r.content.
Note:
requests.get().encoding is an lvar: you can set it to what you want, if it guessed wrongly.

TypeError: decoding Unicode is not supported python

I am using lxml.html to parse an html file and get the text from the page. Bur now I have a string which has a character ' for example Florian's due to which, while printing the output I get traceback
parent_link_id_text = parent_link_id.xpath('./td[#width="400"]/text()')
print (SGS_Mid[0]+";"+"External"+";"+str(link_id_num[0])+";"+parent_link_id_text[0]+";"+parent_link_link[0], file = log_file_1)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-58: ordinal not in range(128)
Then I tried this
print (SGS_Mid[0]+";"+"PublicFreeUrl"+";"+str(link_id_num[0])+";"+unicode(parent_link_id_text[0],"utf-8")+";"+parent_link_link[0], file = log_file_1)
and I get a traceback:
TypeError: decoding Unicode is not supported
How can I solve this by printing the string with the unicode character?
Not sure if this is the solution to your problem, but perhaps it will guide you in the right direction.
Without seeing the code you have to actually get the data, I'm going to speculate and make a programmatic guess with how to solve your issue.
Please see the following code:
import lxml.html as lh
import urllib2
url = 'http://loremipsum.net/about.html'
doc = lh.parse(urllib2.urlopen(url))
value = doc.xpath('//p/strong/text()')[0]
print value
Printed result:
What is 'lorem ipsum'?
By reading the about page on the lorem ipsum site, you can see that the text returned indeed has the ' in it.
I hope this helps you in the right direction.

Decoding error with my Python function

I am using the Robot framework to automate some HTTP POST related tests. I wrote a custom Python library that has a function to do a HTTP POST. It looks like this:
# This function will do a http post and return the json response
def Http_Post_using_python(json_dict,url):
post_data = json_dict.encode('utf-8')
headers = {}
headers['Content-Type'] = 'application/json'
h = httplib2.Http()
resp, content = h.request(url,'POST',post_data,headers)
return resp, content
This works fine as long as I am not using any Unicode characters. When I have Unicode characters in the json_dict variable (for example, 메시지), it fails with this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 164: ordinal not in range(128)
I am running Python 2.7.3 on Windows 7. I saw several related questions, but I have not been able to resolve the issue. I am new to Python and programming, so any help is appreciated.
Thanks.
You're getting this error because json_dict is a str, not a unicode. Without knowing anything else about the application, a simple solution would be:
if isinstance(json_dict, unicode):
json_dict = json_dict.encode("utf-8")
post_data = json_dict
However, if you're using json.dumps(…) to create the json_dict, then you don't need to encode it – that will be done by json.dumps(…).
Use requests:
requests.post(url, data=data, headers=headers)
It will deal with the encodings for you.
You're getting an error because of Python 2's automatic encoding/decoding, which is basically a bug and was fixed in Python 3. In brief, Python 2's str objects are really "bytes", and the right way to handle string data is in a unicode object. Since unicodes were introduced later, Python 2 will automatically try to convert between them and strings when you get them confused. To do so it needs to know an encoding; since you don't specify one, it defaults to ascii which doesn't have the characters needed.
Why is Python automatically trying to decode for you? Because you're calling .encode() on a str object. It's already encoded, so Python first tries to decode it for you, and guesses the ascii encoding.
You should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Try this:
#coding=utf-8
test = "메시지"
test.decode('utf8')
In the line #coding=utf-8 i just set the file encoding to UTF-8 (to be able to write "메시지").
You need to decode the string into utf-8. decode method documentation

UnicodeDecodeError when passing GET data in Python/AppEngine

This feels like a really basic question, but I haven't been able to find an answer.
I would like to read data from an url, for example GET data from a querystring. I am using the webapp framework in Python. I tried the following code, but since I've a total beginner at Python/appengine, I've certainly done something wrong.
class MainPage(webapp.RequestHandler):
def get(self):
self.response.out.write(self.request.get('data'))
application = webapp.WSGIApplication([('/', MainPage),('/search', Search),('/next', Next)],debug=False)
def main():
run_wsgi_app(application)
if __name__ == "__main__":
main()
When testing in my test environment, the URL http://localhost/?data=test just returns this error message below. Without the querystring, it just displays a blank page as expected.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 40: ordinal not in range(128)
What am I doing wrong and what should I do instead?
You try to e.g. print an ASCII coded string actually containing data of a different charset. This can happen e.g. with Latin-1 encoded data. Try converting your input to unicode using
unicoded = unicode(non_unicode_string, source_encoding)
where source_encoding is something like 'cp1252', 'iso-8859-1' etc., and sending this to output.
Have a look at this HOWTO. For a list of encodings supported by Python, see this
Check out this blog post on how to do unicode right in Python. In a nutshell, you're trying to decode a byte string (implicitly) as ASCII, and it contains a byte that isn't valid in that codec. Your string is probably in UTF-8.

Categories