Python String Binary Concatenation Error: Unicode Decode Exception

Python String Binary Concatenation Error: Unicode Decode Exception - python

I'm facing an issue while trying to concatenate strings with gzipped content
content = "Some Long Content"
out = StringIO.StringIO()
with gzip.GZipFile(fileobj=out, mode='w') as f:
f.write(content)
gzipped_content = out.getvalue()
part1 = 'Something'
part2 = 'SomethingElse'
complete_content = part1 + part2 + gzipped_content
During Execution, this causes a UnicodeDecodeError
complete_content = part1 + part2 + gzipped_content
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)
I'm unable to figure out why an ascii decode is required for String Concatenation.
Is there a way around to make the concatenation happen?

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 295: character maps to <undefined>

I am trying to run this program from a book. I have created the file called 'alice_2.txt'
def count_words(filename):
"""Count the approximate number of words in a file."""
try:
with open(filename) as f_obj:
contents = f_obj.read()
except FileNotFoundError:
msg = "Sorry, the file " + filename + " does not exist."
print(msg)
else:
# Count approximate number of words in the file.
words = contents.split()
num_words = len(words)
print("The file " + filename + " has about " + str(num_words) +
" words.")
filename = 'alice_2.txt'
count_words(filename)
But I keep getting this error message
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 295: character maps to <undefined>
Can anyone explain what this means, and how to solve it?

You are trying to use an encoding which cannot store the character you have in file.
for example ɛ can't be opened in ascii since it doesn't have valid ascii code.
try to open the file using utf-8.
with open(filename, encoding='utf8') as f_obj:
pass
# DO your stuff

Getting Null Byte using io.StringIO

I have the below code
stream = io.StringIO(csv_file.stream.read().decode('utf-8-sig'), newline=None) // error is here
reader = csv.DictReader(stream)
list_of_entity = []
line_no, prev_len = 1, 0,
for line in reader:
While executing the above code I got the below error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 252862: invalid start byte
Later to fix this I tried the below.
stream = io.StringIO(csv_file.stream.read().decode('unicode_escape'), newline=None)
reader = csv.DictReader(stream)
list_of_entity = []
line_no, prev_len = 1, 0,
for line in reader:// error is here
when i change decode as unicode_escape it thrown the error "
_csv.Error: line contains NULL byte" at above highlighted comment line.
There is null byte present in csv, I want to ignore or replace it.
can anyone help on this.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'

I use python to get json data from bing api
accountKeyEnc = base64.b64encode(accountKey + ':' + accountKey)
headers = {'Authorization': 'Basic ' + accountKeyEnc}
req = urllib2.Request(bingUrl, headers = headers)
response = urllib2.urlopen(req)
content = response.read()
data = json.loads(content)
for i in range(0,6):
print data["d"]["results"][i]["Description"]
But I got error
print data["d"]["results"][0]["Description"]
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 11: ordinal not in range(128)

Your problem is that you are reading Unicode from the Bing API and then failing to explicitly convert it to ASCII. There does not exist a good mapping between the two. Prefix all of your const strings with u so that they will be seen as Unicode strings, see if that helps.

Error parsing emails using Python's email module when the encoding is in shift_jis

I am getting an error that says "UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 2-3: illegal multibyte sequence" when I try to use my email parser to decode a shift_jis encoded email and convert it to unicode. The code and email can be found below:
import email.header
import base64
import sys
import email
def getrawemail():
line = ' '
raw_email = ''
while line:
line = sys.stdin.readline()
raw_email += line
return raw_email
def getheader(subject, charsets):
for i in charsets:
if isinstance(i, str):
encoding = i
break
if subject[-2] == "?=":
encoded = subject[5 + len(encoding):len(subject) - 2]
else:
encoded = subject[5 + len(encoding):]
return (encoding, encoded)
def decodeheader((encoding, encoded)):
decoded = base64.b64decode(encoded)
decoded = unicode(decoded, encoding)
return decoded
raw_email = getrawemail()
msg = email.message_from_string(raw_email)
subject = decodeheader(getheader(msg["Subject"], msg.get_charsets()))
print subject
Email: http://pastebin.com/L4jAkm5R
I have read on another Stack Overflow question that this may be related to a difference between how Unicode and shift_jis are encoded (they referenced this Microsoft Knowledge Base article). If anyone knows what in my code could be causing it to not work, or if this is even reasonably fixable, I would very much appreciate finding out how.

Starting with this string:
In [124]: msg['Subject']
Out[124]: '=?ISO-2022-JP?B?GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'
=?ISO-2022-JP?B? means the string is ISO-2022-JP encoded, then base64 encoded.
In [125]: msg['Subject'].lstrip('=?ISO-2022-JP?B?')
Out[125]: 'GyRCNS5KfSRLJEgkRiRiQmdAWiRKJCpDTiRpJDskLCQiJGo'
Unfortunately, trying to reverse that process results in an error:
In [126]: base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?'))
TypeError: Incorrect padding
Reading this SO answer lead me to try adding '?=' to the end of the string:
In [130]: print(base64.b64decode(msg['Subject'].lstrip('=?ISO-2022-JP?B?')+'?=').decode('ISO-2022-JP'))
貴方にとても大切なお知らせがあり
According to google translate, this may be translated as "You know there is a very important".
So it appears the subject line has been truncated.

UnicodeDecodeError with Django's request.FILES

I have the following code in the view call..
def view(request):
body = u""
for filename, f in request.FILES.items():
body = body + 'Filename: ' + filename + '\n' + f.read() + '\n'
On some cases I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 7470: ordinal not in range(128)
What am I doing wrong? (I am using Django 1.1.)
Thank you.

Django has some utilities that handle this (smart_unicode, force_unicode, smart_str). Generally you just need smart_unicode.
from django.utils.encoding import smart_unicode
def view(request):
body = u""
for filename, f in request.FILES.items():
body = body + 'Filename: ' + filename + '\n' + smart_unicode(f.read()) + '\n'

you are appending f.read() directly to unicode string, without decoding it, if the data you are reading from file is utf-8 encoded use utf-8, else use whatever encoding it is in.
decode it first and then append to body e.g.
data = f.read().decode("utf-8")
body = body + 'Filename: ' + filename + '\n' + data + '\n'

Anurag's answer is correct. However another problem here is you can't for certain know the encoding of the files that users upload. It may be useful to loop over a tuple of the most common ones till you get the correct one:
encodings = ('windows-xxx', 'iso-yyy', 'utf-8',)
for e in encodings:
try:
data = f.read().decode(e)
break
except UnicodeDecodeError:
pass

If you are not in control of the file encoding for files that can be uploaded , you can guess what encoding a file is in using the Universal Encoding Detector module chardet.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python String Binary Concatenation Error: Unicode Decode Exception - python

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 295: character maps to <undefined>

Getting Null Byte using io.StringIO

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9'

Error parsing emails using Python's email module when the encoding is in shift_jis

UnicodeDecodeError with Django's request.FILES

Categories

Resources