UnicodeDecodeError with Django's request.FILES - python

I have the following code in the view call..
def view(request):
body = u""
for filename, f in request.FILES.items():
body = body + 'Filename: ' + filename + '\n' + f.read() + '\n'
On some cases I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 7470: ordinal not in range(128)
What am I doing wrong? (I am using Django 1.1.)
Thank you.

Django has some utilities that handle this (smart_unicode, force_unicode, smart_str). Generally you just need smart_unicode.
from django.utils.encoding import smart_unicode
def view(request):
body = u""
for filename, f in request.FILES.items():
body = body + 'Filename: ' + filename + '\n' + smart_unicode(f.read()) + '\n'

you are appending f.read() directly to unicode string, without decoding it, if the data you are reading from file is utf-8 encoded use utf-8, else use whatever encoding it is in.
decode it first and then append to body e.g.
data = f.read().decode("utf-8")
body = body + 'Filename: ' + filename + '\n' + data + '\n'

Anurag's answer is correct. However another problem here is you can't for certain know the encoding of the files that users upload. It may be useful to loop over a tuple of the most common ones till you get the correct one:
encodings = ('windows-xxx', 'iso-yyy', 'utf-8',)
for e in encodings:
try:
data = f.read().decode(e)
break
except UnicodeDecodeError:
pass

If you are not in control of the file encoding for files that can be uploaded , you can guess what encoding a file is in using the Universal Encoding Detector module chardet.

Related

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 295: character maps to <undefined>

I am trying to run this program from a book. I have created the file called 'alice_2.txt'
def count_words(filename):
"""Count the approximate number of words in a file."""
try:
with open(filename) as f_obj:
contents = f_obj.read()
except FileNotFoundError:
msg = "Sorry, the file " + filename + " does not exist."
print(msg)
else:
# Count approximate number of words in the file.
words = contents.split()
num_words = len(words)
print("The file " + filename + " has about " + str(num_words) +
" words.")
filename = 'alice_2.txt'
count_words(filename)
But I keep getting this error message
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 295: character maps to <undefined>
Can anyone explain what this means, and how to solve it?
You are trying to use an encoding which cannot store the character you have in file.
for example ɛ can't be opened in ascii since it doesn't have valid ascii code.
try to open the file using utf-8.
with open(filename, encoding='utf8') as f_obj:
pass
# DO your stuff

Cannot write German characters scraped from XPath to CSV file

I am trying to write information containing German umlaut characters into a CSV. When I write only the first parameter, "name", it comes out correctly. If I write "name" and "institution" though, I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u0308' in position 71: character maps to <undefined>
As you can see in the code below, I tried encoding and decoding the text using different combinations of characters.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
# this is the header of the csv
with open('/filepath/result.csv', 'w', encoding='utf-8') as f:
f.write("name, institution, \n")
l = list(range(1148, 1153))
for i in l:
url = 'webaddress.com' + str(i)
driver.get(url)
name = driver.find_elements_by_xpath('//div[#style="width:600px; display:inline-block;"]')[0].text
name = '\"' + name + '\"'
institution = driver.find_elements_by_xpath('//div[#style="width:600px; display:inline-block;"]')[1].text
institution = '\"' + institution + '\"'
print(str(i) + ': ' + name, '\n', str(i) + ': ' + institution, '\n')
print(institution.encode('utf-8'))
print(institution.encode('utf-8').decode('utf-8'))
print(institution.encode('utf-8').decode('ISO-8859-15'))
with open('/filepath/result.csv', 'a', encoding='utf-8') as f:
f.write(name + ',' + institution + '\n')
driver.close()
The results that show up in the CSV when I set all encodings to UTF-8 look like the one where I encode UTF-8 and decode ISO-8859-15 (latin1). I got the same error as above when I encoded latin1 and decoded utf-8.
Thank you for your help.
You seem to be confused about the purpose of encode. Why would you print(institution.encode('utf-8').decode('utf-8')); this is simply equivalent to print(institution)!
I'm guessing your traceback comes from one of the prints rather than the write(). Try taking out the offending one(s); or simply figure out how to print Unicode to your console and then just do that.
Probably read Ned Batchelder's Pragmatic Unicode.
Add the line at the top of your foo.py file as:
# -*- coding: UTF-8 -*-
As an alternative you can use the io module as follows:
import io
# this is the header of the csv
with io.open('/filepath/result.csv', 'w', encoding='utf-8') as f:
f.write("name, institution, \n")
and later:
with io.open('/filepath/result.csv', 'a', encoding='utf-8') as f:
f.write((name + ',' + institution + '\n')..encode("utf-8"))

Python String Binary Concatenation Error: Unicode Decode Exception

I'm facing an issue while trying to concatenate strings with gzipped content
content = "Some Long Content"
out = StringIO.StringIO()
with gzip.GZipFile(fileobj=out, mode='w') as f:
f.write(content)
gzipped_content = out.getvalue()
part1 = 'Something'
part2 = 'SomethingElse'
complete_content = part1 + part2 + gzipped_content
During Execution, this causes a UnicodeDecodeError
complete_content = part1 + part2 + gzipped_content
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)
I'm unable to figure out why an ascii decode is required for String Concatenation.
Is there a way around to make the concatenation happen?

Loading a json file in python

I've got multiple file to load as JSON, they are all formatted the same way but for one of them I can't load it without raising an exception. This is where you can find the file:
File
I did the following code:
def from_seed_data_extract_summoners():
summonerIds = set()
for i in range(1,11):
file_name = 'data/matches%s.json' % i
print file_name
with open(file_name) as data_file:
data = json.load(data_file)
for match in data['matches']:
for summoner in match['participantIdentities']:
summonerIds.add(summoner['player']['summonerId'])
return summonerIds
The error occurs when I do the following: json.load(data_file). I suppose there is a special character but I can't find it and don't know how to replace it. The error generated is:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xeb in position 6: invalid continuation byte
Do you know how I can get ride of it?
Your JSON is trying to force the data into unicode, not just a simple string. You've got some embedded character (probably a space or something not very noticable) that is not able to be forced into unicode.
How to get string objects instead of Unicode ones from JSON in Python?
That is a great thread about making JSON objects more manageable in python.
replace file_name = 'data/matches%s.json' % i with file_name = 'data/matches%i.json' % i
the right syntax is data = json.load(file_name) and not -
with open(file_name) as data_file:
data = json.load(data_file)
EDIT:
def from_seed_data_extract_summoners():
summonerIds = set()
for i in range(1,11):
file_name = 'data/matches%i.json' % i
with open(file_path) as f:
data = json.load(f, encoding='utf-8')
for match in data['matches']:
for summoner in match['participantIdentities']:
summonerIds.add(summoner['player']['summonerId'])
return summonerIds
Try:
json.loads(unicode(data_file.read(), errors='ignore'))
or :
json.loads(unidecode.unidecode(unicode(data_file.read(), errors='ignore')))
(for the second, you would need to install unidecode)
try :
json.loads(data_file.read(), encoding='utf-8')

Python UnicodeEncodeError with pre decoded UTF-8

I'm trying to parse through a bunch of logfiles (up to 4GiB) in a tar.gz file. The source files come from RedHat 5.8 Server systems and SunOS 5.10, processing has to be done on WindowsXP.
I iterate through the tar.gz files, read the files, decode the file contents to UTF-8 and parse them with regular expressions before further processing.
When I'm writing out the processed data along with the raw-data that was read from the tar.gz, I get the following error:
Traceback (most recent call last):
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 375, in <module>
p.analyze_longtails()
File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 196, in analyze_longtails
oFile.write(entries[key]['source'] + '\n')
File "C:\Python\3.2\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 24835-24836: character maps
to <undefined>
Heres the part where I read and parse the logfiles:
def getSalesSoaplogEntries(perfid=None):
for tfile in parser.salestarfiles:
path = os.path.join(parser.logpath,tfile)
if os.path.isfile(path):
if tarfile.is_tarfile(path):
tar = tarfile.open(path,'r:gz')
for tarMember in tar.getmembers():
if 'salescomponent-soap.log' in tarMember.name:
tarMemberFile = tar.extractfile(tarMember)
content = tarMemberFile.read().decode('UTF-8','surrogateescape')
for m in parser.soaplogregex.finditer(content):
entry = {}
entry['time'] = datetime(datetime.now().year, int(m.group('month')), int(m.group('day')),int(m.group('hour')), int(m.group('minute')), int(m.group('second')), int(m.group('millis'))*1000)
entry['perfid'] = m.group('perfid')
entry['direction'] = m.group('direction')
entry['payload'] = m.group('payload')
entry['file'] = tarMember.name
entry['source'] = m.group(0)
sm = parser.soaplogmethodregex.match(entry['payload'])
if sm:
entry['method'] = sm.group('method')
if entry['time'] >= parser.starttime and entry['time'] <= parser.endtime:
if perfid and entry['perfid'] == perfid:
yield entry
tar.members = []
And heres the part where I write the processed information along with the raw data out(its an aggregation of all log-entries for one specific process:
if len(entries) > 0:
time = perfentry['time']
filename = time.isoformat('-').replace(':','').replace('-','') + 'longtail_' + perfentry['perfid'] + '.txt'
oFile = open(os.path.join(parser.logpath,filename), 'w')
oFile.write(perfentry['source'] +'\n')
oFile.write('------\n')
for key in sorted(entries.keys()):
oFile.write('------\n')
oFile.write(entries[key]['source'] + '\n') #<-- here it is failing
What I don't get is why it seems to be correct to read the files in UTF-8, it is not possible to just write them out as UTF-8. What am I doing wrong?
Your output file is using the default encoding for your OS, which is not UTF-8. Use codecs.open instead of open and specify encoding='utf-8'.
oFile = codecs.open(os.path.join(parser.logpath,filename), 'w', encoding='utf-8')
See http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

Categories