Python: Why am I getting a UnicodeDecodeError?

Python: Why am I getting a UnicodeDecodeError? - python

I have the following code that search through files using RE's and if any matches are found it move the file into a different directory.
import os
import gzip
import re
import shutil
def regEx1():
os.chdir("C:/Users/David/myfiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/NewFiles")
regex_txt = input("Please enter the string your are looking for:")
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
regex = re.compile(regex_txt, re.IGNORECASE)
if re.search(regex, content)is not None:
shutil.copy(x, "C:/Users/David/NewFiles")
When I run it i get the following error message:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python33\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 367: character maps to <undefined>
Please could someone explain why this message appears

In python 3, when you open a file for reading in text mode (r) it'll decode the contained text to unicode.
Since you didn't specify what encoding to use to read the file, the platform default (from locale.getpreferredencoding) is being used, and that fails in this case.
You need to either specify an encoding that can decode the file contents, or open the file in binary mode instead (and use b'' bytes patterns for your regular expressions).
See the Python Unicode HOWTO for more information.

I'm not too familiar with python 3x, but the below may work.
inputFile = open((x, encoding="utf8"), "r")

There's a similar question here:
Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0]
But you might want to try:
open((x), "r", encoding='UTF8')

Thank you very much for this solution. It helps me for another subject, I used :
exec (open ("DIP6.py").read ())
and I got this error because I have this symbol in a comment of DIP6.py :
# ● en première colonne
It works fine with :
exec (open ("DIP6.py", encoding="utf8").read ())
It also solves a problem with :
print("été") for example
in DIP6.py
I got :
Ã©tÃ©
in the console.
Thank you :-) .

Related

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked

The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")

oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not

You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()

your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

How do I save this variable to a new text file?

So I'm trying to save some cipher text to a new text file that is named by the user, however when I run the code it displays this message:
Please enter the name you wish the file to be called: cipher
Traceback (most recent call last):
File "C:/Users/User/Documents/file figure.py", line 19, in <module>
f.write(cipher_text_write)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8e' in position 1: character maps to <undefined>
I managed to figure out that it is the actual message that I want to save that is causing the problem. Any help will be appreciated!
Here's my code:
cipher_text = " «²²µ ³¿ ´§³« ¯¹ µ´«²² "
filename = input("Please enter the name you wish the file to be called: ")
cipher_text_write = str(cipher_text)
cipher_filename = filename + ".txt"
f = open(cipher_filename,"w+")
f.write(cipher_text_write)
f.close()

The important thing to understand here is that the result of an encryption is simply a stream of bits. These bits do not necessarily correspond to a legal character string.
Your cypher text contains patterns of bytes that aren't encodable into any legal character. There are many ways to solve this issue, but the easiest would be to open your file as a binary or instead encode the bits in something like base64, for example:
>>> import os
>>> import base64
>>> s = str(os.urandom(10000))
>>> encs=base64.b64encode(s)
>>> s2 = base64.b64decode(encs)
>>> cmp(s,s2)
0
When you want to read the cypher text back, you need to open the file containing the cypher text as a binary or read and decode the base64 representation of the bits, in accordance with the solution you picked when writing to file.

Convert a bunch of files from guessed encoding to UTF-8

I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).
While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.
Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:
import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector
# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
def enforce_unicode():
detector = UniversalDetector()
for root, dirnames, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, '*.cs'):
detector.reset()
filepath = os.path.join(root, filename)
with open(filepath, 'r') as f:
for line in f:
detector.feed(line)
if detector.done: break
detector.close()
encoding = detector.result['encoding']
if encoding and not encoding == 'UTF-8':
print '%s -> UTF-8 %s' % (encoding.ljust(12), filepath)
with codecs.open(filepath, 'r', encoding=encoding) as f:
content = ''.join(f.readlines())
content = to_unicode_or_bust(content)
with codecs.open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
enforce_unicode()
I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:
/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
File "./enforce-unicode.py", line 48, in <module>
enforce_unicode()
File "./enforce-unicode.py", line 43, in enforce_unicode
content = content.decode(encoding).encode('utf-8')
File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)
Ideas?

chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.
For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.
I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.

UnicodeDecode issue -- writing to a SAS program file

I have received a large set of sas files which all need to have their filepaths altered.
The code I've written for that tasks is as follows:
import glob
import os
import sys
os.chdir(r"C:\path\subdir")
glob.glob('*.sas')
import os
fileLIST=[]
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
fileLIST.append(os.path.join(dirname, filename))
print fileLIST
import re
for fileITEM in set(fileLIST):
dataFN=r"//path/subdir/{0}".format(fileITEM)
dataFH=open(dataFN, 'r+')
for row in dataFH:
print row
if re.findall('\.\.\.', str(row)) != []:
dataSTR=re.sub('\.\.\.', "//newpath/newsubdir", row)
print >> dataFH, dataSTR.encode('utf-8')
else:
print >> dataFH, row.encode('utf-8')
dataFH.close()
The issues I have are two fold: First, it seems as though my code does not recognize the three sequential periods, even when separated by a backslash. Second, I receive an error "UnicodeDecodeError: 'ascii' codec can't decode byte...'
Is it possible that SAS program files (.sas) are not utf-8? If so, is the fix as simple as knowing what file encoding they use?
The full traceback is as follows:
Traceback (most recent call last):
File "stringsubnew.py", line 26, in <module>
print >> dataFH, row.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 671: ordinal not in range(128)
Thanks in advance

The problem lies with the reading rather than writing. You have to know what encoding lies within the source file you are reading from and decode it appropriately.
Let's say the source file contains data encoded with iso-8859-1
You can do this when reading using str.decode()
my_row = row.decode('iso-8859-1')
Or you can open the file using codecs to take care of it for you.
import codecs
dataFH = codecs.open(dataFN, 'r+', 'iso-8859-1')
A good talk on this can be found at http://nedbatchelder.com/text/unipain.html

Need help to figure out a solution to this UnicodeDecodeError

When I use this code (adapted from Stephen Holiday code - thanks, Stephen for your code!):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
"""
USSSALoader.py
"""
import os
import re
#import urllib2
from zipfile import ZipFile
import csv
import pickle
def getNameList():
namesDict=extractNamesDict()
maleNames=list()
femaleNames=list()
for name in namesDict:
counts=namesDict[name]
tuple=(name,counts[0],counts[1])
if counts[0]>counts[1]:
maleNames.append(tuple)
elif counts[1]>counts[0]:
femaleNames.append(tuple)
names=(maleNames,femaleNames)
return names
def extractNamesDict():
zf=ZipFile('names.zip', 'r')
filenames=zf.namelist()
names=dict()
genderMap={'M':0,'F':1}
for filename in filenames:
file=zf.open(filename,'r')
rows=csv.reader(file, delimiter=',')
for row in rows:
name=row[0].upper()
# name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
if not names.has_key(name):
names[name]=[0,0]
names[name][gender]=names[name][gender]+count
file.close()
print '\tImported %s'%filename
return names
if __name__ == "__main__":
getNameList()
I got this error:
iterator = raw_query.Run(**kwargs)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1622, in Run
itr = Iterator(self.GetBatcher(config=config))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1601, in GetBatcher
return self.GetQuery().run(_GetConnection(), query_options)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1490, in GetQuery
filter_predicate=self.GetFilterPredicate(),
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1534, in GetFilterPredicate
property_filters.append(datastore_query.make_filter(name, op, values))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\datastore\datastore_query.py", line 107, in make_filter
properties = datastore_types.ToPropertyPb(name, values)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1745, in ToPropertyPb
pbvalue = pack_prop(name, v, pb.mutable_value())
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1556, in PackString
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: ordinal not in range(128)
This happens when I have names with non-ASCII caracters (like "Chávez" or "Barañao"). I tried to fix this problem doing this:
for row in rows:
# name=row[0].upper()
name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
But, then, I got this other error:
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 17, in getNameList
namesDict=extractNamesDict()
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 43, in extractNamesDict
name=row[0].upper().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3: ordinal not in range(128)
I also tried this:
def extractNamesDict():
zf=ZipFile('names.zip', 'r', encode='utf-8')
filenames=zf.namelist()
But ZipFile doesn't have such argument.
So, how to fix that avoiding this UnicodeDecodeError for non-ASCII names?
I'm using this code with GAE.

It looks like your first traceback is AppEngine-related. Are you building a loader that will populate the datastore? If so, seeing the code that comprises the models and does the put'ing would be helpful. I will probably be corrected by someone, but in order for that piece to work I believe you actually need to decode instead of encode (i.e. when you read the sheet prior to the put, convert the string to unicode by using decode('utf-8') or decode('latin1'), depending on your situation).
As far as your local code, I won't pretend to know the deep internals of Unicode handling, but I've generally used decode() and encode() to handle these types of situations. I believe the correct encoding to use depends on the underlying text (meaning you'd need to know if it were encoded utf-8 or latin-1, etc.). Here is a quick test with your example:
>>> s = 'Chávez'
>>> type(s)
<type 'str'>
>>> u = s.decode('latin1')
>>> type(u)
<type 'unicode'>
>>> e = u.encode('latin1')
>>> print e
Chávez
In this case, I needed to use latin1 to decode the encoded string (I was using the terminal), but in your situation using utf-8 may very well work.

Unless I'm missing something, this line in the library:
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
should be:
pbvalue.set_stringvalue(value.decode(filename_encoding).encode('utf-8'))
And the value filename_encoding passed in from your code if not stored in the zip archive somehow (and at least in the early versions of the format, I doubt it's stored). It's yet another occurrence of the classic error of assuming that bytes and "characters" are the same thing.
If you're feeling froggy, dive into the code and fix it, and maybe even contribute a patch. Otherwise, you'll have to write heroic code that checks for U+0080 and above in filenames and performs special handling.

In python 2.7 ( and linux Mint 17.1) , you must use:
hashtags=['transito','tránsito','ñandú','pingüino','fhürer']
for h in hashtags:
u=h.decode('utf-8')
print(u.encode('utf-8'))
transito
tránsito
ñandú
pingüino
fhürer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Why am I getting a UnicodeDecodeError? - python

I'm not too familiar with python 3x, but the below may work. inputFile = open((x, encoding="utf8"), "r")

There's a similar question here: Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0] But you might want to try: open((x), "r", encoding='UTF8')

Related

unable to decode this string using python

How do I save this variable to a new text file?

Convert a bunch of files from guessed encoding to UTF-8

UnicodeDecode issue -- writing to a SAS program file

Need help to figure out a solution to this UnicodeDecodeError

Categories

Resources