Convert a bunch of files from guessed encoding to UTF-8 - python

I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).
While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.
Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:
import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector
# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
def enforce_unicode():
detector = UniversalDetector()
for root, dirnames, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, '*.cs'):
detector.reset()
filepath = os.path.join(root, filename)
with open(filepath, 'r') as f:
for line in f:
detector.feed(line)
if detector.done: break
detector.close()
encoding = detector.result['encoding']
if encoding and not encoding == 'UTF-8':
print '%s -> UTF-8 %s' % (encoding.ljust(12), filepath)
with codecs.open(filepath, 'r', encoding=encoding) as f:
content = ''.join(f.readlines())
content = to_unicode_or_bust(content)
with codecs.open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
enforce_unicode()
I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:
/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
File "./enforce-unicode.py", line 48, in <module>
enforce_unicode()
File "./enforce-unicode.py", line 43, in enforce_unicode
content = content.decode(encoding).encode('utf-8')
File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)
Ideas?

chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.
For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.
I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.

Related

Python reading a PE file and changing resource section

I am trying to open a Windows PE file and alter some strings in the resource section.
f = open('c:\test\file.exe', 'rb')
file = f.read()
if b'A'*10 in file:
s = file.replace(b'A'*10, newstring)
In the resource section I have a string that is just:
AAAAAAAAAA
And I want to replace that with something else. When I read the file I get:
\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A\x00A
I have tried opening with UTF-16 and decoding as UTF-16 but then I run into a error:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1604-1605: illegal encoding
Everyone I seen who had the same issue fixed by decoding to UTF-16. I am not sure why this doesn't work for me.
If resource inside binary file is encoded to utf-16, you shouldn't change encoding.
try this
f = open('c:\\test\\file.exe', 'rb')
file = f.read()
unicode_str = u'AAAAAAAAAA'
encoded_str = unicode_str.encode('UTF-16')
if encoded_str in file:
s = file.replace(encoded_str, new_utf_string.encode('UTF-16'))
inside binary file everything is encoded, keep in mind

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked
The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")
oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not
You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()
your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

UnicodeDecode issue -- writing to a SAS program file

I have received a large set of sas files which all need to have their filepaths altered.
The code I've written for that tasks is as follows:
import glob
import os
import sys
os.chdir(r"C:\path\subdir")
glob.glob('*.sas')
import os
fileLIST=[]
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
fileLIST.append(os.path.join(dirname, filename))
print fileLIST
import re
for fileITEM in set(fileLIST):
dataFN=r"//path/subdir/{0}".format(fileITEM)
dataFH=open(dataFN, 'r+')
for row in dataFH:
print row
if re.findall('\.\.\.', str(row)) != []:
dataSTR=re.sub('\.\.\.', "//newpath/newsubdir", row)
print >> dataFH, dataSTR.encode('utf-8')
else:
print >> dataFH, row.encode('utf-8')
dataFH.close()
The issues I have are two fold: First, it seems as though my code does not recognize the three sequential periods, even when separated by a backslash. Second, I receive an error "UnicodeDecodeError: 'ascii' codec can't decode byte...'
Is it possible that SAS program files (.sas) are not utf-8? If so, is the fix as simple as knowing what file encoding they use?
The full traceback is as follows:
Traceback (most recent call last):
File "stringsubnew.py", line 26, in <module>
print >> dataFH, row.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 671: ordinal not in range(128)
Thanks in advance
The problem lies with the reading rather than writing. You have to know what encoding lies within the source file you are reading from and decode it appropriately.
Let's say the source file contains data encoded with iso-8859-1
You can do this when reading using str.decode()
my_row = row.decode('iso-8859-1')
Or you can open the file using codecs to take care of it for you.
import codecs
dataFH = codecs.open(dataFN, 'r+', 'iso-8859-1')
A good talk on this can be found at http://nedbatchelder.com/text/unipain.html

Need help to figure out a solution to this UnicodeDecodeError

When I use this code (adapted from Stephen Holiday code - thanks, Stephen for your code!):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
"""
USSSALoader.py
"""
import os
import re
#import urllib2
from zipfile import ZipFile
import csv
import pickle
def getNameList():
namesDict=extractNamesDict()
maleNames=list()
femaleNames=list()
for name in namesDict:
counts=namesDict[name]
tuple=(name,counts[0],counts[1])
if counts[0]>counts[1]:
maleNames.append(tuple)
elif counts[1]>counts[0]:
femaleNames.append(tuple)
names=(maleNames,femaleNames)
return names
def extractNamesDict():
zf=ZipFile('names.zip', 'r')
filenames=zf.namelist()
names=dict()
genderMap={'M':0,'F':1}
for filename in filenames:
file=zf.open(filename,'r')
rows=csv.reader(file, delimiter=',')
for row in rows:
name=row[0].upper()
# name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
if not names.has_key(name):
names[name]=[0,0]
names[name][gender]=names[name][gender]+count
file.close()
print '\tImported %s'%filename
return names
if __name__ == "__main__":
getNameList()
I got this error:
iterator = raw_query.Run(**kwargs)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1622, in Run
itr = Iterator(self.GetBatcher(config=config))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1601, in GetBatcher
return self.GetQuery().run(_GetConnection(), query_options)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1490, in GetQuery
filter_predicate=self.GetFilterPredicate(),
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1534, in GetFilterPredicate
property_filters.append(datastore_query.make_filter(name, op, values))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\datastore\datastore_query.py", line 107, in make_filter
properties = datastore_types.ToPropertyPb(name, values)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1745, in ToPropertyPb
pbvalue = pack_prop(name, v, pb.mutable_value())
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1556, in PackString
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: ordinal not in range(128)
This happens when I have names with non-ASCII caracters (like "Chávez" or "Barañao"). I tried to fix this problem doing this:
for row in rows:
# name=row[0].upper()
name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
But, then, I got this other error:
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 17, in getNameList
namesDict=extractNamesDict()
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 43, in extractNamesDict
name=row[0].upper().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3: ordinal not in range(128)
I also tried this:
def extractNamesDict():
zf=ZipFile('names.zip', 'r', encode='utf-8')
filenames=zf.namelist()
But ZipFile doesn't have such argument.
So, how to fix that avoiding this UnicodeDecodeError for non-ASCII names?
I'm using this code with GAE.
It looks like your first traceback is AppEngine-related. Are you building a loader that will populate the datastore? If so, seeing the code that comprises the models and does the put'ing would be helpful. I will probably be corrected by someone, but in order for that piece to work I believe you actually need to decode instead of encode (i.e. when you read the sheet prior to the put, convert the string to unicode by using decode('utf-8') or decode('latin1'), depending on your situation).
As far as your local code, I won't pretend to know the deep internals of Unicode handling, but I've generally used decode() and encode() to handle these types of situations. I believe the correct encoding to use depends on the underlying text (meaning you'd need to know if it were encoded utf-8 or latin-1, etc.). Here is a quick test with your example:
>>> s = 'Chávez'
>>> type(s)
<type 'str'>
>>> u = s.decode('latin1')
>>> type(u)
<type 'unicode'>
>>> e = u.encode('latin1')
>>> print e
Chávez
In this case, I needed to use latin1 to decode the encoded string (I was using the terminal), but in your situation using utf-8 may very well work.
Unless I'm missing something, this line in the library:
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
should be:
pbvalue.set_stringvalue(value.decode(filename_encoding).encode('utf-8'))
And the value filename_encoding passed in from your code if not stored in the zip archive somehow (and at least in the early versions of the format, I doubt it's stored). It's yet another occurrence of the classic error of assuming that bytes and "characters" are the same thing.
If you're feeling froggy, dive into the code and fix it, and maybe even contribute a patch. Otherwise, you'll have to write heroic code that checks for U+0080 and above in filenames and performs special handling.
In python 2.7 ( and linux Mint 17.1) , you must use:
hashtags=['transito','tránsito','ñandú','pingüino','fhürer']
for h in hashtags:
u=h.decode('utf-8')
print(u.encode('utf-8'))
transito
tránsito
ñandú
pingüino
fhürer

Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.
Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))

Categories