UnicodeDecode issue -- writing to a SAS program file - python

I have received a large set of sas files which all need to have their filepaths altered.
The code I've written for that tasks is as follows:
import glob
import os
import sys
os.chdir(r"C:\path\subdir")
glob.glob('*.sas')
import os
fileLIST=[]
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
fileLIST.append(os.path.join(dirname, filename))
print fileLIST
import re
for fileITEM in set(fileLIST):
dataFN=r"//path/subdir/{0}".format(fileITEM)
dataFH=open(dataFN, 'r+')
for row in dataFH:
print row
if re.findall('\.\.\.', str(row)) != []:
dataSTR=re.sub('\.\.\.', "//newpath/newsubdir", row)
print >> dataFH, dataSTR.encode('utf-8')
else:
print >> dataFH, row.encode('utf-8')
dataFH.close()
The issues I have are two fold: First, it seems as though my code does not recognize the three sequential periods, even when separated by a backslash. Second, I receive an error "UnicodeDecodeError: 'ascii' codec can't decode byte...'
Is it possible that SAS program files (.sas) are not utf-8? If so, is the fix as simple as knowing what file encoding they use?
The full traceback is as follows:
Traceback (most recent call last):
File "stringsubnew.py", line 26, in <module>
print >> dataFH, row.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 671: ordinal not in range(128)
Thanks in advance

The problem lies with the reading rather than writing. You have to know what encoding lies within the source file you are reading from and decode it appropriately.
Let's say the source file contains data encoded with iso-8859-1
You can do this when reading using str.decode()
my_row = row.decode('iso-8859-1')
Or you can open the file using codecs to take care of it for you.
import codecs
dataFH = codecs.open(dataFN, 'r+', 'iso-8859-1')
A good talk on this can be found at http://nedbatchelder.com/text/unipain.html

Related

Python parsing XML to CSV Encoding issues

I have a large number of XML files in a folder that I am parsing into a CSV file. My code looks like this:
import xml.etree.ElementTree as ET
import csv
import os
fields = [
('ID', 'FHRSID'),
('businessname', 'BusinessName'),
('businesstype', 'BusinessType'),
('address1', 'AddressLine1'),
('address2', 'AddressLine2'),
('address3', 'AddressLine3'),
('address4', 'AddressLine4'),
('postcode', 'PostCode'),
('longitude', 'Geocode/Longitude'),
('latitude', 'Geocode/Latitude')]
path = '/***/****/****/XML'
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
fullname = os.path.join(path, filename)
tree = ET.parse(fullname)
with open(r'outputdata.csv', 'wb') as f_businesslist:
csv_businessdata = csv.DictWriter(f_businesslist, fieldnames=[field for field, match in fields])
csv_businessdata.writeheader()
for node in tree.iter('EstablishmentDetail'):
row = {}
for field_name, match in fields:
try:
row[field_name] = node.find(match).text
except AttributeError as e:
row[field_name] = ''
csv_businessdata.writerow(row)
It does what it is supposed to do but then i get an encoding error like this:
Traceback (most recent call last):
File "./XMLtoCsv.py", line 42, in <module>
csv_businessdata.writerow(row)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 152, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 11: ordinal not in range(128)
Can anyone help? I have spent a lot of time reading through some similar problems but nothing seems to help. I am very new to this so I assume it is something stupid I have done or failed to do. Many thanks
You need to encode Unicode explicitly when you are opening the file.
with open(r'outputdata.csv', 'wb', encoding='utf-8) as f_businesslist:
It seems you are using python 2.7. I would also suggest switching to python 3.x.

Read all .txt file in one folder, then copy text (based from line) to another .txt file

I am trying to write Python code to read all .txt files from one directory then copy it (based from line) to another .txt file:
import os
import glob
path = '/Users/Documents/*.txt'
f1 = open(os.path.expanduser('/Users/Documents/test.txt'),'w')
for data in glob.glob(path):
with open(data) as script:
for line in script:
script.readline()
if 'Subject: ' in line:
f1.write(line)
My code was working, but it can only copy some text from the files, the rest gives an error message, like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1658: ordinal not in range(128)
How can I fix this? Anyone?

Convert a bunch of files from guessed encoding to UTF-8

I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).
While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.
Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:
import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector
# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
def enforce_unicode():
detector = UniversalDetector()
for root, dirnames, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, '*.cs'):
detector.reset()
filepath = os.path.join(root, filename)
with open(filepath, 'r') as f:
for line in f:
detector.feed(line)
if detector.done: break
detector.close()
encoding = detector.result['encoding']
if encoding and not encoding == 'UTF-8':
print '%s -> UTF-8 %s' % (encoding.ljust(12), filepath)
with codecs.open(filepath, 'r', encoding=encoding) as f:
content = ''.join(f.readlines())
content = to_unicode_or_bust(content)
with codecs.open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
enforce_unicode()
I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:
/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
File "./enforce-unicode.py", line 48, in <module>
enforce_unicode()
File "./enforce-unicode.py", line 43, in enforce_unicode
content = content.decode(encoding).encode('utf-8')
File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)
Ideas?
chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.
For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.
I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.

Python: Why am I getting a UnicodeDecodeError?

I have the following code that search through files using RE's and if any matches are found it move the file into a different directory.
import os
import gzip
import re
import shutil
def regEx1():
os.chdir("C:/Users/David/myfiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/NewFiles")
regex_txt = input("Please enter the string your are looking for:")
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
regex = re.compile(regex_txt, re.IGNORECASE)
if re.search(regex, content)is not None:
shutil.copy(x, "C:/Users/David/NewFiles")
When I run it i get the following error message:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python33\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 367: character maps to <undefined>
Please could someone explain why this message appears
In python 3, when you open a file for reading in text mode (r) it'll decode the contained text to unicode.
Since you didn't specify what encoding to use to read the file, the platform default (from locale.getpreferredencoding) is being used, and that fails in this case.
You need to either specify an encoding that can decode the file contents, or open the file in binary mode instead (and use b'' bytes patterns for your regular expressions).
See the Python Unicode HOWTO for more information.
I'm not too familiar with python 3x, but the below may work.
inputFile = open((x, encoding="utf8"), "r")
There's a similar question here:
Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0]
But you might want to try:
open((x), "r", encoding='UTF8')
Thank you very much for this solution. It helps me for another subject, I used :
exec (open ("DIP6.py").read ())
and I got this error because I have this symbol in a comment of DIP6.py :
# ● en première colonne
It works fine with :
exec (open ("DIP6.py", encoding="utf8").read ())
It also solves a problem with :
print("été") for example
in DIP6.py
I got :
été
in the console.
Thank you :-) .

Need help to figure out a solution to this UnicodeDecodeError

When I use this code (adapted from Stephen Holiday code - thanks, Stephen for your code!):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
"""
USSSALoader.py
"""
import os
import re
#import urllib2
from zipfile import ZipFile
import csv
import pickle
def getNameList():
namesDict=extractNamesDict()
maleNames=list()
femaleNames=list()
for name in namesDict:
counts=namesDict[name]
tuple=(name,counts[0],counts[1])
if counts[0]>counts[1]:
maleNames.append(tuple)
elif counts[1]>counts[0]:
femaleNames.append(tuple)
names=(maleNames,femaleNames)
return names
def extractNamesDict():
zf=ZipFile('names.zip', 'r')
filenames=zf.namelist()
names=dict()
genderMap={'M':0,'F':1}
for filename in filenames:
file=zf.open(filename,'r')
rows=csv.reader(file, delimiter=',')
for row in rows:
name=row[0].upper()
# name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
if not names.has_key(name):
names[name]=[0,0]
names[name][gender]=names[name][gender]+count
file.close()
print '\tImported %s'%filename
return names
if __name__ == "__main__":
getNameList()
I got this error:
iterator = raw_query.Run(**kwargs)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1622, in Run
itr = Iterator(self.GetBatcher(config=config))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1601, in GetBatcher
return self.GetQuery().run(_GetConnection(), query_options)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1490, in GetQuery
filter_predicate=self.GetFilterPredicate(),
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1534, in GetFilterPredicate
property_filters.append(datastore_query.make_filter(name, op, values))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\datastore\datastore_query.py", line 107, in make_filter
properties = datastore_types.ToPropertyPb(name, values)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1745, in ToPropertyPb
pbvalue = pack_prop(name, v, pb.mutable_value())
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1556, in PackString
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: ordinal not in range(128)
This happens when I have names with non-ASCII caracters (like "Chávez" or "Barañao"). I tried to fix this problem doing this:
for row in rows:
# name=row[0].upper()
name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
But, then, I got this other error:
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 17, in getNameList
namesDict=extractNamesDict()
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 43, in extractNamesDict
name=row[0].upper().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3: ordinal not in range(128)
I also tried this:
def extractNamesDict():
zf=ZipFile('names.zip', 'r', encode='utf-8')
filenames=zf.namelist()
But ZipFile doesn't have such argument.
So, how to fix that avoiding this UnicodeDecodeError for non-ASCII names?
I'm using this code with GAE.
It looks like your first traceback is AppEngine-related. Are you building a loader that will populate the datastore? If so, seeing the code that comprises the models and does the put'ing would be helpful. I will probably be corrected by someone, but in order for that piece to work I believe you actually need to decode instead of encode (i.e. when you read the sheet prior to the put, convert the string to unicode by using decode('utf-8') or decode('latin1'), depending on your situation).
As far as your local code, I won't pretend to know the deep internals of Unicode handling, but I've generally used decode() and encode() to handle these types of situations. I believe the correct encoding to use depends on the underlying text (meaning you'd need to know if it were encoded utf-8 or latin-1, etc.). Here is a quick test with your example:
>>> s = 'Chávez'
>>> type(s)
<type 'str'>
>>> u = s.decode('latin1')
>>> type(u)
<type 'unicode'>
>>> e = u.encode('latin1')
>>> print e
Chávez
In this case, I needed to use latin1 to decode the encoded string (I was using the terminal), but in your situation using utf-8 may very well work.
Unless I'm missing something, this line in the library:
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
should be:
pbvalue.set_stringvalue(value.decode(filename_encoding).encode('utf-8'))
And the value filename_encoding passed in from your code if not stored in the zip archive somehow (and at least in the early versions of the format, I doubt it's stored). It's yet another occurrence of the classic error of assuming that bytes and "characters" are the same thing.
If you're feeling froggy, dive into the code and fix it, and maybe even contribute a patch. Otherwise, you'll have to write heroic code that checks for U+0080 and above in filenames and performs special handling.
In python 2.7 ( and linux Mint 17.1) , you must use:
hashtags=['transito','tránsito','ñandú','pingüino','fhürer']
for h in hashtags:
u=h.decode('utf-8')
print(u.encode('utf-8'))
transito
tránsito
ñandú
pingüino
fhürer

Categories