Python parsing XML to CSV Encoding issues

Python parsing XML to CSV Encoding issues - python

I have a large number of XML files in a folder that I am parsing into a CSV file. My code looks like this:
import xml.etree.ElementTree as ET
import csv
import os
fields = [
('ID', 'FHRSID'),
('businessname', 'BusinessName'),
('businesstype', 'BusinessType'),
('address1', 'AddressLine1'),
('address2', 'AddressLine2'),
('address3', 'AddressLine3'),
('address4', 'AddressLine4'),
('postcode', 'PostCode'),
('longitude', 'Geocode/Longitude'),
('latitude', 'Geocode/Latitude')]
path = '/***/****/****/XML'
for filename in os.listdir(path):
if not filename.endswith('.xml'): continue
fullname = os.path.join(path, filename)
tree = ET.parse(fullname)
with open(r'outputdata.csv', 'wb') as f_businesslist:
csv_businessdata = csv.DictWriter(f_businesslist, fieldnames=[field for field, match in fields])
csv_businessdata.writeheader()
for node in tree.iter('EstablishmentDetail'):
row = {}
for field_name, match in fields:
try:
row[field_name] = node.find(match).text
except AttributeError as e:
row[field_name] = ''
csv_businessdata.writerow(row)
It does what it is supposed to do but then i get an encoding error like this:
Traceback (most recent call last):
File "./XMLtoCsv.py", line 42, in <module>
csv_businessdata.writerow(row)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 152, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 11: ordinal not in range(128)
Can anyone help? I have spent a lot of time reading through some similar problems but nothing seems to help. I am very new to this so I assume it is something stupid I have done or failed to do. Many thanks

You need to encode Unicode explicitly when you are opening the file.
with open(r'outputdata.csv', 'wb', encoding='utf-8) as f_businesslist:
It seems you are using python 2.7. I would also suggest switching to python 3.x.

Related

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked

The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")

oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not

You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()

your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

How to fix unicode error while generating xml file using python xml.etree.ElementTree without misssing any data?

I am generating xml file using xml.etree.ElementTree in python and then writting the generated xml file into a file. One of the tag of the xml is regarding system installed software which will have all the details of the system installed software. The xml output on console which i get on execution of the script is perfect but when i try to place output into a file i encounter following error as:
Traceback (most recent call last):
File "C:\xmltry.py", line 65, in <module>
f.write(prettify(top))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 4305: ordinal not in range(128)
Following is the script:
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ElementTree.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
##Here starts populating elements inside xml file
top = Element('my_practice_document')
comment = Comment('details')
top.append(comment)
child = SubElement(top, 'my_information')
childs = SubElement(child,'my_name')
childs.text = str(options.my_name)
#Following section is for retrieving list of software installed on the system
import wmi
w = wmi.WMI()
for p in w.Win32_Product():
if (p.Version is not None) and (p.Caption is not None):
child = SubElement(top, 'sys_info')
child.text = p.Caption + " version "+ p.Version
## Following portion places the xml output into test.xml file
with open("test.xml", 'w') as f:
f.write(prettify(top))
When the script is executed i get the unicode error . I searched on internet and tried out following also:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
But this also did not resolved my issue. I want to have all the data which i am getting on console into the file without missing anything. So, how can i achieve that. Thanx in advance for your assistance.

You need to specify an encoding for your output file; sys.setdefaultencoding doesn't do that for you.
Try
import codecs
with codecs.open("test.xml", 'w', encoding='utf-8') as f:
f.write(prettify(top))

UnicodeDecode issue -- writing to a SAS program file

I have received a large set of sas files which all need to have their filepaths altered.
The code I've written for that tasks is as follows:
import glob
import os
import sys
os.chdir(r"C:\path\subdir")
glob.glob('*.sas')
import os
fileLIST=[]
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
fileLIST.append(os.path.join(dirname, filename))
print fileLIST
import re
for fileITEM in set(fileLIST):
dataFN=r"//path/subdir/{0}".format(fileITEM)
dataFH=open(dataFN, 'r+')
for row in dataFH:
print row
if re.findall('\.\.\.', str(row)) != []:
dataSTR=re.sub('\.\.\.', "//newpath/newsubdir", row)
print >> dataFH, dataSTR.encode('utf-8')
else:
print >> dataFH, row.encode('utf-8')
dataFH.close()
The issues I have are two fold: First, it seems as though my code does not recognize the three sequential periods, even when separated by a backslash. Second, I receive an error "UnicodeDecodeError: 'ascii' codec can't decode byte...'
Is it possible that SAS program files (.sas) are not utf-8? If so, is the fix as simple as knowing what file encoding they use?
The full traceback is as follows:
Traceback (most recent call last):
File "stringsubnew.py", line 26, in <module>
print >> dataFH, row.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 671: ordinal not in range(128)
Thanks in advance

The problem lies with the reading rather than writing. You have to know what encoding lies within the source file you are reading from and decode it appropriately.
Let's say the source file contains data encoded with iso-8859-1
You can do this when reading using str.decode()
my_row = row.decode('iso-8859-1')
Or you can open the file using codecs to take care of it for you.
import codecs
dataFH = codecs.open(dataFN, 'r+', 'iso-8859-1')
A good talk on this can be found at http://nedbatchelder.com/text/unipain.html

Python: Why am I getting a UnicodeDecodeError?

I have the following code that search through files using RE's and if any matches are found it move the file into a different directory.
import os
import gzip
import re
import shutil
def regEx1():
os.chdir("C:/Users/David/myfiles")
files = os.listdir(".")
os.mkdir("C:/Users/David/NewFiles")
regex_txt = input("Please enter the string your are looking for:")
for x in (files):
inputFile = open((x), "r")
content = inputFile.read()
inputFile.close()
regex = re.compile(regex_txt, re.IGNORECASE)
if re.search(regex, content)is not None:
shutil.copy(x, "C:/Users/David/NewFiles")
When I run it i get the following error message:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python33\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 367: character maps to <undefined>
Please could someone explain why this message appears

In python 3, when you open a file for reading in text mode (r) it'll decode the contained text to unicode.
Since you didn't specify what encoding to use to read the file, the platform default (from locale.getpreferredencoding) is being used, and that fails in this case.
You need to either specify an encoding that can decode the file contents, or open the file in binary mode instead (and use b'' bytes patterns for your regular expressions).
See the Python Unicode HOWTO for more information.

I'm not too familiar with python 3x, but the below may work.
inputFile = open((x, encoding="utf8"), "r")

There's a similar question here:
Python: Traceback codecs.charmap_decode(input,self.errors,decoding_table)[0]
But you might want to try:
open((x), "r", encoding='UTF8')

Thank you very much for this solution. It helps me for another subject, I used :
exec (open ("DIP6.py").read ())
and I got this error because I have this symbol in a comment of DIP6.py :
# ● en première colonne
It works fine with :
exec (open ("DIP6.py", encoding="utf8").read ())
It also solves a problem with :
print("été") for example
in DIP6.py
I got :
Ã©tÃ©
in the console.
Thank you :-) .

Need help to figure out a solution to this UnicodeDecodeError

When I use this code (adapted from Stephen Holiday code - thanks, Stephen for your code!):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-
"""
USSSALoader.py
"""
import os
import re
#import urllib2
from zipfile import ZipFile
import csv
import pickle
def getNameList():
namesDict=extractNamesDict()
maleNames=list()
femaleNames=list()
for name in namesDict:
counts=namesDict[name]
tuple=(name,counts[0],counts[1])
if counts[0]>counts[1]:
maleNames.append(tuple)
elif counts[1]>counts[0]:
femaleNames.append(tuple)
names=(maleNames,femaleNames)
return names
def extractNamesDict():
zf=ZipFile('names.zip', 'r')
filenames=zf.namelist()
names=dict()
genderMap={'M':0,'F':1}
for filename in filenames:
file=zf.open(filename,'r')
rows=csv.reader(file, delimiter=',')
for row in rows:
name=row[0].upper()
# name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
if not names.has_key(name):
names[name]=[0,0]
names[name][gender]=names[name][gender]+count
file.close()
print '\tImported %s'%filename
return names
if __name__ == "__main__":
getNameList()
I got this error:
iterator = raw_query.Run(**kwargs)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1622, in Run
itr = Iterator(self.GetBatcher(config=config))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1601, in GetBatcher
return self.GetQuery().run(_GetConnection(), query_options)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1490, in GetQuery
filter_predicate=self.GetFilterPredicate(),
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore.py", line 1534, in GetFilterPredicate
property_filters.append(datastore_query.make_filter(name, op, values))
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\datastore\datastore_query.py", line 107, in make_filter
properties = datastore_types.ToPropertyPb(name, values)
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1745, in ToPropertyPb
pbvalue = pack_prop(name, v, pb.mutable_value())
File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\datastore_types.py", line 1556, in PackString
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: ordinal not in range(128)
This happens when I have names with non-ASCII caracters (like "Chávez" or "Barañao"). I tried to fix this problem doing this:
for row in rows:
# name=row[0].upper()
name=row[0].upper().encode('utf-8')
gender=genderMap[row[1]]
count=int(row[2])
But, then, I got this other error:
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 17, in getNameList
namesDict=extractNamesDict()
File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\USSSALoader.py", line 43, in extractNamesDict
name=row[0].upper().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 3: ordinal not in range(128)
I also tried this:
def extractNamesDict():
zf=ZipFile('names.zip', 'r', encode='utf-8')
filenames=zf.namelist()
But ZipFile doesn't have such argument.
So, how to fix that avoiding this UnicodeDecodeError for non-ASCII names?
I'm using this code with GAE.

It looks like your first traceback is AppEngine-related. Are you building a loader that will populate the datastore? If so, seeing the code that comprises the models and does the put'ing would be helpful. I will probably be corrected by someone, but in order for that piece to work I believe you actually need to decode instead of encode (i.e. when you read the sheet prior to the put, convert the string to unicode by using decode('utf-8') or decode('latin1'), depending on your situation).
As far as your local code, I won't pretend to know the deep internals of Unicode handling, but I've generally used decode() and encode() to handle these types of situations. I believe the correct encoding to use depends on the underlying text (meaning you'd need to know if it were encoded utf-8 or latin-1, etc.). Here is a quick test with your example:
>>> s = 'Chávez'
>>> type(s)
<type 'str'>
>>> u = s.decode('latin1')
>>> type(u)
<type 'unicode'>
>>> e = u.encode('latin1')
>>> print e
Chávez
In this case, I needed to use latin1 to decode the encoded string (I was using the terminal), but in your situation using utf-8 may very well work.

Unless I'm missing something, this line in the library:
pbvalue.set_stringvalue(unicode(value).encode('utf-8'))
should be:
pbvalue.set_stringvalue(value.decode(filename_encoding).encode('utf-8'))
And the value filename_encoding passed in from your code if not stored in the zip archive somehow (and at least in the early versions of the format, I doubt it's stored). It's yet another occurrence of the classic error of assuming that bytes and "characters" are the same thing.
If you're feeling froggy, dive into the code and fix it, and maybe even contribute a patch. Otherwise, you'll have to write heroic code that checks for U+0080 and above in filenames and performs special handling.

In python 2.7 ( and linux Mint 17.1) , you must use:
hashtags=['transito','tránsito','ñandú','pingüino','fhürer']
for h in hashtags:
u=h.decode('utf-8')
print(u.encode('utf-8'))
transito
tránsito
ñandú
pingüino
fhürer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python parsing XML to CSV Encoding issues - python

You need to encode Unicode explicitly when you are opening the file. with open(r'outputdata.csv', 'wb', encoding='utf-8) as f_businesslist: It seems you are using python 2.7. I would also suggest switching to python 3.x.

Related

unable to decode this string using python

How to fix unicode error while generating xml file using python xml.etree.ElementTree without misssing any data?

UnicodeDecode issue -- writing to a SAS program file

Python: Why am I getting a UnicodeDecodeError?

Need help to figure out a solution to this UnicodeDecodeError

Categories

Resources