UnicodeEncodeError while writing data to an xml file - python

My aim is to write an XML file with few tags whose values are in the regional language. I'm using Python to do this and using IDLE (Pythong GUI) for programming.
While I try to write the words in an xmls file it gives the following error:
UnicodeEncodeError: 'ascii' codec
can't encode characters in position
0-4: ordinal not in range(128)
For now, I'm not using any xml writer library; instead, I'm opening a file "test.xml" and writing the data into it. This error is encountered by the line:
f.write(data)
If I replace the above write statement with print statement then it prints the data properly on the Python shell.
I'm reading the data from an Excel file which is not in the UTF-8, 16, or 32 encoding formats. It's in some other format. cp1252 is reading the data properly.
Any help in getting this data written to an XML file would be highly appreciated.

You should .decode your incoming cp1252 to get Unicode strings, and .encode them in utf-8 (by far the preferred encoding for XML) at the time you write, i.e.
f.write(unicodedata.encode('utf-8'))
where unicodedata is obtained by .decode('cp1252') on the incoming bytestrings.
It's possible to put lipstick on it by using the codecs module of the standard Python library to open the input and output files each with their proper encodings in lieu of plain open, but what I show is the underlying mechanism (and it's often, though not invariably, clearer and more explicit to apply it directly, rather than indirectly via codecs -- a matter of style and taste).
What does matter is the general principle: translate your input strings to unicode as soon as you can right after you obtain them, use unicode throughout your processing, translate them back to byte strings at late as you can just before you output them. This gives you the simplest, most straightforward life!-)

Related

Python not able to read "–" character from text file

Using Python, I am fetching some text data from an API and storing it in a text file after some transformations and then reading this text file from a different process.
There are no problems while reading data from API, but I am getting this error while reading the text file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 907: invalid start byte
The byte being read as '0x96' is actually "–" character in API data and this error occurs only when encoding argument is explicitly specified as 'utf-8'. It doesn't occur when encoding is not explicitly passed to open function while opening the text file.
My questions:
Why do we get this error only when encoding is specified? I think, we should get the same error in other case as well since default encoding is also 'UTF-8'. (Please correct me if I am wrong)
Is it possible to resolve this issue without changing the way I am reading the text file? (i.e. Can I make any changes to the stage where I am creating this text file from API data?)
Really appreciate you looking into it. Thanks!
In open() the default encoding is platform dependent, you can find out what is the default for your system by checking what locale.getpreferredencoding() returns. This is from the documentation
For the 2nd part of your question, since you are not getting an error when you do not specify utf-8 as encoding, you could just use the output for locale.getpreferredencoding() as the encoding method.
You could do this for each line of the text if you are doing it this way. Since 0x96 is considered a "non-printable".
import re
...
line = re.sub(r'\x96',r'\x2D', line)

Open Outlook .msg like a text file in Python?

I want to treat Outlook .msg file as string and check if a substring exists in it.
So I thought importing win32 library, which is suggested in similar SO threads, would be an overkill.
Instead, I tried to just open the file the same way as a .txt file:
file_path= 'O:\\MAP\\177926 Delete comiitted position.msg'
mail = open(file_path)
mail_contents = mail.read()
print(mail_contents)
However, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 870: character maps to <undefined>
Is there any decoding I can specify to make it work?
I have also tried
mail = open(file_path, encoding='utf-8')
which returns
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 0: invalid continuation byte
Unless you're willing to do a lot of work, you really should use a library for this.
First, a .msg file is a binary file, so the contents should not be read in as a string. A string is usually terminated with a null byte, and binary files can have a lot of those inside, which could mean you're not looking at all the data (might depend on the implementation).
Also, the .msg file can have plain ascii and/or unicode in different parts/blocks of the file, so it would be really hard to treat this as one string to search for a substring.
As an alternative you could save the mails as .eml (i.e. the plain text version of an e-mail), but there would still be some problems to overcome in order to search for a specific text:
All data in an e-mail are lower ascii (1-127) which means special characters have to be encoded to lower ascii bytes. There are several different encodings for headers (for example 'Subject'), body, attachment.
Body text: can be plain text or hml (or both). Lines and words can be split because there is a maximum line length. Different encodings can be used, even base64 in which you would never find the text you're looking for.
A lot more would have to be done to properly decode everything, but this should give you an idea of the work you would have to do in order to find the text you're looking for.
When you face these type of issues, it is good pratice to try the Python Latin-1 encoding.
mail = open(file_path, encoding='Latin-1')
We often confound the Windows cp1252 encoding with the actual Python's Latin-1. Indeed, the latter maps all possible byte values to the first 256 Unicode code points.
See this for more information.

python - How to properly encode string in utf8 which is ISO8859-1 from xml

I'm using the following code in python 2.7 to retrieve an xmlFile containing some german umlauts like ä,ü,ö,ß:
.....
def getXML(self,url):
xmlFile=urllib2.urlopen(self.url)
xmlResponse=xmlFile.read()
xmlResponse=xmlResponse
xmlFile.close()
return xmlResponse
pass
def makeDict(self, xmlFile):
data = xmltodict.parse(xmlFile)
return data
def saveJSON(self, dictionary):
currentDayData=dictionary['speiseplan']['tag'][1]
file=open('data.json','w')
# Write the currentDay as JSON
file.write(json.dumps(currentDayData))
file.close()
return True
pass
# Execute
url="path/to/xml/"
App=GetMensaJSON(url)
xml=GetMensaJSON.getXML(App,url)
dictionary=GetMensaJSON.makeDict(App,xml)
GetMensaJSON.saveJSON(App,dictionary)
The problem is that the xml File claims in its <xml> tag that it is utf-8. It however isn't. By trying I found out that it is iso8859_1
So I wanted to reconvert from utf-8 to iso8859 and back to utf-8 to resolve the conflicts.
Is there an elegant way to resolve missing umlauts? In my code for example I have instead of ß\u00c3\u009f an instead of ü\u00c3\u00bc
I found this similar question but I can't get it to work How to parse an XML file with encoding declaration in Python?
Also I should add that I can't influence the way I get the xml.
The XML File Link can be found in the code.
The output from ´repr(xmlResponse)´ is
"<?xml version='1.0' encoding='utf-8'?>\n<speiseplan>\n<tag timestamp='1453676400'>\n<item language='de'>\n<category>Stammessen</category>\n<title>Gem\xc3\x83\xc2\xbcsebr\xc3\x83\xc2\xbche mit Backerbsen (25,28,31,33,34), paniertes H\xc3\x83\xc2\xa4hnchenbrustschnitzel (25) mit Paprikasauce (33,34), Pommes frites und Gem\xc3\x83\xc2\xbcse
You are trying to encode already encoded data. urllib2.urlopen() can only return you a bytestring, not unicode, so encoding makes little sense.
What happens instead is that Python is trying to be helpful here; if you insist on encoding bytes, then it'll decode those to unicode data first. And it'll use the default codec for that.
On top of that, XML documents are themselves responsible for documenting what codec should be used to decode. The default codec is UTF-8, don't manually re-code the data yourself, leave that to a XML parser.
If you have Mojibake data in your XML document, the best way to fix that is to do so after parsing. I recommend the ftfy package to do this for you.
You could manually 'fix' the encoding by first decoding as UTF-8, then encoding to Latin-1 again:
xmlResponse = xmlFile.read().decode('utf-8').encode('latin-1')
However, this makes the assumption that your data has been badly decoded as Latin-1 to begin with; this is not always a safe assumption. If it was decoded as Windows CP 1252, for example, then the best way to recover your data is to still use ftfy.
You could try using ftfy before parsing as XML, but this relies on the document not having used any non-ASCII elements outside of text and attribute content:
xmlResponse = ftfy.fix_text(
xmlFile.read().decode('utf-8'),
fix_entities=False, uncurl_quotes=False, fix_latin_ligatures=False).encode('utf-8')

'ascii' codec can't encode error when reading using Python to read JSON

Yet another person unable to find the correct magic incantation to get Python to print UTF-8 characters.
I have a JSON file. The JSON file contains string values. One of those string values contains the character "à". I have a Python program that reads in the JSON file and prints some of the strings in it. Sometimes when the program tries to print the string containing "à" I get the error
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 12: ordinal not in range(128)
This is hard to reproduce. Sometimes a slightly different program is able to print the string "à". A smaller JSON file containing only this string does not exhibit the problem. If I start sprinkling encode('utf-8') and decode('utf-8') around the code it changes what blows up in unpredictable ways. I haven't been able to create a minimal code fragment and input that exhibits this problem.
I load the JSON file like so.
with codecs.open(filename, 'r', 'utf-8') as f:
j = json.load(f)
I'll pull out the offending string like so.
s = j['key']
Later I do a print that has s as part of it and see the error.
I'm pretty sure the original file is in UTF-8 because in the interactive command line
codecs.open(filename, 'r', 'utf-8').read()
returns a string but
codecs.open(filename, 'r', 'ascii').read()
gives an error about the ascii codec not being able to decode such-and-such a byte. The file size in bytes is identical to the number of characters returned by wc -c, and I don't see anything else that looks like a non-ASCII character, so I suspect the problem lies entirely with this one high-ASCII "à".
I am not making any explicit calls to str() in my code.
I've been through the Python Unicode HOWTO multiple times. I understand that I'm supposed to "sandwich" unicode handling. I think I'm doing this, but obviously there's something I still misunderstand.
Mostly I'm confused because it seems like if I specify 'utf-8' in the codecs.open call, everything should be happening in UTF-8. I don't understand how the ASCII codec still sneaks in.
What am I doing wrong? How do I go about debugging this?
Edit: Used io module in place of codecs. Same result.
Edit: I don't have a minimal example, but at least I have a minimal repro scenario.
I am printing an object derived from the strings in the JSON that is causing the problem. So the following gives an error.
print(myobj)
(Note that I am using from __future__ import print_function though that does not appear to make a difference.)
Putting an encode('utf-8') on the end of my object's __str__ function return value does not fix the bug. However changing the print line to this does.
print("%s" % myobj)
This looks wrong to me. I'd expect these two print calls to be equivalent.
I can make this work by doing the sys.setdefaultencoding hack:
import sys
reload(sys)
sys.setdefaultencoding("UTF-8")
But this is apparently a bad idea that can make Python malfunction in other ways.
What is the correct way to do this? I tried
env PYTHONIOENCODING=UTF-8 ./myscript.py
but that didn't work. (Unsurprisingly, since the issue is the default encoding, not the io encoding.)
When you write directly to a file or redirect stdout to a file or pipe the default encoding is ASCII and you have to encode Unicode strings before writing them. With opened file handles you can set an encoding to have this happen automatically but with print you must use an encode() method.
print s.encode('utf-8')
It is recommended to use the newer io module in place of codecs because it has an improved implementation and is forward compatible with Py3.x open().

python noob question about codecs and utf-8

Using python to pick it some pieces so definitely a noob ? here but didn't seeing a satisfactory answer.
I have a json utf-8 file with some pieces that have grave's, accute's etc.... I'm using codecs and have (for example):
str=codecs.open('../../publish_scripts/locations.json', 'r','utf-8')
locations=json.load(str)
for location in locations:
print location['name']
For print'ing, does anything special need to be done? It's giving me the following
ascii' codec can't encode character u'\xe9' in position 5
It looks like the correct utf-8 value for e-accute. I suspect I'm doing something wrong with print'ing. Would the iteration cause it to lose it's utf-8'ness?
PHP and Ruby versions handle the utf-8 piece fine; is there some looseness in those languages that python won't do?
thx
codec.open() will decode the contents of the file using the codec you supplied (utf-8). You then have a python unicode object (which behaves similarly to a string object).
Printing a unicode object will cause an implict (behind-the-scenes) encode using the default codec, which is usually ascii. If ascii cannot encode all of the characters present it will fail.
To print it, you should first encode it, thus:
for location in locations:
print location['name'].encode('utf8')
EDIT:
For your info, json.load() actually takes a file-like object (which is what codecs.open() returns). What you have at that point is neither a string nor a unicode object, but an iterable wrapper around the file.
By default json.load() expects the file to be utf8 encoded so your code snippet can be simplified:
locations = json.load(open('../../publish_scripts/locations.json'))
for location in locations:
print location['name'].encode('utf8')
You're probably reading the file correctly. The error occurs when you're printing. Python tries to convert the unicode string to ascii, and fails on the character in position 5.
Try this instead:
print location['name'].encode('utf-8')
If your terminal is set to expect output in utf-8 format, this will print correctly.
It's the same as in PHP. UTF8 strings are good to print.
The standard io streams are broken for non-ascii, character io in python2 and some site.py setups. Basically, you need to sys.setdefaultencoding('utf8') (or whatever the system locale's encoding is) very early in your script. With the site.py shipped in ubuntu, you need to imp.reload(sys) to make sys.setdefaultencoding available. Alternatively, you can wrap sys.stdout (and stdin and stderr) to be unicode-aware readers/writers, which you can get from codecs.getreader / getwriter.

Categories