So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.
Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))
Related
I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked
The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")
oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not
You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()
your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f
just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))
You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.
I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()
Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}
You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.
I want to write the HTML of a website to the file I created, tough I decode to utf-8 but still it puts up a error like this, I use print(data1) and the html is printed properlyand I am using python 3.5.0
import re
import urllib.request
city = input("city name")
url = "http://www.weather-forecast.com/locations/"+city+"/forecasts/latest"
data = urllib.request.urlopen(url).read()
data1 = data.decode("utf-8")
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt","w")
f.write(data1)
You've opened a file with the default system encoding:
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt", "w")
You need to specify your encoding explicitly:
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt", "w", encoding='utf8')
See the open() function documentation:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
On your system, the default is a codec that cannot handle your data.
f = open("C:\\Users\\Gopal\\Desktop\\test\\scrape.txt","w",encoding='utf8')
f.write(data1)
This should work, it did for me
I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).
While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.
Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:
import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector
# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
def enforce_unicode():
detector = UniversalDetector()
for root, dirnames, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, '*.cs'):
detector.reset()
filepath = os.path.join(root, filename)
with open(filepath, 'r') as f:
for line in f:
detector.feed(line)
if detector.done: break
detector.close()
encoding = detector.result['encoding']
if encoding and not encoding == 'UTF-8':
print '%s -> UTF-8 %s' % (encoding.ljust(12), filepath)
with codecs.open(filepath, 'r', encoding=encoding) as f:
content = ''.join(f.readlines())
content = to_unicode_or_bust(content)
with codecs.open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
enforce_unicode()
I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:
/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
File "./enforce-unicode.py", line 48, in <module>
enforce_unicode()
File "./enforce-unicode.py", line 43, in enforce_unicode
content = content.decode(encoding).encode('utf-8')
File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)
Ideas?
chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.
For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.
I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.
I am trying to use StringIO to feed ConfigObj.
I would like to do this in my unit tests, so that I can mock config "files", on the fly, depending on what I want to test in the configuration objects.
I have a whole bunch of things that I am taking care of in the configuration module (I am reading several conf file, aggregating and "formatting" information for the rest of the apps). However, in the tests, I am facing a unicode error from hell. I think I have pinned down my problem to the minimal functionning code, that I have extracted and over-simplified for the purpose of this question.
I am doing the following:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import configobj
import io
def main():
"""Main stuff"""
input_config = """
[Header]
author = PloucPlouc
description = Test config
[Study]
name_of_study = Testing
version = 9999
"""
# Just not to trust my default encoding
input_config = unicode(input_config, "utf-8")
test_config_fileio = io.StringIO(input_config)
print configobj.ConfigObj(infile=test_config_fileio, encoding="UTF8")
if __name__ == "__main__":
main()
It produces the following traceback:
Traceback (most recent call last):
File "test_configobj.py", line 101, in <module>
main()
File "test_configobj.py", line 98, in main
print configobj.ConfigObj(infile=test_config_fileio, encoding='UTF8')
File "/work/irlin168_1/USER/Apps/python272/lib/python2.7/site-packages/configobj-4.7.2-py2.7.egg/configobj.py", line 1242, in __init__
self._load(infile, configspec)
File "/work/irlin168_1/USER/Apps/python272/lib/python2.7/site-packages/configobj-4.7.2-py2.7.egg/configobj.py", line 1302, in _load
infile = self._handle_bom(infile)
File "/work/irlin168_1/USER/Apps/python272/lib/python2.7/site-packages/configobj-4.7.2-py2.7.egg/configobj.py", line 1442, in _handle_bom
if not line.startswith(BOM):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
I am using Python-2.7.2 (32 bits) on linux. My locale for the console and for the editor (Kile) are set to fr_FR.utf8.
I thought I could do this.
From the io.StringIO documentation, I got this:
The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care.
And from ConfigObj documentation, I can do this:
>>> config = ConfigObj('config.ini', encoding='UTF8')
>>> config['name']
u'Michael Foord'
and this:
infile: None
You don't need to specify an infile. If you omit it, an empty ConfigObj will be created. infile can be :
[...]
A StringIO instance or file object, or any object with a read method. The filename attribute of your ConfigObj will be None [5].
'encoding': None
By default ConfigObj does not decode the file/strings you pass it into Unicode [8]. If you want your config file as Unicode (keys and members) you need to provide an encoding to decode the file with. This encoding will also be used to encode the config file when writing.
My question is why does it produce this? What else did I not understand from (simple) Unicode handling?...
By looking at this answer, I changed:
input_config = unicode(input_config, "utf8")
to (importing codecs module breforehand):
input_config = unicode(input_config, "utf8").strip(codecs.BOM_UTF8.decode("utf8", "strict"))
in order to get rid of possible included byte order mark, but it did not help.
Thanks a lot
NB: I have the same traceback if I use StringIO.StringIO instead of io.StringIO.
This line:
input_config = unicode(input_config, "utf8")
is converting your input to Unicode, but this line:
print configobj.ConfigObj(infile=test_config_fileio, encoding="UTF8")
is declaring the input to be a UTF-8-encoded byte string. The error indicates a Unicode string was passed when a byte string was expected, so commenting out the first line above should resolve the issue. I don't have configobj at the moment so can't test it.