UnicodeDecodeError: 'utf8' codec can't decode bytes - python

I'm parsing an xml file which has "iso-8859-15" encoding.
Words like 'Zürich', 'Aktienrückk' get converted to "&#228 ;" etc.
I tried these suggestions :
p = ElementTree.fromstring(u'<p>found "\u62c9\u67cf \u591a\u516c \u56ed"</p>'.encode('utf8'))
>>> p.text
u'found "\u62c9\u67cf \u591a\u516c \u56ed"'
>>> print p.text
but I get errors like UnicodeDecodeError: 'ascii' codec can't decode byte
Even this doesn't help
content = unicode(mystring.strip(codecs.BOM_UTF8), 'utf-8')
I tried a lot of suggestions on Stack Overflow, but I couldn't figure out my way.
I need to write the parsed content back to a html file with same character sets like 'ü'

Try this:
from xml.etree import ElementTree
p = ElementTree.fromstring(u'<p>found "\u62c9\u67cf \u591a\u516c \u56ed"</p>'.encode('utf8'))
print p.text.encode('utf8')
found "拉柏 多公 园"
For your example:
# -*- coding: utf-8 -*-
from xml.etree import ElementTree
text = 'Aktienrückk'.decode('utf8')
print text.encode('utf8')
Aktienrückk
Don't forget to put # -*- coding: utf-8 -*- at the beginning of the file.

Related

how can save file with names in utf-8

I need to save file with utf-8 names.but when I do it django error:
UnicodeEncodeError at /uploaded/document/ 'فیلتر.png'
'ascii' codec can't encode characters in position 55-59: ordinal not in range(128)
although, my filefield like it:
# -*- coding: utf-8 -*-
def get_path(instance, filename):
return u' '.join((u'document', filename)).encode('utf-8').strip()
class Document(models.Model):
file_path = models.FileField(verbose_name='File', upload_to=get_path,
storage=FileSystemStorage(base_url=settings.LOCAL_MEDIA_URL))
how can I fix it?
I use tastypie api to upload file.
my question answered here:
https://itekblog.com/ascii-codec-cant-encode-characters-in-position/#The_Code
I should change apache2 encoding:
/etc/apache/envvars
export LANG='en_US.UTF-8'
export LC_ALL='en_US.UTF-8'

Python return error while writing data into file(Python 2.7)

I am parsing XML file with python mini-Dom module.while writing data into file its giving error like Unicode Encode Error: 'ASCII' codec can't encode characters in position 0-3: ordinal not in range(128). But Out put prints perfectly on command line Please tell me the solution.
my XML file is:
<?xml version="1.0"?>
<Feature>
<Word Root ="ਨੌਕਰ-ਚਾਕਰ">
<info Inflection ="ਨੌਕਰਾਂ-ਚਾਕਰਾਂ">
<posinfo gender ="Masculine" number ="Plural" case ="Oblique" />
</info>
</Word>
</Feature>
My python code is:
import sys
from xml.dom import minidom
file=open("npu.txt","w+")
doc = minidom.parse("NPU.xml")
word = doc.getElementsByTagName("Word")
for each in word:
# print "root"+each.getAttribute("Root")
file.write(each.getAttribute("Root")+"\n")
hh=each.getElementsByTagName("info")
for each1 in hh:
# print "inflection"+each1.getAttribute("Inflection")
file.write(each1.getAttribute("Inflection")+"\t")
vv=each1.getElementsByTagName("posinfo")
for each2 in vv:
# print each2.getAttribute("gender")
# print each2.getAttribute("number")
# print each2.getAttribute("case")
file.write( each2.getAttribute("gender")+",")
file.write( each2.getAttribute("number")+",")
file.write(each2.getAttribute("case"))
file.write("\n")
file.write("--------\n")
encode data while writing-
#!/usr/bin/env python
# -*- coding: utf-8 -*-
file=open("npu.txt","w+")
file.write("ਨੌਕਰ-ਚਾਕਰ")
The problem isn't in the way you parse the XML, this is an encoding problem.
The error is caused by the encoding of your text (UTF-8).
You are trying to write your text as ASCII that doesn't include the characters that you are using.
try with codecs as follow:
import codecs
file = codecs.open("npu.txt", "w+", "utf-8")
file.write("ਨੌਕਰ-ਚਾਕਰ".decode('utf-8'))
file.close()
EDIT :
You can also set the default encoding to UTF-8 adding the special comment
# -*- coding: UTF-8 -*-
at the beginning of the python source. The default encoding is ASCII (7-bit).
Note that Python identifiers are still restricted to ASCII characters.

UnicodeDecodeError when import json file

I want to open a json file in python and I have the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 64864: ordinal not in range(128)
my code is quite simple:
# -*- coding: utf-8 -*-
import json
with open('birdw3l2.json') as data_file:
data = json.load(data_file)
print(data)
Someone can help me? Thanks!
Try the following code.
import json
with open('birdw3l2.json') as data_file:
data = json.load(data_file).decode('utf-8')
print(data)
You should specify your encoding format when you load your json file. like this:
data = json.load(data_file, encoding='utf-8')
The encoding depends on your file encoding.

UnicodeEncodeError -- utf8 and unicode() not working

I have a synopsis as follows:
synopsis = 'Eine Geschichte, wie im normalen Leben... Der als äußerst vorsichtig
geltende Risikoanalytiker Ruben verlässt seine Frau,...'
I am trying to write this to a file, but keep running into:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 705: ordinal not in range(128)
Here is what I've tried:
synopsis = unicode(synopsis)
new_file.write('%s' % synopsis)
synopsis = synopsis.encode('utf-8')
new_file.write('%s' % synopsis)
In addition, I have # # -*- coding: utf-8 -*- specified at the top of my file.
Why is this occurring and how can I fix it?
How are you opening new_file?
import codecs
new_file = codecs.open('out', mode='w', encoding='utf-8')
This should allow you to write Unicode strings to the file, which will be encoded as UTF-8.
(Unless otherwise set, sys.getdefaultencoding() is 'ascii', which affects the encoding of newly-opened files.)

Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.
Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))

Categories