Unicode: Python / lxml file output not as expected (print vs write) - python

I'm parsing an xml file using the code below:
import lxml
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
crid = (info.get('programId'))
titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
title = (titlex.text if titlex != None else 'Missing')
synopsis1x = (info.find('.//xmlns:Synopsis[1]', namespaces=nsmap))
synopsis1 = (synopsis1x.text if synopsis1x != None else 'Missing')
synopsis1 = synopsis1.replace('\r','').replace('\n','')
f.write('{}|{}|{}\n'.format(crid, title, synopsis1))
Let take an example title of 'Přešité bydlení'. If I print the title whilst parsing the file, it comes out as expected. When I write it out however, it displays as 'PÅ™eÅ¡ité bydlení'.
I understand that this is do to with encoding (as I was able to change the print command to use UTF-8, and 'corrupt' the output), but I couldn't get the written output to print as I desired. I had a look at the codecs library, but couldn't wasn't successful. Having 'encoding = "utf-8"' in the XML Parser line didn't make any difference.
How can I configure the written output to be human readable?

I had all sorts of troubles with this before. But the solution is rather simple. There is a chapter on how to read and write in unicode to a file in the documentation. This Python talk is also very enlightening to understand the issue. Unicode can be a pain. It gets a lot easier if you start using python 3 though.
import codecs
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()

Your code looks ok, so I reckon your input is duff. Assuming you're viewing your output file with a UTF-8 viewer or shell then I suspect that the encoding in the <?xml doesn't match the actual encoding.
This would explain why printing works but not writing to a file. If your shell/IDE is set to "ISO-8859-2" and your input XML is also "ISO-8859-2" then printing is pushing out the raw encoding.

Related

Python script doesn't write Chinese characters to XML file

I'm making a mod for a game where the majority of the files are XMLs, the text of which is Simplified Chinese. My goal is to replace all of the Simplified Chinese in the files with Traditional, followed by an English translation. I'm using the Cloud Translate API from Google to do that part, and it all works fine. At first I was just doing a find and replace on the Chinese text and then adding English to the end of string, but the issue with that is that I'm getting extra English translations whenever the Chinese text occurs more than once.
In an effort to fix that I read more of the XML documentation for Python, and I started trying to use tree.write, but that's where I'm getting stuck. When I use it, the XML file has the UTF codes for the Chinese characters, rather than the actual characters. If I open the file in a web browser, the characters render correctly, but at this point I'm just unsure if they'll still work with the game if they're not writing into the XML normally.
Here's an example XML I'm working with:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍卖会</DisplayName>
<Desc>[NAME]来到了[PLACE],发现此地有个拍卖行。</Desc>
<Selections.0.Display>参与拍卖</Selections.0.Display>
<Selections.1.Display>离去</Selections.1.Display>
</Text>
</List>
</Texts>
My code which works but sometimes duplicates English translations:
import lxml.etree as ET
from google.cloud import translate_v2 as translate
import pinyin
translator = translate.Client()
tgt = "zh-TW"
tt = "en"
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'r', encoding="utf-8") as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.iter('Text'):
print(elem.text)
for child in elem:
txt = child.text
ttxt = translator.translate(txt, target_language=tgt)
etxt = translator.translate(txt, target_language=tt)
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'r') as n:
new = n.read().replace(txt, ttxt['translatedText'] + '(' + etxt['translatedText'] + ')', 1)
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'w') as n:
n.write(new)
The output of that looks like this:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍賣會(auctions)</DisplayName>
<Desc>[NAME]來到了[PLACE],發現此地有個拍賣行。([NAME] came to [PLACE] and found an auction house here.)</Desc>
<Selections.0.Display>參與拍賣(Participate in the auction)</Selections.0.Display>
<Selections.1.Display>離去(Leave)</Selections.1.Display>
</Text>
</List>
</Texts>
And here's my tree.write code:
import lxml.etree as ET
from google.cloud import translate_v2 as translate
import pinyin
translator = translate.Client()
tgt = "zh-TW"
tt = "en"
with open('/home/dave/zh-TW/Settings/MapStories/MapStory_Auction.xml', 'r', encoding="utf-8") as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.iter('Text'):
print(elem.text)
for child in elem:
print(child.text)
txt = child.text
ttxt = translator.translate(txt, target_language=tgt)
etxt = translator.translate(txt, target_language=tt)
child.text = ttxt['translatedText'] + "(" + etxt['translatedText'] + ")"
tree.write('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml')
And the output from that looks like this:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍賣會(auctions)</DisplayName>
<Desc>[NAME]來到了[PLACE],發現此地有個拍賣行。([NAME] came to [PLACE] and found an auction house here.)</Desc>
<Selections.0.Display>參與拍賣(Participate in the auction)</Selections.0.Display>
<Selections.1.Display>離去(Leave)</Selections.1.Display>
</Text>
</List>
</Texts>
Any help would be appreciated. I think once I figure this out I should be able to fly through the rest of the translating.
tree.write('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml')
Per the documentation:
write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None, method="xml", *, short_empty_elements=True)
...
The output is either a string (str) or binary (bytes). This is controlled by the encoding argument. If encoding is "unicode", the output is a string; otherwise, it’s binary. Note that this may conflict with the type of file if it’s an open file object; make sure you do not try to write a string to a binary stream and vice versa.
So we just need to set the encoding parameter appropriately. Writing as ASCII means that non-ASCII characters need to be entity-escaped (拍 etc.) (It still writes to the file without a problem, of course, because the UTF-8 encoding specified for the file is ASCII-transparent.)

Parse XML in Python with encoding other than utf-8

Any clue on how to parse xml in python that has: encoding='Windows-1255' in it?
At least the lxml.etree parser won't even look at the string when there's an "encoding" tag in the XML header which isn't "utf-8" or "ASCII".
Running the following code fails with:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without
declaration.
from lxml import etree
parser = etree.XMLParser(encoding='utf-8')
def convert_xml_to_utf8(xml_str):
tree = etree.fromstring(xml_str, parser=parser)
if tree.docinfo.encoding == 'utf-8':
# already in correct encoding, abort
return xml_str
decoded_str = xml_str.decode(tree.docinfo.encoding)
utf8_encoded_str = decoded_str.encode('utf-8')
tree = etree.fromstring(utf8_encoded_str)
tree.docinfo.encoding = 'utf-8'
return etree.tostring(tree, pretty_print = True, xml_declaration = True, encoding='UTF-8', standalone="yes")
data = '''<?xml version='1.0' encoding='Windows-1255'?><rss version="2.0"><channel ><title ><![CDATA[ynet - חדשות]]></title></channel></rss>'''
print(convert_xml_to_utf8(data))
data is a unicode str. The error is saying that such a thing which also contains an encoding="..." declaration is not supported, because a str is supposedly already decoded from its encoding and hence it's ambiguous/nonsensical that it would also contain an encoding declaration. It's telling you to use a bytes instead, e.g. data = b'<...>'. Presumably you should be opening a file in binary mode, read the data from there and let etree handle the encoding="...", instead of using string literals in your code, which complicates the encoding situation even further.
It's as simple as:
from xml.etree import ElementTree
# open in binary mode ↓
with open('/tmp/test.xml', 'rb') as f:
e = ElementTree.fromstring(f.read())
Et voilà, e contains your parsed file with the encoding having been (presumably) correctly interpreted by etree based on the file's internal encoding="..." header.
ElementTree in fact has a shortcut method for this:
e = ElementTree.parse('/tmp/test.xml')

How to get more info from lxml errors?

Because I'm not able to use an XSL IDE, I've written a super-simple Python script using lxml to transform a given XML file with a given XSL transform, and write the results to a file. As follows (abridged):
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
print(xml_root.tag)
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
I'm getting the following error:
"lxml.etree.XSLTApplyError: Failed to evaluate the 'select' expression"
...but I have quite a number of select expressions in my XSLT. After having looked carefully and isolated blocks of code, I'm still at a loss as to which select is failing, or why.
Without trying to debug the code, is there a way to get more information out of lxml, like a line number or quote from the failing expression?
aaaaaand of course as soon as I actually take the time to post the question, I stumble upon the answer.
This might be a duplicate of this question, but I think the added benefit here is the Python side of things.
The linked answer points out that each parser includes an error log that you can access. The only "trick" is catching those errors so that you can look in the log once it's been created.
I did it thusly (perhaps also poorly, but it worked):
import os
import lxml.etree as etree
from lxml.etree import XMLParser
import sys
xml_filename = '(some path to an XML file)'
xsl_filename = '(some path to an XSL file)'
output = '(some path to a file)'
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = None
try:
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
except:
for error in transform.error_log:
print(error.message, error.line)
The messages in this log are more descriptive than those printed to the console, and the "line" element will point you to the line number where the failure occurred.

Can anyone tell me what error msg "line 1182 in parse" means when I'm trying to parse and xml in python

This is the code that results in an error message:
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.parse(data)
The error:
I'm new to python. I did read documentation and a couple of tutorials, but clearly I still have done something wrong. I don't believe it is the xml file itself because it does this to two different xml files.
Consider using ElementTree's fromstring():
import urllib
import xml.etree.ElementTree as ET
url = raw_input('Enter URL:')
# http://feeds.bbci.co.uk/news/rss.xml?edition=int
urlhandle = urllib.urlopen(url)
data = urlhandle.read()
tree = ET.fromstring(data)
print ET.tostring(tree, encoding='utf8', method='xml')
data is a reference to the XML content as a string, but the parse() function expects a filename or file object as argument. That's why there is an an error.
urlhandle is a file object, so tree = ET.parse(urlhandle) should work for you.
The error message indicates that your code is trying to open a file, who's name is stored in the variable source.
It's failing to open that file (IOError) because the variable source contains a bunch of XML, not a file name.

Python django writing ascii characters in coded format in file

I am using Django to generate the abc.tex file
I am displaying the the data in browser and same data i am writing to tex file like this
with open("sample.tex") as f:
t = Template(f.read())
head = ['name','class']
c = Context({"head":headers, "table": rowlist})
# Render template
output = t.render(c)
with open("mytable.tex", 'w') as out_f:
out_f.write(output)
Now in the broser i can see the text as speaker-hearer's but in the file it is coming as speaker-hearer's
How can i fix that
As far as I know, the browser decodes this data automatically, but the text within the file will be raw; so you are seeing the data "as it is".
Maybe you can use the HTMLParser library to decode the data generated by Django (output) before writing to the abc.tex file.
For your sample string:
import HTMLParser
h = HTMLParser.HTMLParser()
s = "speaker-hearer's"
s = h.unescape(s)
So then it would be just a matter of unescaping your output when you write it to a file, and probably handling the parsing exception.
Source (see step #3)

Categories