Python script doesn't write Chinese characters to XML file - python

I'm making a mod for a game where the majority of the files are XMLs, the text of which is Simplified Chinese. My goal is to replace all of the Simplified Chinese in the files with Traditional, followed by an English translation. I'm using the Cloud Translate API from Google to do that part, and it all works fine. At first I was just doing a find and replace on the Chinese text and then adding English to the end of string, but the issue with that is that I'm getting extra English translations whenever the Chinese text occurs more than once.
In an effort to fix that I read more of the XML documentation for Python, and I started trying to use tree.write, but that's where I'm getting stuck. When I use it, the XML file has the UTF codes for the Chinese characters, rather than the actual characters. If I open the file in a web browser, the characters render correctly, but at this point I'm just unsure if they'll still work with the game if they're not writing into the XML normally.
Here's an example XML I'm working with:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍卖会</DisplayName>
<Desc>[NAME]来到了[PLACE],发现此地有个拍卖行。</Desc>
<Selections.0.Display>参与拍卖</Selections.0.Display>
<Selections.1.Display>离去</Selections.1.Display>
</Text>
</List>
</Texts>
My code which works but sometimes duplicates English translations:
import lxml.etree as ET
from google.cloud import translate_v2 as translate
import pinyin
translator = translate.Client()
tgt = "zh-TW"
tt = "en"
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'r', encoding="utf-8") as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.iter('Text'):
print(elem.text)
for child in elem:
txt = child.text
ttxt = translator.translate(txt, target_language=tgt)
etxt = translator.translate(txt, target_language=tt)
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'r') as n:
new = n.read().replace(txt, ttxt['translatedText'] + '(' + etxt['translatedText'] + ')', 1)
with open('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml', 'w') as n:
n.write(new)
The output of that looks like this:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍賣會(auctions)</DisplayName>
<Desc>[NAME]來到了[PLACE],發現此地有個拍賣行。([NAME] came to [PLACE] and found an auction house here.)</Desc>
<Selections.0.Display>參與拍賣(Participate in the auction)</Selections.0.Display>
<Selections.1.Display>離去(Leave)</Selections.1.Display>
</Text>
</List>
</Texts>
And here's my tree.write code:
import lxml.etree as ET
from google.cloud import translate_v2 as translate
import pinyin
translator = translate.Client()
tgt = "zh-TW"
tt = "en"
with open('/home/dave/zh-TW/Settings/MapStories/MapStory_Auction.xml', 'r', encoding="utf-8") as f:
tree = ET.parse(f)
root = tree.getroot()
for elem in root.iter('Text'):
print(elem.text)
for child in elem:
print(child.text)
txt = child.text
ttxt = translator.translate(txt, target_language=tgt)
etxt = translator.translate(txt, target_language=tt)
child.text = ttxt['translatedText'] + "(" + etxt['translatedText'] + ")"
tree.write('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml')
And the output from that looks like this:
<Texts Type="Story">
<List>
<Text Name="TradeAuction">
<DisplayName>拍賣會(auctions)</DisplayName>
<Desc>[NAME]來到了[PLACE],發現此地有個拍賣行。([NAME] came to [PLACE] and found an auction house here.)</Desc>
<Selections.0.Display>參與拍賣(Participate in the auction)</Selections.0.Display>
<Selections.1.Display>離去(Leave)</Selections.1.Display>
</Text>
</List>
</Texts>
Any help would be appreciated. I think once I figure this out I should be able to fly through the rest of the translating.

tree.write('/home/dave/zh-TW-final/Settings/MapStories/MapStory_Auction.xml')
Per the documentation:
write(file, encoding="us-ascii", xml_declaration=None, default_namespace=None, method="xml", *, short_empty_elements=True)
...
The output is either a string (str) or binary (bytes). This is controlled by the encoding argument. If encoding is "unicode", the output is a string; otherwise, it’s binary. Note that this may conflict with the type of file if it’s an open file object; make sure you do not try to write a string to a binary stream and vice versa.
So we just need to set the encoding parameter appropriately. Writing as ASCII means that non-ASCII characters need to be entity-escaped (拍 etc.) (It still writes to the file without a problem, of course, because the UTF-8 encoding specified for the file is ASCII-transparent.)

Related

Python - How to convert an XML file containing XADES signatures to HTML using XSLT?

I want to use python to convert an XML file to HTML having an XSL file. I found this code that works for some cases:
def xml2html(xml_filename: str, xsl_filename: str, save_path: str):
parser = ET.XMLParser(huge_tree=True, encoding='utf-8', recover=True)
dom = ET.parse(xml_filename, parser)
xslt = ET.parse(xsl_filename, parser)
transform = ET.XSLT(xslt)
newdom = transform(dom)
with open(save_path, 'wb') as f:
f.write(ET.tostring(newdom, pretty_print=True))
The code works for files without signatures. Unfortunately for files containing signatures, the resulting html contains almost no content. Files that do not work include, for example, such tags:
<ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#"
<ds:SignedInfo
<ds:SignatureMethod Algorithm="http://www.w3.org/2001/04/xmldsig-more#rsa-sha256"/>
<ds:Reference
<ds:DigestValue>
<ds:SignatureValue
<ds:KeyInfo><ds:X509Data><ds:X509Certificate
<xades:SigningCertificate>
</xades:IssuerSerial>
</xades:Cert>
</xades:SigningCertificate>
</xades:SignedSignatureProperties>
<xades:SignedDataObjectProperties/>
</xades:SignedProperties>
Some files do not contain any content at all that I would be able to read in xml. They contain only similar tags to the above and encoded strings. Probably the content is encoded.
However, I know that converting them is possible because I found a repository on github that has a link to a page where I was able to convert such files. Here is the link https://github.com/wlodekf/sprawozdania. The code is written in javascript and the XSLTProcessor class is used. I don't know if this class or something else is responsible for decoding the contents of these files. I would like to know if it is possible to do this in python, and if so what step was missing from my solution.

Python: Import text from HTML or text document into Word

I've been looking at some of the documentation, but all of the work I've seen around docx is primarily directed towards working with text already in a word document. What I'd like to know, is is there a simple way to take text either from HTML or a Text document, and import that into a word document, and to do that wholesale? with all of the text in the HTML/Text document? It doesn't seem to like the string, it's too long.
My understanding of the documentation, is that you have to work with text on a paragraph by paragraph basis. The task that I'd like to do is relatively simple - however it's beyond my python skills. I'd like to set up the margins on the word document, and then import the text into the word document so that it adheres to the margins that I previously specified.
Does anyone have any-thoughts? None of the previous posts have been very helpful that I have found.
import textwrap
import requests
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Inches
class DocumentWrapper(textwrap.TextWrapper):
def wrap(self, text):
split_text = text.split('\n\n')
lines = [line for para in split_text for line in textwrap.TextWrapper.wrap(self, para)]
return lines
page = requests.get("http://classics.mit.edu/Aristotle/prior.mb.txt")
soup = BeautifulSoup(page.text,"html.parser")
#we are going to pull in the text wrap extension that we have added.
#The typical width that we want tow
text_wrap_extension = DocumentWrapper(width=82,initial_indent="",fix_sentence_endings=True)
new_string = text_wrap_extension.fill(page.text)
final_document = "Prior_Analytics.txt"
with open(final_document, "w") as f:
f.writelines(new_string)
document = Document(final_document)
### Specified margin specifications
sections = document.sections
for section in sections:
section.top_margin = (Inches(1.00))
section.bottom_margin = (Inches(1.00))
section.right_margin = (Inches(1.00))
section.left_margin = (Inches(1.00))
document.save(final_document)
The error that I get thrown is below:
docx.opc.exceptions.PackageNotFoundError: Package not found at 'Prior_Analytics.txt'
This error simply means there is no .docx file at the location you specified.. So you can modify your code to create the file it it doesnt exist.
final_document = "Prior_Analytics.txt"
with open(final_document, "w+") as f:
f.writelines(new_string)
You are providing a relative path. How do you know what Python's current working directory is? That's where the relative path you give will start from.
A couple lines of code like this will tell you:
import os
print(os.path.realpath('./'))
Note that:
docx is used to open .docx files
I got it.
document = Document()
sections = document.sections
for section in sections:
section.top_margin = Inches(2)
section.bottom_margin = Inches(2)
section.left_margin = Inches(2)
section.right_margin = Inches(2)
document.add_paragraph(###Add your text here. Add Paragraph Accepts text of whatever size.###)
document.save()#name of document goes here, as a string.

How to get more info from lxml errors?

Because I'm not able to use an XSL IDE, I've written a super-simple Python script using lxml to transform a given XML file with a given XSL transform, and write the results to a file. As follows (abridged):
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
print(xml_root.tag)
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
I'm getting the following error:
"lxml.etree.XSLTApplyError: Failed to evaluate the 'select' expression"
...but I have quite a number of select expressions in my XSLT. After having looked carefully and isolated blocks of code, I'm still at a loss as to which select is failing, or why.
Without trying to debug the code, is there a way to get more information out of lxml, like a line number or quote from the failing expression?
aaaaaand of course as soon as I actually take the time to post the question, I stumble upon the answer.
This might be a duplicate of this question, but I think the added benefit here is the Python side of things.
The linked answer points out that each parser includes an error log that you can access. The only "trick" is catching those errors so that you can look in the log once it's been created.
I did it thusly (perhaps also poorly, but it worked):
import os
import lxml.etree as etree
from lxml.etree import XMLParser
import sys
xml_filename = '(some path to an XML file)'
xsl_filename = '(some path to an XSL file)'
output = '(some path to a file)'
p = XMLParser(huge_tree=True)
xml = etree.parse(xml_filename, parser=p)
xml_root = xml.getroot()
xslt_root = etree.parse(xsl_filename)
transform = etree.XSLT(xslt_root)
newtext = None
try:
newtext = transform(xml)
with open(output, 'w') as f:
f.write(str(newtext))
except:
for error in transform.error_log:
print(error.message, error.line)
The messages in this log are more descriptive than those printed to the console, and the "line" element will point you to the line number where the failure occurred.

Unicode: Python / lxml file output not as expected (print vs write)

I'm parsing an xml file using the code below:
import lxml
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
crid = (info.get('programId'))
titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
title = (titlex.text if titlex != None else 'Missing')
synopsis1x = (info.find('.//xmlns:Synopsis[1]', namespaces=nsmap))
synopsis1 = (synopsis1x.text if synopsis1x != None else 'Missing')
synopsis1 = synopsis1.replace('\r','').replace('\n','')
f.write('{}|{}|{}\n'.format(crid, title, synopsis1))
Let take an example title of 'Přešité bydlení'. If I print the title whilst parsing the file, it comes out as expected. When I write it out however, it displays as 'PÅ™eÅ¡ité bydlení'.
I understand that this is do to with encoding (as I was able to change the print command to use UTF-8, and 'corrupt' the output), but I couldn't get the written output to print as I desired. I had a look at the codecs library, but couldn't wasn't successful. Having 'encoding = "utf-8"' in the XML Parser line didn't make any difference.
How can I configure the written output to be human readable?
I had all sorts of troubles with this before. But the solution is rather simple. There is a chapter on how to read and write in unicode to a file in the documentation. This Python talk is also very enlightening to understand the issue. Unicode can be a pain. It gets a lot easier if you start using python 3 though.
import codecs
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Your code looks ok, so I reckon your input is duff. Assuming you're viewing your output file with a UTF-8 viewer or shell then I suspect that the encoding in the <?xml doesn't match the actual encoding.
This would explain why printing works but not writing to a file. If your shell/IDE is set to "ISO-8859-2" and your input XML is also "ISO-8859-2" then printing is pushing out the raw encoding.

Text-Replace in docx and save the changed file with python-docx

I'm trying to use the python-docx module to replace a word in a file and save the new file with the caveat that the new file must have exactly the same formatting as the old file, but with the word replaced. How am I supposed to do this?
The docx module has a savedocx that takes 7 inputs:
document
coreprops
appprops
contenttypes
websettings
wordrelationships
output
How do I keep everything in my original file the same except for the replaced word?
this worked for me:
def docx_replace(old_file,new_file,rep):
zin = zipfile.ZipFile (old_file, 'r')
zout = zipfile.ZipFile (new_file, 'w')
for item in zin.infolist():
buffer = zin.read(item.filename)
if (item.filename == 'word/document.xml'):
res = buffer.decode("utf-8")
for r in rep:
res = res.replace(r,rep[r])
buffer = res.encode("utf-8")
zout.writestr(item, buffer)
zout.close()
zin.close()
As it seems to be, Docx for Python is not meant to store a full Docx with images, headers, ... , but only contains the inner content of the document. So there's no simple way to do this.
Howewer, here is how you could do it:
First, have a look at the docx tag wiki:
It explains how the docx file can be unzipped: Here's how a typical file looks like:
+--docProps
| + app.xml
| \ core.xml
+ res.log
+--word //this folder contains most of the files that control the content of the document
| + document.xml //Is the actual content of the document
| + endnotes.xml
| + fontTable.xml
| + footer1.xml //Containst the elements in the footer of the document
| + footnotes.xml
| +--media //This folder contains all images embedded in the word
| | \ image1.jpeg
| + settings.xml
| + styles.xml
| + stylesWithEffects.xml
| +--theme
| | \ theme1.xml
| + webSettings.xml
| \--_rels
| \ document.xml.rels //this document tells word where the images are situated
+ [Content_Types].xml
\--_rels
\ .rels
Docx only gets one part of the document, in the method opendocx
def opendocx(file):
'''Open a docx file, return a document XML tree'''
mydoc = zipfile.ZipFile(file)
xmlcontent = mydoc.read('word/document.xml')
document = etree.fromstring(xmlcontent)
return document
It only gets the document.xml file.
What I recommend you to do is:
get the content of the document with **opendocx*
Replace the document.xml with the advReplace method
Open the docx as a zip, and replace the document.xml content's by the new xml content.
Close and output the zipped file (renaming it to output.docx)
If you have node.js installed, be informed that I have worked on DocxGenJS which is templating engine for docx documents, the library is in active development and will be released soon as a node module.
Are you using the docx module from here?
If yes, then the docx module already exposes methods like replace, advReplace etc which can help you achieve your task. Refer to the source code for more details of the exposed methods.
from docx import Document
file_path = 'C:/tmp.docx'
document = Document(file_path)
def docx_replace(doc_obj, data: dict):
"""example: data=dict(order_id=123), result: {order_id} -> 123"""
for paragraph in doc_obj.paragraphs:
for key, val in data.items():
key_name = '{{{}}}'.format(key)
if key_name in paragraph.text:
paragraph.text = paragraph.text.replace(key_name, str(val))
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace(cell, data)
docx_replace(document, dict(order_id=123, year=2018, payer_fio='payer_fio', payer_fio1='payer_fio1'))
document.save(file_path)
The problem with the methods above is that they lose the existing formatting. Please see my answer which performs the replace and retains formatting.
There is also python-docx-template which allows jinja2 style templating within a docx template. Here's a link to the documentation
I've forked a repo of python-docx here, which preserves all of the preexisting data in a docx file, including formatting. Hopefully this is what you're looking for.
In addition to #ramil, you have to escape some characters before placing them as string values into the XML, so this worked for me:
def escape(escapee):
escapee = escapee.replace("&", "&")
escapee = escapee.replace("<", "<")
escapee = escapee.replace(">", ">")
escapee = escapee.replace("\"", """)
escapee = escapee.replace("'", "&apos;")
return escapee
We can use python-docx to keep an image on docx.
docx detect image as a paragraph.
But for this paragraph the text is empty.
So you can use like this.
paragraphs = document.paragraphs for paragraph in paragraphs: if paragraph.text == '': continue

Categories