Convert XML document (describing a markdown file) to markdown file - python

Have been using Ulysses App (a markdown writing app for Mac) for a while and created hundreds of notes. Ulysses App will let you write in markdown, but internally it saves the notes in XML format.
Now I'd like to move my notes out of it, but it does not offer an easy way to do so. (At least in some edge cases where you have embedded images.)
Here is an example note in XML format:
<sheet version="5" app_version="19.2">
<string xml:space="preserve">
<p><tags><tag kind="heading1"># </tag></tags>Ulysses App - Example note</p>
<p></p>
<p><tags><tag kind="heading2">## </tag></tags>This is a header 2</p>
<p></p>
<p>And here is some text under headar 2.</p>
<p></p>
<p><tags><tag kind="heading2">## </tag></tags>Now some more</p>
<p></p>
<p><tags><tag kind="orderedList">1. </tag></tags>Item one</p>
<p><tags><tag kind="orderedList">2. </tag></tags>Item two</p>
<p></p>
<p><tags><tag kind="heading2">## </tag></tags>Finally some code</p>
<p></p>
<p><tags><tag kind="codeblock"></tag></tags><attribute identifier="syntax">python</attribute>for j in range(10):</p>
<p><tags><tag kind="codeblock"></tag></tags> print(j)</p>
<p></p>
<p></p>
<p></p></string>
</sheet>
It will be rendered as:
# Ulysses App - Example note
## This is a header 2
And here is some text under headar 2.
## Now some more
1. Item one
2. Item two
## Finally some code
```python
for j in range(10):
print(j)
``
In doing some research about how I can convert these XML files to markdown files I came across XSLT. Did some research about XSLT, but I am overwhelmed because I don't have a clue where to start.
Can someone point me to the right direction? Is trying to do an XML transformation via XSLT the right way to doing the task? Or is it too heavyweight?
I am fluent in Python. Does it make more sense to parse the XML in Python and try to convert it via Python?
All tips are welcome and appreciated.
Transformation rules
The way I see it, the transformation rules are like so:
A document consists of paragraphs. Even a headar is (in a) paragraph
Each paragraph can have one or more "qualifiers".
So a paragraph can contain only plain text like <p>And here is some text under headar 2.</p> or it can have a qualifier before or after the text like here: <p><tags><tag kind="orderedList">1. </tag></tags>Item one</p>.
In a first attempt I can parse the XML in Python to get the paragrapth texts. That would get me the bare bones of my note.
In a second iteration I can focus on parsing the "tags" (or qualifiers).
Will post my results here once done.
Question: Can this "easily" be done in XSLT?
Update: Naive solution in Python
I came up with a solution in Python3 that gets all inner texts of the paragraphs and their children. That pretty closely resembles my markdown file:
#!/usr/bin/env python3
import argparse
import xml.etree.ElementTree as ET
def load_xml(filename):
tree = ET.parse(filename)
root = tree.getroot()
# Traverse second child!
# First: "markdown", second: "string" (contains the content of the doc)
content = root[1]
return content
def main():
# get filename from command line
parser = argparse.ArgumentParser()
parser.add_argument("--file", "-f", type=str, required=True)
args = parser.parse_args()
filename = args.file
# load xml content
content = load_xml(filename)
# get all text from all children
# https://stackoverflow.com/a/34240893/5115219
para = list(content.itertext())
# join text list
text = "".join(para)
print(text)
if __name__ == "__main__":
main()
Test run
➜ (venv:ulysses_xml) ulysses_xml ./process_xml.py -f Content.xml
# Ulysses App - Example note
## This is a header 2
And here is some text under headar 2.
## Now some more
1. Item one
2. Item two
## Finally some code
pythonfor j in range(10):
 print(j)
Of course some things are missing. For instance the code block markup at the bottom. And some other fine details. But a first start at least.

Related

How to extract text from XML based on a tag & then put it back (Python)?

I have a messy XML, with some tag structure like -
<textTag>
<div xmlns="http://www.tei-c.org/ns/1.0"><p> -----some text goes here-----
</p>
</div>
</textTag>
I want to extract -----some text goes here-----, make some changes and put it back into the XML. How should I go about it?
Option 1:
You can use the xml module of python for parsing, updating and saving the xml file. A problem though, would be that the resulting xml file might have order of attributes etc different from the original xml file. So when you make a diff, you might see a lot of differences.
So you might do something like.
from xml.etree import ElementTree as ET
tree = ET.parse('xmlfilename')
root = tree.getroot()
p_nodes = root.findall('.//<p>')
for node in p_nodes:
# process
tree.save()
Option 2:
Use regular expression.
Read the file line by line and look for the pattern you are interested in and make the update and write it back.
The obvious advantage being the diff between the original and modified file will shown only the update you made.
import re
with open(outputfile) as fout:
with open(xmlfile) as f:
data = f.readlines()
pattern = re.compile(r"...") # your pattern
for line in data:
re.sub(line, pattern, update)
fout.write(line)
You could use lxml (which has much better XPath 1.0 support than ElementTree) to find all text() nodes that contain "-----some text goes here-----", modify the text, and then replace the .text (or .tail) of the parent.
Example...
Python 3.x
from lxml import etree
xml = """
<textTag>
<div xmlns="http://www.tei-c.org/ns/1.0"><p> <br/>-----some text goes here-----
</p>
</div>
</textTag>"""
tree = etree.fromstring(xml)
for text in tree.xpath(".//text()[contains(.,'-----some text goes here-----')]"):
parent = text.getparent()
new_text = text.replace("-----some text goes here-----", "---- BAM! ----")
if text.is_text:
parent.text = new_text
elif text.is_tail:
parent.tail = new_text
etree.dump(tree)
Output (dumped to console)
<textTag>
<div xmlns="http://www.tei-c.org/ns/1.0"><p> ---- BAM! ----
</p>
</div>
</textTag>

Changing root in XML for xml.etree

I have a file that looks like this:
<?xml version="1.0"etc>
<xliff version="1.2" etc>
<file datatype="plaintext" mt="eMT-R2" original="" source-language="en-US" target-language="es">
<header/>
<body>
<trans-unit etc>
<source>blabla</source>
<target>blabla</target>
<note>blabla</note>
</trans-unit>
</body>
</file>
</xliff>
I want to go through the source and target elements. My code only works if I have <body> as a root. Is there a way to skip the first 4 elements at the beginning of the file or just set the root to <body>?
import xml.etree.ElementTree as ET
tree = ET.parse('myfile.xlf')
root = tree.getroot()
for trans in root.findall('trans-unit'):
source = trans.find('source').text
target = trans.find('target').text
lencomp = (len(target) - len(source))/len(source)*100.0
print(source,">>>", target)
ElementTree's findall takes a quasi-xpath string. Its not a full-featured xpath like is available with lxml but works for what you need
import xml.etree.ElementTree as ET
tree = ET.parse('myfile.xlf')
for trans in tree.findall('file/body/trans-unit'):
source = trans.find('source').text
target = trans.find('target').text
lencomp = (len(target) - len(source))/len(source)*100.0
print(source,">>>", target)
Ok, so it turns out the problem is not in the code but in my file. For anyone working with XLIFF files, this may be useful:
The issue is in the "XMLNS" - if you remove at least one letter, the file will be parsed correctly. I'm not sure exactly what the problem is, but changing this definitely solves the problem

Remove xmlns information from generated file?

I am using Elementtree to parse an xml file, edit the contents and write to a new xml file. I have this all working apart form one issue. When I generate the file there are a lot of extra lines containing namespace information. Here are some snippets of code:
import xml.etree.ElementTree as ET
ET.register_namespace("", "http://clish.sourceforge.net/XMLSchema")
tree = ET.parse('ethernet.xml')
root = tree.getroot()
commands = root.findall('{http://clish.sourceforge.net/XMLSchema}'
'VIEW/{http://clish.sourceforge.net/XMLSchema}COMMAND')
for command in commands:
all1.append(list(command.iter()))
And a sample of the output file, with the erroneous line xmlns="http://clish.sourceforge.net/XMLSchema:
<COMMAND xmlns="http://clish.sourceforge.net/XMLSchema" help="Interface specific description" name="description">
<PARAM help="Description (must be in double-quotes)" name="description" ptype="LINE" />
<CONFIG />
</COMMAND>
How can I remove this with elementtree, can I? Or will i have to use some regex (I am writing a string to the file)?

Generating HTML documents in python

In python, what is the most elegant way to generate HTML documents. I currently manually append all of the tags to a giant string, and write that to a file. Is there a more elegant way of doing this?
You can use yattag to do this in an elegant way. FYI I'm the author of the library.
from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('html'):
with tag('body'):
with tag('p', id = 'main'):
text('some text')
with tag('a', href='/my-url'):
text('some link')
result = doc.getvalue()
It reads like html, with the added benefit that you don't have to close tags.
I would suggest using one of the many template languages available for python, for example the one built into Django (you don't have to use the rest of Django to use its templating engine) - a google query should give you plenty of other alternative template implementations.
I find that learning a template library helps in so many ways - whenever you need to generate an e-mail, HTML page, text file or similar, you just write a template, load it with your template library, then let the template code create the finished product.
Here's some simple code to get you started:
#!/usr/bin/env python
from django.template import Template, Context
from django.conf import settings
settings.configure() # We have to do this to use django templates standalone - see
# http://stackoverflow.com/questions/98135/how-do-i-use-django-templates-without-the-rest-of-django
# Our template. Could just as easily be stored in a separate file
template = """
<html>
<head>
<title>Template {{ title }}</title>
</head>
<body>
Body with {{ mystring }}.
</body>
</html>
"""
t = Template(template)
c = Context({"title": "title from code",
"mystring":"string from code"})
print t.render(c)
It's even simpler if you have templates on disk - check out the render_to_string function for django 1.7 that can load templates from disk from a predefined list of search paths, fill with data from a dictory and render to a string - all in one function call. (removed from django 1.8 on, see Engine.from_string for comparable action)
If you're building HTML documents than I highly suggest using a template system (like jinja2) as others have suggested. If you're in need of some low level generation of html bits (perhaps as an input to one of your templates), then the xml.etree package is a standard python package and might fit the bill nicely.
import sys
from xml.etree import ElementTree as ET
html = ET.Element('html')
body = ET.Element('body')
html.append(body)
div = ET.Element('div', attrib={'class': 'foo'})
body.append(div)
span = ET.Element('span', attrib={'class': 'bar'})
div.append(span)
span.text = "Hello World"
if sys.version_info < (3, 0, 0):
# python 2
ET.ElementTree(html).write(sys.stdout, encoding='utf-8',
method='html')
else:
# python 3
ET.ElementTree(html).write(sys.stdout, encoding='unicode',
method='html')
Prints the following:
<html><body><div class="foo"><span class="bar">Hello World</span></div></body></html>
There is also a nice, modern alternative: airium: https://pypi.org/project/airium/
from airium import Airium
a = Airium()
a('<!DOCTYPE html>')
with a.html(lang="pl"):
with a.head():
a.meta(charset="utf-8")
a.title(_t="Airium example")
with a.body():
with a.h3(id="id23409231", klass='main_header'):
a("Hello World.")
html = str(a) # casting to string extracts the value
print(html)
Prints such a string:
<!DOCTYPE html>
<html lang="pl">
<head>
<meta charset="utf-8" />
<title>Airium example</title>
</head>
<body>
<h3 id="id23409231" class="main_header">
Hello World.
</h3>
</body>
</html>
The greatest advantage of airium is - it has also a reverse translator, that builds python code out of html string. If you wonder how to implement a given html snippet - the translator gives you the answer right away.
Its repository contains tests with example pages translated automatically with airium in: tests/documents. A good starting point (any existing tutorial) - is this one: tests/documents/w3_architects_example_original.html.py
I would recommend using xml.dom to do this.
http://docs.python.org/library/xml.dom.html
Read this manual page, it has methods for building up XML (and therefore XHTML). It makes all XML tasks far easier, including adding child nodes, document types, adding attributes, creating texts nodes. This should be able to assist you in the vast majority of things you will do to create HTML.
It is also very useful for analysing and processing existing xml documents.
Here is a tutorial that should help you with applying the syntax:
http://www.postneo.com/projects/pyxml/
I am using the code snippet known as throw_out_your_templates for some of my own projects:
https://github.com/tavisrudd/throw_out_your_templates
https://bitbucket.org/tavisrudd/throw-out-your-templates/src
Unfortunately, there is no pypi package for it and it's not part of any distribution as this is only meant as a proof-of-concept. I was also not able to find somebody who took the code and started maintaining it as an actual project. Nevertheless, I think it is worth a try even if it means that you have to ship your own copy of throw_out_your_templates.py with your code.
Similar to the suggestion to use yattag by John Smith Optional, this module does not require you to learn any templating language and also makes sure that you never forget to close tags or quote special characters. Everything stays written in Python. Here is an example of how to use it:
html(lang='en')[
head[title['An example'], meta(charset='UTF-8')],
body(onload='func_with_esc_args(1, "bar")')[
div['Escaped chars: ', '< ', u'>', '&'],
script(type='text/javascript')[
'var lt_not_escaped = (1 < 2);',
'\nvar escaped_cdata_close = "]]>";',
'\nvar unescaped_ampersand = "&";'
],
Comment('''
not escaped "< & >"
escaped: "-->"
'''),
div['some encoded bytes and the equivalent unicode:',
'你好', unicode('你好', 'utf-8')],
safe_unicode('<b>My surrounding b tags are not escaped</b>'),
]
]
I am attempting to make an easier solution called
PyperText
In Which you can do stuff like this:
from PyperText.html import Script
from PyperText.htmlButton import Button
#from PyperText.html{WIDGET} import WIDGET; ex from PyperText.htmlEntry import Entry; variations shared in file
myScript=Script("myfile.html")
myButton=Button()
myButton.setText("This is a button")
myScript.addWidget(myButton)
myScript.createAndWrite()
I wrote a simple wrapper for the lxml module (should work fine with xml as well) that makes tags for HTML/XML -esq documents.
Really, I liked the format of the answer by John Smith but I didn't want to install yet another module to accomplishing something that seemed so simple.
Example first, then the wrapper.
Example
from Tag import Tag
with Tag('html') as html:
with Tag('body'):
with Tag('div'):
with Tag('span', attrib={'id': 'foo'}) as span:
span.text = 'Hello, world!'
with Tag('span', attrib={'id': 'bar'}) as span:
span.text = 'This was an example!'
html.write('test_html.html')
Output:
<html><body><div><span id="foo">Hello, world!</span><span id="bar">This was an example!</span></div></body></html>
Output after some manual formatting:
<html>
<body>
<div>
<span id="foo">Hello, world!</span>
<span id="bar">This was an example!</span>
</div>
</body>
</html>
Wrapper
from dataclasses import dataclass, field
from lxml import etree
PARENT_TAG = None
#dataclass
class Tag:
tag: str
attrib: dict = field(default_factory=dict)
parent: object = None
_text: str = None
#property
def text(self):
return self._text
#text.setter
def text(self, value):
self._text = value
self.element.text = value
def __post_init__(self):
self._make_element()
self._append_to_parent()
def write(self, filename):
etree.ElementTree(self.element).write(filename)
def _make_element(self):
self.element = etree.Element(self.tag, attrib=self.attrib)
def _append_to_parent(self):
if self.parent is not None:
self.parent.element.append(self.element)
def __enter__(self):
global PARENT_TAG
if PARENT_TAG is not None:
self.parent = PARENT_TAG
self._append_to_parent()
PARENT_TAG = self
return self
def __exit__(self, typ, value, traceback):
global PARENT_TAG
if PARENT_TAG is self:
PARENT_TAG = self.parent

Best way to generate xml? [duplicate]

This question already has answers here:
Creating a simple XML file using python
(6 answers)
Closed 5 years ago.
I'm creating an web api and need a good way to very quickly generate some well formatted xml. I cannot find any good way of doing this in python.
Note: Some libraries look promising but either lack documentation or only output to files.
ElementTree is a good module for reading xml and writing too e.g.
from xml.etree.ElementTree import Element, SubElement, tostring
root = Element('root')
child = SubElement(root, "child")
child.text = "I am a child"
print(tostring(root))
Output:
<root><child>I am a child</child></root>
See this tutorial for more details and how to pretty print.
Alternatively if your XML is simple, do not underestimate the power of string formatting :)
xmlTemplate = """<root>
<person>
<name>%(name)s</name>
<address>%(address)s</address>
</person>
</root>"""
data = {'name':'anurag', 'address':'Pune, india'}
print xmlTemplate%data
Output:
<root>
<person>
<name>anurag</name>
<address>Pune, india</address>
</person>
</root>
You can use string.Template or some template engine too, for complex formatting.
Using lxml:
from lxml import etree
# create XML
root = etree.Element('root')
root.append(etree.Element('child'))
# another child with text
child = etree.Element('child')
child.text = 'some text'
root.append(child)
# pretty string
s = etree.tostring(root, pretty_print=True)
print s
Output:
<root>
<child/>
<child>some text</child>
</root>
See the tutorial for more information.
I would use the yattag library.
from yattag import Doc
doc, tag, text = Doc().tagtext()
with tag('food'):
with tag('name'):
text('French Breakfast')
with tag('price', currency='USD'):
text('6.95')
with tag('ingredients'):
for ingredient in ('baguettes', 'jam', 'butter', 'croissants'):
with tag('ingredient'):
text(ingredient)
print(doc.getvalue())
FYI I'm the author of the library.
Use lxml.builder class, from: http://lxml.de/tutorial.html#the-e-factory
import lxml.builder as lb
from lxml import etree
nstext = "new story"
story = lb.E.Asset(
lb.E.Attribute(nstext, name="Name", act="set"),
lb.E.Relation(lb.E.Asset(idref="Scope:767"),
name="Scope", act="set")
)
print 'story:\n', etree.tostring(story, pretty_print=True)
Output:
story:
<Asset>
<Attribute name="Name" act="set">new story</Attribute>
<Relation name="Scope" act="set">
<Asset idref="Scope:767"/>
</Relation>
</Asset>
An optional way if you want to use pure Python:
ElementTree is good for most cases, but it can't CData and pretty print.
So, if you need CData and pretty print you should use minidom:
minidom_example.py:
from xml.dom import minidom
doc = minidom.Document()
root = doc.createElement('root')
doc.appendChild(root)
leaf = doc.createElement('leaf')
text = doc.createTextNode('Text element with attributes')
leaf.appendChild(text)
leaf.setAttribute('color', 'white')
root.appendChild(leaf)
leaf_cdata = doc.createElement('leaf_cdata')
cdata = doc.createCDATASection('<em>CData</em> can contain <strong>HTML tags</strong> without encoding')
leaf_cdata.appendChild(cdata)
root.appendChild(leaf_cdata)
branch = doc.createElement('branch')
branch.appendChild(leaf.cloneNode(True))
root.appendChild(branch)
mixed = doc.createElement('mixed')
mixed_leaf = leaf.cloneNode(True)
mixed_leaf.setAttribute('color', 'black')
mixed_leaf.setAttribute('state', 'modified')
mixed.appendChild(mixed_leaf)
mixed_text = doc.createTextNode('Do not use mixed elements if it possible.')
mixed.appendChild(mixed_text)
root.appendChild(mixed)
xml_str = doc.toprettyxml(indent=" ")
with open("minidom_example.xml", "w") as f:
f.write(xml_str)
minidom_example.xml:
<?xml version="1.0" ?>
<root>
<leaf color="white">Text element with attributes</leaf>
<leaf_cdata>
<![CDATA[<em>CData</em> can contain <strong>HTML tags</strong> without encoding]]> </leaf_cdata>
<branch>
<leaf color="white">Text element with attributes</leaf>
</branch>
<mixed>
<leaf color="black" state="modified">Text element with attributes</leaf>
Do not use mixed elements if it possible.
</mixed>
</root>
I've tried a some of the solutions in this thread, and unfortunately, I found some of them to be cumbersome (i.e. requiring excessive effort when doing something non-trivial) and inelegant. Consequently, I thought I'd throw my preferred solution, web2py HTML helper objects, into the mix.
First, install the the standalone web2py module:
pip install web2py
Unfortunately, the above installs an extremely antiquated version of web2py, but it'll be good enough for this example. The updated source is here.
Import web2py HTML helper objects documented here.
from gluon.html import *
Now, you can use web2py helpers to generate XML/HTML.
words = ['this', 'is', 'my', 'item', 'list']
# helper function
create_item = lambda idx, word: LI(word, _id = 'item_%s' % idx, _class = 'item')
# create the HTML
items = [create_item(idx, word) for idx,word in enumerate(words)]
ul = UL(items, _id = 'my_item_list', _class = 'item_list')
my_div = DIV(ul, _class = 'container')
>>> my_div
<gluon.html.DIV object at 0x00000000039DEAC8>
>>> my_div.xml()
# I added the line breaks for clarity
<div class="container">
<ul class="item_list" id="my_item_list">
<li class="item" id="item_0">this</li>
<li class="item" id="item_1">is</li>
<li class="item" id="item_2">my</li>
<li class="item" id="item_3">item</li>
<li class="item" id="item_4">list</li>
</ul>
</div>

Categories