regex to parse tables wrapped into xml

regex to parse tables wrapped into xml - python

Suppose we have a table:
Key|Val|Flag
01 |AAA| Y
02 |BBB| N
...
wrapped into xml this way:
<Data>
<R><F>Key</F><F>Val</F><F>Flag</F></R>
<R><F>01</F><F>AAA</F><F>Y</F></R>
<R><F>02</F><F>BBB</F><F>N</F></R>
...
</Data>
There can be more columns and rows, obviously.
Now I'd like to parse XML back to table using single regex.
I can find all fields with '<F>([\w\d]*)</F>', but I need them to be groupped by rows somehow.
I thought about <R>(<F>([\w\d]*)</F>)*</R>, but Python implementation finds nothing.
Can someone please help to compose regex?
UPDATE
Some context of the question.
I'm aware about plenty of XML parsing libraries, but unfortunately my environment is limited to standard libraries. Anyway thanks to everyone who have warned not to use regexes for XML parsing.
And I needed some quick and dirty solution, therefore I decided to start with regexes and switch to parsing later.
So far I have the code:
...
row_p = r'<R>(.*?)</R>'
field_p = r'<F>(.*?)</F>'
table = ''
for row in re.finditer(row_p, xml):
table += '|'.join(re.findall(field_p, row.group(1))) + '\n'
...
It works for small datasets (about 10'000 rows) but fails for tables larger 500'000 rows.
Maybe I'll do some investigation why it fails, but next step I'm going to take - switch to some standard XML parser. ElementTree is the first candidate.

Mandatory links:
RegEx match open tags except XHTML self-contained tags and
Can you provide some examples of why it is hard to parse XML and HTML with a regex?
Use an XML parser. lxml is very good and even provides (among other XML-related thingies) XPath - if you got a fetish with oneliners, I'm sure there is an XPath oneliner to extract these elements ;)

import libxml2
txt = '\n<Data>\n <R><F>Key</F><F>Val</F><F>Flag</F></R>\n <R><F>01</F><F>AAA</F><F>Y</F></R>\n <R><F>02</F><F>BBB</F><F>N</F></R>\n</Data>\n'
rows = []
for elem in libxml2.parseDoc(txt):
if elem.name == 'R':
curRow = []
rows.append(curRow)
elif elem.name == 'F':
curRow.append(elem.get_content())
returns:
rows = [['Key', 'Val', 'Flag'], ['01', 'AAA', 'Y'], ['02', 'BBB', 'N']]

If this question is tagged with Perl, I can post a solution + code for you, but since this is python.
Anyway, I suggest you load the xml file, and read it line by line. Loop each line until the end of the file and find all fields within that line. As far as I know matches in python are stored in an array. There you have it. Wish I can show you with code but this is just the main idea:
load file
foreach line in <file>
if regex.match('<F>([\w\d]*)</F>', line)
print matches[1] . '|' . matches[2] . '|' . matches[3] . "\n"
end loop
DISCLAIMER: The above code is just a scratch
Oh by the way, if possible, use an XML parser instead.

lxml is a Pythonic binding for
the libxml2 and libxslt libraries. It
is unique in that it combines the
speed and feature completeness of
these libraries with the simplicity of
a native Python API, mostly compatible
but superior to the well-known
ElementTree API.

Related

Editing a DOCX file

I am working on a little project that should be quite simple. I know its been done before but for the life of me, I cannot get it to work. Alright so I made a docx template using Microsoft word that contains a Header and just some text in the body of the paper. My goal is have a program that can change this text. Using python-docx I have successfully been able to write a program that modifies the body text easily. That being said I am trying to learn how to do the same thing using XML parsing, which will allow the header to be changed. Long story short, XML parsing (I think thats what it is) will give me much more freedom down the road.
I know after the docx is unzipped, the word/document.xml contains the body text.
Here is my code so far.
from lxml import etree as ET
tree = ET.parse('document.xml')
root = tree.getroot()
for i in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
if i.text == 'Title':
i.text = 'How to cook'
tree.write('document_output.xml', xml_declaration = True, encoding = "UTF-8", method = "xml" \
, standalone = "yes")
This program successfully changes the wanted text to the updated text.
Here is the original document.xml
https://www.dropbox.com/s/ghe1m176rdqtng7/document.xml?dl=0
Here is the output.
https://www.dropbox.com/s/8n9llagozbvb2mz/document_output.xml?dl=0
P.S. viewing the code from dropbox, it makes everything start at line 4 instead of line 1.
If you view them in an XML viewer you can see they are identical. Also, if you use a text difference tool, the only difference is the changed word. And I wouldn't think this would matter but the top line uses single quotes instead of double.
Hope someone can shed some light on why this is still not opening properly in Word.
Thanks for all the help!!

you're having the usual problems with ET.
As a starter, check out these Stackoverflow threads:
Namespace 1
Namespace 2
Namespace 3 with xml declaration
xml declaration
As you can see, you're not the first person with these problems.
What you could do for the namespaces is parse the xml twice:
first time in order to extract the namespaces and
a second time in order to do your actual work.
Besides, some people already suggested to switch from Elementtree to lxml.

using python to parse xml data

I have a question with regards to XML and python. I want to comb through this xml file and look for certain tags, and then within those tags look for where there is data separated by a comma. split that and make a new line. I have the logic down, im just not too familiar with python to know whoch modules I should be researching. Any help as to where i should start researching would help.
172.28.18.142,10.0.0.2
thanks

I think when it comes to xml parsing in python there are a few options: lxml, xml, and BeautifulSoup. Most of my experience has dealt with the first two and I've found lxml to be extraordinarily faster than xml. Here's an lxml code snippet for parsing all elements of the root with a particular tag and storing the comma-separated text of each tag as a list. I think you'll want to add a lot of try and except blocks and tinker with the details, but this should get you started.
from lxml import etree
file_path = r'C:\Desktop\some_file.xml'
tree = etree.parse(file_path)
info_list = []
my_tag_path = tree.xpath('//topTag')
for elem in my_tag_path:
if elem.find('.//childTag') is not None:
info_list.append(elem.xpath('.//childTag')[0].text.split(','))

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>

The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...

It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.

Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Targeting specific sub-elements when parsing XML with Python

I'm working on building a simple parser to handle a regular data feed at work. This post, XML to csv(-like) format , has been very helpful. I'm using a for loop like in the solution, to loop through all of the elements/subelements I need to target but I'm still a bit stuck.
For instance, my xml file is structured like so:
<root>
<product>
<identifier>12</identifier>
<identifier>ab</identifier>
<contributor>Alex</contributor>
<contributor>Steve</contributor>
</product>
<root>
I want to target only the second identifier, and only the first contributor. Any suggestions on how might I do that?
Cheers!

The other answer you pointed to has an example of how to turn all instances of a tag into a list. You could just loop through those and discard the ones you're not interested in.
However, there's a way to do this directly with XPath: the mini-language supports item indexes in brackets:
import xml.etree.ElementTree as etree
document = etree.parse(open("your.xml"))
secondIdentifier = document.find(".//product/identifier[2]")
firstContributor = document.find(".//product/contributor[1]")
print secondIdentifier, firstContributor
prints
'ab', 'Alex'
Note that in XPath, the first index is 1, not 0.
ElementTree's find and findall only support a subset of XPath, described here. Full XPath, described in brief on W3Schools and more fully in the W3C's normative document is available from lxml, a third-party package, but one that is widely available. With lxml, the example would look like this:
import lxml.etree as etree
document = etree.parse(open("your.xml"))
secondIdentifier = document.xpath(".//product/identifier[2]")[0]
firstContributor = document.xpath(".//product/contributor[1]")[0]
print secondIdentifier, firstContributor

Python: Keyword to Links

I am building a blog on Google App Engine. I would like to convert some keywords in my blog posts to links, just like what you see in many WordPress blogs.
Here is one WP plugin which do the same thing:http://wordpress.org/extend/plugins/blog-mechanics-keyword-link-plugin-v01/
A plugin that allows you to define keyword/link pairs. The keywords are automatically linked in each of your posts.
I think this is more than a simple Python Replace. What I am dealing with is HTML code. It can be quite complex sometimes.
Take the following code snippet as an example. I want to conver the word example into a link to http://example.com:
Here is an example link:example.com
By a simple Python replace function which replaces example with example, it would output:
Here is an example link:example.com">example.com</a>
but I want:
Here is an example link:example.com
Is there any Python plugin that capable of this? Thanks a lot!

This is roughly what you could do using Beautifulsoup:
from BeautifulSoup import BeautifulSoup
html_body ="""
Here is an example link:<a href='http://example.com'>example.com</a>
"""
soup = BeautifulSoup(html_body)
for link_tag in soup.findAll('a'):
link_tag.string = "%s%s%s" % ('|',link_tag.string,'|')
for text in soup.findAll(text=True):
text_formatted = ['example'\
if word == 'example' and not (word.startswith('|') and word.endswith('|'))\
else word for word in foo.split() ]
text.replaceWith(' '.join(text_formatted))
for link_tag in soup.findAll('a'):
link_tag.string = link_tag.string[1:-1]
print soup
Basically I'm stripping out all the text from the post_body, replacing the example word with the given link, without touching the links text that are saved by the '|' characters during the parsing.
This is not 100% perfect, for example it does not work if the word you are trying to replace ends with a period; with some patience you could fix all the edge cases.

This would probably be better suited to client-side code. You could easily modify a word highlighter to get the desired results. By keeping this client-side, you can avoid having to expire page caches when your 'tags' change.
If you really need it to be processed server-side, then you need to look at using re.sub which lets you pass in a function, but unless you are operating on plain-text you will have to first parse the HTML using something like minidom to ensure you are not replacing something in the middle of any elements.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.