How can i convert i large xml (500M)with complex structure in to csv ?
Sample XML:
<images>
<image ismain="1" sml="1" med="1" big="0"><id>2</id><title><![CDATA[]]></title><url>www.mysite.com/45656.jpeg</url></image>
<image ismain="1" sml="1" med="0" big="1"><id>2</id><title><![CDATA[]]></title><url>www.mysite.com/354456.jpeg</url></image>
</images>
Code Python :
from xmlutils.xml2csv import xml2csv
converter = xml2csv("/home/mehul/Downloads/instant/static/images.xml", "/home/mehul/Downloads/instant/static/images.csv", encoding="utf-8")
converter.convert(tag="image")
Actual Output:
id,title,url
2,,www.mysite.com/45656.jpeg
2,,www.mysite.com/354456.jpeg
Expected Output:
id,ismain sml med big,title,url
2,,,,,www.mysite.com/45656.jpeg
2,,,,,www.mysite.com/354456.jpeg
As far as I have used xmlutils, it doesn't work well with complex structures, such as XML with nested tags. Furthermore, you want all the attributes too.
I had worked on this in a company project, and basically I had to write my own parsing code.
You can use Python's built-in xml library to parse through the XML, and check for events such as start and end tags, and then extract the data.
In fact, if all your tag names are in lowercase, you can just use Python's HTMLParser. It has pre-defined functions for handling events which you can just override. It however converts tag names to lowercase (if they are in uppercase originally).
Related
I am a writer of books and I am new to Python. The question I have is strategic. I write my manuscripts into an xml-file with proprietary tags (they have a size around 1 MB and 5000-1000 lines):
manuscript.xml
<h>Title of Chapter</h>
<p>This is a sentence, with one word written in <i>italics</i></p>
I often want to output what I have written so far and I am trying to create a fully automated workflow with Python. Python should convert my XML into two different XML-schemes:
1. HTML for epub (with creating IDs):
<h1 id="title-of-chapter">Title of Chapter</h1>
<p>This is a sentence, with one word written in <i>italics</i></p>
Then save as manuscript.html.
2. ODT:
<text:h text:style-name="HeadlineStyle1" text:outline-level="1">GetByName</text:h>
<text:p text:style-name="ParagraphStyle1">This is a sentence, with one word written in <text:span text:style-name="Italics">italics</text:span></text:p>
Then save as content.xml
I am not sure whether I should really parse the XML (original XML → dict → new XML) at all. Wouldn't it be easier to handle the original file as plain text and to let Python just convert the tags, thus <p> becomes <text:p text:style-name="ParagraphStyle1">?
On the other side, the task above is only the first step. Later on, I would like to make Python create a table of contents by collecting all headlines and write it into the helper file toc.ncx and finally letting Python zip all those files into an epub-container.
There are alot of tutorials out there about xml → dict, but it is hard to find something about the second step dict → xml which goes into details.
It is easily possible to rename tags in ElementTree:
for oldTag in root.iter('oldtag'):
oldTag.tag = 'newtag'
XSLT cannot do that. It cannot transform XML, it can only pick elements out of it.
Right now I have some code which uses Biopython and NCBI's "Entrez" API to get XML strings from Pubmed Central. I'm trying to parse the XML with ElementTree to just have the text from the page. Although I have BeautifulSoup code that does exactly this when I scrape the lxml data from the site itself, I'm switching to the NCBI API since scrapers are apparently a no-no. But now with the XML from the NCBI API, I'm finding ElementTree extremely unintuitive and could really use some help getting it to work. Of course I've looked at other posts, but most of these deal with namespaces and in my case, I just want to use the XML tags to grab information. Even the ElementTree documentation doesn't go into this (from what I can tell). Can anyone help me figure out the syntax to grab information within certain tags rather than within certain namespaces?
Here's an example. Note: I use Python 3.4
Small snippit of the XML:
<sec sec-type="materials|methods" id="s5">
<title>Materials and Methods</title>
<sec id="s5a">
<title>Overgo design</title>
<p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50–60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
<table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
<label>Table 2</label>
<caption>
<title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
</caption>
<alternatives>
<graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"/>
<col align="center" span="1"/>
</colgroup>
For my project, I want all of the text in the "p" tag (not just for this snippit of the XML, but for the entire XML string).
Now, I already know that I can make the whole XML string into an ElementTree Object
>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)
Now if I try to get the text using the tag like this:
>>> text = root.find('p')
>>> print("".join(text.itertext()))
or
>>> text = root.get('p').text
I can't extract the text that I want. From what I've read, this is because I'm using the tag "p" as an argument rather than a namespace.
While I feel like it should be quite simple for me to get all the text in "p" tags within an XML file, I'm currently unable to do it. Please let me know what I'm missing and how I can fix this. Thanks!
--- EDIT ---
So now I know that I should be using this code to get everything in the 'p' tags:
>>> text = root.find('.//p')
>>> print("".join(text.itertext()))
Despite the fact that I'm using itertext(), it's only returning content from the first "p" tag and not looking at any other content. Does itertext() only iterate within a tag? Documentation seems to suggest it iterates across all tags as well, so I'm not sure why its only returning one line instead of all of the text under all of the "p" tags.
---- FINAL EDIT --
I figured out that itertext() only works within one tag and find() only returns the first item. In order to get the enitre text that I want I must use findall()
>>> all_text = root.findall('.//p')
>>> for texts in all_text:
print("".join(texts.itertext()))
root.get() is the wrong method, as it will retrieve an attribute of the root tag not a subtag.
root.find() is correct as it will find the first matching subtag (alternatively one can use root.findall() for all matching subtags).
If you want to find not only direct subtags but also indirect subtags (as in your example), the expression within root.find/root.findall has be to a subset of XPath (see https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support). In your case it is './/p':
text = root.find('.//p')
print("".join(text.itertext()))
I am trying to read a XML file using python [ver - 2.6.7] using ElementTree
There are some tags of the format :
<tag, [attributes]>
....Data....
</tag>
The data in my case is usually some binary data that I read using text attribute.
However there are some cases where data can reference any other tag in the file.
<tag, [attributes]>
....Data....
<ref target='idname'/>
</tag>
What attribute from element tree can be used to parse them ?
Try XPath expressions.
This will tell you whether the tag is present and, if present, returns the node.
I think I would use something like this:
for iteration in root.iter('tag'):
if iteration.find('ref'):
...
So basicly I would parse thous cases separately.
My question is regarding how to get information stored in a tag which allows for no closing tag. Here's the relevant xml:
<?xml version="1.0" encoding="UTF-8"?>
<uws:job>
<uws:results>
<uws:result id="2014-03-03T15:42:31:1337" xlink:href="http://www.cosmosim.org/query/index/stream/table/2014-03-03T15%3A42%3A31%3A1337/format/csv" xlink:type="simple"/>
</uws:results>
</uws:job>
I'm looking to extract the xlink:href url here. As you can see the uws:result tag requires no closing tag. Additionally, having the 'uws:' makes it a bit tricky to handle them when working in python. Here's what I've tried so far:
from lxml import etree
root = etree.fromstring(xmlresponse.content)
url = root.find('{*}results').text
Where xmlresponse.content is the xml data to be parsed. What this returns is
'\n '
which indicates that it's only finding the newline character, since what I'm really after is contained within a tag inside the results tag. Any ideas would be greatly appreciated.
You found the right node; you extracted the data incorrectly. Instead of
url = root.find('{*}results').text
you really want
url = root.find('{*}results').get('attribname', 'value_to_return_if_not_present')
or
url = root.find('{*}results').attrib['attribname']
(which will throw an exception if not present).
Because of the namespace on the attribute itself, you will probably need to use the {ns}attrib syntax to look it up too.
You can dump out the attrib dictionary and just copy the attribute name out too.
text is actually the space between elements, and is not normally used but is supported both for spacing (like etreeindent) and some special cases.
I've file which contains name of scientist in following format
<scientist_names>
<scientist>abc</scientist>
</scientist_names>
i want to use python to strip out name of scientists from above format How should I do it??
I would like to use regular epressions but don't know how to use it...please help
DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])
Use an xml/html parser, take a look at BeautifulSoup.
This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).
Here is an example:
from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""
tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
print scientist.text
As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).
Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).
In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML
Here is an simple example that should handle the xml tags for you
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData
If you find anything unclear just let me know