This question already has answers here:
How to parse XML and get instances of a particular node attribute?
(19 answers)
Closed 9 years ago.
I would like to parse xml in python, but as a string, not taken from a file. Can someone help me do this?
From a file, you could normally do it as
from xml.dom import minidom
xmldoc = minidom.parse('~/diveintopython/common/py/kgp/binary.xml')
For a string, you can change it to
from xml.dom import minidom
xmldoc = minidom.parseString( Your string goes here )
You could use: xml.dom.minidom.parseString(text)
This method creates a StringIO object for the string and passes that on to parse().
You could also use the same technique of using StringIO for any other XML parser that expects a file-like object.
import StringIO
your_favourite_xml_parser.parse(StringIO.StringIO('<xml>...</xml>'))
You can use (xml.etree.cElementTree) also.
import xml.etree.cElementTree as ET
aElement = ET.fromstring('<Root id="UUID_1"><Item id="id_Item" /></Root>')
See Python help document
Each element has a number of properties associated with it:
a tag which is a string identifying what kind of data this element represents (the element type, in other words).
a number of attributes, stored in a Python dictionary.
a text string.
an optional tail string.
a number of child elements, stored in a Python sequence
You can also use lxml. My startup (http://dealites.com) involves a lot of XML processing everyday. I have tried almost every xml library available in python. lxml is the best library available for xml processing.
You can also try Beautiful soup. It is great for HTML parsing but a good alternative to lxml.
lxml example:
from lxml import etree;
parsedfeed = etree.xml('your xml here');
Beautiful Soup example:
from BeautifulSoup import BeautifulStoneSoup;
soup = BeautifulStoneSoup('your xml here');
Related
Python 2.7
I assume I'm missing something incredibly basic having to do with lxml but I have no idea what it is. By way of background, I have not used lxml much before but have used Xpaths extensively in Selenium and have also done a bit of parsing with BS4.
So, I'm making a call to this API that returns some XML as a string. Easy enough:
from lxml import etree
from io import StringIO
myXML = 'xml here'
tree = etree.parse(StringIO(myXML))
print tree.xpath('/IKnowThisTagExistsInMyXML')
It always returns [] or None. I've tried tree.find() and tree.findall() as well, to no avail.
I'm hoping someone has seen this before and can tell me what's going on.
By using an XPath of /IKnowThisTagExistsInMyXML this assumes the tag IKnowThisTagExistsInMyXML is at the top-level of your XML Document; which I really doubt it is.
Trying search your XMl Document for this tag instead by doing:
print tree.xpath('//*/IKnowThisTagExistsInMyXML')
See: XPath Syntax
I have what I believe to be a fairly simple issue.
I've retrieved a file from gdata, this file: https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments
I'm attempting to single out the tex t between the
"< author >HERE< /author >"
tags so i'll be left with an output containing only usernames. Is python even the best way to go about this or should I use another language? I've been googling since 8:00am (4hrs) and i've yet to find anything for such a seemingly easy task.
Best regards,
- Mitch Powell
You have an atom feed there, so I'd use feedparser to handle that:
import feedparser
result = feedparser.parse('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
for entry in result.entries:
print entry.author
This prints:
FreebieFM
micromicros
FreebieFM
Sarah Grimstone
FreebieFM
# etc.
Feedparser is an external library, but easily installed. If you have to use only the standard library, you could use the ElementTree API, but to parse the Atom feed you need to include HTML entities in the parser, and you'll have to deal with namespaces (not ElementTree's strong point):
from urllib2 import urlopen
from xml.etree import ElementTree
response = urlopen('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
tree = ElementTree.parse(response)
nsmap = {'a': 'http://www.w3.org/2005/Atom'}
for author in tree.findall('.//a:author/a:name', namespaces=nsmap):
print author.text
The nsmap dictionary lets ElementTree translate the a: prefix to the correct namespace for those elements.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
If I have a string that looks something like...
"<tr><td>123</td><td>234</td>...<td>697</td></tr>"
Basically a table row with n cells.
What's the easiest way in python to get the values of each cell. That is I just want the values "123", "234", "697" stored in a list or array or what ever is easiest.
I've tried to use regular expressions, when I use
re.match
I am not able to get it to find anything. If I try with
re.search
I can only get the first cell. But I want to get all the cells. If I can't do this with n cells, how would you do it with a fixed number of cells?
If that markup is part of a larger set of markup, you should prefer a tool with a HTML parser.
One such tool is BeautifulSoup.
Here's one way to find what you need using that tool:
>>> markup = '''"<tr><td>123</td><td>234</td>...<td>697</td></tr>"'''
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(markup)
>>> for i in soup.find_all('td'):
... print(i.text)
Result:
123
234
697
Don't do this. Just use a proper HTML parser, and use something like xpath to get the elements you want.
A lot of people like lxml. For this task, you will probably want to use the BeautifulSoup backend, or use BeautifulSoup directly, because this is presumably not markup from a source known to generate well-formed, valid documents.
When using lxml, an element tree gets created. Each element in the element tree holds information about a tag.
from lxml import etree
root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
elements = root.findall(".//a")
tag = elements[0].tag
attr = elements[0].attrib
I've file which contains name of scientist in following format
<scientist_names>
<scientist>abc</scientist>
</scientist_names>
i want to use python to strip out name of scientists from above format How should I do it??
I would like to use regular epressions but don't know how to use it...please help
DO NOT USE REGULAR EXPRESSIONS! (all reasons well explained [here])
Use an xml/html parser, take a look at BeautifulSoup.
This is XML and you should use a XML parser like lxml instead of regular expressions (because XML is not a regular language).
Here is an example:
from lxml import etree
text = """<scientist_names> <scientist>abc</scientist> </scientist_names>"""
tree = etree.fromstring(text)
for scientist in tree.xpath("//scientist"):
print scientist.text
As noted, this appears to be xml. In that case, you should use an xml parser to parse this document; I recommend lxml ( http://lxml.de ).
Given your requirements, you may find it more convenient to use SAX-style parsing, rather than DOM-style, because SAX parsing simply involves registering handlers when the parser encounters a particular tag, as long as the meaning of a tag is not dependent on context, and you have more than one type of tag to process (which may not be the case here).
In case your input document may be incorrectly formed, you may wish to use Beautiful Soup: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML
Here is an simple example that should handle the xml tags for you
#import library to do http requests:
import urllib2
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
#download the file if it's not on the same machine otherwise just use a path:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName,
#in your case <scientist>:
xmlTag = dom.getElementsByTagName('scientist')[0].toxml()
#strip off the tag (<tag>data</tag> ---> data):
xmlData=xmlTag.replace('<scientist>','').replace('</scientist>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData
If you find anything unclear just let me know
I am using BeautifulStoneSoup to parse an XML document and change some attributes. I noticed that it automatically converts all XML tags to lowercase. For example, my source file has <DocData> elements, which BeautifulSoup converts to <docdata>.
This appears to be causing problems since the program I am feeding my modified XML document to does not seem to accept the lowercase versions. Is there a way to prevent this behavior in BeautifulSoup?
No, that's not a built-in option. The source is pretty straightforward, though. It looks like you want to change the value of encodedName in Tag.__str__.
Simple Answer
change (default html.parser) to xml parser
code: soup = BeautifulSoup(yourXmlStr, 'xml')
Detailed Explanation
refer my answer in another post