Python 2.7
I assume I'm missing something incredibly basic having to do with lxml but I have no idea what it is. By way of background, I have not used lxml much before but have used Xpaths extensively in Selenium and have also done a bit of parsing with BS4.
So, I'm making a call to this API that returns some XML as a string. Easy enough:
from lxml import etree
from io import StringIO
myXML = 'xml here'
tree = etree.parse(StringIO(myXML))
print tree.xpath('/IKnowThisTagExistsInMyXML')
It always returns [] or None. I've tried tree.find() and tree.findall() as well, to no avail.
I'm hoping someone has seen this before and can tell me what's going on.
By using an XPath of /IKnowThisTagExistsInMyXML this assumes the tag IKnowThisTagExistsInMyXML is at the top-level of your XML Document; which I really doubt it is.
Trying search your XMl Document for this tag instead by doing:
print tree.xpath('//*/IKnowThisTagExistsInMyXML')
See: XPath Syntax
Related
I have a question with regards to XML and python. I want to comb through this xml file and look for certain tags, and then within those tags look for where there is data separated by a comma. split that and make a new line. I have the logic down, im just not too familiar with python to know whoch modules I should be researching. Any help as to where i should start researching would help.
172.28.18.142,10.0.0.2
thanks
I think when it comes to xml parsing in python there are a few options: lxml, xml, and BeautifulSoup. Most of my experience has dealt with the first two and I've found lxml to be extraordinarily faster than xml. Here's an lxml code snippet for parsing all elements of the root with a particular tag and storing the comma-separated text of each tag as a list. I think you'll want to add a lot of try and except blocks and tinker with the details, but this should get you started.
from lxml import etree
file_path = r'C:\Desktop\some_file.xml'
tree = etree.parse(file_path)
info_list = []
my_tag_path = tree.xpath('//topTag')
for elem in my_tag_path:
if elem.find('.//childTag') is not None:
info_list.append(elem.xpath('.//childTag')[0].text.split(','))
I have what I believe to be a fairly simple issue.
I've retrieved a file from gdata, this file: https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments
I'm attempting to single out the tex t between the
"< author >HERE< /author >"
tags so i'll be left with an output containing only usernames. Is python even the best way to go about this or should I use another language? I've been googling since 8:00am (4hrs) and i've yet to find anything for such a seemingly easy task.
Best regards,
- Mitch Powell
You have an atom feed there, so I'd use feedparser to handle that:
import feedparser
result = feedparser.parse('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
for entry in result.entries:
print entry.author
This prints:
FreebieFM
micromicros
FreebieFM
Sarah Grimstone
FreebieFM
# etc.
Feedparser is an external library, but easily installed. If you have to use only the standard library, you could use the ElementTree API, but to parse the Atom feed you need to include HTML entities in the parser, and you'll have to deal with namespaces (not ElementTree's strong point):
from urllib2 import urlopen
from xml.etree import ElementTree
response = urlopen('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
tree = ElementTree.parse(response)
nsmap = {'a': 'http://www.w3.org/2005/Atom'}
for author in tree.findall('.//a:author/a:name', namespaces=nsmap):
print author.text
The nsmap dictionary lets ElementTree translate the a: prefix to the correct namespace for those elements.
I am trying to query with XPath an html document parsed with lxml. The document is a straight html-only download of the page about Plastic in Wikipedia. Then I parse it with lxml disabling entity substitution to avoid an error with '®'
from lxml import etree
root = etree.parse("plastic.html",etree.XMLParser(resolve_entities=False))
Then, I retrieve the namespace url
htmltag = root.iter().next()
nsurl = htmltag.nsmap.values()[0]
Now, I would like to use xpath queries on either 'root' or 'htmltag', but I am unable to do so. I have tried different ways, but the following seems to me the most correct form, which yields errors anyway.
root.xpath('//ns:body',namespace={'ns',nsurl})
And this is what I get
XPathResultError: Unknown return type: dict
I am running the commands in an IPython console, but I don't think that might be the problem. What am I doing wrong?
This is a simple miss spell. You should use namespaces instead of namespace.
Currently I am having trouble typing this because, according to top, my processor is at 100% and my memory is at 85.7%, all being taken up by python.
Why? Because I had it go through a 250-meg file to remove markup. 250 megs, that's it! I've been manipulating these files in python with so many other modules and things; BeautifulSoup is the first code to give me any problems with something so small. How are nearly 4 gigs of RAM used to manipulate 250megs of html?
The one-liner that I found (on stackoverflow) and have been using was this:
''.join(BeautifulSoup(corpus).findAll(text=True))
Additionally, this seems to remove everything BUT markup, which is sort of the opposite of what I want to do. I'm sure that BeautifulSoup can do that, too, but the speed issue remains.
Is there anything that will do something similar (remove markup, leave text reliably) and NOT require a Cray to run?
lxml.html is FAR more efficient.
http://lxml.de/lxmlhtml.html
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Looks like this will do what you want.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
A couple of other similar questions: python [lxml] - cleaning out html tags
lxml.etree, element.text doesn't return the entire text from an element
Filter out HTML tags and resolve entities in python
UPDATE:
You probably want to clean the HTML to remove all scripts and CSS, and then extract the text using .text_content()
from lxml import html
from lxml.html.clean import clean_html
tree = html.parse('http://www.example.com')
tree = clean_html(tree)
text = tree.getroot().text_content()
(From: Remove all html in python?)
use cleaner from lxml.html:
>>> import lxml.html
>>> from lxml.html.clean import Cleaner
>>> cleaner = Cleaner(style=True) # to delete scripts styles objects comments etc;)
>>> html = lxml.html.fromstring(content).xpath('//body')[0]
>>> print cleaner.clean_html(html)
I am using BeautifulStoneSoup to parse an XML document and change some attributes. I noticed that it automatically converts all XML tags to lowercase. For example, my source file has <DocData> elements, which BeautifulSoup converts to <docdata>.
This appears to be causing problems since the program I am feeding my modified XML document to does not seem to accept the lowercase versions. Is there a way to prevent this behavior in BeautifulSoup?
No, that's not a built-in option. The source is pretty straightforward, though. It looks like you want to change the value of encodedName in Tag.__str__.
Simple Answer
change (default html.parser) to xml parser
code: soup = BeautifulSoup(yourXmlStr, 'xml')
Detailed Explanation
refer my answer in another post