I am trying to query with XPath an html document parsed with lxml. The document is a straight html-only download of the page about Plastic in Wikipedia. Then I parse it with lxml disabling entity substitution to avoid an error with '®'
from lxml import etree
root = etree.parse("plastic.html",etree.XMLParser(resolve_entities=False))
Then, I retrieve the namespace url
htmltag = root.iter().next()
nsurl = htmltag.nsmap.values()[0]
Now, I would like to use xpath queries on either 'root' or 'htmltag', but I am unable to do so. I have tried different ways, but the following seems to me the most correct form, which yields errors anyway.
root.xpath('//ns:body',namespace={'ns',nsurl})
And this is what I get
XPathResultError: Unknown return type: dict
I am running the commands in an IPython console, but I don't think that might be the problem. What am I doing wrong?
This is a simple miss spell. You should use namespaces instead of namespace.
Related
Python 2.7
I assume I'm missing something incredibly basic having to do with lxml but I have no idea what it is. By way of background, I have not used lxml much before but have used Xpaths extensively in Selenium and have also done a bit of parsing with BS4.
So, I'm making a call to this API that returns some XML as a string. Easy enough:
from lxml import etree
from io import StringIO
myXML = 'xml here'
tree = etree.parse(StringIO(myXML))
print tree.xpath('/IKnowThisTagExistsInMyXML')
It always returns [] or None. I've tried tree.find() and tree.findall() as well, to no avail.
I'm hoping someone has seen this before and can tell me what's going on.
By using an XPath of /IKnowThisTagExistsInMyXML this assumes the tag IKnowThisTagExistsInMyXML is at the top-level of your XML Document; which I really doubt it is.
Trying search your XMl Document for this tag instead by doing:
print tree.xpath('//*/IKnowThisTagExistsInMyXML')
See: XPath Syntax
I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:
soup = bs4.BeautifulSoup(html, 'lxml')
but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?
According to the Specifying the parser to use documentation page:
The first argument to the BeautifulSoup constructor is a string or an
open filehandle–the markup you want parsed. The second argument is how
you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
In other words, just installing lxml in the same python environment makes it a default parser.
Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup choose the best parser by itself. You would also have to remember that you need to have lxml installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup would just get the next available parser without throwing any errors.
If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml in your project requirements alongside with beautifulsoup4.
Besides: "Explicit is better than implicit."
Obviously take a look at the accepted answer first. It is pretty good, and as for this technicality:
but I don't want to have to repeatedly type 'lxml' every time I call
BeautifulSoup. Is there a way I can set which parser to use once at
the beginning of my program?
If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.
# V1 - define a wrapper function - most straight-forward.
import bs4
def bs_parse(html):
return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)
Or if you feel like showing off ...
import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)
I'm using pywikibot-core, and I used before another python Mediawiki API wrapper as Wikipedia.py (which has a .HTML method). I switched to pywikibot-core 'cause I think it has many more features, but I can't find a similar method.
(beware: I'm not very skilled).
I'll post here user283120 second answer, more precise than the first one:
Pywikibot core doesn't support any direct (HTML) way to interact to Wiki, so you should use API.
If you need to, you can do it easily by using urllib2.
This is an example I used to get HTML of a wiki page in commons:
import urllib2
...
url = "https://commons.wikimedia.org/wiki/" + page.title().replace(" ","_")
html = urllib2.urlopen(url).read().decode('utf-8')
"[saveHTML.py] downloads the HTML-pages of articles and images and saves the interesting parts, i.e. the article-text and the footer to a file"
source: https://git.wikimedia.org/blob/pywikibot%2Fcompat.git/HEAD/saveHTML.py
IIRC you want the HTML of the entire pages, so you need something that uses api.php?action=parse. In Python I'd often just use wikitools for such a thing, I don't know about PWB or the other requirements you have.
In general you should use pywikibot instead of wikipedia (e.g. instead of "import wikipedia" you should use "import pywikibot") and if you are looking for methods and class that were been in wikipedia.py, they are now separated and can be found in pywikibot folder (mainly in page.py and site.py)
If you want to run your scripts that you wrote in compat, you can use a script in pywikibot-core named compat2core.py (in scripts folder) and there is a detailed help about conversion named README-conversion.txt, read it carefully.
The Mediawiki API has a parse action which allows to get the html snippet for the wikimarkup as returned by the Mediawiki markup parser.
For the pywikibot library there is already a function implemented which you can use like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
Returns:
str: the rendered HTML code for the page
'''
page=self.getPage(pageTitle)
html=page._get_parsed_page()
return html
When using the mwclient python library there is a generic api method see:
https://github.com/mwclient/mwclient/blob/master/mwclient/client.py
Which can be used to retrieve the html code like this:
def getHtml(self,pageTitle):
'''
get the HTML code for the given page Title
Args:
pageTitle(str): the title of the page to retrieve
'''
api=self.getSite().api("parse",page=pageTitle)
if not "parse" in api:
raise Exception("could not retrieve html for page %s" % pageTitle)
html=api["parse"]["text"]["*"]
return html
As shown above this gives a duck typed interface which is implemented in the py-3rdparty-mediawiki library for which i am a committer. This was resolved with closing issue 38 - add html page retrieval
With Pywikibot you may use http.request() to get the html content:
import pywikibot
from pywikibot.comms import http
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(s, 'Elvis Presley')
path = '{}/index.php?title={}'.format(site.scriptpath(), page.title(as_url=True))
r = http.request(site, path)
print(r[94:135])
This should give the html content
'<title>Elvis Presley – Wikipedia</title>\n'
With Pywikibot 6.0 http.request() gives a requests.Response object rather than plain text. In this case you must use the text Attribute:
print(r.text[94:135])
to get the same result.
I'm trying to use lxml to read a response from the AWS REST API but not having any luck. I can easily parse the response and print it, but none of the find or xpath functions find anything. For example, take this document fragment:
<DistributionConfig xmlns="http://cloudfront.amazonaws.com/doc/2013-11-11/">
<CallerReference>e6d6909d-f1ed-47f1-83d9-290acf10f324</CallerReference>
<Aliases>
<Quantity>1</Quantity>
<Items>
And this code:
from lxml import etree
root = etree.XML( ... )
node = root.find( 'Quantity' )
node is always None. I've tried a variety of xpaths like //Quanity, .//Quantity, and also the xpath function, but can't find anything.
How do I use this library on this type of document?
Seems you will need to supply the namespace of the element as well:
>>> root.find('.//aws:Quantity', namespaces={'aws': 'http://cloudfront.amazonaws.com/doc/2013-11-11/'})
<Element {http://cloudfront.amazonaws.com/doc/2013-11-11/}Quantity at 0xb6c16aa4>
I'm finally upgrading (rewriting ;) ) my first Django app, but I am migrating all the content.
I foolishly gave users a full WYSIWYG editor for certain tasks, the HTML code produced is of course terribly ugly with more extra tags than content.
Does anyone know of a library or external shell app I could use to clean up the code?
I use tidy sometimes, but as far as I know that doesn't do what I'm asking. I want to simplify all the extra span and other garbage tags. I cleaned the most offensive offending styles with some regex, but I it would take a really long time to do anything more using just regex.
Any ideas?
You could also take a look at Bleach a white-list based HTML sanitizer. It uses html5lib to do what Kyle posted, but you'll get a lot more control over which elements and attributes are allowed in the final output.
Beautiful Soup will probably get you a more complete solution, but you might be able to get some cleanup done more simply with html5lib (if you're OK with html5 rules):
import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer
my_html = "<i>Some html fragment</I>" #intentional 'I'
html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = html_parser.parseFragment(my_html)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False, quote_attr_values=True)
cleaned_html = s.render(stream)
cleaned_html == '<i>Some html fragment</i>"
You can also sanitize the html by initializing your html_parser like this:
html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"), tokenizer=sanitizer.HTMLSanitizer)
The standard answer is Beautiful Soup.
"Extra span" and "garbage tags" is something you'll need to define very, very carefully so you can remove the tags without removing content.
I would suggest you do two things.
Fix your app so that users don't provide HTML under any circumstances. Django can use RST markup which is much more user-friendly. http://docs.djangoproject.com/en/1.3/ref/templates/builtins/#django-contrib-markup
Write a Beautiful Soup parser and transform the user's content into RST markup. Keep the structural elements (headings, lists, etc.) and lose the formatting to the extent possible.