I'm trying to get a text from one tag using lxml etree.
<div class="litem__type">
<div>
Robbp
</div>
<div>Estimation</div>
+487 (0)639 14485653
•
<a href="mailto:herbrich#gmail.com">
Email Address
</a>
•
<a class="external" href="http://www.google.com">
Homepage
</a>
</div>
The problem is that I can't locate it because there are many differences between this kind of snippets. There are situations, when the first and second div is not there at all. As you can see, the telephone number is not in it's own div.
I suppose that it would be possible to extract the telephone using BeautifulSoups contents but I'm trying to use lxml module's xpath.
Do you have any ideas? (email don't have to be there sometimes)
EDIT: The best idea is probably to use regex but I don't know how to tell it that it should extract just text between two <div></div>
You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees.
The text after element A's closing tag, but before element B's opening tag, is called element A's tail text. To select this tail text using lxml etree you could do the following:
content = '''
<div class="litem__type">
<div>Robbp</div>
<div>Estimation</div>
+487 (0)639 14485653
Email Address
<a class="external" href="http://www.google.com">Homepage</a>
</div>'''
from lxml import etree
tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)
Output
'+487 (0)639 14485653'
The strip() function is used here to remove whitespace on either side of the tail text.
You can iterate and get text after div tag.
from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
# you can check here if it is a phone number
print node.tail
Related
How do you access a text in an XPath if it doesn't have a node?
The text is in quotation marks and on seperate line inside another node
I'm having trouble choosing the correct element in an XPath
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
I'd normally do this by writing
import requests
from lxml import html,etree
from lxml.html import document_fromstring
page = requests.get('https://www.the_link_im_trying_to_webscrape.org')
tree = html.fromstring(page.content)
the_text_i_need_to_access_xpath = '/span/a/text()'
the_text_i_need_to_access = tree.xpath(the_text_i_need_to_access_xpath)
Unfortunately this is only returning an empty list. Does anyone know how I have to modify the XPath in order to get the string I'm looking for?
How do you access a text in an XPath if it doesn't have a node?
Text in an XML or HTML document will be associated with a node. That's not the problem here. And the " " delimiters are just there to show you surrounding whitespace.
As presented your XPath should select the text within the a element. Here're some reasons that may not be happening:
As #MadsHansen mentioned in comments, the root element of your actual HTML may not be a span as shown. See:
Difference between "//" and "/" in XPath?
The text may not be loaded at the time of your XPath execution because the document hasn't completely loaded or because JavaScript dynamically changes the DOM later. See:
Selenium wait until document is ready
Selenium WebDriver: Wait for complex page with JavaScript to load
fromstring() can use a bit more magic than might be expected:
fromstring(string):
Returns document_fromstring or fragment_fromstring, based on
whether the string looks like a full document, or just a fragment.
Given this, here is an update to your code that will select the targeted text as expected:
import requests
from lxml import html
from lxml.html import document_fromstring
htmlstr = """
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
"""
tree = html.fromstring(htmlstr)
print(html.tostring(tree))
the_text_i_need_to_access_xpath = '//span/a/text()'
the_text_i_need_to_access = tree.xpath(the_text_i_need_to_access_xpath)
print(the_text_i_need_to_access)
Or, if you don't need/want the HTML surprises, this also selects the text:
import lxml.etree as ET
xmlstr = """
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
"""
root = ET.fromstring(xmlstr)
print(root.xpath('/span/a/text()'))
Credit: Thanks to #ThomasWeller for pointing out the additional complications and helping to resolve them.
I've defined css selectors within the script to get the text within span elements and I'm getting them accordingly. However, the way I tried is definitely messy. I just seperated different css selectors using comma to let the script understand I'm after this or that.
If I opt for xpath I could have used 'div//span[.="Featured" or .="Sponsored"]' but in case of css selector I could not find anything similar to serve the same purpose. I know using 'span:contains("Featured"),span:contains("Sponsored")' I can get the text but there is the comma in between as usual.
What is the ideal way to locate the elements (within different ids) using css selectors except for comma?
My try so far with:
from lxml.html import fromstring
html = """
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/pizza-hut">
Pizza Hut
</a>
<div id="featured other-dynamic-ids">
<span>Sponsored</span>
</div>
</div>
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/salads-up">
Salads UP
</a>
<div id="other-dynamic-ids border">
<span>Featured</span>
</div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("[id~='featured'] span,[id~='border'] span"):
print(item.text)
You can do:
.rest-list-information div span
But I think it's a bad idea to consider the comma messy. You won't find many stylesheets that don't have commas.
If you are just looking to get all 'span' text from the HTML then the following should suffice:
root_spans = root.xpath('//span')
for i, root_spans in enumerate(root_spans):
span_text = root_spans.xpath('.//text()')[0]
print(span_text)
I scrapped a website and I want to find an element based on the text written in it. Let's say below is the sample code of the website:
code = bs4.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
I want some way to get a p element that has as a text value Some Information. How can I select an element like so?
Just use text parameter:
code.find_all("p", text="Some Information")
If you need only the first element than use find instead of find_all.
You could use text to search all tags matching the string
import BeautifulSoup as bs
import re
code = bs.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
for elem in code(text='Some Information'):
print elem.parent
Im making a small python script for auto logon to a website. But i'm stuck.
I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:
<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>
But how do I extract and print just the name, John Appleseed?
I'm using Pythons' Mechanize on a mac, by the way.
Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)
Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:
html = ... # this is the html you've fetched
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element
As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.
But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?
You can obtain the xpath for an element using "inspect element" feature of firefox.
For ex, if you want to find the XPATH for username in stackoverflow site.
Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
Keep your mouse over tag or right click it and "Copy xpath".
You can use a parser to extract any information in a document. I suggest you to use lxml module.
Here you have an example:
from lxml import etree
from StringIO import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO("""<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>"""),parser)
>>> tree.xpath("string()").strip()
u'John Appleseed'
More information about lxml here
I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.
<body>
<h1>A title</h1>
<p>Some text</p>
</body>
InnerHtml is therefore:
<h1>A title</h1>
<p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.
Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:
<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>
Text directly under the root element is ignored. I ended up doing this:
(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])
You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
... print etree.tostring(child)
...
<h1>A title</h1>
<p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])
import lxml.etree as ET
body = t.xpath("//body");
for tag in body:
h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
p = html.fromstring( ET.tostring(tag[1]) ).xpath("//p");
htext = h[0].text_content();
ptext = h[0].text_content();
you can also use .get('href') for a tag and .attrib for attribute ,
here tag no is hardcoded but you can also do this dynamic
Here is a Python 3 version:
from xml.sax import saxutils
from lxml import html
def inner_html(tree):
""" Return inner HTML of lxml element """
return (saxutils.escape(tree.text) if tree.text else '') + \
''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])
Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!
I find none of the answers satisfying, some are even in Python 2. So I add a one-liner solution that produces innerHTML-like output and works with Python 3:
from lxml import etree, html
# generate some HTML element node
node = html.fromstring("""<container>
Some random text <b>bold <i>italic</i> yeah</b> no yeah
<!-- comment blah blah --> <img src='gaga.png' />
</container>""")
# compute inner HTML of element
innerHTML = "".join([
str(c) if type(c)==etree._ElementUnicodeResult
else html.tostring(c, with_tail=False).decode()
for c in node.xpath("node()")
]).strip()
The result will be:
'Some random text <b>bold <i>italic</i> yeah</b> no yeah\n<!-- comment blah blah --> <img src="gaga.png">'
What it does: The xpath delivers all node children (text, elements, comments). The list comprehension produces a list of the text contents of the text nodes and HTML content of element nodes. Those are then joined into a single string. If you want to get rid of comments, use *|text() instead of node() for xpath.