Why is my XPath to select text not working? - python

How do you access a text in an XPath if it doesn't have a node?
The text is in quotation marks and on seperate line inside another node
I'm having trouble choosing the correct element in an XPath
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
I'd normally do this by writing
import requests
from lxml import html,etree
from lxml.html import document_fromstring
page = requests.get('https://www.the_link_im_trying_to_webscrape.org')
tree = html.fromstring(page.content)
the_text_i_need_to_access_xpath = '/span/a/text()'
the_text_i_need_to_access = tree.xpath(the_text_i_need_to_access_xpath)
Unfortunately this is only returning an empty list. Does anyone know how I have to modify the XPath in order to get the string I'm looking for?

How do you access a text in an XPath if it doesn't have a node?
Text in an XML or HTML document will be associated with a node. That's not the problem here. And the " " delimiters are just there to show you surrounding whitespace.
As presented your XPath should select the text within the a element. Here're some reasons that may not be happening:
As #MadsHansen mentioned in comments, the root element of your actual HTML may not be a span as shown. See:
Difference between "//" and "/" in XPath?
The text may not be loaded at the time of your XPath execution because the document hasn't completely loaded or because JavaScript dynamically changes the DOM later. See:
Selenium wait until document is ready
Selenium WebDriver: Wait for complex page with JavaScript to load
fromstring() can use a bit more magic than might be expected:
fromstring(string):
Returns document_fromstring or fragment_fromstring, based on
whether the string looks like a full document, or just a fragment.
Given this, here is an update to your code that will select the targeted text as expected:
import requests
from lxml import html
from lxml.html import document_fromstring
htmlstr = """
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
"""
tree = html.fromstring(htmlstr)
print(html.tostring(tree))
the_text_i_need_to_access_xpath = '//span/a/text()'
the_text_i_need_to_access = tree.xpath(the_text_i_need_to_access_xpath)
print(the_text_i_need_to_access)
Or, if you don't need/want the HTML surprises, this also selects the text:
import lxml.etree as ET
xmlstr = """
<span>
<a href="www.imagine_a_link_here.org">
"
This is the text I need to access
"
</a>
</span>
"""
root = ET.fromstring(xmlstr)
print(root.xpath('/span/a/text()'))
Credit: Thanks to #ThomasWeller for pointing out the additional complications and helping to resolve them.

Related

Find text with find_next_sibling, if it is sometimes hyperlinked and sometimes not

This question is based on a similar question of mine (Search for a Word on website and get the next words in return)
I want to get the text "Herr Max Mustermann" from the website. This text is changing from site to site. My plan was to search for the word "Position", which is stable from site to site and than get the next words (solution in the mentioned question above).
Sometimes the text ""Herr Max Mustermann"" is marked with an hyperlink, so that I only get an empty output.
<br>
<strong>Geschäftsführer</strong>
<br>
<a class="blue" data toggle="modal" href="https://www.firmenabc.at/person/mustermann-max_jhgxzd" data- target="#shareholder">
Herr Max Mustermann
<i class="icon etc"
::before
</i>
</a>
<br>
Privatperson
My idea would be to include an if loop:
if the next sibling of soup.find('strong', string='Vorstand') contains an a tag:
ceo = return the text from it's next sibling
else:
ceo= soup.find('strong', string='Vorstand').find_next_sibling(string=True).strip()
Any ideas how to code it?
There are several options to deal with that issue, two of them described:
Use decompose() to remove all the br and use the approach of #BEK (without decomposing it won't find the a cause next element is a br)
Select your elements more specific so that you can start directly from the sibling br of the strong
Example
Used css selectors here:
from bs4 import BeautifulSoup
import requests
url = "https://www.firmenabc.at/austrian-airlines-ag_EES"
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('strong:-soup-contains("Aufsichtsrat") + br'):
if e.find_next_sibling().name == 'a':
print(e.find_next_sibling('a').text.strip())
else:
print(e.find_next_sibling(text=True).strip())
You can use find_next_sibling() for check if the next sibling of strong tag with the text Vorstand contains an a tag.
ceo_tag = soup.find('strong', string='Vorstand')
next_sibling = ceo_tag.find_next_sibling()
if next_sibling.name == 'a':
ceo = next_sibling.text
else:
ceo = ceo_tag.find_next_sibling(string=True).strip()
You can also use next_element instead of find_next_sibling to get the next element after the 'strong' tag.

with BeautifulSoup extract text from div in a href in loop

<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy
you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!

Find text using lxml etree

I'm trying to get a text from one tag using lxml etree.
<div class="litem__type">
<div>
Robbp
</div>
<div>Estimation</div>
+487 (0)639 14485653
•
<a href="mailto:herbrich#gmail.com">
Email Address
</a>
•
<a class="external" href="http://www.google.com">
Homepage
</a>
</div>
The problem is that I can't locate it because there are many differences between this kind of snippets. There are situations, when the first and second div is not there at all. As you can see, the telephone number is not in it's own div.
I suppose that it would be possible to extract the telephone using BeautifulSoups contents but I'm trying to use lxml module's xpath.
Do you have any ideas? (email don't have to be there sometimes)
EDIT: The best idea is probably to use regex but I don't know how to tell it that it should extract just text between two <div></div>
You should avoid using regex to parse XML/HTML wherever possible because it is not as efficient as using element trees.
The text after element A's closing tag, but before element B's opening tag, is called element A's tail text. To select this tail text using lxml etree you could do the following:
content = '''
<div class="litem__type">
<div>Robbp</div>
<div>Estimation</div>
+487 (0)639 14485653
Email Address
<a class="external" href="http://www.google.com">Homepage</a>
</div>'''
from lxml import etree
tree = etree.XML(content)
phone_number = tree.xpath('div[2]')[0].tail.strip()
print(phone_number)
Output
'+487 (0)639 14485653'
The strip() function is used here to remove whitespace on either side of the tail text.
You can iterate and get text after div tag.
from lxml import etree
tree = etree.parse("filename.xml")
items = tree.xpath('//div')
for node in items:
# you can check here if it is a phone number
print node.tail

How do I replace a HTML element with some new format in Python

What is a good way to replace an HTML tag like:
Old : <div id=pgbrk" ....../>....Page Break....</div>
New : <!--page break -->
div id might have many other values hence regex is not a good idea. I need some LXML kind of thing. Basically, my problem is to replace an HTML tag with a string!
As long as your div has a parent tag, you could do this:
import lxml.html as LH
import lxml.etree as ET
content='<root><div id="pgbrk" ......>....Page Break....</div></root>'
doc=LH.fromstring(content)
# print(LH.tostring(doc))
for div in doc.xpath('//div[#id="pgbrk"]'):
parent=div.getparent()
parent.replace(div,ET.Comment("page break"))
print(LH.tostring(doc))
yields
<root><!--page break--></root>
You can use plain DOM http://docs.python.org/library/xml.dom.minidom.html
1) parse your source
from xml.dom.minidom import parse
datasource = open('c:\\temp\\mydata.xml')
doc= parse(datasource)
2) find your nodes to remove
for node in doc.getElementsByTagName('div'):
for attr in node.attributes:
if attr.name == 'id':
...
3) when found targeted nodes, replace them with new comment node
parent = node.parentNode
parent.replaceChild(doc.createComment("page break"), node)
docs: http://docs.python.org/library/xml.dom.html

Printing certain HTML Python Mechanize

Im making a small python script for auto logon to a website. But i'm stuck.
I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:
<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>
But how do I extract and print just the name, John Appleseed?
I'm using Pythons' Mechanize on a mac, by the way.
Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)
Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:
html = ... # this is the html you've fetched
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element
As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.
But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?
You can obtain the xpath for an element using "inspect element" feature of firefox.
For ex, if you want to find the XPATH for username in stackoverflow site.
Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
Keep your mouse over tag or right click it and "Copy xpath".
You can use a parser to extract any information in a document. I suggest you to use lxml module.
Here you have an example:
from lxml import etree
from StringIO import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO("""<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>"""),parser)
>>> tree.xpath("string()").strip()
u'John Appleseed'
More information about lxml here

Categories