i'm using XPath to scrap one web page, but i'm with trouble with one part of the code:
<div class="description">
here's the page description
<span> some other text</span>
<span> another tag </span>
</div>
i'm using this code to get the value from element:
description = tree.xpath('//div[#class="description"]/text()')
i can find the correct div i'm looking for, but i only want to get the text "here's the page description" not the content from inner span tags
anyone know how can i get only the text in the root node but not the content from child nodes?
The expression you are currently using would actually match the top-level text child nodes only. You can just wrap it into normalize-space() to clean up the text from extra newlines and spaces:
>>> from lxml.html import fromstring
>>> data = """
... <div class="description">
... here's the page description
... <span> some other text</span>
... <span> another tag </span>
... </div>
... """
>>> root = fromstring(data)
>>> root.xpath('normalize-space(//div[#class="description"]/text())')
"here's the page description"
To get the complete text of a node including the child nodes, use the .text_content() method:
node = tree.xpath('//div[#class="description"]')[0]
print(node.text_content())
Related
I have a little bit of difficulty extracting text from a div that has div inside (without it). So here it is:
<div style="width:100%">
<div class="status_p">
ACTIVE
</div>
Name
</div>
I want to extract Name without div that has ACTIVE. Whenever I print first div, it always gives me ACTIVEName
You can use children attribute on bs4 tag that gives you all the children in a tag. After choosing children, you can get the last element of the children list
from bs4 import BeautifulSoup
html = """<div style="width:100%">
<div class="status_p">
ACTIVE
</div>
Name
</div>"""
soup = BeautifulSoup(html, "html.parser")
print(list(soup.find("div").children)[-1].strip())
Output:
Name
OR
you can use stripped_strings
print(list(soup.find("div").stripped_strings)[-1])
OR
you can delete the inner div and get only the name.
soup.find("div",class_="status_p").extract()
print(soup.find("div").get_text(strip=True))
I have found solution and used
find("div", class_="status_p").decompose()
With Python Scrapy, I am trying to get contents in a webpage whose nodes look like this:
<div id="title">Title</div>
<ul>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
...
</ul>
I'm a newbie with XPath and couldn't get it for now. My last try was something like:
contents = response.xpath('[#id="title"]/following-sibling::ul[1]//li//p.text()')
... but it seems I cannot use /following-sibling after [#id="title"].
Any idea?
Try this XPath
contents = response.xpath('//div[#id="title"]/following-sibling::ul[1]/li/p/text()')
It selects both "CONTENT TO EXTRACT" text nodes.
One XPath would be:
response.xpath('//*[#id="title"]/following-sibling::ul[1]//p/text()).getall()
which get text from every <p> tag child or grand child of nearest <ul> tag to node with id = "title".
XPath syntax
Try this using css selector.
response.css('#title ::text).extract()
I am new to scrapy. I want to crawl some data from the web. I got the html document like below.
dom style1:
<div class="user-info">
<p class="user-name">
something in p tag
</p>
text data I want
</div>
dom style2:
<div class="user-info">
<div>
<p class="user-img">
something in p tag
</p>
something in div tag
</div>
<div>
<p class="user-name">
something in p tag
</p>
text data I want
</div>
</div>
I want to get the data text data I want, now I can use css or xpath selector to get it by check it exists. But I want to know some better ways.
For example, I can get css p.user-name first, and then I get it's parent, and then I get it's div/text(), and always the data I want is the text() of the p.user-name's immediate parent div, but the question is, how can I get the immediate parent p.user-name?
With xpath you can traverse the xml tree in every direction(parent, sibling, child etc.) where css doesn't support this.
For your case you can get node's parent with xpath .. parent notation:
//p[#class='user-name']/../text()
Explanation:
//p[#class='user-name'] - find <p> nodes with class value user-name.
/.. - select node's parent.
/text() - select text of the current node.
This xpath should work in both of your described cases.
What about using following-sibling axis?
>>> s = scrapy.Selector(text='''<div class="user-info">
... <p class="user-name">
... something in p tag
... </p>
... text data I want
... </div>''')
>>> username = s.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n text data I want\n'
>>>
>>> s2 = scrapy.Selector(text='''<div class="user-info">
... <div>
... <p class="user-img">
... something in p tag
... </p>
... something in div tag
... </div>
... <div>
... <p class="user-name">
... something in p tag
... </p>
... text data I want
... </div>
... </div>''')
>>> username = s2.css('p.user-name')[0]
>>> username.xpath('following-sibling::text()[1]').get()
'\n text data I want\n '
>>>
I am trying to use Python Selenium Firefox Webdriver to grab the h2 content 'My Data Title' from this HTML
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Section Details
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
My Data Title
</h2>
</div>
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Another Section
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
Another Title
</h2>
</div>
Each div has a class of box so I can't easily identify the one I want. Is there a way to tell Selenium to grab the h2 in the box class that comes after the one that has the span called 'Section Details'?
If you want grab the h2 in the box class that comes after the one that has the span with text Section Details try below xpath using preceding :-
(//h2[preceding::span[normalize-space(text()) = 'Section Details']])[1]
or using following :
(//span[normalize-space(text()) = 'Section Details']/following::h2)[1]
and for Another Section just change the span text in xpath as:-
(//h2[preceding::span[normalize-space(text()) = 'Another Section']])[1]
or
(//span[normalize-space(text()) = 'Another Section']/following::h2)[1]
Here is an XPath to select the title following the text "Section Details":
//div[#class='box'][normalize-space(.)='Section Details']/following::h2
yeah, you need to do some complicated xpath searching:
referenceElementList = driver.find_elements_by_xpath("//span")
for eachElement in referenceElementList:
if eachElement.get_attribute("innerHTML") == 'Section Details':
elementYouWant = eachElement.find_element_by_xpath("../../../following-sibling::div/h2")
elementYouWant.get_attribute("innerHTML") should give you "My Data Title"
My code reads:
find all span elements regardless of where they are in HTML and store them in a list called referenceElementList;
iterate all span elements in referenceElementList one by one, looking for a span whose innerHTML attribute is 'Section Details'.
if there is a match, we have found the span, and we navigate backwards three levels to locate the enclosing div[#class='box'], and find this div element next sibling, which is the second div element,
Lastly, we locate the h2 element from its parent.
Can you please tell me if my code works? I might have gone wrong somewhere navigating backwards.
There is potential difficulty you may encounter, the innerHTML attribute may contain tab, new line and space characters, in that case, you need regex to do some filtering first.
I have a html like this
<div id="all-stories" class="book">
<ul>
<li title="Book1" >Book1</li>
<li title="Book2" >Book2</li>
</ul>
</div>
I want to get the books and their respective url using xpath, but it seems my approach is not working. for simplicity i tried to extract all the elements under "li " tags as follows
lis = tree.xpath('//div[#id="all-stories"]/div/text()')
import lxml.html as LH
content = '''\
<div id="all-stories" class="book">
<ul>
<li title="Book1" >Book1</li>
<li title="Book2" >Book2</li>
</ul>
</div>
'''
root = LH.fromstring(content)
for atag in root.xpath('//div[#id="all-stories"]//li/a'):
print(atag.attrib['href'], atag.text_content())
yields
('book1_url', 'Book1')
('book2_url', 'Book2')
The XPath //div[#id="all-stories"]/div does not match anything because there is no child div inside the outer div tag.
The XPath //div[#id="all-stories"]/li also would not match because the there is no direct child li tage inside the div tag. However, //div[#id="all-stories"]//li does match li tags because // tells XPath to recursively search as deeply as necessary to find the li tags.
Now, the content you are looking for is not in the li tag. It is inside the a tag. So instead use the XPath
'//div[#id="all-stories"]//li/a' to reach the a tags.
The value of the href attribute can be accessed with atag.attrib['href'], and the text with atag.text_content().