Extract text from children of next nodes with XPath and Scrapy - python

With Python Scrapy, I am trying to get contents in a webpage whose nodes look like this:
<div id="title">Title</div>
<ul>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
...
</ul>
I'm a newbie with XPath and couldn't get it for now. My last try was something like:
contents = response.xpath('[#id="title"]/following-sibling::ul[1]//li//p.text()')
... but it seems I cannot use /following-sibling after [#id="title"].
Any idea?

Try this XPath
contents = response.xpath('//div[#id="title"]/following-sibling::ul[1]/li/p/text()')
It selects both "CONTENT TO EXTRACT" text nodes.

One XPath would be:
response.xpath('//*[#id="title"]/following-sibling::ul[1]//p/text()).getall()
which get text from every <p> tag child or grand child of nearest <ul> tag to node with id = "title".
XPath syntax

Try this using css selector.
response.css('#title ::text).extract()

Related

scrapy xpath how to use?

guys,
I have a question, scrapy, selector, XPath
I would like to choose the link in the "a" tag in the last "li" tag in HTML, and how to write the query for XPath
I did that, but I believe there are simpler ways to do that, such as using XPath queries, not using list fragmentation, but I don't know how to write
from scrapy import Selector
sel = Selector(text=html)
print sel.xpath('(//ul/li)').xpath('a/#href').extract()[-1]
'''
html
'''
</ul>
<li>
<a href="/info/page/" rel="follow">
<span class="page-numbers">
35
</span>
</a>
</li>
<li>
<a href="/info/page/" rel="follow">
<span class="next">
next page.
</span>
</a>
</li>
</ul>
I am assuming you want specifically the link to the "next" page. If this is the case, you can locate an a element checking the child span to the "next" class:
//a[span/#class = "next"]/#href

Get all text in a tag unless it is in another tag

I'm trying to parse some HTML with BeautifulSoup, and I'd like to get all the text (recursively) in a tag, but I want to ignore all text that appears within a small tag. For example, this HTML:
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
should give the text Final definition. Note that this is a minimal example. In the real HTML, there are many other tags involved, so small should be excluded rather than a being included.
The text attribute of the tag is close to what I want, but it would include Fun fact. I could concatenate the text of all children except the small tags, but that would leave out definition. I couldn't find a method like get_text_until (the small tag is always at the end), so what can I do?
You can use find_all to find all the <small> tags, clear them, then use get_text():
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
>>> for el in soup.find_all("small"):
... el.clear()
...
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small></small>
</li>
>>> soup.get_text()
'\n\n\n Final\n \n definition.\n \n\n'
You can get this using recursive method state that you don't want to recurse into child tags:
Like
soup.li.find(text=True, recursive=False)
So you can do this like
' '.join(li.find(text=True, recursive=False) for li in soup.findAll('li', 'a'))

Python Selenium Webdriver - Grab div after specified one

I am trying to use Python Selenium Firefox Webdriver to grab the h2 content 'My Data Title' from this HTML
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Section Details
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
My Data Title
</h2>
</div>
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Another Section
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
Another Title
</h2>
</div>
Each div has a class of box so I can't easily identify the one I want. Is there a way to tell Selenium to grab the h2 in the box class that comes after the one that has the span called 'Section Details'?
If you want grab the h2 in the box class that comes after the one that has the span with text Section Details try below xpath using preceding :-
(//h2[preceding::span[normalize-space(text()) = 'Section Details']])[1]
or using following :
(//span[normalize-space(text()) = 'Section Details']/following::h2)[1]
and for Another Section just change the span text in xpath as:-
(//h2[preceding::span[normalize-space(text()) = 'Another Section']])[1]
or
(//span[normalize-space(text()) = 'Another Section']/following::h2)[1]
Here is an XPath to select the title following the text "Section Details":
//div[#class='box'][normalize-space(.)='Section Details']/following::h2
yeah, you need to do some complicated xpath searching:
referenceElementList = driver.find_elements_by_xpath("//span")
for eachElement in referenceElementList:
if eachElement.get_attribute("innerHTML") == 'Section Details':
elementYouWant = eachElement.find_element_by_xpath("../../../following-sibling::div/h2")
elementYouWant.get_attribute("innerHTML") should give you "My Data Title"
My code reads:
find all span elements regardless of where they are in HTML and store them in a list called referenceElementList;
iterate all span elements in referenceElementList one by one, looking for a span whose innerHTML attribute is 'Section Details'.
if there is a match, we have found the span, and we navigate backwards three levels to locate the enclosing div[#class='box'], and find this div element next sibling, which is the second div element,
Lastly, we locate the h2 element from its parent.
Can you please tell me if my code works? I might have gone wrong somewhere navigating backwards.
There is potential difficulty you may encounter, the innerHTML attribute may contain tab, new line and space characters, in that case, you need regex to do some filtering first.

get list items inside div tag using xpath

I have a html like this
<div id="all-stories" class="book">
<ul>
<li title="Book1" >Book1</li>
<li title="Book2" >Book2</li>
</ul>
</div>
I want to get the books and their respective url using xpath, but it seems my approach is not working. for simplicity i tried to extract all the elements under "li " tags as follows
lis = tree.xpath('//div[#id="all-stories"]/div/text()')
import lxml.html as LH
content = '''\
<div id="all-stories" class="book">
<ul>
<li title="Book1" >Book1</li>
<li title="Book2" >Book2</li>
</ul>
</div>
'''
root = LH.fromstring(content)
for atag in root.xpath('//div[#id="all-stories"]//li/a'):
print(atag.attrib['href'], atag.text_content())
yields
('book1_url', 'Book1')
('book2_url', 'Book2')
The XPath //div[#id="all-stories"]/div does not match anything because there is no child div inside the outer div tag.
The XPath //div[#id="all-stories"]/li also would not match because the there is no direct child li tage inside the div tag. However, //div[#id="all-stories"]//li does match li tags because // tells XPath to recursively search as deeply as necessary to find the li tags.
Now, the content you are looking for is not in the li tag. It is inside the a tag. So instead use the XPath
'//div[#id="all-stories"]//li/a' to reach the a tags.
The value of the href attribute can be accessed with atag.attrib['href'], and the text with atag.text_content().

parse nested html lists using lxml in python

I am trying to parse the elements of an html list which looks like this:
<ol>
<li>r1</li>
<li>r2
<ul>
<li>n1</li>
<li>n2</li>
</ul>
</li>
<li>r3
<ul>
<li>d1
<ol>
<li>e1</li>
<li>e2</li>
</ol>
</li>
<li>d2</li>
</ul>
</li>
<li>r4</li>
</ol>
I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?
For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.
Each node has an attribute called text. That's what you are looking for.
e.g.:
for node in root.iter("*"):
print node.text
# print node.tail # e.g.: <div> <span> abc </span> def </div> => abc def

Categories