Python scrapy, how to only get immediate children - python

so i have some html like this
<div class="content">
<div class="infobox">
<p> text </p>
<p> more text </p>
</div>
<p> text again </p>
<p> even more text </p>
</div>
And i am using this selector '.content p::text' i thought this would only get me the immediate children, so i wanted it to extract "text again" and "even more text" but it's also getting the text from the paragraphs inside the other div, how can i prevent this from happening, i only want text from the paragraphs that are the immediate children of the div with the class .content

Scrapy uses an extended set of CSS selectors and XPath selectors. In your case, you're using CSS selectors. The CSS relationship selector you want is > denoting a parent/child relationship, as in: .content > p::text. Scrapy's selectors are described in the section titled "Selectors" in its documentation.

to get the child: div>p ( text, more text )
In your case to get what you need: div+p
http://www.w3schools.com/cssref/css_selectors.asp
Worth reading

Related

how to get div text with python selenium?

How can I get the text "950" from the div that has neither a ID nor a Class with python selenium?
<div class="player-hover-box" style="display: none;">
<div class="ps-price-hover">
<div><img class="price-platform-img-hover"></div>
<div>950</div>
</div>
I dont know how I could access this div and its text.
In case player-hover-box is an unique class name you can use the following command
price = driver.find_element_by_xpath('//div[#class="player-hover-box"]/div/div[2]').text
In case there are more products on that page with the similar HTML structure your XPath locator should contain some unique relation to some other element.

Extract text from children of next nodes with XPath and Scrapy

With Python Scrapy, I am trying to get contents in a webpage whose nodes look like this:
<div id="title">Title</div>
<ul>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
...
</ul>
I'm a newbie with XPath and couldn't get it for now. My last try was something like:
contents = response.xpath('[#id="title"]/following-sibling::ul[1]//li//p.text()')
... but it seems I cannot use /following-sibling after [#id="title"].
Any idea?
Try this XPath
contents = response.xpath('//div[#id="title"]/following-sibling::ul[1]/li/p/text()')
It selects both "CONTENT TO EXTRACT" text nodes.
One XPath would be:
response.xpath('//*[#id="title"]/following-sibling::ul[1]//p/text()).getall()
which get text from every <p> tag child or grand child of nearest <ul> tag to node with id = "title".
XPath syntax
Try this using css selector.
response.css('#title ::text).extract()

Python Selenium Webdriver - Grab div after specified one

I am trying to use Python Selenium Firefox Webdriver to grab the h2 content 'My Data Title' from this HTML
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Section Details
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
My Data Title
</h2>
</div>
<div class="box">
<ul class="navigation">
<li class="live">
<span>
Another Section
</span>
</li>
</ul>
</div>
<div class="box">
<h2>
Another Title
</h2>
</div>
Each div has a class of box so I can't easily identify the one I want. Is there a way to tell Selenium to grab the h2 in the box class that comes after the one that has the span called 'Section Details'?
If you want grab the h2 in the box class that comes after the one that has the span with text Section Details try below xpath using preceding :-
(//h2[preceding::span[normalize-space(text()) = 'Section Details']])[1]
or using following :
(//span[normalize-space(text()) = 'Section Details']/following::h2)[1]
and for Another Section just change the span text in xpath as:-
(//h2[preceding::span[normalize-space(text()) = 'Another Section']])[1]
or
(//span[normalize-space(text()) = 'Another Section']/following::h2)[1]
Here is an XPath to select the title following the text "Section Details":
//div[#class='box'][normalize-space(.)='Section Details']/following::h2
yeah, you need to do some complicated xpath searching:
referenceElementList = driver.find_elements_by_xpath("//span")
for eachElement in referenceElementList:
if eachElement.get_attribute("innerHTML") == 'Section Details':
elementYouWant = eachElement.find_element_by_xpath("../../../following-sibling::div/h2")
elementYouWant.get_attribute("innerHTML") should give you "My Data Title"
My code reads:
find all span elements regardless of where they are in HTML and store them in a list called referenceElementList;
iterate all span elements in referenceElementList one by one, looking for a span whose innerHTML attribute is 'Section Details'.
if there is a match, we have found the span, and we navigate backwards three levels to locate the enclosing div[#class='box'], and find this div element next sibling, which is the second div element,
Lastly, we locate the h2 element from its parent.
Can you please tell me if my code works? I might have gone wrong somewhere navigating backwards.
There is potential difficulty you may encounter, the innerHTML attribute may contain tab, new line and space characters, in that case, you need regex to do some filtering first.

Python Xpath get the value only from root element

i'm using XPath to scrap one web page, but i'm with trouble with one part of the code:
<div class="description">
here's the page description
<span> some other text</span>
<span> another tag </span>
</div>
i'm using this code to get the value from element:
description = tree.xpath('//div[#class="description"]/text()')
i can find the correct div i'm looking for, but i only want to get the text "here's the page description" not the content from inner span tags
anyone know how can i get only the text in the root node but not the content from child nodes?
The expression you are currently using would actually match the top-level text child nodes only. You can just wrap it into normalize-space() to clean up the text from extra newlines and spaces:
>>> from lxml.html import fromstring
>>> data = """
... <div class="description">
... here's the page description
... <span> some other text</span>
... <span> another tag </span>
... </div>
... """
>>> root = fromstring(data)
>>> root.xpath('normalize-space(//div[#class="description"]/text())')
"here's the page description"
To get the complete text of a node including the child nodes, use the .text_content() method:
node = tree.xpath('//div[#class="description"]')[0]
print(node.text_content())

BeautifulSoup Scraping How to

Consider a HTML structure like
<div class="entry_content">
<p>
<script>some blah blah script here</script>
<fb:like--blah blah></fb:like>
<img/>
</p>
<p align="left">
content to be scraped begins here
</p>
<p>
more content to be scraped in one or many paragraphs from this paragraph onwards
</p>
-- there could be many more <p> here which also need to be included
</div>
The soup
content = soup.html.body.find('div', class_='entry_content')
gives me everything within the outermost div tag, including javascript, facebook code and all html tags.
Now how to remove everything before <p align="left">
I tried something like:
content.split('<p align="left">')[1]
But this is not doing the trick
Have a look at extract or decompose.
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
Tag.decompose() removes a tag from the tree, then completely destroys it and its contents.

Categories