How to select next node using scrapy

How to select next node using scrapy - python

I have html looks like this:
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
I understand how to extract using scrapy information from h1:
content.select("//h1[contains(text(),'Text 1')]/text()").extract()
But my goal is to extract content from <div>Some info</div>
My problem is that I don't have any specific information about div. All what I know, that it goes exactly after <h1>Text 1</h1>. Can I, using selectors, get NEXT element in tree? Element, that situated on the same level in DOM tree?
Something like:
a = content.select("//h1[contains(text(),'Text 1')]/text()")
a.next("//div/text()").extract()
Some info

Try this xpath:
//h1[contains(text(), 'Text 1')]/following-sibling::div[1]/text()

Use following-sibling. From https://www.w3.org/TR/2017/REC-xpath-31-20170321/
the following-sibling axis contains the context node's following siblings, those children of the context node's parent that occur after the context node in document order;
Example:
from scrapy.selector import Selector
text = '''
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
'''
sel = Selector(text=text)
h1s = sel.xpath('//h1/text()')
for counter, h1 in enumerate(h1s,1):
div = sel.xpath('(//h1)[{}]/following-sibling::div[1]/text()'.format(counter))
print(h1.get())
print(div.get())
The output is:
Text 1
Some info
Text 2
...

Related

Using scrapy selector with conditions

I am using "scrapy" to scrape a few articles, like these ones: https://fivethirtyeight.com/features/championships-arent-won-on-paper-but-what-if-they-were/
I am using the following code in my spider:
def parse_article(self, response):
il = ItemLoader(item=Scrapping538Item(), response=response)
il.add_css('article_text', '.entry-content *::text')
...which works. But I'd like to make this CSS-selector a little bit more sophisticated.
Right now, I am extracting every text passage. But looking at the article, there are tables and visualizations in there, which include text, too. The HTML structure looks like this:
<div class="entry-content single-post-content">
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<section class="viz">
<header class="viz">
<h5 class="title">TITLE-text</h5>
<p class="subtitle">SUB-TITLE-text</p>
</header>
<table class="viz full"">TABLE DATA</table>
</section>
<p>text I want</p>
<p>text I want</p>
</div>
With the code snipped above, I get something like:
text I want
text I want
text I want
TITLE-text <<<< (text I don't want)
SUB-TITLE-text <<<< (text I don't want)
TABLE DATA <<<< (text I don't want)
text I want
text I want
My questions:
How can I modify the add_css()function in a way such that it
takes all text except texts from the table?
Would it be easier with the function add_xpath?
In general, what would be the best practise for this? (extract text
under conditions)
Feedback would be much appreciated

Use > in your CSS expression to limit it to children (direct descendants).
.entry-content > *::text

You can get output that you want with XPath and ancestor axis:
'//*[contains(#class, "entry-content")]//text()[not(ancestor::*[#class="viz"])]'

Unless I miss something crucial, the following xpath should work:
import scrapy
import w3lib
raw = response.xpath(
'//div[contains(#class, "entry-content") '
'and contains(#class, "single-post-content")]/p'
).extract()
This omits the table content and only yields the text in paragraphs and links as a list. But there's a catch! Since we didn't use /text(), all <p> and <a> tags are still there. Let's remove them:
cleaned = [w3lib.html.remove_tags(block) for block in raw]

scraping data from next span within same h1 tag in beautifulsoup

Hi i am trying to scrape subcategory
subcat = soup.find(class_='bread-block-wrap').find(class_='breadcrumb-keyword-bg').find(class_='breadcrumb-keyword list-responsive-container').find(class_='ui-breadcrumb').find('h1')
and this is the output
<h1>
Cellphones & Telecommunications
<span class="divider">></span> <span> Mobile Phones</span>
</h1>
so now there is 2 span tag number 1 is
<span class="divider">></span>
and 2nd one is
<span> Mobile Phones</span>
and i want to scrape this text in 2nd span tag, please can someone help

You can use find_all() function to get all the span tags in a list and then use .text attribute to get the text.
subcat.find_all('span')[1].text
Should output
Mobile Phones
Demo
from bs4 import BeautifulSoup
html="""
<h1>
Cellphones & Telecommunications
<span class="divider">></span> <span> Mobile Phones</span>
</h1>
"""
soup=BeautifulSoup(html,'html.parser')
h1=soup.find('h1')
print(h1.find_all('span')[1].text.strip())
Output
Mobile Phones

You can use css nth-of-type selector
h1 span:nth-of-type(2)
i.e.
items = soup.select("h1 span:nth-of-type(2)")
Then iterate list.
If only one match possible then simply:
item = soup.select_one("h1 span:nth-of-type(2)")
print(item.text.strip())

Another solution would be using CSS selectors which lets you get rid of cascading over and over again. In your case this:
results = soup.select(".bread-block-wrap .breadcrumb-keyword-bg .breadcrumb-keyword.list-responsive-container .ui-breadcrumb h1 span")
is going to return the two span tags in a list. Then, you can simply just use the second one.
You, of course, have lots of other useful tools to work with when you choose CSS selectors. Just find a CSS selector cheatsheet and enjoy.

How to extract text in python from div tag if other html is within the tag?

I am trying to extract a ref. id from HTML with scrapy:
<div class="col" itemprop="description">
<p>text Ref. <span>220.20.34.20.53.001</span></p>
<p>more text</p>
</div>
The span and p tag are not always present.
Using xpath selector:
text = ' '.join(response.xpath('//div[#itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try:
ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()
Returns in this case only an empty string, as there is HTML inside the tag.
Now trying to extract the text with CSS selector in order to use remove_tags:
>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()])
This returns an empty result as I somehow can not grab the item.
How can I extract the ref_id regardless of having html <p> tags within the div or not. Some items of the crawl have no <p> tag and no <span> where my first attempt with xpath works.

You don't need to use the remove_tags as you can get directly the text with the selectors:
sel.css('div[itemprop=description] ::text')
That will get all inner text from the div tag with itemprop="description" and later you can extract your information with a regex:
sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')

Try to remove ::text from your last expression:
''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()])
But if you need to extract only 220.20.34.20.53.001 from your html, why don't you use response.css('div[itemprop=description] p span::text').extract()?
Or even response.css('div[itemprop=description]').re(r'([\.\d]+)').

Select element based on text inside Beautiful Soup

I scrapped a website and I want to find an element based on the text written in it. Let's say below is the sample code of the website:
code = bs4.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
I want some way to get a p element that has as a text value Some Information. How can I select an element like so?

Just use text parameter:
code.find_all("p", text="Some Information")
If you need only the first element than use find instead of find_all.

You could use text to search all tags matching the string
import BeautifulSoup as bs
import re
code = bs.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
for elem in code(text='Some Information'):
print elem.parent

Remove elements from tree based on list of terms

I'm trying to capture some text from a webpage (whose URL is passed when running the script), but its buried in a paragraph tag with no other attributes assigned. I can collect the contents of every paragraph tag, but I want to remove any elements from the tree that contain any of a list of keywords.
I get the following error:
tree.remove(elem) TypeError: Argument 'element' has incorrect type
(expected lxml.etree._Element, got _ElementStringResult)
I understand that what I am getting back when I try to iterate through the tree is the wrong type, but how do I get the element instead?
Sample Code:
#!/usr/bin/python
from lxml import html
from lxml import etree
url = sys.argv[1]
page = requests.get(url)
tree = html.fromstring(page.content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
tree.remove(elem)

In your code, elem is an _ElementStringResult which has the instance method getparent. Its parent is an Element object of one of the <p> nodes.
The parent has a remove method which can be used to remove it from the tree:
element.getparent().remove(element)
I do not believe there is a more direct way and I don't have a good answer to why there isn't a removeself method.
Using the example html:
content = '''
<root>
<p> nothing1 </p>
<p> keyword1 </p>
<p> nothing2 </p>
<p> nothing3 </p>
<p> keyword4 </p>
</root>
'''
You can see this in action in your code with:
from lxml import html
from lxml import etree
tree = html.fromstring(content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
actual_element = elem.getparent()
actual_element.getparent().remove(actual_element)
for child in tree.getchildren():
print('<{tag}>{text}</{tag}>'.format(tag=child.tag, text=child.text))
# Output:
# <p> nothing1 </p>
# <p> nothing2 </p>
# <p> nothing3 </p>
From the comments, it seems like this code isn't working for you. If so, you might need to provide more information about the structure of your html.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to select next node using scrapy - python

Try this xpath: //h1[contains(text(), 'Text 1')]/following-sibling::div[1]/text()

Related

Using scrapy selector with conditions

scraping data from next span within same h1 tag in beautifulsoup

How to extract text in python from div tag if other html is within the tag?

Select element based on text inside Beautiful Soup

Remove elements from tree based on list of terms

Categories

Resources