I have html looks like this:
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
I understand how to extract using scrapy information from h1:
content.select("//h1[contains(text(),'Text 1')]/text()").extract()
But my goal is to extract content from <div>Some info</div>
My problem is that I don't have any specific information about div. All what I know, that it goes exactly after <h1>Text 1</h1>. Can I, using selectors, get NEXT element in tree? Element, that situated on the same level in DOM tree?
Something like:
a = content.select("//h1[contains(text(),'Text 1')]/text()")
a.next("//div/text()").extract()
Some info
Try this xpath:
//h1[contains(text(), 'Text 1')]/following-sibling::div[1]/text()
Use following-sibling. From https://www.w3.org/TR/2017/REC-xpath-31-20170321/
the following-sibling axis contains the context node's following siblings, those children of the context node's parent that occur after the context node in document order;
Example:
from scrapy.selector import Selector
text = '''
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
'''
sel = Selector(text=text)
h1s = sel.xpath('//h1/text()')
for counter, h1 in enumerate(h1s,1):
div = sel.xpath('(//h1)[{}]/following-sibling::div[1]/text()'.format(counter))
print(h1.get())
print(div.get())
The output is:
Text 1
Some info
Text 2
...
Related
I am using "scrapy" to scrape a few articles, like these ones: https://fivethirtyeight.com/features/championships-arent-won-on-paper-but-what-if-they-were/
I am using the following code in my spider:
def parse_article(self, response):
il = ItemLoader(item=Scrapping538Item(), response=response)
il.add_css('article_text', '.entry-content *::text')
...which works. But I'd like to make this CSS-selector a little bit more sophisticated.
Right now, I am extracting every text passage. But looking at the article, there are tables and visualizations in there, which include text, too. The HTML structure looks like this:
<div class="entry-content single-post-content">
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<section class="viz">
<header class="viz">
<h5 class="title">TITLE-text</h5>
<p class="subtitle">SUB-TITLE-text</p>
</header>
<table class="viz full"">TABLE DATA</table>
</section>
<p>text I want</p>
<p>text I want</p>
</div>
With the code snipped above, I get something like:
text I want
text I want
text I want
TITLE-text <<<< (text I don't want)
SUB-TITLE-text <<<< (text I don't want)
TABLE DATA <<<< (text I don't want)
text I want
text I want
My questions:
How can I modify the add_css()function in a way such that it
takes all text except texts from the table?
Would it be easier with the function add_xpath?
In general, what would be the best practise for this? (extract text
under conditions)
Feedback would be much appreciated
Use > in your CSS expression to limit it to children (direct descendants).
.entry-content > *::text
You can get output that you want with XPath and ancestor axis:
'//*[contains(#class, "entry-content")]//text()[not(ancestor::*[#class="viz"])]'
Unless I miss something crucial, the following xpath should work:
import scrapy
import w3lib
raw = response.xpath(
'//div[contains(#class, "entry-content") '
'and contains(#class, "single-post-content")]/p'
).extract()
This omits the table content and only yields the text in paragraphs and links as a list. But there's a catch! Since we didn't use /text(), all <p> and <a> tags are still there. Let's remove them:
cleaned = [w3lib.html.remove_tags(block) for block in raw]
Hi i am trying to scrape subcategory
subcat = soup.find(class_='bread-block-wrap').find(class_='breadcrumb-keyword-bg').find(class_='breadcrumb-keyword list-responsive-container').find(class_='ui-breadcrumb').find('h1')
and this is the output
<h1>
Cellphones & Telecommunications
<span class="divider">></span> <span> Mobile Phones</span>
</h1>
so now there is 2 span tag number 1 is
<span class="divider">></span>
and 2nd one is
<span> Mobile Phones</span>
and i want to scrape this text in 2nd span tag, please can someone help
You can use find_all() function to get all the span tags in a list and then use .text attribute to get the text.
subcat.find_all('span')[1].text
Should output
Mobile Phones
Demo
from bs4 import BeautifulSoup
html="""
<h1>
Cellphones & Telecommunications
<span class="divider">></span> <span> Mobile Phones</span>
</h1>
"""
soup=BeautifulSoup(html,'html.parser')
h1=soup.find('h1')
print(h1.find_all('span')[1].text.strip())
Output
Mobile Phones
You can use css nth-of-type selector
h1 span:nth-of-type(2)
i.e.
items = soup.select("h1 span:nth-of-type(2)")
Then iterate list.
If only one match possible then simply:
item = soup.select_one("h1 span:nth-of-type(2)")
print(item.text.strip())
Another solution would be using CSS selectors which lets you get rid of cascading over and over again. In your case this:
results = soup.select(".bread-block-wrap .breadcrumb-keyword-bg .breadcrumb-keyword.list-responsive-container .ui-breadcrumb h1 span")
is going to return the two span tags in a list. Then, you can simply just use the second one.
You, of course, have lots of other useful tools to work with when you choose CSS selectors. Just find a CSS selector cheatsheet and enjoy.
I am trying to extract a ref. id from HTML with scrapy:
<div class="col" itemprop="description">
<p>text Ref. <span>220.20.34.20.53.001</span></p>
<p>more text</p>
</div>
The span and p tag are not always present.
Using xpath selector:
text = ' '.join(response.xpath('//div[#itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try:
ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()
Returns in this case only an empty string, as there is HTML inside the tag.
Now trying to extract the text with CSS selector in order to use remove_tags:
>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()])
This returns an empty result as I somehow can not grab the item.
How can I extract the ref_id regardless of having html <p> tags within the div or not. Some items of the crawl have no <p> tag and no <span> where my first attempt with xpath works.
You don't need to use the remove_tags as you can get directly the text with the selectors:
sel.css('div[itemprop=description] ::text')
That will get all inner text from the div tag with itemprop="description" and later you can extract your information with a regex:
sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')
Try to remove ::text from your last expression:
''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()])
But if you need to extract only 220.20.34.20.53.001 from your html, why don't you use response.css('div[itemprop=description] p span::text').extract()?
Or even response.css('div[itemprop=description]').re(r'([\.\d]+)').
I scrapped a website and I want to find an element based on the text written in it. Let's say below is the sample code of the website:
code = bs4.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
I want some way to get a p element that has as a text value Some Information. How can I select an element like so?
Just use text parameter:
code.find_all("p", text="Some Information")
If you need only the first element than use find instead of find_all.
You could use text to search all tags matching the string
import BeautifulSoup as bs
import re
code = bs.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
for elem in code(text='Some Information'):
print elem.parent
I'm trying to capture some text from a webpage (whose URL is passed when running the script), but its buried in a paragraph tag with no other attributes assigned. I can collect the contents of every paragraph tag, but I want to remove any elements from the tree that contain any of a list of keywords.
I get the following error:
tree.remove(elem) TypeError: Argument 'element' has incorrect type
(expected lxml.etree._Element, got _ElementStringResult)
I understand that what I am getting back when I try to iterate through the tree is the wrong type, but how do I get the element instead?
Sample Code:
#!/usr/bin/python
from lxml import html
from lxml import etree
url = sys.argv[1]
page = requests.get(url)
tree = html.fromstring(page.content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
tree.remove(elem)
In your code, elem is an _ElementStringResult which has the instance method getparent. Its parent is an Element object of one of the <p> nodes.
The parent has a remove method which can be used to remove it from the tree:
element.getparent().remove(element)
I do not believe there is a more direct way and I don't have a good answer to why there isn't a removeself method.
Using the example html:
content = '''
<root>
<p> nothing1 </p>
<p> keyword1 </p>
<p> nothing2 </p>
<p> nothing3 </p>
<p> keyword4 </p>
</root>
'''
You can see this in action in your code with:
from lxml import html
from lxml import etree
tree = html.fromstring(content)
terms = ['keyword1','keyword2','keyword3','keyword4','keyword5','keyword6','keyword7']
paragraphs = tree.xpath('//p/text()')
for elem in paragraphs:
if any(term in elem for term in terms):
actual_element = elem.getparent()
actual_element.getparent().remove(actual_element)
for child in tree.getchildren():
print('<{tag}>{text}</{tag}>'.format(tag=child.tag, text=child.text))
# Output:
# <p> nothing1 </p>
# <p> nothing2 </p>
# <p> nothing3 </p>
From the comments, it seems like this code isn't working for you. If so, you might need to provide more information about the structure of your html.