Using scrapy selector with conditions - python

I am using "scrapy" to scrape a few articles, like these ones: https://fivethirtyeight.com/features/championships-arent-won-on-paper-but-what-if-they-were/
I am using the following code in my spider:
def parse_article(self, response):
il = ItemLoader(item=Scrapping538Item(), response=response)
il.add_css('article_text', '.entry-content *::text')
...which works. But I'd like to make this CSS-selector a little bit more sophisticated.
Right now, I am extracting every text passage. But looking at the article, there are tables and visualizations in there, which include text, too. The HTML structure looks like this:
<div class="entry-content single-post-content">
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<section class="viz">
<header class="viz">
<h5 class="title">TITLE-text</h5>
<p class="subtitle">SUB-TITLE-text</p>
</header>
<table class="viz full"">TABLE DATA</table>
</section>
<p>text I want</p>
<p>text I want</p>
</div>
With the code snipped above, I get something like:
text I want
text I want
text I want
TITLE-text <<<< (text I don't want)
SUB-TITLE-text <<<< (text I don't want)
TABLE DATA <<<< (text I don't want)
text I want
text I want
My questions:
How can I modify the add_css()function in a way such that it
takes all text except texts from the table?
Would it be easier with the function add_xpath?
In general, what would be the best practise for this? (extract text
under conditions)
Feedback would be much appreciated

Use > in your CSS expression to limit it to children (direct descendants).
.entry-content > *::text

You can get output that you want with XPath and ancestor axis:
'//*[contains(#class, "entry-content")]//text()[not(ancestor::*[#class="viz"])]'

Unless I miss something crucial, the following xpath should work:
import scrapy
import w3lib
raw = response.xpath(
'//div[contains(#class, "entry-content") '
'and contains(#class, "single-post-content")]/p'
).extract()
This omits the table content and only yields the text in paragraphs and links as a list. But there's a catch! Since we didn't use /text(), all <p> and <a> tags are still there. Let's remove them:
cleaned = [w3lib.html.remove_tags(block) for block in raw]

Related

Iterate over all elements in html and replace content with Beautifulsoup

In my database I am storing HTML coming from a custom CMS's WYSIWYG editor.
The contents are in English and I'd like to use Beautifulsoup to iterate over every single element, translate its contents to German (using another class, Translator) and replace the value of the current element with the translated text.
So far, I have been able to come up with specific selectors for p, a, pre in combination with the .findAll function of Beautifulsoup, however I have googled and it is not clear to me how I can simply go through all elements and replace their content on the fly, instead of having to filter based on a specific type.
A very basic example of HTML produced by the editor covering all different kinds of types:
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p>This is a search engine</p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
The bs4 documentation points me to a replace_with function, which would be ideal if I could only select each element after each other, not having to specifically select something.
Pointers would be welcome 😊
Here a small sample code on how to use BeautifulSoup to substitute strings. In your case you need a preliminary step, get the mapping between the languages, a dictionary could be the case.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # or use any other parser
new_string = 'xxx' # replace each string with the same value
_ = [s.replace_with(new_string) for s in soup.find_all(string=True)]
print(soup.prettify())
You can basically do this to iterate over every element :
html="""
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p>This is a search engine</p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
for x in soup.findAll():
print(x.text)
# You can try this as well
print(x.find(text=True,recursive=False))
# I think this will return result as you expect.
Output :
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text
Quote
text after quote
code
text after code
This is a search engine
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text
Quote
text after quote
code
text after code
This is a search engine
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Bold text
Italic text
Italic text
Quote
text after quote
code
text after code
This is a search engine
This is a search engine
And I believe you have translator function and you know how to replace that also.

merging two html strings into one, using python

im trying to understand if there's a relatively simple way, to take an HTML string, and "insert" it inside a different HTML string. I Tried converting the HTML into a simple DIV, and put it in the first HTML, but that didn't work and caused weird failures.
Some more info: I'm creating a report using bokeh, and have some figures. My code is creating some figures and appending them to a list, which eventually is parsed into an HTML and saved on my PC. What i want to do, is read a different HTML string, and append it entirely in my report.
You can do that with BeautifulSoup. See this example:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup("<html><body><p>my paragraph</p></body></html>")
body = soup.find("body")
new_tag = soup.new_tag("a", href="http://www.example.com")
body.append(new_tag)
another_new_tag = soup.new_tag("p")
another_new_tag.insert(0, NavigableString("bla bla, and more bla"))
body.append(another_new_tag)
print(soup.prettify())
The result is:
<html>
<body>
<p>
my paragraph
</p>
<a href="http://www.example.com">
</a>
<p>
bla bla, and more bla
</p>
</body>
</html>
So what i was looking for, and what solves my problem, is just using iframe with srcdoc attribute.
iframe = '<iframe srcdoc="%s"></iframe>' % raw_html
and then i can push this iframe into the original HTML wherever i want

Select element based on text inside Beautiful Soup

I scrapped a website and I want to find an element based on the text written in it. Let's say below is the sample code of the website:
code = bs4.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
I want some way to get a p element that has as a text value Some Information. How can I select an element like so?
Just use text parameter:
code.find_all("p", text="Some Information")
If you need only the first element than use find instead of find_all.
You could use text to search all tags matching the string
import BeautifulSoup as bs
import re
code = bs.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
for elem in code(text='Some Information'):
print elem.parent

How to select next node using scrapy

I have html looks like this:
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
I understand how to extract using scrapy information from h1:
content.select("//h1[contains(text(),'Text 1')]/text()").extract()
But my goal is to extract content from <div>Some info</div>
My problem is that I don't have any specific information about div. All what I know, that it goes exactly after <h1>Text 1</h1>. Can I, using selectors, get NEXT element in tree? Element, that situated on the same level in DOM tree?
Something like:
a = content.select("//h1[contains(text(),'Text 1')]/text()")
a.next("//div/text()").extract()
Some info
Try this xpath:
//h1[contains(text(), 'Text 1')]/following-sibling::div[1]/text()
Use following-sibling. From https://www.w3.org/TR/2017/REC-xpath-31-20170321/
the following-sibling axis contains the context node's following siblings, those children of the context node's parent that occur after the context node in document order;
Example:
from scrapy.selector import Selector
text = '''
<h1>Text 1</h1>
<div>Some info</div>
<h1>Text 2</h1>
<div>...</div>
'''
sel = Selector(text=text)
h1s = sel.xpath('//h1/text()')
for counter, h1 in enumerate(h1s,1):
div = sel.xpath('(//h1)[{}]/following-sibling::div[1]/text()'.format(counter))
print(h1.get())
print(div.get())
The output is:
Text 1
Some info
Text 2
...

extracting text from noisy string.. python

I have some html documents and I want to extract a very particular text from it.
Now, this text is always located as
<div class = "fix">text </div>
Now, sometimes what happens is... there are other opening divs as well...something like:
<div class = "fix"> part of text <div something> other text </div> some more text </div>
Now.. I want to extract all the text corresponding to
<div class = "fix"> </div> markups??
How do i do this?
I would use the BeautifulSoup libraries. They're kinda built for this, as long your data is correct html it should find exactly what you're looking for. They've got reasonably good documentation, and it's extremely straight forward, even for beginners. If your file is on the web somewhere where you can't access the direct html, grab the html with urllib.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.find({"class":"fix"})
If there is more than one item with it use find_all instead. This should give you what you're looking for (roughly).
Edit: Fixed example (class is a keyword, so you can't use the usual (attr="blah")
Here's a really simple solution that uses a non-greedy regex to remove all html tags.:
import re
s = "<div class = \"fix\"> part of text <div something> other text </div> some more text </div>"
s_text = re.sub(r'<.*?>', '', s)
The values are then:
print(s)
<div class = "fix"> part of text <div something> other text </div> some more text </div>
print(s_text)
part of text other text some more text

Categories