Iterate over all elements in html and replace content with Beautifulsoup - python

In my database I am storing HTML coming from a custom CMS's WYSIWYG editor.
The contents are in English and I'd like to use Beautifulsoup to iterate over every single element, translate its contents to German (using another class, Translator) and replace the value of the current element with the translated text.
So far, I have been able to come up with specific selectors for p, a, pre in combination with the .findAll function of Beautifulsoup, however I have googled and it is not clear to me how I can simply go through all elements and replace their content on the fly, instead of having to filter based on a specific type.
A very basic example of HTML produced by the editor covering all different kinds of types:
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p>This is a search engine</p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
The bs4 documentation points me to a replace_with function, which would be ideal if I could only select each element after each other, not having to specifically select something.
Pointers would be welcome 😊

Here a small sample code on how to use BeautifulSoup to substitute strings. In your case you need a preliminary step, get the mapping between the languages, a dictionary could be the case.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # or use any other parser
new_string = 'xxx' # replace each string with the same value
_ = [s.replace_with(new_string) for s in soup.find_all(string=True)]
print(soup.prettify())

You can basically do this to iterate over every element :
html="""
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<p>Normal text</p>
<p><strong>Bold text</strong></p>
<p><em>Italic text </em></p>
<p><br></p>
<blockquote>Quote</blockquote>
<p>text after quote</p>
<p><br></p>
<p><br></p>
<pre class="code-syntax" spellcheck="false">code</pre>
<p><br></p>
<p>text after code</p>
<p><br></p>
<p>This is a search engine</p>
<p><br></p>
<p><img src="https://via.placeholder.com/350x150"></p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
for x in soup.findAll():
print(x.text)
# You can try this as well
print(x.find(text=True,recursive=False))
# I think this will return result as you expect.
Output :
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text
Quote
text after quote
code
text after code
This is a search engine
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Italic text
Quote
text after quote
code
text after code
This is a search engine
Heading 1
Heading 2
Heading 3
Normal text
Bold text
Bold text
Italic text
Italic text
Quote
text after quote
code
text after code
This is a search engine
This is a search engine
And I believe you have translator function and you know how to replace that also.

Related

Putting Links in Parenthesis with BeautifulSoup

BeautifulSoup's get_text() function only records the textual information of an HTML webpage. However, I want my program to return the href link of an tag in parenthesis directly after it returns the actual text.
In other words, using get_text() will just return "17.602" on the following HTML:
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
However, I want my program to return "17.602 (17.602.html#FAR_17_602)". How would I go about doing this?
EDIT: What if you need to print text from other tags, such as:
<p> Sample text.
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
Sample closing text.
</p>
In other words, how would you compose a program that would print
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.
You can format the output using f-strings.
Access the tag's text using .text, and then access the href attribute.
from bs4 import BeautifulSoup
html = """
<a class="xref fm:ParaNumOnly" href="17.602.html#FAR_17_602">17.602</a>
"""
soup = BeautifulSoup(html, "html.parser")
a_tag = soup.find("a")
print(f"{a_tag.text} ({a_tag['href']})")
Output:
17.602 (17.602.html#FAR_17_602)
Edit: You can use .next_sibling and .previous_sibling
print(f"{a_tag.previous_sibling.strip()} {a_tag.text} ({a_tag['href']}) {a_tag.next_sibling.strip()}")
Output:
Sample text. 17.602 (17.602.html#FAR_17_602) Sample closing text.

Using scrapy selector with conditions

I am using "scrapy" to scrape a few articles, like these ones: https://fivethirtyeight.com/features/championships-arent-won-on-paper-but-what-if-they-were/
I am using the following code in my spider:
def parse_article(self, response):
il = ItemLoader(item=Scrapping538Item(), response=response)
il.add_css('article_text', '.entry-content *::text')
...which works. But I'd like to make this CSS-selector a little bit more sophisticated.
Right now, I am extracting every text passage. But looking at the article, there are tables and visualizations in there, which include text, too. The HTML structure looks like this:
<div class="entry-content single-post-content">
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<section class="viz">
<header class="viz">
<h5 class="title">TITLE-text</h5>
<p class="subtitle">SUB-TITLE-text</p>
</header>
<table class="viz full"">TABLE DATA</table>
</section>
<p>text I want</p>
<p>text I want</p>
</div>
With the code snipped above, I get something like:
text I want
text I want
text I want
TITLE-text <<<< (text I don't want)
SUB-TITLE-text <<<< (text I don't want)
TABLE DATA <<<< (text I don't want)
text I want
text I want
My questions:
How can I modify the add_css()function in a way such that it
takes all text except texts from the table?
Would it be easier with the function add_xpath?
In general, what would be the best practise for this? (extract text
under conditions)
Feedback would be much appreciated
Use > in your CSS expression to limit it to children (direct descendants).
.entry-content > *::text
You can get output that you want with XPath and ancestor axis:
'//*[contains(#class, "entry-content")]//text()[not(ancestor::*[#class="viz"])]'
Unless I miss something crucial, the following xpath should work:
import scrapy
import w3lib
raw = response.xpath(
'//div[contains(#class, "entry-content") '
'and contains(#class, "single-post-content")]/p'
).extract()
This omits the table content and only yields the text in paragraphs and links as a list. But there's a catch! Since we didn't use /text(), all <p> and <a> tags are still there. Let's remove them:
cleaned = [w3lib.html.remove_tags(block) for block in raw]

Select element based on text inside Beautiful Soup

I scrapped a website and I want to find an element based on the text written in it. Let's say below is the sample code of the website:
code = bs4.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
I want some way to get a p element that has as a text value Some Information. How can I select an element like so?
Just use text parameter:
code.find_all("p", text="Some Information")
If you need only the first element than use find instead of find_all.
You could use text to search all tags matching the string
import BeautifulSoup as bs
import re
code = bs.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
for elem in code(text='Some Information'):
print elem.parent

BeautifulSoup: getting text outside of tags: unexpected behaviour

If I have some HTML string:
s="<html><body><div><p>inner text</p></div><p>middle text</p>outside text</body></html>"
and try to get the text:
soup=BeautifulSoup(s, "html.parser")
ps=soup.findAll("p")
for i in ps:
print(i.text)
It gives:
inner text
middle text
Then I have a web-page when the structure is similar:
<article>
<p>text1</p>
<br>
some outside text1
<p>....</p>
<br>
some outside text2
</article>
</body>
But when I use
soup2=BeautifulSoup(urllib.request.urlopen("http://www.wired.com/2016/08/review-samsung-galaxy-note-7/", "html.parser")
ab=soup2.find("article", {"itemprop":"articleBody"})
ps=ab.findAll("p")
It gives me the outside text2 too.
Also there is a some form a javascript commercial (<div id="wired-tired") after extracting which I can also get the outside text1.
What's going on there?? How come can I get the second text searching only for p and why the first text also becomes available after removing wired-tired?

Use Beautiful Soup findall to extract text between single quotations

I'm using Beautiful Soup and I want to extract the text within '' with the findall method.
content = urllib.urlopen(address).read()
soup = BeautifulSoup(content, from_encoding='utf-8')
soup.prettify()
x = soup.findAll(do not know what to write)
An extract from soup as an example:
<td class="leftCell identityColumn snap" onclick="fundview('Schroder
European Special Situations');" title="Schroder European Special
Situations"> <a class="coreExpandArrow" href="javascript:
void(0);"></a> <span class="sigill"><a class="qtpop"
href="/vips/ska/all/sv/quicktake/redirect?perfid=0P0000XZZ3&flik=Chosen">
<img
src="/vips/Content/corestyles/4pSigillGubbe.gif"/></a></span>
<span class="bluetext" style="white-space: nowrap; overflow:
hidden;">Schroder European Spe..</span>
I would like the result from soup.findAll(do not know what to write) to be: Schroder European Special Situations and the findall logic should be based on that it is the text between single quotation marks.
Locate the td element and get the onclick attribute value - the BeautifulSoup's job at this point would be completed. The next step would be to extract the desired text from the attribute value - let's use regular expressions for that. Implementation:
import re
onclick = soup.select_one("td.identityColumn[onclick]")["onclick"]
match = re.search(r"fundview\('(.*?)'\);", onclick)
if match:
print(match.group(1))
Alternatively, it looks like the span with bluetext class has the desired text inside:
soup.select_one("td.identityColumn span.bluetext").get_text()
Also, make sure you are using the 4th BeautifulSoup version and your import statement is:
from bs4 import BeautifulSoup

Categories