Change content only to parent element text, in Beautifulsoup

Change content only to parent element text, in Beautifulsoup - python

I have this piece of code:
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for ft in soup.findAll('p'):
print str(ft).upper()
When running I get this:
<P>HI <SPAN>MARK</SPAN>, HOW ARE YOU?, DON'T FORGET MEETING ON <STRONG>SUNDAY</STRONG>, OK?</P>
But I want to get this:
<p>HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday<strong>, ok?</p>
I just want to change inner text on p tag but keep format in other inner tags inside p, also I want to keep tag names in lowercase
Thanx

You can assign the modified text to the string attribute of the tag, p.string. So loop over all contents of the <p> tag and use the regular expression module to check if it contains the tag symbols < and > and skip them. Something like:
from bs4 import BeautifulSoup
import re
txt = """<p>Hi <span>Mark</span>, how are you?, Don't forget meeting on <strong>sunday</strong>, ok?</p>"""
soup = BeautifulSoup(txt)
for p in soup.find_all('p'):
p.string = ''.join(
[str(t).upper()
if not re.match(r'<[^>]+>', str(t))
else str(t)
for t in p.contents])
print soup.prettify(formatter=None)
I use the formatter option to avoid the encoding of html special symbols. It yields:
<html>
<body>
<p>
HI <span>Mark</span>, HOW ARE YOU?, DON'T FORGET MEETING ON <strong>sunday</strong>, OK?
</p>
</body>
</html>

Related

BeautifulSoup: Select P tag that comes after another P tag which should contain a link

The webpage I'm scraping has paragraphs and headings structured this way:
<p>
<strong>
<a href="https://dummy.com" class="">This is a link heading
</strong>
</p>
<p>
Content To Be Pulled
</p>
I wrote the following code to pull the link heading's content:
for anchor in soup.find_all('#pcl-full-content > p > strong > a'):
signs.append(anchor.text)
The next part is confusing me because the text I want to collect next is the <p> tag after the <p> tag which contains the link. I cannot use .next_sibling() here because it is outside of the parent <p> tag.
How do I choose the following paragraph given that the <p> before it contained a link?

One way seems to be to extract from script tag though you will need to split the text by horoscope:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/weekly-horoscope-june-6-june-12-gemini-cancer-taurus-and-other-signs-check-astrological-prediction-7346080/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
print(data['articleBody'])
You could get the horoscopes separately as follows. This dynamically determines which horoscopes are present, and in what order:
import requests, re, json
r = requests.get('https://indianexpress.com/article/horoscope/horoscope-today-april-6-2021-sagittarius-leo-aries-and-other-signs-check-astrological-prediction-7260276/',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"#context.*articleBody.*\})', r.text).group(1))
# print(data['articleBody'])
signs = ['ARIES', 'TAURUS', 'GEMINI', 'CANCER', 'LEO', 'VIRGO', 'LIBRA', 'SCORPIO', 'SAGITTARIUS', 'CAPRICORN', 'AQUARIUS', 'PISCES']
p = re.compile('|'.join(signs))
signs = p.findall(data['articleBody'])
for number, sign in enumerate(signs):
if number < len(signs) - 1:
print(re.search(f'({sign}.*?){signs[number + 1]}', data['articleBody']).group(1))
else:
print(re.search(f'({sign}.*)', data['articleBody']).group(1))

How can I improve my solution for getting unknown text between tags?

I'm very at Python and BeautifulSoup and trying to up my game. Let's say this is my HTML:
<div class="container">
<h4>Title 1</h4>
Text I want is here
<br /> # random break tags inserted throughout
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
<ul> # More HTML that I do not want</ul>
</div> # End container div
My expected output is the text between the two H4 tags:
Text I want is here
More text I want here
yet more text I want
But I don't know in advance what this text will say or how much of it there will be. There might be only one line, or there might be several paragraphs. It is not tagged with anything: no p tags, no id, nothing. The only thing I know about it is that it will appear between those two H4 tags.
At the moment, what I'm doing is working backward from the second H4 tag by using .previous_siblings to get everything up to the container div.
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = []
for line in text:
text_list.append(line)
text_list.reverse()
full_text = ' '.join([str(line) for line in text_list])
text = full_text.strip().replace('<h4>Title 1</h4>', '').replace('<br />'>, '')
This gives me the content I want, but it also gives me a lot more that I don't want, plus it gives it to me backwards, which is why I need to use reverse(). Then I end up having to strip out a lot of stuff using replace().
What I don't like about this is that since my end result is a list, I'm finding it hard to clean up the output. I can't use get_text() on a list. In my real-life version of this I have about ten instances of replace() and it's still not getting rid of everything.
Is there a more elegant way for me to get the desired output?

You can filter the previous siblings for NavigableStrings.
For example:
from bs4 import NavigableString
text = soup.find('div', class_ = 'container').find_next('h4', text = 'Title 2')
text = text.previous_siblings
text_list = [t for t in text if type(t) == NavigableString]
text_list will look like:
>>> text_list
[u'\nyet more text I want\n', u'\nMore text I want here\n', u'\n', u'\nText I want is here\n', u'\n']
You can also filter out \n's:
text_list = [t for t in text if type(t) == NavigableString and t != '\n']

Other solution: Use .find_next_siblings() with text=True (that will find only NavigableString nodes in the tree). Then each iteration check, if previous <h4> is correct one:
from bs4 import BeautifulSoup
txt = '''<div class="container">
<h4>Title 1</h4>
Text I want is here
<br />
<br />
More text I want here
<br />
yet more text I want
<h4>Title 2</h4>
More text here, but I do not want it
<br />
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
out = []
first_h4 = soup.find('h4')
for t in first_h4.find_next_siblings(text=True):
if t.find_previous('h4') != first_h4:
break
elif t.strip():
out.append(t.strip())
print(out)
Prints:
['Text I want is here', 'More text I want here', 'yet more text I want']

Parse HTML page to get contents of <p> and <b> tags

There are lots of HTML pages which are structured as a sequence of such groups:
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
The addresses of these pages are like https://some.page.org/year/0001, https://some.page.org/year/0002, etc.
How can I extract the keywords separately from each of such pages? I've tried to use BeautifulSoup, but unsuccessfully. I've only written the program that prints titles of groups (between <b> and </b>).
from bs4 import BeautifulSoup
from urllib2 import urlopen
import re
html_doc = urlopen('https://some.page.org/2018/1234').read()
soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
print 'https://some.page.org'+link.get('href')
for node in soup.findAll('b'):
print ''.join(node.findAll(text=True))

I can't test this without knowing the actual source code format but it seems you want the <p> tags text vaue:
for node in soup.findAll('p'):
print(node.text)
# or: keywords = node.text.split(', ')
# print(keywords)

You need to split your string which in this case is url with /
And then you can choose chunks you want
For example if url is https://some.page.org/year/0001 i use split function to split url with / sign
it will convert it to array and then i choose what i need and again convert it to string with ''.join() method you can read about split method in this link

There are different ways to HTML parse the desired categories and keywords from this kind of HTML structure, but here is one of the "BeautifulSoup" ways to do it:
find b elements with a text which ends with a :
use .next_sibling to get to the next text node which contains keywords
Working example:
from bs4 import BeautifulSoup
data = """
<div>
<p>
<b> Category 1:</b>
"keyword_a, keyword_b"
</p>
<p>
<b> Category 2:</b>
"keyword_c, keyword_d"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for category in soup('b', text=lambda text: text and text.endswith(":")):
keywords = category.next_sibling.strip('" \n').split(", ")
print(category.get_text(strip=True), keywords)
Prints:
Category 1: ['keyword_a', 'keyword_b']
Category 2: ['keyword_c', 'keyword_d']

Assuming for each block
<p>
<b> Keywords/Category:</b>
"keyword_a, keyword_b"
</p>
you want to extract keyword_a and keyword_b for each Keywords/Category. So an example would be:
<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>
Once you have the HTML code, you can do:
from bs4 import BeautifulSoup
html = '''<p>
<b>Mammals</b>
"elephant, rhino"
</p>
<p>
<b>Birds</b>
"hummingbird, ostrich"
</p>'''
soup = BeautifulSoup(html, 'html.parser')
p_elements = soup.find_all('p')
for p_element in p_elements:
b_element = soup.find_all('b')[0]
b_element.extract()
category = b_element.text.strip()
keywords = p_element.text.strip()
keyword_a, keyword_b = keywords[1:-1].split(', ')
print('Category:', category)
print('Keyword A:', keyword_a)
print('Keyword B:', keyword_b)
Which prints:
Category: Mammals
Keyword A: elephant
Keyword B: rhino
Category: Birds
Keyword A: hummingbird
Keyword B: ostrich

Combine multiple tags with lxml

I have an html file which looks like:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
What I need is, if all the tags in a 'p' block are 'strong', then combine them into one line, i.e.
<p>
<strong>This is a line which I want to join.</strong>
</p>
Without touching the other block since it contains something else.
Any suggestions? I am using lxml.
UPDATE:
So far I tried:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
With these code I was able to strip off the strong tag in the part desired, giving:
<p>
This is a line which I want to join.
</p>
So now I just need a way to put the tag back in...

I was able to do this with bs4 (BeautifulSoup):
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
Then use replace_with():
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
prints:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>

I have managed to solve my own problem.
for p in self.tree.xpath('//body/p'):
if p.tail is None: # some conditions specifically for my doc
children = p.getchildren()
if len(children)>1:
for child in children:
#if other stuffs present, break
if child.tag!='strong' or child.tail is not None:
break
else:
# If not break, we find a p block to fix
# Get rid of stuffs inside p, and put a SubElement in
etree.strip_tags(p,'strong')
tmp_text = p.text_content()
p.clear()
subtext = etree.SubElement(p, "strong")
subtext.text = tmp_text
Special thanks to #Scott who helps me come down to this solution. Although I cannot mark his answer correct, I have no less appreciation to his guidance.

Alternatively, you can use more specific xpath to get the targeted p elements directly :
p_target = """
//p[strong]
[not(*[not(self::strong)])]
[not(text()[normalize-space()])]
"""
for p in self.tree.xpath(p_target):
#logic inside the loop can also be the same as your `else` block
content = p.xpath("normalize-space()")
p.clear()
strong = etree.SubElement(p, "strong")
strong.text = content
brief explanation about xpath being used :
//p[strong] : find p element, anywhere in the XML/HTML document, having child element strong...
[not(*[not(self::strong)])] : ..and not having child element other than strong...
[not(text()[normalize-space()])] : ..and not having non-empty text node child.
normalize-space() : get all text nodes from current context element, concatenated with consecutive whitespaces normalized to single space

Using BeautifulSoup to extract an li element based on a string contained within

I have been attempting to use BeautifulSoup to retrieve any <li> element that contains any format of the following word: Ottawa. The problem is that ottawa is never within a tag of it's own such as <p>. So I want to only print li elements that contain Ottawa.
The HTML formatting is like this:
<html>
<body>
<blockquote>
<ul><li><b>name</b>
(National: Ottawa, ON)
<blockquote> some description </blockquote></li>
<li><b>name</b>
(National: Vancouver, BC)
<blockquote> some description </blockquote></li>
<li><b>name</b>
(Local: Ottawa, ON)
<blockquote> some description </blockquote></li>
</ul>
</blockquote>
</body>
</html>
My code is as follows:
from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
re1='.*?'
re2='(Ottawa)'
ottawa = soup.findAll(text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))
search = soup.findAll('li')
The results of the above code finds Ottawa correctly, and when using it to find the li elements, it does find the li elements but it gives me every single one on the page.
I understand that they are currently not in conjunction as trying to do search = soup.findAll('li', text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL)) results in []
My end goal is basically to get every <li> element that contains any mention of Ottawa and give me the entire <li> element with the name, description, link, etc.

Use the text attribute to filter the results of the findAll:
elems = [elem for elem in soup.findAll('li') if 'Ottawa' in str(elem.text)]

from bs4 import BeautifulSoup
import re
import urllib2,sys
url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
for item in soup.find_all(text=re.compile('\(.+: Ottawa', re.IGNORECASE)):
link = item.find_previous_sibling(lambda tag: tag.has_key('href'))
if link is None:
continue
print(u'{} [{}]: {}'.format(link.text,
item.strip(),
link['href']).encode('utf8'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change content only to parent element text, in Beautifulsoup - python

Related

BeautifulSoup: Select P tag that comes after another P tag which should contain a link

How can I improve my solution for getting unknown text between tags?

Parse HTML page to get contents of <p> and <b> tags

Combine multiple tags with lxml

Using BeautifulSoup to extract an li element based on a string contained within

Categories

Resources