BeautifulSoup Tag Removal unexpected result

BeautifulSoup Tag Removal unexpected result - python

So I wrote some code to extract only what's within the <p> tags of some HTML code. Here is my code
soup = BeautifulSoup(my_string, 'html')
no_tags=' '.join(el.string for el in soup.find_all('p', text=True))
It works how I want it to for most of the examples it is run on, but I have noticed that in examples such as
<p>hello, how are you <code>other code</code> my name is joe</p>
it returns nothing. I suppose this is because there are other tags within the <p> tags. So just to be clear, what I would want it to return is
hello, how are you my name is joe
That is, I want everything inside the <p> tags but only the first level in. I would like to ignore everything that is enclosed in other tags within the <p> tags.
can someone help me out regarding how to deal with such examples?

Hello I think that you can use it to extract the text which is within p tag.
my_string = "<p>hello, how are you <code>other code</code> my name is joe</p>"
soup = BeautifulSoup(my_string, 'html')
soup.code.extract()
text = soup.p.get_text()
print text

Related

Get text from inside element without its children

I'm scraping a webpage with several p elements and I wanna get the text inside of them without including their children.
The page is structured like this:
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
When I use
parent.find_all("p", {"class": "default").get_text() this is the result I get:
I don't want this text
I want this text
I'm using BeautifulSoup 4 with Python 3
Edit: When I use
parent.find_all("p", {"class": "public item-cost"}, text=True, recursive=False)
It returns an empty list

You can use .find_next_sibling() with text=True parameter:
from bs4 import BeautifulSoup
html_doc = """
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.select_one(".default > div").find_next_sibling(text=True))
Prints:
I want this text
Or using .contents:
print(soup.find("p", class_="default").contents[-1])
EDIT: To strip the string:
print(soup.find("p", class_="default").contents[-1].strip())

You can use xpath, which is a bit complex but provides much powerful querying.
Something like this will work for you:
soup.xpath('//p[contains(#class, "default")]//text()[normalize-space()]')

Get content from certain tags with certain attributes using BS4

I need to get the content from the following tag with these attributes: <span class="h6 m-0">.
An example of the HTML I'll encounter would be <span class="h6 m-0">Hello world</span>, and it obviously needs to return Hello world.
My current code is as follows:
page = BeautifulSoup(text, 'html.parser')
names = [item["class"] for item in page.find_all('span')]
This works fine, and gets me all the spans in the page, but I don't know how to specify that I only want those with the specific class "h6 m-0" and grab the content inside. How will I go about doing this?

page = BeautifulSoup(text, 'html.parser')
names = page.find_all('span' , class_ = 'h6 m-0')
Without knowing your use case I don't know if this will work.

names = [item["class"] for item in page.find_all('span',class_="h6 m-0" )]
can you please be more specific about what problem you face
but this should work fine for you

Extract part of text with Beautifulsoup

How can I extract the text after the "br/" tag?
I only what that text and not whatever would be inside the "strong"-tag.
<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>
Have tried code such as
text_content = paragraph.get_text(separator='strong/').strip()
But this will also include the text in the "strong" tag.
The "paragraph" variable is a bs4.element.Tag if that was not clear.
Any help appreciated!

If you have the <p> tag, then find the <br> within that and use .next_siblings
import bs4
html = '''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
paragraph = soup.find('p')
text_wanted = ''.join(paragraph.find('br').next_siblings)
print (text_wanted)
Output:
print (text_wanted)
Text I want which also
includes linebreaks.

Find <br> tag and use next_element
from bs4 import BeautifulSoup
data='''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('p').find('br').next_element
print(item)

BeautifulSoup - removing MS Word specific tags?

I have html document that was saved from MS Word and now it has some tags related with MS Word. I don't need too keep any backwards compatibility with it, I just need to extract contents from that file. The problem is that word specific tags are not removed so easily.
I have this code:
from bs4 import BeautifulSoup, NavigableString
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return soup
It removes not needed tags. But some are left even after using this method.
For example look at this:
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">Some text -
some content<o:p></o:p></SPAN></P>
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">some text2 -
647894654<o:p></o:p></SPAN></P>
<P class="MsoNormal"><SPAN style="mso-bidi-font-weight: bold;">some text3 -
some content blabla<o:p></o:p></SPAN></P>
This is how it look inside html document. When I use method like this:
invalid_tags = ['span']
stripped = strip_tags(html_file, invalid)
print stripped
It prints like this:
<p class="MsoNormal">Some text -
some content<html><body><o:p></o:p></body></html></p>
<p class="MsoNormal">some text2 -
647894654<html><body><o:p></o:p></body></html></p>
<p class="MsoNormal">some text3 -
some content blabla<html><body><o:p></o:p></body></html></p>
As you can see for some reason html and body tags appeared there even though in html it does not exist. If I add invalid_tags = ['span', 'o:p'], it removes <o:p></o:p> tags, but if I add to remove html or body tags, it does not do anything and it is still kept there.
P.S. I can remove html tags there if I directly change where to look for finding tags. For example by adding this line in a method (before findAll is used) soup = soup.body. But still after this, body tags are kept hanging in those specific paragraphs.

you can try this:
from bs4 import BeautifulSoup
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for t in invalid_tags:
tag = soup.find_all(t)
if tag:
for item in tag:
item.unwrap()
return str(soup)
then you just need to strip html and body tag.

Search and Replace in HTML with BeautifulSoup

I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.
EDIT
I would assume it would be something similar to:
soup.findAll('a').
In the documentation, there is a:
find(text="ahh").replaceWith('Hooray')
So I would assume it would be along the lines of:
soup.findAll(tag = '</a>').replaceWith(tag = '</a><br>')
But that doesn't work and the python help() doesn't give much

This will insert a <br> tag after the end of each <a>...</a> element:
from BeautifulSoup import BeautifulSoup, Tag
# ....
soup = BeautifulSoup(data)
for a in soup.findAll('a'):
a.parent.insert(a.parent.index(a)+1, Tag(soup, 'br'))
You can't use soup.findAll(tag = '</a>') because BeautifulSoup doesn't operate on the end tags separately - they are considered part of the same element.
If you wanted to put the <a> elements inside a <p> element as you ask in a comment, you can use this:
for a in soup.findAll('a'):
p = Tag(soup, 'p') #create a P element
a.replaceWith(p) #Put it where the A element is
p.insert(0, a) #put the A element inside the P (between <p> and </p>)
Again, you don't create the <p> and </p> separately because they are part of the same thing.

suppose you have an element which you know contains the "br" markup tags, one way to remove & replace the "br" tags with a different string is like this:
originalSoup = BeautifulSoup("your_html_file.html")
replaceString = ", " # replace each <br/> tag with ", "
# Ex. <p>Hello<br/>World</p> to <p>Hello, World</p>
cleanSoup = BeautifulSoup(str(originalSoup).replace("<br/>", replaceString))

You don't replace an end-tag; in BeautifulSoup you are dealing with a document object model like in a browser, not a string full of HTML. So you couldn't ‘replace’ an end-tag without also replacing the start-tag.
What you want to do is insert a new <br> element immediately after the <a>...</a> element. To do so you'll need to find out the index of the <a> element inside its parent element, and insert the new element just after that index. eg.
soup= BeautifulSoup('<body>blah blah blah</body>')
for link in soup.findAll('a'):
br= Tag(soup, 'br')
index= link.parent.contents.index(link)
link.parent.insert(index+1, br)
# soup now serialises to '<body>blah blah<br /> blah</body>'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup Tag Removal unexpected result - python

Hello I think that you can use it to extract the text which is within p tag. my_string = "<p>hello, how are you <code>other code</code> my name is joe</p>" soup = BeautifulSoup(my_string, 'html') soup.code.extract() text = soup.p.get_text() print text

Related

Get text from inside element without its children

Get content from certain tags with certain attributes using BS4

Extract part of text with Beautifulsoup

BeautifulSoup - removing MS Word specific tags?

Search and Replace in HTML with BeautifulSoup

Categories

Resources