I scrapped a website and I want to find an element based on the text written in it. Let's say below is the sample code of the website:
code = bs4.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
I want some way to get a p element that has as a text value Some Information. How can I select an element like so?
Just use text parameter:
code.find_all("p", text="Some Information")
If you need only the first element than use find instead of find_all.
You could use text to search all tags matching the string
import BeautifulSoup as bs
import re
code = bs.BeautifulSoup("""<div>
<h1>Some information</h1>
<p>Spam</p>
<p>Some Information</p>
<p>More Spam</p>
</div>""")
for elem in code(text='Some Information'):
print elem.parent
Related
I'm scraping a webpage with several p elements and I wanna get the text inside of them without including their children.
The page is structured like this:
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
When I use
parent.find_all("p", {"class": "default").get_text() this is the result I get:
I don't want this text
I want this text
I'm using BeautifulSoup 4 with Python 3
Edit: When I use
parent.find_all("p", {"class": "public item-cost"}, text=True, recursive=False)
It returns an empty list
You can use .find_next_sibling() with text=True parameter:
from bs4 import BeautifulSoup
html_doc = """
<p class="default">
<div>I don't want this text</div>
I want this text
</p>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(soup.select_one(".default > div").find_next_sibling(text=True))
Prints:
I want this text
Or using .contents:
print(soup.find("p", class_="default").contents[-1])
EDIT: To strip the string:
print(soup.find("p", class_="default").contents[-1].strip())
You can use xpath, which is a bit complex but provides much powerful querying.
Something like this will work for you:
soup.xpath('//p[contains(#class, "default")]//text()[normalize-space()]')
How can I extract the text after the "br/" tag?
I only what that text and not whatever would be inside the "strong"-tag.
<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>
Have tried code such as
text_content = paragraph.get_text(separator='strong/').strip()
But this will also include the text in the "strong" tag.
The "paragraph" variable is a bs4.element.Tag if that was not clear.
Any help appreciated!
If you have the <p> tag, then find the <br> within that and use .next_siblings
import bs4
html = '''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
paragraph = soup.find('p')
text_wanted = ''.join(paragraph.find('br').next_siblings)
print (text_wanted)
Output:
print (text_wanted)
Text I want which also
includes linebreaks.
Find <br> tag and use next_element
from bs4 import BeautifulSoup
data='''<p><strong>A title</strong><br/>
Text I want which also
includes linebreaks.</p>'''
soup=BeautifulSoup(data,'html.parser')
item=soup.find('p').find('br').next_element
print(item)
I am looking for a solution using Python and BeautifulSoup to find an element based on the inside text. For example:
<div> <b>Ignore this text</b>Find based on this text </div>
How can I find this div? Thanks for you helps!
You can use .find with the text argument and then use findParent to the parent element.
Ex:
from bs4 import BeautifulSoup
s="""<div> <b>Ignore this text</b>Find based on this text </div>"""
soup = BeautifulSoup(s, 'html.parser')
t = soup.find(text="Find based on this text ")
print(t.findParent())
Output:
<div> <b>Ignore this text</b>Find based on this text </div>
try it , it is like example but it works
from bs4 import BeautifulSoup
html="""
<div> <b>Ignore this text</b>Find based on this text </div>
"""
soup = BeautifulSoup(html, 'lxml')
s = soup.find('div')
for child in s.find_all('b'):
child.decompose()
print(s.get_text())
Output
Find based on this text
I would like to know how to extract all of the elements under a specific tag.
For example:
<div class="text">
<h2>...</h2>
<p>...</p>
<p>...</p>
<h2>...</h2>
</div>
I would like to get these elements in a list
list = ['<h2>...</h2>',
'<p>...</p>',
'<p>...</p>',
'<h2>...</h2>']
The reason I need this, I want to know under what category (header) the text is written and extract the text.
from bs4 import BeautifulSoup
l = soup.find('div', {'class':'text'}).findChildren()
I want to use BeautfulSoup to search and replace <\a> with <\a><br>. I know how to open with urllib2 and then parse to extract all the <a> tags. What I want to do is search and replace the closing tag with the closing tag plus the break. Any help, much appreciated.
EDIT
I would assume it would be something similar to:
soup.findAll('a').
In the documentation, there is a:
find(text="ahh").replaceWith('Hooray')
So I would assume it would be along the lines of:
soup.findAll(tag = '</a>').replaceWith(tag = '</a><br>')
But that doesn't work and the python help() doesn't give much
This will insert a <br> tag after the end of each <a>...</a> element:
from BeautifulSoup import BeautifulSoup, Tag
# ....
soup = BeautifulSoup(data)
for a in soup.findAll('a'):
a.parent.insert(a.parent.index(a)+1, Tag(soup, 'br'))
You can't use soup.findAll(tag = '</a>') because BeautifulSoup doesn't operate on the end tags separately - they are considered part of the same element.
If you wanted to put the <a> elements inside a <p> element as you ask in a comment, you can use this:
for a in soup.findAll('a'):
p = Tag(soup, 'p') #create a P element
a.replaceWith(p) #Put it where the A element is
p.insert(0, a) #put the A element inside the P (between <p> and </p>)
Again, you don't create the <p> and </p> separately because they are part of the same thing.
suppose you have an element which you know contains the "br" markup tags, one way to remove & replace the "br" tags with a different string is like this:
originalSoup = BeautifulSoup("your_html_file.html")
replaceString = ", " # replace each <br/> tag with ", "
# Ex. <p>Hello<br/>World</p> to <p>Hello, World</p>
cleanSoup = BeautifulSoup(str(originalSoup).replace("<br/>", replaceString))
You don't replace an end-tag; in BeautifulSoup you are dealing with a document object model like in a browser, not a string full of HTML. So you couldn't ‘replace’ an end-tag without also replacing the start-tag.
What you want to do is insert a new <br> element immediately after the <a>...</a> element. To do so you'll need to find out the index of the <a> element inside its parent element, and insert the new element just after that index. eg.
soup= BeautifulSoup('<body>blah blah blah</body>')
for link in soup.findAll('a'):
br= Tag(soup, 'br')
index= link.parent.contents.index(link)
link.parent.insert(index+1, br)
# soup now serialises to '<body>blah blah<br /> blah</body>'