Python: String filter - python

I need help in python. I need to find my code in this -->
<li class="hide" style="display: list-item;">
<div class="name">Name</div>
<span class="value">TEST TEST</span>
</li>
These words:Name, TEST TEST.

Try using the find method in bs4
Ex:
from bs4 import BeautifulSoup
html = """<li class="hide" style="display: list-item;">
<div class="name">Name</div>
<span class="value">TEST TEST</span>
</li>"""
soup = BeautifulSoup(html, "html.parser")
print( soup.find("li", class_="hide").text.strip() )
Output:
Name
TEST TEST
After you find the required element use .text to extract the string.

Related

python beautifulsoup: how to find all before certain stop tag?

I need to find all tags of a certain kind (class "nice") but excluding those after a certain other tag (class "stop").
<div class="nice"></div>
<div class="nice"></div>
<div class="stop">here should be the end of found items</div>
<div class="nice"></div>
<div class="nice"></div>
How do I accomplish this using bs4?
I found this as a similar question but it appears a bit fuzzy.
You can use for example .find_previous to filter out unwanted tags:
from bs4 import BeautifulSoup
html_doc = """\
<div class="nice">want 1</div>
<div class="nice">want 2</div>
<div class="stop">here should be the end of found items</div>
<div class="nice">do not want 1</div>
<div class="nice">do not want 2</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
for div in soup.find_all("div", class_="nice"):
if div.find_previous("div", class_="stop"):
break
print(div)
Prints:
<div class="nice">want 1</div>
<div class="nice">want 2</div>

Ignore one div class in BeautifulSoup find_all in Python 3

I want to ignore one class when using find_all. I've followed this solution Select all divs except ones with certain classes in BeautifulSoup
My divs are a bit different, I want to ignore description-0
<div class="abc">...</div>
<div class="parent">
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
</div>
<div class="xyz">...</div>
Following is my code
classToIgnore = ["description-0"]
all = soup.find_all('div', class_=lambda x: x not in classToIgnore)
It is reading all divs on the page, instead of just the ones with "descriptions-n". How to fix it?
Use regex, like this, for example:
import re
from bs4 import BeautifulSoup
sample_html = """<div class="abc">...</div>
<div class="description-0"></div>
<div class="description-1"></div>
<div class="description-2"></div>
<div class="xyz">...</div>"""
classes_regex = (
BeautifulSoup(sample_html, "lxml")
.find_all("div", {"class": (re.compile(r"description-[1-9]"))})
)
print(classes_regex)
Output:
[<div class="description-1"></div>, <div class="description-2"></div>]

BeautifulSoup - Getting all the child from tag instead of the first

I am creating a script that collects data from a website. However I am getting some issues to collect only specific information. The HTML part that is causing me problems is the following:
<div class="Content">
<article>
<blockquote class="messageText 1234">
I WANT THIS
<br/>
I WANT THIS 2
<br/>
</a>
<br/>
</blockquote>
</article>
</div>
<div class="Content">
<article>
<blockquote class="messageText 1234">
<a class="IDENTIFIER" href="WEBSITE">
</a>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<div class="messageTextEndMarker">
</div>
</blockquote>
</article>
</div>
And I am trying to create a process that prints only the part "I WANT THIS". I have the following script:
import requests
from bs4 import BeautifulSoup
url = ''
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.find_all('div', class_='panels'):
for b in a.find_all('form', class_='section'):
for c in b.find_all('div', class_='message'):
for d in c.find_all('div', class_='primaryContent'):
for d in d.find_all('div', class_='messageContent'):
for e in d.content.find_all('blockquote', class_='messageText 1234')[0]:
print(e.string)
My idea with the code was to extract only the part from the first blockquote element, however, I am getting all the text from the blockquotes:
I WANT THIS
NO WANT THIS
NO WANT THIS
NO WANT THIS
How can I achieve this?
Why not use select_one to isolate first block then stripped_strings to separate out text strings?
from bs4 import BeautifulSoup as bs
html = ''' your html'''
soup = bs(html, 'lxml')
print([s for s in soup.select_one('.Content .messageText').stripped_strings])

CSS: Div and Argument Selectors

I have the following structure,
<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>
How do I acess the data-ret argument inside div id done? For doing some web scraping.
Tried a couple of ways but don't seem to be able to stick it.
Thanks
Using beautiful soup library:
from bs4 import BeautifulSoup
html = '''<div class="main">
<div id="son" class="well"></div>
<div id="done"
data-ret="512,500"></div>
</div>'''
soup = BeautifulSoup(html,"lxml")
data_ret = soup.find("div",{'id':'done'})
print(data_ret['data-ret'])
O/P:
512,500

Python RegEx with Beautifulsoup 4 not working

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.
This is the code snippet
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})
where html_doc is the string with the following html
<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>
But all_findings is coming out as an empty list while it should have found one item.
It's working in the case of exact match
all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})
I am using bs4.
Instead of using a regular expression, put the classes you are looking for in a list:
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
Example code:
from bs4 import BeautifulSoup
html_doc = """<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings
This outputs:
[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]
To extend #Andy's answer, you can make a list of class names and compiled regular expressions:
soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})
Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.
If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":
soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

Categories