Python BeautifulSoup find_all with regex doesn't match text - python

I have the following HTML code:
<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>
I would like to get the anchor tag that has Shop as text disregarding the spacing before and after. I have tried the following code, but I keep getting an empty array:
import re
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
prog = re.compile('\s*Shop\s*')
print(soup.find_all("a", string=prog))
# Output: []
I also tried retrieving the text using get_text():
text = soup.find_all("a")[0].get_text()
print(repr(text))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'
and ran the following code to make sure my Regex was right, which seems to be to the case.
result = prog.match(text)
print(repr(result.group()))
# Output: '\n\n\t\t\t\t\t\t\t\tShop \n'
I also tried selecting span instead of a but I get the same issue. I'm guessing it's something with find_all, I have read the BeautifulSoup documentation but I still can't find the issue. Any help would be appreciated. Thanks!

The problem you have here is that the text you are looking for is in a tag that contains children tags, and when a tag has children tags, the string property is empty.
You can use a lambda expression in the .find call and since you are looking for a fixed string, you may use a mere 'Shop' in t.text condition rather than a regex check:
soup.find(lambda t: t.name == "a" and 'Shop' in t.text)

The text Shop you are searching it is inside span tag so when you are trying with regular expression its unable to fetch the value using regex.
You can try regex to find text and then parent of that.
import re
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find(text=re.compile('Shop')).parent.parent)
If you have BS 4.7.1 or above you can use following css selector.
html = """<a class="nav-link" href="https://cbd420.ch/fr/tous-les-produits/">
<span class="cbp-tab-title">
Shop <i class="fa fa-angle-down cbp-submenu-aindicator"></i></span>
</a>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.select_one('a:contains("Shop")'))

Related

Python BeautifulSoup - ignore child tags and IDs

I have a file that has the below structure.
I want to find all the parent tags, i.e. all the IDs that contain numbers only, and the text contained within. However, now I get a flat structure of all a tags, both parents and children tags.
<A ID=101>
<a id=”A1”>Today is a nice day.
<a id=”A2”>Today is a very nice day.
<a id=”A3”>Today is a very very nice day.
</A>
<A ID=102>
<a id=”A1”>Today is a nice day2.
<a id=”A2”>Today is a very nice day2.
<a id=”A3”>Today is a very very nice day2.
</A>
I want this only and ignore all child tags and IDs. What is a way to extract it like this?
<A ID=101>
Today is a nice day.
Today is a very nice day.
Today is a very very nice day.
</A>
<A ID=102>
Today is a nice day2.
Today is a very nice day2.
Today is a very very nice day2.
</A>
The below code could do what you are asking for provided the child and parent tags have different names and are not just uppercase and lower case versions of each other
html = """
<B ID=101>
<a id=”A1”>Today is a nice day.
<a id=”A2”>Today is a very nice day.
<a id=”A3”>Today is a very very nice day.
</B>
<B ID=102>
<a id=”A1”>Today is a nice day2.
<a id=”A2”>Today is a very nice day2.
<a id=”A3”>Today is a very very nice day2.
</B>
"""
invalid_tags = ['a',"html","body"]
soup = BeautifulSoup(html,"lxml")
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
print (soup)
This is because BeautifulSoup by default deals with html data.HTML is case insensitive; on parsing all tags are lowercased.
If you need to match tags case sensitive, you need to parse the document as XML. Install lxml(by pip) and tell BeautifulSoup to use that parser in XML mode
eg
soup = BeautifulSoup(source, 'xml')
Your html is not correct.
See below fixed html + example code:
from bs4 import BeautifulSoup
html = '''
<a id="101">
<a id="A1">Today is a nice day.</a>
<a id="A2">Today is a very nice day.</a>
<a id="A3">Today is a very very nice day.</a>
</a>
<a id="102">
<a id="A1">Today is a nice day2.</a>
<a id="A2">Today is a very nice day2.</a>
<a id="A3">Today is a very very nice day2.</a>
</a>
'''
bs = BeautifulSoup( html, "html.parser" )
tags = bs.find_all( "a", recursive = False )
for tag in tags:
print( "<" + tag.name + ' id="' + tag[ "id" ] + '>' )
print( tag.text )
print( "</" + tag.name )

Python: String filter

I need help in python. I need to find my code in this -->
<li class="hide" style="display: list-item;">
<div class="name">Name</div>
<span class="value">TEST TEST</span>
</li>
These words:Name, TEST TEST.
Try using the find method in bs4
Ex:
from bs4 import BeautifulSoup
html = """<li class="hide" style="display: list-item;">
<div class="name">Name</div>
<span class="value">TEST TEST</span>
</li>"""
soup = BeautifulSoup(html, "html.parser")
print( soup.find("li", class_="hide").text.strip() )
Output:
Name
TEST TEST
After you find the required element use .text to extract the string.

Python: scrape a part of source code and save it as html

Here is the case, I need to save a web page's source code as html file. But if you look at the web page, there are lots of section, I don't need them, I only want to save the source code of the article itself.
code:
from urllib.request import urlopen
page = urlopen('http://www.abcde.com')
page_content = page.read()
with open('page_content.html', 'wb') as f:
f.write(page_content)
I can save the whole source code from my code, but how can I just save the only part I want?
Explain:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>
I need to save the source code with and inside this tag , not extract the sentences in the tags.
The result I want is to save like this:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
<div class="col-md-12 col-xs-12" style="padding-left:10px;">
<h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
</div>
<!--Article Start-->
<section class="page_article_div" id="print">
<article itemprop="text" class="page_article_content">
<p>
<img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
<strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
<li>
Germanic paganism</li>
<li>
Greek mythology</li>
</ol>
<p style="text-align: right;">
【Jane】</p>
<p style="text-align: right;">
Credit : Wiki</p>
</article>
<div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
<br />
<div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
</section>
<!--Article End-->
</div>
My own solution here:
page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
list.append(str(tag))
list2= (', '.join(list))
#print(list2)
#print(type(list2))
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
f.write(list2)
I am a beginner so I am trying to do it as simple as it is, and this is my answer, it's working quite well at the moment :)
You can search with the tag with the property of tag such as class or tag name or id and save it to the what ever format you want like the example below.
driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me
tag_for_me will have your required code.
You can use Beautiful Soup to get any HTML source you need.
import requests
from bs4 import BeautifulSoup
target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")
for elem in soup.find_all(attrs={"class":target_class}):
if elem.text == target_text:
print(elem)
Output:
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>
Use BeautifulSoup to get the HTML where you want to insert, get the HTML which you want to insert. use insert() to generate new_tag. Overwrite to the original file.
from bs4 import BeautifulSoup
import requests
#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>
res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.

Having problems understanding BeautifulSoup filtering

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">‎
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text
You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup

Python BeautifulSoup findAll by "class" attribute

I want to do the following code, which is what BS documentation says to do, the only problem is that the word "class" isn't just a word. It can be found inside HTML, but it's also a python keyword which causes this code to throw an error.
So how do I do the following?
soup.findAll('ul', class="score")
Your problem seems to be that you expect find_all in the soup to find an exact match for your string. In fact:
When you search for a tag that matches a certain CSS class, you’re
matching against any of its CSS classes:
You can properly search for a class tag as #alKid said. You can also search with the class_ keyword arg.
soup.find_all('ul', class_="score")
Here is how to do it:
soup.find_all('ul', {'class':"score"})
If OP is interested in getting the finalScore by going through ul you could solve this with a couple of lines of gazpacho:
from gazpacho import Soup
html = """\
<div>
<ul class="score header" id="400488971-linescoreHeader" style="display: block">
<li>1</li>
<li>2</li>
<li>3</li>
<li>4</li>
<li id="400488971-lshot"> </li>
<li class="finalScore">T</li>
</ul>
<div>
"""
soup = Soup(html)
soup.find("ul", {"class": "score"}).find("li", {"class": "finalScore"}).text
Which would output:
'T'

Categories