I have
<div style="background:url('link_to_img')"></div>
and I need to extract an image link of this div, does anyone know how to do it?
You can do this with regular expressions.
from bs4 import BeautifulSoup
import re
html = '''<div style="background:url('link_to_img')"></div>'''
soup = BeautifulSoup(html,'lxml')
print(re.search(r'\((.*?)\)',soup.find('div')['style']).group(1))
The result is
'link_to_img'
Related
I am a beginner in Python3, I am working on selenium project for a website
the text that i want is under the path ("//div[#class='classname']//span[#class='classname2']).text
but i cannot extract it without a beautifulsoup
for i in postsContainer.extract():
soup = bs(i)
people.append([soup.find("div",{"class":"classname"}).text])
but It doesn't work without the //span part. How can I insert my path in a beautifulsoup?
If someone can help
If there would be some more html to inspect, we would maybe find a better solution, but you can use the css selectors in this case
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span.css-901oao.css-16my406.r-poiln3.r-bcqeeo r-qvutc0').get_text()
or:
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span').get_text()
Example
from bs4 import BeautifulSoup
html='''
<div class="classname">
<span class="classname2">text</span>
</div>
'''
soup = BeautifulSoup(html,'html.parser')
soup.select_one('div.classname span.classname2').get_text()
I am parsing some links from an html page and I want to detect all links that match the following pattern:
http://www.example.com/category1/some-content-here/
http://www.example.com/category-12/some-content-here/
It should NOT match links below:
http://www.example.com/category1/
http://www.example.org/category-12/some-content-here/
Thanks!
You can use BeautifulSoup to parse the HTML a tags, and then use regex to filter the original, full result:
from bs4 import BeautifulSoup as soup
import re
sample = """
<div id='test'>
<a href='http://www.example.com/category1/some-content-here/'>Someting</a>
<a href='http://www.example.com/category-12/some-content-here/'>Someting Here</a>
<a href='http://www.example.com/category1/'>Someting1</a>
<a href='http://www.example.org/category-12/some-content-here/'>Sometingelse</a>
</div>
"""
a = [i['href'] for i in soup(sample, 'lxml').find_all('a') if re.findall('http://[\w\.]+\.com/[\w\-]+/[\w\-]+/', i['href'])]
Output:
['http://www.example.com/category1/some-content-here/', 'http://www.example.com/category-12/some-content-here/']
I am looking for a solution using Python and BeautifulSoup to find an element based on the inside text. For example:
<div> <b>Ignore this text</b>Find based on this text </div>
How can I find this div? Thanks for you helps!
You can use .find with the text argument and then use findParent to the parent element.
Ex:
from bs4 import BeautifulSoup
s="""<div> <b>Ignore this text</b>Find based on this text </div>"""
soup = BeautifulSoup(s, 'html.parser')
t = soup.find(text="Find based on this text ")
print(t.findParent())
Output:
<div> <b>Ignore this text</b>Find based on this text </div>
try it , it is like example but it works
from bs4 import BeautifulSoup
html="""
<div> <b>Ignore this text</b>Find based on this text </div>
"""
soup = BeautifulSoup(html, 'lxml')
s = soup.find('div')
for child in s.find_all('b'):
child.decompose()
print(s.get_text())
Output
Find based on this text
I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text
I try to scrape links with a specific class with BeautifulSoup and want to exclude images like:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('''<html><body><a class="link"
href="http://test/file.html">right</a><br><a class="link"
href="/test/file.jpg">false</a><br><a class="link"
href="/test/file.img">false</a><br><a class="link"
href="http://test/file.html">right</a><br></html>''')
for a in soup.findAll('a',
attrs={"class":"link", "href":re.compile('.*\.(?!jpg$|img$)[^.]+')}):
print(a.text)
Whats wrong with my regex?