BeautifulSoup, access css value | <div style="background:url('this_one')">

BeautifulSoup, access css value | <div style="background:url('this_one')"> - python

I have
<div style="background:url('link_to_img')"></div>
and I need to extract an image link of this div, does anyone know how to do it?

You can do this with regular expressions.
from bs4 import BeautifulSoup
import re
html = '''<div style="background:url('link_to_img')"></div>'''
soup = BeautifulSoup(html,'lxml')
print(re.search(r'\((.*?)\)',soup.find('div')['style']).group(1))
The result is
'link_to_img'

Related

beautiful soup find text of path that contains div and span

I am a beginner in Python3, I am working on selenium project for a website
the text that i want is under the path ("//div[#class='classname']//span[#class='classname2']).text
but i cannot extract it without a beautifulsoup
for i in postsContainer.extract():
soup = bs(i)
people.append([soup.find("div",{"class":"classname"}).text])
but It doesn't work without the //span part. How can I insert my path in a beautifulsoup?
If someone can help

If there would be some more html to inspect, we would maybe find a better solution, but you can use the css selectors in this case
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span.css-901oao.css-16my406.r-poiln3.r-bcqeeo r-qvutc0').get_text()
or:
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span').get_text()
Example
from bs4 import BeautifulSoup
html='''
<div class="classname">
<span class="classname2">text</span>
</div>
'''
soup = BeautifulSoup(html,'html.parser')
soup.select_one('div.classname span.classname2').get_text()

Regex help in python to find links

I am parsing some links from an html page and I want to detect all links that match the following pattern:
http://www.example.com/category1/some-content-here/
http://www.example.com/category-12/some-content-here/
It should NOT match links below:
http://www.example.com/category1/
http://www.example.org/category-12/some-content-here/
Thanks!

You can use BeautifulSoup to parse the HTML a tags, and then use regex to filter the original, full result:
from bs4 import BeautifulSoup as soup
import re
sample = """
<div id='test'>
<a href='http://www.example.com/category1/some-content-here/'>Someting</a>
<a href='http://www.example.com/category-12/some-content-here/'>Someting Here</a>
<a href='http://www.example.com/category1/'>Someting1</a>
<a href='http://www.example.org/category-12/some-content-here/'>Sometingelse</a>
</div>
"""
a = [i['href'] for i in soup(sample, 'lxml').find_all('a') if re.findall('http://[\w\.]+\.com/[\w\-]+/[\w\-]+/', i['href'])]
Output:
['http://www.example.com/category1/some-content-here/', 'http://www.example.com/category-12/some-content-here/']

How to find element based on text ignore child tags in beautifulsoup

I am looking for a solution using Python and BeautifulSoup to find an element based on the inside text. For example:
<div> <b>Ignore this text</b>Find based on this text </div>
How can I find this div? Thanks for you helps!

You can use .find with the text argument and then use findParent to the parent element.
Ex:
from bs4 import BeautifulSoup
s="""<div> <b>Ignore this text</b>Find based on this text </div>"""
soup = BeautifulSoup(s, 'html.parser')
t = soup.find(text="Find based on this text ")
print(t.findParent())
Output:
<div> <b>Ignore this text</b>Find based on this text </div>

try it , it is like example but it works
from bs4 import BeautifulSoup
html="""
<div> <b>Ignore this text</b>Find based on this text </div>
"""
soup = BeautifulSoup(html, 'lxml')
s = soup.find('div')
for child in s.find_all('b'):
child.decompose()
print(s.get_text())
Output
Find based on this text

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...

I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"

You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster

Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

Python - regex ends with specific image extension

I try to scrape links with a specific class with BeautifulSoup and want to exclude images like:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('''<html><body><a class="link"
href="http://test/file.html">right</a><br><a class="link"
href="/test/file.jpg">false</a><br><a class="link"
href="/test/file.img">false</a><br><a class="link"
href="http://test/file.html">right</a><br></html>''')
for a in soup.findAll('a',
attrs={"class":"link", "href":re.compile('.*\.(?!jpg$|img$)[^.]+')}):
print(a.text)
Whats wrong with my regex?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup, access css value | <div style="background:url('this_one')"> - python

I have <div style="background:url('link_to_img')"></div> and I need to extract an image link of this div, does anyone know how to do it?

You can do this with regular expressions. from bs4 import BeautifulSoup import re html = '''<div style="background:url('link_to_img')"></div>''' soup = BeautifulSoup(html,'lxml') print(re.search(r'\((.*?)\)',soup.find('div')['style']).group(1)) The result is 'link_to_img'

Related

beautiful soup find text of path that contains div and span

Regex help in python to find links

How to find element based on text ignore child tags in beautifulsoup

Can't get item with python beautifulsoup

Python - regex ends with specific image extension

Categories

Resources