beautiful soup find text of path that contains div and span - python

I am a beginner in Python3, I am working on selenium project for a website
the text that i want is under the path ("//div[#class='classname']//span[#class='classname2']).text
but i cannot extract it without a beautifulsoup
for i in postsContainer.extract():
soup = bs(i)
people.append([soup.find("div",{"class":"classname"}).text])
but It doesn't work without the //span part. How can I insert my path in a beautifulsoup?
If someone can help

If there would be some more html to inspect, we would maybe find a better solution, but you can use the css selectors in this case
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span.css-901oao.css-16my406.r-poiln3.r-bcqeeo r-qvutc0').get_text()
or:
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span').get_text()
Example
from bs4 import BeautifulSoup
html='''
<div class="classname">
<span class="classname2">text</span>
</div>
'''
soup = BeautifulSoup(html,'html.parser')
soup.select_one('div.classname span.classname2').get_text()

Related

how to crawl vue.js based website with beautifulsoup?

I've tried to crawl a website using beautifulsoup and I've encountered with:
<p data-v-57d17052 class="text text--gray70 text--subtitle2">Hello</p>
and this, for some reason, doesn't allow me to use beatifulsoup's features.
title = soup.find_all(class_={"text, text--gray70, text--subtitle2"})
I think data-v-57d17052 is causing this difficulty.
Does anyone know how to solve this issue?
I have tried all beautifulsoup's features and it doesn't work at all.
Remove the comma from the class_=:
from bs4 import BeautifulSoup
html_doc = ''''<p data-v-57d17052 class="text text--gray70 text--subtitle2">Hello</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.find(class_={"text text--gray70 text--subtitle2"})
print(title.text)
Prints:
Hello
You can use also a CSS selector:
title = soup.select_one(".text.text--gray70.text--subtitle2")
print(title.text)

Scraping <span> text</span> with BeautifulSoup and urllib

I want to scrape 2015 from below HTML:
I use the below code but am only able to scrape "Annee"
soup.find('span', {'class':'optionLabel'}).get_text()
Can someone please help?
I am a new learner.
Simply try to find its next span that holds the text you wanna scrape:
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
or css selectors with adjacent sibling combinator:
soup.select_one('span.optionLabel + span').get_text()
Example
html='''
<span class="optionLabel"><button>Année</button</span> :
<span>2015</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
Output
2015

BeautifulSoup, access css value | <div style="background:url('this_one')">

I have
<div style="background:url('link_to_img')"></div>
and I need to extract an image link of this div, does anyone know how to do it?
You can do this with regular expressions.
from bs4 import BeautifulSoup
import re
html = '''<div style="background:url('link_to_img')"></div>'''
soup = BeautifulSoup(html,'lxml')
print(re.search(r'\((.*?)\)',soup.find('div')['style']).group(1))
The result is
'link_to_img'

Python - regex ends with specific image extension

I try to scrape links with a specific class with BeautifulSoup and want to exclude images like:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('''<html><body><a class="link"
href="http://test/file.html">right</a><br><a class="link"
href="/test/file.jpg">false</a><br><a class="link"
href="/test/file.img">false</a><br><a class="link"
href="http://test/file.html">right</a><br></html>''')
for a in soup.findAll('a',
attrs={"class":"link", "href":re.compile('.*\.(?!jpg$|img$)[^.]+')}):
print(a.text)
Whats wrong with my regex?

Unable to get correct link in BeautifulSoup

I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div class="entry">
<a target="_blank" href="http://www.rottentomatoes.com/m/diary_of_a_wimpy_kid/">RT</a>
<a target="_blank" href="http://www.imdb.com/video/imdb/vi2496267289/">Trailer</a> –
<a target="_blank" href="http://www.imdb.com/title/tt1196141/">IMDB</a> –
</div>
"""
soup = BeautifulSoup(html)
print soup.find('a', href = re.compile(r".*title/tt.*"))['href']
I should be getting the second link but BS always returns the first link. The href of the first link doesn't even match my regex so why does it return it?
Thanks.
find only returns the first <a> tag. You want findAll.
Can't answer your question, but anyway your (originally) posted code has an import typo. Change
import BeautifulSoup
to
from BeautifulSoup import BeautifulSoup
Then, your output (using beautifulsoup version 3.1.0.1) will be:
http://www.imdb.com/title/tt1196141/

Categories