I am a beginner in Python3, I am working on selenium project for a website
the text that i want is under the path ("//div[#class='classname']//span[#class='classname2']).text
but i cannot extract it without a beautifulsoup
for i in postsContainer.extract():
soup = bs(i)
people.append([soup.find("div",{"class":"classname"}).text])
but It doesn't work without the //span part. How can I insert my path in a beautifulsoup?
If someone can help
If there would be some more html to inspect, we would maybe find a better solution, but you can use the css selectors in this case
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span.css-901oao.css-16my406.r-poiln3.r-bcqeeo r-qvutc0').get_text()
or:
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span').get_text()
Example
from bs4 import BeautifulSoup
html='''
<div class="classname">
<span class="classname2">text</span>
</div>
'''
soup = BeautifulSoup(html,'html.parser')
soup.select_one('div.classname span.classname2').get_text()
Related
I've tried to crawl a website using beautifulsoup and I've encountered with:
<p data-v-57d17052 class="text text--gray70 text--subtitle2">Hello</p>
and this, for some reason, doesn't allow me to use beatifulsoup's features.
title = soup.find_all(class_={"text, text--gray70, text--subtitle2"})
I think data-v-57d17052 is causing this difficulty.
Does anyone know how to solve this issue?
I have tried all beautifulsoup's features and it doesn't work at all.
Remove the comma from the class_=:
from bs4 import BeautifulSoup
html_doc = ''''<p data-v-57d17052 class="text text--gray70 text--subtitle2">Hello</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.find(class_={"text text--gray70 text--subtitle2"})
print(title.text)
Prints:
Hello
You can use also a CSS selector:
title = soup.select_one(".text.text--gray70.text--subtitle2")
print(title.text)
I want to scrape 2015 from below HTML:
I use the below code but am only able to scrape "Annee"
soup.find('span', {'class':'optionLabel'}).get_text()
Can someone please help?
I am a new learner.
Simply try to find its next span that holds the text you wanna scrape:
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
or css selectors with adjacent sibling combinator:
soup.select_one('span.optionLabel + span').get_text()
Example
html='''
<span class="optionLabel"><button>Année</button</span> :
<span>2015</span>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find('span', {'class':'optionLabel'}).find_next('span').get_text()
Output
2015
I have
<div style="background:url('link_to_img')"></div>
and I need to extract an image link of this div, does anyone know how to do it?
You can do this with regular expressions.
from bs4 import BeautifulSoup
import re
html = '''<div style="background:url('link_to_img')"></div>'''
soup = BeautifulSoup(html,'lxml')
print(re.search(r'\((.*?)\)',soup.find('div')['style']).group(1))
The result is
'link_to_img'
I try to scrape links with a specific class with BeautifulSoup and want to exclude images like:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('''<html><body><a class="link"
href="http://test/file.html">right</a><br><a class="link"
href="/test/file.jpg">false</a><br><a class="link"
href="/test/file.img">false</a><br><a class="link"
href="http://test/file.html">right</a><br></html>''')
for a in soup.findAll('a',
attrs={"class":"link", "href":re.compile('.*\.(?!jpg$|img$)[^.]+')}):
print(a.text)
Whats wrong with my regex?
I'm trying to parse a bit of HTML and I'd like to extract the link that matches a particular pattern. I'm using the find method with a regular expression but it doesn't get me the correct link. Here's my snippet. Could someone tell me what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
import re
html = """
<div class="entry">
<a target="_blank" href="http://www.rottentomatoes.com/m/diary_of_a_wimpy_kid/">RT</a>
<a target="_blank" href="http://www.imdb.com/video/imdb/vi2496267289/">Trailer</a> –
<a target="_blank" href="http://www.imdb.com/title/tt1196141/">IMDB</a> –
</div>
"""
soup = BeautifulSoup(html)
print soup.find('a', href = re.compile(r".*title/tt.*"))['href']
I should be getting the second link but BS always returns the first link. The href of the first link doesn't even match my regex so why does it return it?
Thanks.
find only returns the first <a> tag. You want findAll.
Can't answer your question, but anyway your (originally) posted code has an import typo. Change
import BeautifulSoup
to
from BeautifulSoup import BeautifulSoup
Then, your output (using beautifulsoup version 3.1.0.1) will be:
http://www.imdb.com/title/tt1196141/