Python - regex ends with specific image extension - python

I try to scrape links with a specific class with BeautifulSoup and want to exclude images like:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('''<html><body><a class="link"
href="http://test/file.html">right</a><br><a class="link"
href="/test/file.jpg">false</a><br><a class="link"
href="/test/file.img">false</a><br><a class="link"
href="http://test/file.html">right</a><br></html>''')
for a in soup.findAll('a',
attrs={"class":"link", "href":re.compile('.*\.(?!jpg$|img$)[^.]+')}):
print(a.text)
Whats wrong with my regex?

Related

beautiful soup find text of path that contains div and span

I am a beginner in Python3, I am working on selenium project for a website
the text that i want is under the path ("//div[#class='classname']//span[#class='classname2']).text
but i cannot extract it without a beautifulsoup
for i in postsContainer.extract():
soup = bs(i)
people.append([soup.find("div",{"class":"classname"}).text])
but It doesn't work without the //span part. How can I insert my path in a beautifulsoup?
If someone can help
If there would be some more html to inspect, we would maybe find a better solution, but you can use the css selectors in this case
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span.css-901oao.css-16my406.r-poiln3.r-bcqeeo r-qvutc0').get_text()
or:
soup.select_one('div.css-901oao.r-18jsvk2.r-1qd0xha.r-a023e6.r-16dba41.r-ad9z0x.r-bcqeeo.r-bnwqim.r-qvutc0 > span').get_text()
Example
from bs4 import BeautifulSoup
html='''
<div class="classname">
<span class="classname2">text</span>
</div>
'''
soup = BeautifulSoup(html,'html.parser')
soup.select_one('div.classname span.classname2').get_text()

Python beautifulsoup search issue

I'm having issues having bs find this text. I think it's because the text on the page has extra quotes around it. I was told it's because the class is actually blank. If that's the case, then any suggestions on how I can build my search?
Actual text on website: <span class="" data-product-price="">
My code (I've tried several variations): soup.find_all('span',{'class' : '" data-product-price="'})
I've also tried just doing a regular search, but I'm not doing that correctly. Any suggestions or should I use something other than bs?
Edited to include full code:
import bs4
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.gouletpens.com/products/twsbi-diamond-580-
fountain-pen-clear?variant=11884892028971')
soup = bs4.BeautifulSoup(r.text, features="html.parser")
print(soup)
#soup.find_all('span',{'class' : '" data-product-price="'})
#soup.find_all('span',{'class' : 'data-product-price'})[0].text
After looking at URL, you can select the price with CSS selector:
import requests
from bs4 import BeautifulSoup
url = 'https://www.gouletpens.com/products/twsbi-diamond-580-fountain-pen-clear?variant=11884892028971'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print(soup.select_one('span[data-product-price]').get_text(strip=True))
Prints:
$50.00
OR: with bs4 API (set {'data-product-price':True} to search tags with this attribute regardless of value in it:
print(soup.find('span', {'data-product-price':True}).get_text(strip=True))

BeautifulSoup, access css value | <div style="background:url('this_one')">

I have
<div style="background:url('link_to_img')"></div>
and I need to extract an image link of this div, does anyone know how to do it?
You can do this with regular expressions.
from bs4 import BeautifulSoup
import re
html = '''<div style="background:url('link_to_img')"></div>'''
soup = BeautifulSoup(html,'lxml')
print(re.search(r'\((.*?)\)',soup.find('div')['style']).group(1))
The result is
'link_to_img'

Regex help in python to find links

I am parsing some links from an html page and I want to detect all links that match the following pattern:
http://www.example.com/category1/some-content-here/
http://www.example.com/category-12/some-content-here/
It should NOT match links below:
http://www.example.com/category1/
http://www.example.org/category-12/some-content-here/
Thanks!
You can use BeautifulSoup to parse the HTML a tags, and then use regex to filter the original, full result:
from bs4 import BeautifulSoup as soup
import re
sample = """
<div id='test'>
<a href='http://www.example.com/category1/some-content-here/'>Someting</a>
<a href='http://www.example.com/category-12/some-content-here/'>Someting Here</a>
<a href='http://www.example.com/category1/'>Someting1</a>
<a href='http://www.example.org/category-12/some-content-here/'>Sometingelse</a>
</div>
"""
a = [i['href'] for i in soup(sample, 'lxml').find_all('a') if re.findall('http://[\w\.]+\.com/[\w\-]+/[\w\-]+/', i['href'])]
Output:
['http://www.example.com/category1/some-content-here/', 'http://www.example.com/category-12/some-content-here/']

Can't get item with python beautifulsoup

I'm trying to learn how to webscrape with beautifulsoup + python, and I want to grab the name of the cinematographer from https://letterboxd.com/film/donnie-darko/ but I can't figure out how to isolate the text. The html for what I want is written as below, what I want to output is "Steven Poster":
<h3><span>Cinematography</span></h3>
<div class="text-sluglist">
<p>
Steven Poster
</p>
</div>
within my code I've done soup.find(text="Cinematography"), and a mixture of different thigns like trying to find item or get_text from within the a and p tags, but ...
I would use a regex to parse the soup object for a link that contains "cinematography".
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(r.text, 'lxml')
cinematographer = soup(href=re.compile(r'/cinematography/'))[0].text
print cinematographer
# outputs "Stephen Poster"
You can do the same without using regex as well:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://letterboxd.com/film/donnie-darko/')
soup = BeautifulSoup(res.text,'lxml')
item = soup.select("[href*='cinematography']")[0].text
print(item)
Output:
Steven Poster
Use CSS partial text selector:
soup.find('a[href*="cinematography"]').text

Categories