TItle and Link Not Getting printed for Google Search - python

I am trying to scrape Title,Description and URL from Google Search Page using beautifulsoup and python.
from bs4 import BeautifulSoup
query = input("Enter your value: ")
print("Search Term:" + query) # query = 'Python'
links = [] # Initiate empty list to capture final results
titles = []
descriptions = []
# Specify number of pages on google search, each page contains 10 #links
n_pages = 6
for page in range(1, n_pages):
url = "http://www.google.com/search?q=" + query + "&start=" + str((page - 1) * 10)
print("Link : " + url)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
try:
search = soup.find_all('div', class_ = "yuRUbf")
for link in search:
links.append(link.a['href'])
description = soup.find_all('div', class_ = "VwiC3b yXK7lf MUxGbd yDYNvb lyLwlc lEBKkf")
for d in description:
description_text.append(d.span.text)
title = soup.find_all('div', class_ = "yuRUbf")
for t in title:
titles.append(t.h3.text)
# Next loop if one element is not present
except:
continue
print(links)
print(len(links))
print(description_text)
print(len(description_text))
print(titles)
print(len(titles))
The description is getting stored in a list however the links and title list is empty. I inspected the elements and I am using correct class but still unable to get the data.
Can someone help me figure out what I am doing wrong.

Personally, I find working with many lists irritating and cumbersome, when content can be stored directly and structured but anyway you can get information without selecting the dynamic classes in a more generic way:
for r in soup.select('#search a h3'):
data.append({
'title':r.text,
'url':r.parent['href'],
'desc':r.parent.parent.nextSibling.span.text if r.parent.parent.nextSibling.span else 'no desc'
})
Example
from bs4 import BeautifulSoup
import requests
query = input("Enter your value: ")
print("Search Term:" + query) # query = 'Python'
data = []
n_pages = 6
for page in range(1, n_pages):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = f'http://www.google.com/search?q={query}&start={str((page - 1) * 10)}'
r = requests.get(url, headers= headers)
soup = BeautifulSoup(r.text, 'lxml')
for r in soup.select('#search a h3'):
data.append({
'title':r.text,
'url':r.parent['href'],
'desc':r.parent.parent.nextSibling.span.text if r.parent.parent.nextSibling.span else 'no desc'
})
data
Output
[{'title': 'Welcome to Python.org',
'url': 'https://www.python.org/',
'desc': 'The official home of the Python Programming Language.'},
{'title': 'Python (Programmiersprache) - Wikipedia',
'url': 'https://de.wikipedia.org/wiki/Python_(Programmiersprache)',
'desc': 'Python ([ˈpʰaɪθn̩], [ ˈpʰaɪθɑn], auf Deutsch auch [ ˈpʰyːtɔn]) ist eine universelle, üblicherweise interpretierte, höhere Programmiersprache.'},
{'title': 'Pythons - Wikipedia',
'url': 'https://de.wikipedia.org/wiki/Pythons',
'desc': 'Die Pythons (Pythonidae; altgr. Πύθων Pythōn; Einzahl der, allgemeinsprachlich auch die Python) sind eine Familie von Schlangen aus der Überfamilie der\xa0...'},
{'title': 'Das Python-Tutorial — Das Python3.3-Tutorial auf Deutsch',
'url': 'https://py-tutorial-de.readthedocs.io/',
'desc': 'Python ist eine einfach zu lernende, aber mächtige Programmiersprache mit effizienten abstrakten Datenstrukturen und einem einfachen, aber effektiven Ansatz\xa0...'},...]

Related

Handling this website which is redirecting to the same url with BeautifulSoup

So I'm scrapping this website: https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari
I'm scraping all the reviews for a particular product on this website. Also handling the pagination. The URL for different pages is like:
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=1
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=2
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=3
The code is working as expected but when accessing the (n/2)+1 page, the website is redirecting to the first page of the comments. That is suppose total pages for a website is 12. Upon accessing the 7th page directly for which the URL is
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa=7
It redirects to the first page for which the URL is
https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari
When we goto the 7th page using the website's page links, it opens the 7th page with no issues. But its only when we access the page directly, its redirecting.
How can I handle this with BeautifulSoup? Also, Please don't suggest using Selenium as client needs the task to be done using BeautifulSoup only.
Here's the Python code:
from urllib import response
from bs4 import BeautifulSoup
from django.shortcuts import redirect
import requests
import sys
import re
file_path = 'output.txt'
sys.stdout = open(file_path, "w", encoding="utf-8")
# url = sys.argv[1]
url = 'https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari'
response = requests.get(url, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(response.content, 'html5lib')
div = soup.find_all('div', class_='paginationContentHolder')
# l = [(div.contents[0].strip(), span.text.strip())
# for div in soup.select('div.league-data')
# for span in div.select('span')]
divlist = soup.find_all("div", {'itemprop':'review'})
# print(divlist)
# for i in divlist:
# print(i)
# print('---------------------------')
pagecount = soup.find('div', {'class':'paginationBarHolder'})
pagelist = pagecount.find_all('ul')
pagelistnew = []
for i in pagelist[0]:
if i.text.isdigit():
pagelistnew.append(int(i.text))
maxpage = max(pagelistnew)
print(maxpage)
firstcard = divlist[0]
for i in divlist:
if i.find('span', {'itemprop':'description'}):
print('Review: ' + i.find('span', {'itemprop':'description'}).text)
ratinglist = i.find_all('div', {'class':'star', 'style':'width:12px;height:12px'})
print('Rating: ' + str(len(ratinglist)))
print('Date: ' + i.find('span', {'itemprop': 'datePublished'}).text)
# rating = 0
# for j in ratinglist:
# rating = rating + 1
# print(rating)
print()
urlnew = 'https://www.hepsiburada.com/apple-macbook-pro-m1-cip-8gb-256gb-ssd-macos-13-qhd-tasinabilir-bilgisayar-uzay-grisi-myd82tu-a-p-HBV0000130VNO-yorumlari?sayfa='
for i in range(2, maxpage + 1):
# payload = {'sayfa':i}
req = requests.get(urlnew + str(i), headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246", 'Access-Control-Allow-Origin': '*'}, allow_redirects=False)
print('URL: ' + urlnew + str(i))
print(req.status_code)
print()
soup2 = BeautifulSoup(req.content, 'html5lib')
divlist1 = soup2.find_all("div", {'itemprop':'review'})
# firstcard1 = divlist1[0]
for l in divlist1:
if l.find('span', {'itemprop':'description'}):
print('Review: ' + l.find('span', {'itemprop':'description'}).text)
ratinglist1 = l.find_all('div', {'class':'star', 'style':'width:12px;height:12px'})
print('Rating: ' + str(len(ratinglist1)))
print('Date: ' + l.find('span', {'itemprop': 'datePublished'}).text)
print()
To get all reviews you can use their API url. For example:
import json
import requests
# HBV0000130VNO is from the URL in the question
url = "https://user-content-gw-hermes.hepsiburada.com/queryapi/v2/ApprovedUserContents?skuList=HBV0000130VNO&from={}&size=10"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}
from_ = 0
while True:
data = requests.get(url.format(from_), headers=headers).json()
# uncomment to see all data:
# print(json.dumps(data, indent=4))
for d in data["data"]["approvedUserContent"]["approvedUserContentList"]:
print(d["customer"]["name"], d["customer"]["surname"])
print(d["review"]["content"])
print()
if data["currentItemCount"] != 10:
break
from_ += 10
Prints:
h********** z*********
Mükemmel kalitede bir ürün
A*** T*****
tek kelimeyle EFSANE. Her zaman olduğu gibi Apple kalitesi 1 numara. Benim kullanımımla 2 gün gidiyor şarjı günlük iş maillerimi yanıtlıyor raporlarımı inceliyorum. Akşamda belki bir dizi-film açıp şarjını bitiriyorum. Performansa diyecek hiç bir şey yok. Bu mimari ve altyapı son derece muazzam çalışacak şekilde tasarlanmış zaten.
E*** D*******
Kesinlikle fiyatıyla performansının birbirini fazlasıyla karşıladığı şahane bir iş istasyonu. Malzeme yapısı ve içindeki yeni M1 Yonga seti işlemcisi ile uzun süreli kullanımı ve iş gücünü vaat ediyor. Ayrıca içinden çıkan 61W'lık Adaptörü ile hızlı şarj olması ayrı bir güzelliği. Hepsiburada'ya ve Lojistiği olan HepsiJet'e hızlı ve güvenilir oldukları için çok teşekkür ederim.
O**** B****
Mükemmel bir ürün. Kullanmadan önce bu ekosisteme karşı yargıları olanlar, şimdi macbook'uma tapıyorlar.
A**** A*********
2019 macbook airden geçiş yaptığım için söyleyebilirim ki özellikle Apple'ın m1 çip için optimize ettiği safari süper hızlı çalışıyor chrome kullanmayı bıraktım diyebilirim görüntü ses kalitesi çok daha iyi ve işlemci çok hızlı çok memnunum pil 20 saat gidiyormuş ben genelde güç kaynağına bağlı kullanıyorum
Y**** G*****
Ben satın aldığımda MBA 512 GB ve MBP 256 fiyatları birbirine çok yakındı. O yüzden MBP tercih ettim. Arada oyun oynamak isteyenler için MBA yerine kesinlikle tecih edilmesi gereken ürün. Çünkü bazı oyunlarda throttling problemi yaşanabiliyor (Tomb Raider oyunları gibi)
ş***** a****
Tek kelimeyle Apple kalitesiyle tasarlanmış şahane bir ürün; fakat Türkiye şartlarından dolayı sırtımıza ağır yükler bindiren kur farkından kaynaklanan kallavi fiyatı biraz üzüyor.
...and so on.

how to scrape page inside the result card using Bs4?

<img class="no-img" data-src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium" alt="Biryani By Kilo" data-gatype="RestaurantImageClick" data-url="/delhi/biryani-by-kilo-connaught-place-central-delhi-40178" data-w-onclick="cardClickHandler" src="https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/h/u/p4059-15500352575c63a9394c209.jpg?tr=tr:n-medium">
page url - https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p=1
this page contains some restaurants card now while scrapping the page in the loop I want to go inside the restaurant card URL which is in the above HTML code name by data-url class and scrape the no. of reviews from inside it, I don't know how to do it my current code for normal front page scrapping is ;
def extract(page):
url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}" # URL of the website
header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
r = requests.get(url, headers=header)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup): # function to scrape the page
divs = soup.find_all('div', class_ = 'restnt-card restaurant')
for item in divs:
title = item.find('a').text.strip() # restaurant name
loc = item.find('div', class_ = 'restnt-loc ellipsis').text.strip() # restaurant location
try: # used this try and except method because some restaurants are unrated and while scrpaping those we would run into an error
rating = item.find('div', class_="img-wrap").text
rating = (re.sub("[^0-9,.]", "", rating))
except:
rating = None
pricce = item.find('span', class_="double-line-ellipsis").text.strip() # price for biriyani
price = re.sub("[^0-9]", "", pricce)[:-1]
biry_del = {
'name': title,
'location': loc,
'rating': rating,
'price': price
}
rest_list.append(biry_del)
rest_list = []
for i in range(1,18):
print(f'getting page, {i}')
c = extract(i)
transform(c)
I hope you guys understood please ask in comment for any confusion.
It's not very fast but it looks like you can get all the details you want including the review count (not 232!) if you hit this backend api endpoint:
https://www.dineout.co.in/get_rdp_data_main/delhi/69676/restaurant_detail_main
import requests
from bs4 import BeautifulSoup
import pandas as pd
rest_list = []
for page in range(1,3):
print(f'getting page, {page}')
s = requests.Session()
url = f"https://www.dineout.co.in/delhi-restaurants?search_str=biryani&p={page}" # URL of the website
header = {'User-Agent':'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36'} # Temporary user agent
r = s.get(url, headers=header)
soup = BeautifulSoup(r.content, 'html.parser')
divs = soup.find_all('div', class_ = 'restnt-card restaurant')
for item in divs:
code = item.find('a')['href'].split('-')[-1] # restaurant code
print(f'Getting details for {code}')
data = s.get(f'https://www.dineout.co.in/get_rdp_data_main/delhi/{code}/restaurant_detail_main').json()
info = data['header']
info.pop('share') #clean up csv
info.pop('options')
rest_list.append(info)
df = pd.DataFrame(rest_list)
df.to_csv('dehli_rest.csv',index=False)

Extract href from lazy-loading page

I am trying to extract the href from the premier league website, however I cannot seem to get all the html links except from the first page:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.premierleague.com/players/')
soup = BeautifulSoup(r.content, 'lxml')
#get the player index
table = soup.find('div', {'class': 'table playerIndex'})
#<a> is where the href is stored
href_names = [link.get('href') for link in table.findAll('a')]
football_string = 'https://www.premierleague.com'
#concatenate to get the full html link
[football_string + str(x) for x in href_names]
Which only returns the first page - I have tried using selenium however the premier league website has an ad that appears every time it is used that prevents it from working. Any ideas on how to get all the html links?
If I understood your question right, the following approach should do it:
import requests
base = 'https://www.premierleague.com/players/{}/'
link = 'https://footballapi.pulselive.com/football/players'
payload = {
'pageSize': '30',
'compSeasons': '418',
'altIds': 'true',
'page': 0,
'type': 'player',
'id': '-1',
'compSeasonId': '418'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
s.headers['referer'] = 'https://www.premierleague.com/'
s.headers['origin'] = 'https://www.premierleague.com'
while True:
res = s.get(link,params=payload)
if not res.json()['content']:break
for item in res.json()['content']:
print(base.format(int(item['id'])))
payload['page']+=1
Results are like (truncated):
https://www.premierleague.com/players/19970/
https://www.premierleague.com/players/13279/
https://www.premierleague.com/players/13286/
https://www.premierleague.com/players/10905/
https://www.premierleague.com/players/4852/
https://www.premierleague.com/players/4328/
https://www.premierleague.com/players/90665/

Scraping multiple page on this site HELP NEEDED

Hi I would love to be able to scrape multiple page for this website
Can some one give help on how i can scrape scrape through all the pages i am only able to get information from one page how ever I just get information from one page
headers = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
for i in range(2000):
Centris ='https://www.centris.ca/en/commercial-units~for-rent~montreal-ville-marie/26349148?view=Summary'.format(i)
r = get(Centris, headers=headers)
soup = bs(r.text, 'html.parser')
results = soup.find_all('div', attrs={'id':'divMainResult'})
data = []
for result in results:
titre = result.find('span', attrs={'data-id': 'PageTitle'})
titre = [str(titre.string).strip() for titre in titre]
superficie = result.find('div', attrs={'class': 'carac-value'}, string=re.compile('sqft'))
superficie = [str(superficie.string).strip() for superficie in superficie]
emplacement = result.find_all('h2', attrs={'class': 'pt-1'})
emplacement = [str(emplacement.string).strip() for emplacement in emplacement]
prix = result.find_all('span', attrs={'class':'text-nowrap'})
prix = [(prix.text).strip('\w.') for prix in prix]
description = result.find_all('div', attrs={'itemprop': 'description'})
description = [str(description.string).strip() for description in description]
lien = result.find_all('a', attrs={'class': 'dropdown-item js-copy-clipboard'})
To get pagination working you can simulate Ajax requests with requests module:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.centris.ca/Property/GetInscriptions"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
json_data = {"startPosition": 0}
with requests.session() as s:
# load cookies:
s.get(
"https://www.centris.ca/en/commercial-units~for-rent?uc=0",
headers=headers,
)
for page in range(0, 100, 20): # <-- increase number of pages here
json_data["startPosition"] = page
data = s.post(url, headers=headers, json=json_data).json()
soup = BeautifulSoup(data["d"]["Result"]["html"], "html.parser")
for a in soup.select(".a-more-detail"):
print(a.select_one(".category").get_text(strip=True))
print(a.select_one(".address").get_text(strip=True, separator="\n"))
print("https://www.centris.ca" + a["href"])
print("-" * 80)
Prints:
Commercial unit for rent
6560, Avenue de l'Esplanade, suite 105
Montréal (Rosemont/La Petite-Patrie)
Neighbourhood La Petite-Patrie
https://www.centris.ca/en/commercial-units~for-rent~montreal-rosemont-la-petite-patrie/16168393?view=Summary
--------------------------------------------------------------------------------
Commercial unit for rent
75, Rue Principale
Gatineau (Aylmer)
Neighbourhood Vieux Aylmer, Des Cèdres, Marina
https://www.centris.ca/en/commercial-units~for-rent~gatineau-aylmer/22414903?view=Summary
--------------------------------------------------------------------------------
Commercial building for rent
53, Rue Saint-Pierre, suite D
Saint-Pie
https://www.centris.ca/en/commercial-buildings~for-rent~saint-pie/15771470?view=Summary
--------------------------------------------------------------------------------
...and so on.
Thank you so much I came up with this and it worked perfectly
import json
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.centris.ca/Property/GetInscriptions"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0",
}
json_data = {"startPosition": 0}
with requests.session() as s:
Centris = []
# load cookies:
s.get(
"https://www.centris.ca/en/commercial-units~for-rent?uc=0",
headers=headers,
)
for page in range(0, 100, 20): # <-- increase number of pages here
json_data["startPosition"] = page
data = s.post(url, headers=headers, json=json_data).json()
soup = BeautifulSoup(data["d"]["Result"]["html"], "html.parser")
for a in soup.select(".a-more-detail"):
titre = a.select_one(".category").get_text(strip=True)
emplacement = a.select_one(".address").get_text(strip=True, separator="\n")
lien = "https://www.centris.ca" + a["href"]
prix = a.select_one(".price").get_text(strip=True)
Centris.append((titre, emplacement, lien, prix))
df = pd.DataFrame(Centris, columns={'Titre':titre, 'Emplacement':emplacement, 'Lien':lien, 'Prix':prix})
writer = pd.ExcelWriter('Centris.xlsx')
df.to_excel(writer)
writer.save()
print( 'Data Saved To excel' )

Python Web Scraping / Beautiful Soup, with list of keywords at the end of URL

I'm trying to build a webscraper to get the reviews of wine off Vivino.com. I have a large list of wines and wanted it to search
url = ("https://www.vivino.com/search/wines?q=")
Then cycle through the list. Scraping the rating text "4.5 - 203 reviews", the name of the wine and the attached link to page.
I found a 20 lines of code https://www.kashifaziz.me/web-scraping-python-beautifulsoup.html/ to build a web scraper. Was trying to compile it with
url = ("https://www.vivino.com/search/wines?q=")
#list having the keywords (made by splitting input with space as its delimiter)
keyword = input().split()
#go through the keywords
for key in keywords :
#everything else is same logic
r = requests.get(url + key)
print("URL :", url+key)
if 'The specified profile could not be found.' in r.text:
print("This is available")
else :
print('\nSorry that one is taken')
Also, where would I include the list of keywords?
I'd love any help with this! I'm trying to get better at python but not sure I'm at this level yet haha.
Thank you.
This script traverses all pages for selected keyword and selects title, price, rating, reviews and link to wine:
import re
import requests
from time import sleep
from bs4 import BeautifulSoup
url = 'https://www.vivino.com/search/wines?q={kw}&start={page}'
prices_url = 'https://www.vivino.com/prices'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
def get_wines(kw):
with requests.session() as s:
page = 1
while True:
soup = BeautifulSoup(s.get(url.format(kw=kw, page=page), headers=headers).content, 'html.parser')
if not soup.select('.default-wine-card'):
break
params = {'vintages[]': [wc['data-vintage'] for wc in soup.select('.default-wine-card')]}
prices_js = s.get(prices_url, params=params, headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01'
}).text
wine_prices = dict(re.findall(r"\$\('\.vintage-price-id-(\d+)'\)\.find\( '\.wine-price-value' \)\.text\( '(.*?)' \);", prices_js))
for wine_card in soup.select('.default-wine-card'):
title = wine_card.select_one('.header-smaller').get_text(strip=True, separator=' ')
price = wine_prices.get(wine_card['data-vintage'], '-')
average = wine_card.select_one('.average__number')
average = average.get_text(strip=True) if average else '-'
ratings = wine_card.select_one('.text-micro')
ratings = ratings.get_text(strip=True) if ratings else '-'
link = 'https://www.vivino.com' + wine_card.a['href']
yield title, price, average, ratings, link
sleep(3)
page +=1
kw = 'angel'
for title, price, average, ratings, link in get_wines(kw):
print(title)
print(price)
print(average + ' / ' + ratings)
print(link)
print('-' * 80)
Prints:
Angél ica Zapata Malbec Alta
-
4,4 / 61369 ratings
https://www.vivino.com/wines/1469874
--------------------------------------------------------------------------------
Château d'Esclans Whispering Angel Rosé
16,66
4,1 / 38949 ratings
https://www.vivino.com/wines/1473981
--------------------------------------------------------------------------------
Angél ica Zapata Cabernet Sauvignon Alta
-
4,3 / 27699 ratings
https://www.vivino.com/wines/1471376
--------------------------------------------------------------------------------
... and so on.
EDIT: To select only one wine, you can put keyword inside a list and then check each wine in loop:
import re
import requests
from time import sleep
from bs4 import BeautifulSoup
url = 'https://www.vivino.com/search/wines?q={kw}&start={page}'
prices_url = 'https://www.vivino.com/prices'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
def get_wines(kw):
with requests.session() as s:
page = 1
while True:
soup = BeautifulSoup(s.get(url.format(kw=kw, page=page), headers=headers).content, 'html.parser')
if not soup.select('.default-wine-card'):
break
params = {'vintages[]': [wc['data-vintage'] for wc in soup.select('.default-wine-card')]}
prices_js = s.get(prices_url, params=params, headers={
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0',
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01'
}).text
wine_prices = dict(re.findall(r"\$\('\.vintage-price-id-(\d+)'\)\.find\( '\.wine-price-value' \)\.text\( '(.*?)' \);", prices_js))
no = 1
for no, wine_card in enumerate(soup.select('.default-wine-card'), 1):
title = wine_card.select_one('.header-smaller').get_text(strip=True, separator=' ')
price = wine_prices.get(wine_card['data-vintage'], '-')
average = wine_card.select_one('.average__number')
average = average.get_text(strip=True) if average else '-'
ratings = wine_card.select_one('.text-micro')
ratings = ratings.get_text(strip=True) if ratings else '-'
link = 'https://www.vivino.com' + wine_card.a['href']
yield title, price, average, ratings, link
# if no < 20:
# break
# sleep(3)
page +=1
wines = ['10 SPAN VINEYARDS CABERNET SAUVIGNON CENTRAL COAST',
'10 SPAN VINEYARDS CHARDONNAY CENTRAL COAST']
for wine in wines:
for title, price, average, ratings, link in get_wines(wine):
print(title)
print(price)
print(average + ' / ' + ratings)
print(link)
print('-' * 80)
Prints:
10 Span Vineyards Cabernet Sauvignon
-
3,7 / 557 ratings
https://www.vivino.com/wines/4535453
--------------------------------------------------------------------------------
10 Span Vineyards Chardonnay
-
3,7 / 150 ratings
https://www.vivino.com/wines/5815131
--------------------------------------------------------------------------------
import requests
#list having the keywords (made by splitting input with space as its delimiter)
keywords = input().split()
#go through the keywords
for key in keywords :
url = "https://www.vivino.com/search/wines?q={}".format(key)
#everything else is same logic
r = requests.get(url)
print("URL :", url)
if 'The specified profile could not be found.' in r.text:
print("This is available")
else :
print('\nSorry that one is taken')
for the list of keywords you can use a text file where you put in each line one keyword

Categories