How i can get href from row - python

I do some telegram bot, and i need to get links from html.
I want to take href for Matches from this website https://www.hltv.org/matches
My previous code is
elif message.text == "Matches":
url_news = "https://www.hltv.org/matches"
response = requests.get(url_news)
soup = BeautifulSoup(response.content, "html.parser")
match_info = []
match_items = soup.find("div", class_="upcomingMatchesSection")
print(match_items)
for item in match_items:
match_info.append({
"link": item.find("div", class_="upcomingMatch").text,
"title": item["href"]
})
And i dont know how i can get links from this body.Appreciate any help

What happens?
You try to iterate over match_items but there is nothing to iterate, cause you only selected the section including the matches but not the matches itself.
How to fix?
Select the upcomingMatches instead and iterate over them:
match_items = soup.select("div.upcomingMatchesSection div.upcomingMatch")
Getting the url you have to select an <a>:
item.a["href"]
Example
from bs4 import BeautifulSoup as bs
import requests
url_news = "https://www.hltv.org/matches"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url_news, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
match_info = []
match_items = soup.select("div.upcomingMatchesSection div.upcomingMatch")
for item in match_items:
match_info.append({
"title": item.get_text('|', strip=True),
"link": item.a["href"]
})
match_info
Output
[{'title': '09:00|bo3|1WIN|K23|Pinnacle Fall Series 2|Odds',
'link': '/matches/2352066/1win-vs-k23-pinnacle-fall-series-2'},
{'title': '09:00|bo3|INDE IRAE|Nemiga|Pinnacle Fall Series 2|Odds',
'link': '/matches/2352067/inde-irae-vs-nemiga-pinnacle-fall-series-2'},
{'title': '10:00|bo3|OPAA|Nexus|Malta Vibes Knockout Series 3|Odds',
'link': '/matches/2352207/opaa-vs-nexus-malta-vibes-knockout-series-3'},
{'title': '11:00|bo3|Checkmate|TBC|Funspark ULTI 2021 Asia Regional Series 3|Odds',
'link': '/matches/2352092/checkmate-vs-tbc-funspark-ulti-2021-asia-regional-series-3'},
{'title': '11:00|bo3|ORDER|Alke|ESEA Premier Season 38 Australia|Odds',
'link': '/matches/2352122/order-vs-alke-esea-premier-season-38-australia'},...]

You can try this out.
All the match information is present inside a <div> with classname as upcomingMatch
Select all those <div> and from each <div>, extract the match link which is present inside the <a> tag with class name as match.
Here is the code:
import requests
from bs4 import BeautifulSoup
url_news = "https://www.hltv.org/matches"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
response = requests.get(url_news,headers=headers)
soup = BeautifulSoup(response.text, "lxml")
match_items = soup.find_all("div", class_="upcomingMatch")
for match in match_items:
link = match.find('a', class_='match a-reset')['href']
print(f'Link: {link}')
Link: /matches/2352235/malta-vibes-knockout-series-3-quarter-final-1-malta-vibes-knockout-series-3
Link: /matches/2352098/pinnacle-fall-series-2-quarter-final-2-pinnacle-fall-series-2
Link: /matches/2352236/malta-vibes-knockout-series-3-quarter-final-2-malta-vibes-knockout-series-3
Link: /matches/2352099/pinnacle-fall-series-2-quarter-final-3-pinnacle-fall-series-2
.
.
.

Related

Scrapping/downloading all the product image url from ebay site using r or python

I'm only able to scrape the URL of one full-resolution image from the ebay site; however, I'm unable to capture the URLs of all other images.
I'm looking for a script that scrapes or downloads all of the images.
I wanted high-resolution photographs, not thumbnails, to download.
code
from lxml import html
import requests
from bs4 import BeautifulSoup
import pandas as pd
main_url= 'https://www.ebay.com/'
headers= {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
url= 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=laptop&_sacat=0&LH_TitleDesc=0&rt=nc&_odkw=toaster&_osacat=0&LH_PrefLoc=3&LH_All=1&_ipg=240'
r= requests.get(url, headers=headers)
print(r)
soup= BeautifulSoup(r.content, 'html.parser')
product_list= soup.find_all('div', class_= 's-item__image')
products_site = []
for item in product_list:
for link in item.find_all('a', href= True):
products_site.append(link['href'])
products_site = list(dict.fromkeys(products_site))
products_site = list(filter(None, products_site))
products_site = [x for x in products_site if x.startswith('https://www.ebay.com/itm/')][:2]
print(len('product_site'))
item_list=[]
for link in products_site:
r = requests.get(link, headers=headers)
print(r)
soup= BeautifulSoup(r.content, 'html.parser')
Title= soup.select_one('h1', class_='x-item-title__mainTitle').get_text(strip=True)
Image_URL= [x['src'] for x in soup.findAll('img', {'id': 'icImg'})]
Product= {
"Title": Title,
"Image_URL": Image_URL
}
The URL of the images stays the same on eBay.
To get all the images of a product in high resolution you can easily change the dimension of those different thumbnails and get HQ images.
for example -
https://i.ebayimg.com/images/g/pxcAAOSwis1hwW4V/s-l64.jpg
the tailing s-l64 before .jpg denotes the resolution which is 64p you can change this to s-l100 / s-l300 or s-l500 to increase the resolution, the highest resolution it supports - s-l2000.
So you can just replace the thumbnail's s-l64 with s-l2000 to get HQ images.
Using this trick you don't need to click on the images to zoom in and get HQ images.
Full working code -
import requests
from bs4 import BeautifulSoup
main_url = 'https://www.ebay.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
}
url = 'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw=laptop&_sacat=0&LH_TitleDesc=0&rt=nc&_odkw=toaster&_osacat=0&LH_PrefLoc=3&LH_All=1&_ipg=240'
r = requests.get(url, headers=headers)
print(r)
soup = BeautifulSoup(r.content, 'html.parser')
product_list = soup.find_all('div', class_='s-item__image')
products_site = []
for item in product_list:
for link in item.find_all('a', href=True):
products_site.append(link['href'])
products_site = list(dict.fromkeys(products_site))
products_site = list(filter(None, products_site))
products_site = [x for x in products_site if x.startswith('https://www.ebay.com/itm/')][:2]
print(len('product_site'))
item_list = []
for link in products_site:
# print(link)
r = requests.get(link, headers=headers)
print(r)
soup = BeautifulSoup(r.content, 'html.parser')
Title = soup.select_one('h1', class_='x-item-title__mainTitle').get_text(strip=True)
# example page - https://www.ebay.com/itm/125058259597?epid=4051542538&hash=item1d1e0d9a8d:g:pxcAAOSwis1hwW4V
image_urls = [i.get('src').replace('s-l64', 's-l2000')
for i in soup.select('ul#vertical-align-items-viewport > li img')]
if len(image_urls) == 0:
# example page with no extra images
# https://www.ebay.com/itm/125287169558?epid=19053326726&hash=item1d2bb27e16:g:sRAAAOSwKV9ia3Ia
image_urls = set([x['src'] for x in soup.findAll('img', {'id': 'icImg'})]) # remove duplicate images
product = {
"Title": Title,
"Image_URL": image_urls
}
print(product)
Output -
<Response [200]>
12
<Response [200]>
{'Title': 'Lenovo Legion 5 Pro 16 165Hz QHD IPS G-Sync Ryzen 7 16GB RAM 1TB SSD RTX 3070', 'Image_URL': ['https://i.ebayimg.com/images/g/pxcAAOSwis1hwW4V/s-l2000.jpg', 'https://i.ebayimg.com/images/g/UWEAAOSwLslhwW4V/s-l2000.jpg', 'https://i.ebayimg.com/images/g/sOIAAOSwANNhwW4V/s-l2000.jpg', 'https://i.ebayimg.com/images/g/SOIAAOSwwORhwW4V/s-l2000.jpg', 'https://i.ebayimg.com/images/g/g7kAAOSwhzNhwW4V/s-l2000.jpg', 'https://i.ebayimg.com/images/g/HjsAAOSw6pxhvXmX/s-l2000.jpg', 'https://i.ebayimg.com/images/g/OSQAAOSwAvVhwW4V/s-l2000.jpg', 'https://i.ebayimg.com/images/g/pHAAAOSwjnJhwW4V/s-l2000.jpg', '//p.ebaystatic.com/aw/pics/cmp/icn/iconImgNA_96x96.gif', '//p.ebaystatic.com/aw/pics/cmp/icn/iconImgNA_96x96.gif']}
<Response [200]>
{'Title': '\ufeff\ufeffLenovo IdeaPad Gaming 3 15.6" 120Hz i5-11300H 8GB RAM 512GB SSD GTX 1650', 'Image_URL': {'https://i.ebayimg.com/images/g/sRAAAOSwKV9ia3Ia/s-l500.jpg'}}

(Beginner) Python web scraping BeautifulSoup

how to scrape a text from an element with multiple attributes?
<h2 class="_63-j _1rimQ" data-qa="heading">Popular Dishes</h2>
I used this
category = soup.find(name="h2", attrs={"class":"_63-j _1rimQ","data-qa":"heading"}).getText()
but it returns an error
AttributeError: 'NoneType' object has no attribute 'getText'
Same error is returned when using this
category = soup.find(name="h2",class_="_63-j _1rimQ")
from bs4 import BeautifulSoup as bs
html = """<h2 class="_63-j _1rimQ" data-qa="heading">Popular Dishes</h2>"""
soup = bs(html, 'html.parser')
soup.find('h2', class_ = '_63-j _1rimQ').getText() # 'Popular Dishes'
Works very well here. Maybe the 'html.parser'?
BeautifulSoup 4.10.0, Python 3.10.2
The content you wish to get from that page generate dynamically, so BeautifulSoup will not help you grab them. The requests is being issued to an endpoint. The following is how you can achieve using requests:
import requests
link = 'https://cw-api.takeaway.com/api/v29/restaurant'
params = {
'slug': 'c-pizza-c-kebab'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'x-country-code': 'fr',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link,params=params)
container = res.json()['menu']['products']
for key,val in container.items():
print(val['name'])
Output (truncated):
Kebab veau
Pot de kebabs
Pot de frites
Margherita
Bambino
Reine
Sicilienne
Végétarienne
Calzone soufflée jambon
Calzone soufflée bœuf haché
Pêcheur

Created list from BeautifulSoup contains multiplicate entries -- need a unique link list

I am trying to create a list containing all unique year links from a website (see below).
When I execute the append function it gives me a huge list containing dupli-multiplicate entries.
I need to get a list containing only the unique year links.
The website : https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
Code written so far :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
print(link.get('href'))
#year.append(link.get('href'))
#print(year)
The desired result would look like this (but I need this in list format):
https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/83-2022.html
/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/71-2020.html
/apofaseis-gnomodotiseis/itemlist/category/4-2019.html
/apofaseis-gnomodotiseis/itemlist/category/5-2018.html
/apofaseis-gnomodotiseis/itemlist/category/6-2017.html
/apofaseis-gnomodotiseis/itemlist/category/7-2016.html
/apofaseis-gnomodotiseis/itemlist/category/8-2015.html
/apofaseis-gnomodotiseis/itemlist/category/9-2014.html
/apofaseis-gnomodotiseis/itemlist/category/10-2013.html
/apofaseis-gnomodotiseis/itemlist/category/11-2012.html
/apofaseis-gnomodotiseis/itemlist/category/12-2011.html
/apofaseis-gnomodotiseis/itemlist/category/13-2010.html
/apofaseis-gnomodotiseis/itemlist/category/18-2009.html
/apofaseis-gnomodotiseis/itemlist/category/19-2008.html
/apofaseis-gnomodotiseis/itemlist/category/20-2007.html
/apofaseis-gnomodotiseis/itemlist/category/21-2006.html
/apofaseis-gnomodotiseis/itemlist/category/22-2005.html
/apofaseis-gnomodotiseis/itemlist/category/23-2004.html
/apofaseis-gnomodotiseis/itemlist/category/24-2003.html
/apofaseis-gnomodotiseis/itemlist/category/25-2002.html
/apofaseis-gnomodotiseis/itemlist/category/26-2001.html
/apofaseis-gnomodotiseis/itemlist/category/27-2000.html
/apofaseis-gnomodotiseis/itemlist/category/44-1999.html
/apofaseis-gnomodotiseis/itemlist/category/45-1998.html
/apofaseis-gnomodotiseis/itemlist/category/48-1997.html
/apofaseis-gnomodotiseis/itemlist/category/47-1996.html
/apofaseis-gnomodotiseis/itemlist/category/46-1995.html
/apofaseis-gnomodotiseis/itemlist/category/49-1994.html
Edit : I am Trying to create a case list for every year in year list :
Code :
# 1) Created an year list (year = [])
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
total_cases = []
#Url to scrape
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
# 2) Created a case list
case = []
for link in soup.find_all('a', href=lambda href: href and "apofasi" in href):
if link.get('href') not in case :
case.append(link.get('href'))
print(case)
#Trying to create a case list for every year in year list
# A)Get every year link in year list
for year_link in year :
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(year_link, headers = headers1)
soup2 = BeautifulSoup(page.content,"html.parser")
print(year)
# B)Get every case link for every case in a fixed year
for case_link in case :
total_cases.append(case_link)
#Get case link for every case for every year_link (element of year[])
???
EDIT 2 :
When I try to run the code you (HedgeHog) so kinldy posted it gives me this error :
--------------------------------------------------------------------------
FeatureNotFound Traceback (most recent call last)
C:\Users\ARISTE~1\AppData\Local\Temp/ipykernel_13944/1621925083.py in <module>
8 "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
9 page = requests.get(URL, headers = headers)
---> 10 soup = BeautifulSoup(page.content,'lxml')
11
12 baseUrl = 'https://www.epant.gr'
~\Documents\conda\envs\conda\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
243 builder_class = builder_registry.lookup(*features)
244 if builder_class is None:
--> 245 raise FeatureNotFound(
246 "Couldn't find a tree builder with the features you "
247 "requested: %s. Do you need to install a parser library?"
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Any ideas? Thanks!
EDIT
Based on your question edits I would recommend to use a dict instead of all this lists - Following example will create a data dictionary with years as keys, it´s own url and a list of cases urls.
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
baseUrl = 'https://www.epant.gr'
data = {}
for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
page = requests.get(f'{baseUrl}{href}', headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
data[href.split('-')[-1].split('.')[0]] = {
'url': f'{baseUrl}{href}'
}
data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
data
Output
{'2022': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/83-2022.html',
'cases': []},
'2021': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html',
'cases': ['https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1508-apofasi-728-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1584-apofasi-727-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1586-apofasi-726-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1583-apofasi-725-2021.html']},...}
How to fix?
Just check if the link is not in your list of links - So it is True append it to your list:
if link.get('href') not in year:
year.append(link.get('href'))
Note
The desired result would look like this (but I need this in list
format)
This is not a list in the sense of data structure it is a printed version of each single element of a list.
Alternativ
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
Output
['https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/83-2022.html', '/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/71-2020.html', '/apofaseis-gnomodotiseis/itemlist/category/4-2019.html', '/apofaseis-gnomodotiseis/itemlist/category/5-2018.html', '/apofaseis-gnomodotiseis/itemlist/category/6-2017.html', '/apofaseis-gnomodotiseis/itemlist/category/7-2016.html', '/apofaseis-gnomodotiseis/itemlist/category/8-2015.html', '/apofaseis-gnomodotiseis/itemlist/category/9-2014.html', '/apofaseis-gnomodotiseis/itemlist/category/10-2013.html', '/apofaseis-gnomodotiseis/itemlist/category/11-2012.html', '/apofaseis-gnomodotiseis/itemlist/category/12-2011.html', '/apofaseis-gnomodotiseis/itemlist/category/13-2010.html', '/apofaseis-gnomodotiseis/itemlist/category/18-2009.html', '/apofaseis-gnomodotiseis/itemlist/category/19-2008.html', '/apofaseis-gnomodotiseis/itemlist/category/20-2007.html', '/apofaseis-gnomodotiseis/itemlist/category/21-2006.html', '/apofaseis-gnomodotiseis/itemlist/category/22-2005.html', '/apofaseis-gnomodotiseis/itemlist/category/23-2004.html', '/apofaseis-gnomodotiseis/itemlist/category/24-2003.html', '/apofaseis-gnomodotiseis/itemlist/category/25-2002.html', '/apofaseis-gnomodotiseis/itemlist/category/26-2001.html', '/apofaseis-gnomodotiseis/itemlist/category/27-2000.html', '/apofaseis-gnomodotiseis/itemlist/category/44-1999.html', '/apofaseis-gnomodotiseis/itemlist/category/45-1998.html', '/apofaseis-gnomodotiseis/itemlist/category/48-1997.html', '/apofaseis-gnomodotiseis/itemlist/category/47-1996.html', '/apofaseis-gnomodotiseis/itemlist/category/46-1995.html', '/apofaseis-gnomodotiseis/itemlist/category/49-1994.html']
Use a set as the intermediate storage for the HREFs then convert to a list later.
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB"}
page = requests.get(URL, headers=headers1)
soup = BeautifulSoup(page.content, "lxml")
year = set()
for link in soup.find_all('a', href=lambda href: href and "category" in href):
year.add(link.get('href'))
print(list(year))

How to extract this dictionary using Beautiful Soup

I want to get to variable last element of dictionary (pasted below), it's in another dictionary "offers", and i have no clue how to extract it.
html = s.get(url=url, headers=headers, verify=False, timeout=15)
soup = BeautifulSoup(html.text, 'html.parser')
products = soup.find_all('script', {'type': "application/ld+json"})
{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"} (...)
As mentioned extract contents via BeautifulSoup decode the string with json.loads():
import json
products = '{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"}]}'
products = json.loads(products)
To get the last element (dict) in offers:
products['offers'][-1]
Output:
{'#type': 'Offer',
'availability': 'http://schema.org/OutOfStock',
'price': '489',
'priceCurrency': 'PLN',
'sku': 'NE215O06U-A110002000',
'url': '/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html'}
Example
In your special case you also have to replace('"','"') first:
from bs4 import BeautifulSoup
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
html = requests.get('https://www.zalando.de/new-balance-550-unisex-sneaker-low-whitered-ne215o06u-a11.html', headers=headers)
soup = BeautifulSoup(html.content, 'lxml')
jsonData = json.loads(soup.select_one('script[type="application/ld+json"]').text.replace('"','"'))
jsonData['offers'][-1]

unable to scrape website pages with unchanged url - python

im trying to get the names of all games within this website "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList".To do so im using the following code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
table = soup.find_all('div', attrs={'class':'providerCard'})
for game in range(0,len(table)-1):
print(table[game].find('a')['title'])
and i get what i want.
I would like to replicate the same across all pages available on the website, but given that the url is not changing, I looked at the network (XMR) events on the page happening when clicking on a different page and I tried to send a request using the following code:
for page_no in range(1, 100):
data = {
"blck":"fltrGamesBlk",
"ajax":"1",
"lang":"end",
"p":str(page_no),
"translit":"The-Best-Slots",
"tag":"TOP",
"dt1":"",
"dt2":"",
"sorting":"SRANK",
"cISO":"GB",
"dt_period":"",
"rtp_1":"50.00",
"rtp_2":"100.00",
"max_exp_1":"2.00",
"max_exp_2":"250000.00",
"min_bet_1":"0.01",
"min_bet_2":"5.00",
"max_bet_1":"3.00",
"max_bet_2":"10000.00"
}
page = requests.post('https://slotcatalog.com/index.php',
data=data,
headers={'Host' : 'slotcatalog.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'
})
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.find_all('div', attrs={'class':'providerCard'}):
name = row.find('a')['title']
print(name)
result : ("KeyError: 'title'") - meaning that its not finding the class "providerCard".
Has the request to the website been done in the wrong way? If so, where should i change the code?
thanks in advance
Alright, so, you had a typo. XD It was this "lang":"end" from the payload but it should have been "lang": "en", among other things.
Anyhow, I've cleaned your code up a bit and it works as expected. You can keep looping for all the games, if you want.
import requests
from bs4 import BeautifulSoup
headers = {
"referer": "https://slotcatalog.com/en/The-Best-Slots",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
payload = {
"blck": "fltrGamesBlk",
"ajax": "1",
"lang": "en",
"p": 1,
"translit": "The-Best-Slots",
"tag": "TOP",
"dt1": "",
"dt2": "",
"sorting": "SRANK",
"cISO": "EN",
"dt_period": "",
"rtp_1": "50.00",
"rtp_2": "100.00",
"max_exp_1": "2.00",
"max_exp_2": "250000.00",
"min_bet_1": "0.01",
"min_bet_2": "5.00",
"max_bet_1": "3.00",
"max_bet_2": "10000.00"
}
page = requests.post(
"https://slotcatalog.com/index.php",
data=payload,
headers=headers,
)
soup = BeautifulSoup(page.content, "html.parser")
print([i.get("title") for i in soup.find_all("a", {"class": "providerName"})])
Output (for page 1 only):
['Starburst', 'Bonanza', 'Rainbow Riches', 'Book of Dead', "Fishin' Frenzy", 'Wolf Gold', 'Twin Spin', 'Slingo Rainbow Riches', "Gonzo's Quest", "Gonzo's Quest Megaways", 'Eye of Horus (Reel Time Gaming)', 'Age of the Gods God of Storms', 'Lightning Roulette', 'Buffalo Blitz', "Fishin' Frenzy Megaways", 'Fluffy Favourites', 'Blue Wizard', 'Legacy of Dead', '9 Pots of Gold', 'Buffalo Blitz II', 'Cleopatra (IGT)', 'Quantum Roulette', 'Reel King Mega', 'Mega Moolah', '7s Deluxe', "Rainbow Riches Pick'n'Mix", "Shaman's Dream"]

Categories