How to extract this dictionary using Beautiful Soup

How to extract this dictionary using Beautiful Soup - python

I want to get to variable last element of dictionary (pasted below), it's in another dictionary "offers", and i have no clue how to extract it.
html = s.get(url=url, headers=headers, verify=False, timeout=15)
soup = BeautifulSoup(html.text, 'html.parser')
products = soup.find_all('script', {'type': "application/ld+json"})
{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"} (...)

As mentioned extract contents via BeautifulSoup decode the string with json.loads():
import json
products = '{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"}]}'
products = json.loads(products)
To get the last element (dict) in offers:
products['offers'][-1]
Output:
{'#type': 'Offer',
'availability': 'http://schema.org/OutOfStock',
'price': '489',
'priceCurrency': 'PLN',
'sku': 'NE215O06U-A110002000',
'url': '/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html'}
Example
In your special case you also have to replace('"','"') first:
from bs4 import BeautifulSoup
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
html = requests.get('https://www.zalando.de/new-balance-550-unisex-sneaker-low-whitered-ne215o06u-a11.html', headers=headers)
soup = BeautifulSoup(html.content, 'lxml')
jsonData = json.loads(soup.select_one('script[type="application/ld+json"]').text.replace('"','"'))
jsonData['offers'][-1]

Related

(Beginner) Python web scraping BeautifulSoup

how to scrape a text from an element with multiple attributes?
<h2 class="_63-j _1rimQ" data-qa="heading">Popular Dishes</h2>
I used this
category = soup.find(name="h2", attrs={"class":"_63-j _1rimQ","data-qa":"heading"}).getText()
but it returns an error
AttributeError: 'NoneType' object has no attribute 'getText'
Same error is returned when using this
category = soup.find(name="h2",class_="_63-j _1rimQ")

from bs4 import BeautifulSoup as bs
html = """<h2 class="_63-j _1rimQ" data-qa="heading">Popular Dishes</h2>"""
soup = bs(html, 'html.parser')
soup.find('h2', class_ = '_63-j _1rimQ').getText() # 'Popular Dishes'
Works very well here. Maybe the 'html.parser'?
BeautifulSoup 4.10.0, Python 3.10.2

The content you wish to get from that page generate dynamically, so BeautifulSoup will not help you grab them. The requests is being issued to an endpoint. The following is how you can achieve using requests:
import requests
link = 'https://cw-api.takeaway.com/api/v29/restaurant'
params = {
'slug': 'c-pizza-c-kebab'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'x-country-code': 'fr',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link,params=params)
container = res.json()['menu']['products']
for key,val in container.items():
print(val['name'])
Output (truncated):
Kebab veau
Pot de kebabs
Pot de frites
Margherita
Bambino
Reine
Sicilienne
Végétarienne
Calzone soufflée jambon
Calzone soufflée bœuf haché
Pêcheur

Created list from BeautifulSoup contains multiplicate entries -- need a unique link list

I am trying to create a list containing all unique year links from a website (see below).
When I execute the append function it gives me a huge list containing dupli-multiplicate entries.
I need to get a list containing only the unique year links.
The website : https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
Code written so far :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
print(link.get('href'))
#year.append(link.get('href'))
#print(year)
The desired result would look like this (but I need this in list format):
https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/83-2022.html
/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/71-2020.html
/apofaseis-gnomodotiseis/itemlist/category/4-2019.html
/apofaseis-gnomodotiseis/itemlist/category/5-2018.html
/apofaseis-gnomodotiseis/itemlist/category/6-2017.html
/apofaseis-gnomodotiseis/itemlist/category/7-2016.html
/apofaseis-gnomodotiseis/itemlist/category/8-2015.html
/apofaseis-gnomodotiseis/itemlist/category/9-2014.html
/apofaseis-gnomodotiseis/itemlist/category/10-2013.html
/apofaseis-gnomodotiseis/itemlist/category/11-2012.html
/apofaseis-gnomodotiseis/itemlist/category/12-2011.html
/apofaseis-gnomodotiseis/itemlist/category/13-2010.html
/apofaseis-gnomodotiseis/itemlist/category/18-2009.html
/apofaseis-gnomodotiseis/itemlist/category/19-2008.html
/apofaseis-gnomodotiseis/itemlist/category/20-2007.html
/apofaseis-gnomodotiseis/itemlist/category/21-2006.html
/apofaseis-gnomodotiseis/itemlist/category/22-2005.html
/apofaseis-gnomodotiseis/itemlist/category/23-2004.html
/apofaseis-gnomodotiseis/itemlist/category/24-2003.html
/apofaseis-gnomodotiseis/itemlist/category/25-2002.html
/apofaseis-gnomodotiseis/itemlist/category/26-2001.html
/apofaseis-gnomodotiseis/itemlist/category/27-2000.html
/apofaseis-gnomodotiseis/itemlist/category/44-1999.html
/apofaseis-gnomodotiseis/itemlist/category/45-1998.html
/apofaseis-gnomodotiseis/itemlist/category/48-1997.html
/apofaseis-gnomodotiseis/itemlist/category/47-1996.html
/apofaseis-gnomodotiseis/itemlist/category/46-1995.html
/apofaseis-gnomodotiseis/itemlist/category/49-1994.html
Edit : I am Trying to create a case list for every year in year list :
Code :
# 1) Created an year list (year = [])
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
total_cases = []
#Url to scrape
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
# 2) Created a case list
case = []
for link in soup.find_all('a', href=lambda href: href and "apofasi" in href):
if link.get('href') not in case :
case.append(link.get('href'))
print(case)
#Trying to create a case list for every year in year list
# A)Get every year link in year list
for year_link in year :
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(year_link, headers = headers1)
soup2 = BeautifulSoup(page.content,"html.parser")
print(year)
# B)Get every case link for every case in a fixed year
for case_link in case :
total_cases.append(case_link)
#Get case link for every case for every year_link (element of year[])
???
EDIT 2 :
When I try to run the code you (HedgeHog) so kinldy posted it gives me this error :
--------------------------------------------------------------------------
FeatureNotFound Traceback (most recent call last)
C:\Users\ARISTE~1\AppData\Local\Temp/ipykernel_13944/1621925083.py in <module>
8 "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
9 page = requests.get(URL, headers = headers)
---> 10 soup = BeautifulSoup(page.content,'lxml')
11
12 baseUrl = 'https://www.epant.gr'
~\Documents\conda\envs\conda\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
243 builder_class = builder_registry.lookup(*features)
244 if builder_class is None:
--> 245 raise FeatureNotFound(
246 "Couldn't find a tree builder with the features you "
247 "requested: %s. Do you need to install a parser library?"
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Any ideas? Thanks!

EDIT
Based on your question edits I would recommend to use a dict instead of all this lists - Following example will create a data dictionary with years as keys, it´s own url and a list of cases urls.
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
baseUrl = 'https://www.epant.gr'
data = {}
for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
page = requests.get(f'{baseUrl}{href}', headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
data[href.split('-')[-1].split('.')[0]] = {
'url': f'{baseUrl}{href}'
}
data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
data
Output
{'2022': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/83-2022.html',
'cases': []},
'2021': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html',
'cases': ['https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1508-apofasi-728-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1584-apofasi-727-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1586-apofasi-726-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1583-apofasi-725-2021.html']},...}
How to fix?
Just check if the link is not in your list of links - So it is True append it to your list:
if link.get('href') not in year:
year.append(link.get('href'))
Note
The desired result would look like this (but I need this in list
format)
This is not a list in the sense of data structure it is a printed version of each single element of a list.
Alternativ
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
Output
['https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/83-2022.html', '/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/71-2020.html', '/apofaseis-gnomodotiseis/itemlist/category/4-2019.html', '/apofaseis-gnomodotiseis/itemlist/category/5-2018.html', '/apofaseis-gnomodotiseis/itemlist/category/6-2017.html', '/apofaseis-gnomodotiseis/itemlist/category/7-2016.html', '/apofaseis-gnomodotiseis/itemlist/category/8-2015.html', '/apofaseis-gnomodotiseis/itemlist/category/9-2014.html', '/apofaseis-gnomodotiseis/itemlist/category/10-2013.html', '/apofaseis-gnomodotiseis/itemlist/category/11-2012.html', '/apofaseis-gnomodotiseis/itemlist/category/12-2011.html', '/apofaseis-gnomodotiseis/itemlist/category/13-2010.html', '/apofaseis-gnomodotiseis/itemlist/category/18-2009.html', '/apofaseis-gnomodotiseis/itemlist/category/19-2008.html', '/apofaseis-gnomodotiseis/itemlist/category/20-2007.html', '/apofaseis-gnomodotiseis/itemlist/category/21-2006.html', '/apofaseis-gnomodotiseis/itemlist/category/22-2005.html', '/apofaseis-gnomodotiseis/itemlist/category/23-2004.html', '/apofaseis-gnomodotiseis/itemlist/category/24-2003.html', '/apofaseis-gnomodotiseis/itemlist/category/25-2002.html', '/apofaseis-gnomodotiseis/itemlist/category/26-2001.html', '/apofaseis-gnomodotiseis/itemlist/category/27-2000.html', '/apofaseis-gnomodotiseis/itemlist/category/44-1999.html', '/apofaseis-gnomodotiseis/itemlist/category/45-1998.html', '/apofaseis-gnomodotiseis/itemlist/category/48-1997.html', '/apofaseis-gnomodotiseis/itemlist/category/47-1996.html', '/apofaseis-gnomodotiseis/itemlist/category/46-1995.html', '/apofaseis-gnomodotiseis/itemlist/category/49-1994.html']

Use a set as the intermediate storage for the HREFs then convert to a list later.
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB"}
page = requests.get(URL, headers=headers1)
soup = BeautifulSoup(page.content, "lxml")
year = set()
for link in soup.find_all('a', href=lambda href: href and "category" in href):
year.add(link.get('href'))
print(list(year))

How i can get href from row

I do some telegram bot, and i need to get links from html.
I want to take href for Matches from this website https://www.hltv.org/matches
My previous code is
elif message.text == "Matches":
url_news = "https://www.hltv.org/matches"
response = requests.get(url_news)
soup = BeautifulSoup(response.content, "html.parser")
match_info = []
match_items = soup.find("div", class_="upcomingMatchesSection")
print(match_items)
for item in match_items:
match_info.append({
"link": item.find("div", class_="upcomingMatch").text,
"title": item["href"]
})
And i dont know how i can get links from this body.Appreciate any help

What happens?
You try to iterate over match_items but there is nothing to iterate, cause you only selected the section including the matches but not the matches itself.
How to fix?
Select the upcomingMatches instead and iterate over them:
match_items = soup.select("div.upcomingMatchesSection div.upcomingMatch")
Getting the url you have to select an <a>:
item.a["href"]
Example
from bs4 import BeautifulSoup as bs
import requests
url_news = "https://www.hltv.org/matches"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
response = requests.get(url_news, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
match_info = []
match_items = soup.select("div.upcomingMatchesSection div.upcomingMatch")
for item in match_items:
match_info.append({
"title": item.get_text('|', strip=True),
"link": item.a["href"]
})
match_info
Output
[{'title': '09:00|bo3|1WIN|K23|Pinnacle Fall Series 2|Odds',
'link': '/matches/2352066/1win-vs-k23-pinnacle-fall-series-2'},
{'title': '09:00|bo3|INDE IRAE|Nemiga|Pinnacle Fall Series 2|Odds',
'link': '/matches/2352067/inde-irae-vs-nemiga-pinnacle-fall-series-2'},
{'title': '10:00|bo3|OPAA|Nexus|Malta Vibes Knockout Series 3|Odds',
'link': '/matches/2352207/opaa-vs-nexus-malta-vibes-knockout-series-3'},
{'title': '11:00|bo3|Checkmate|TBC|Funspark ULTI 2021 Asia Regional Series 3|Odds',
'link': '/matches/2352092/checkmate-vs-tbc-funspark-ulti-2021-asia-regional-series-3'},
{'title': '11:00|bo3|ORDER|Alke|ESEA Premier Season 38 Australia|Odds',
'link': '/matches/2352122/order-vs-alke-esea-premier-season-38-australia'},...]

You can try this out.
All the match information is present inside a <div> with classname as upcomingMatch
Select all those <div> and from each <div>, extract the match link which is present inside the <a> tag with class name as match.
Here is the code:
import requests
from bs4 import BeautifulSoup
url_news = "https://www.hltv.org/matches"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
response = requests.get(url_news,headers=headers)
soup = BeautifulSoup(response.text, "lxml")
match_items = soup.find_all("div", class_="upcomingMatch")
for match in match_items:
link = match.find('a', class_='match a-reset')['href']
print(f'Link: {link}')
Link: /matches/2352235/malta-vibes-knockout-series-3-quarter-final-1-malta-vibes-knockout-series-3
Link: /matches/2352098/pinnacle-fall-series-2-quarter-final-2-pinnacle-fall-series-2
Link: /matches/2352236/malta-vibes-knockout-series-3-quarter-final-2-malta-vibes-knockout-series-3
Link: /matches/2352099/pinnacle-fall-series-2-quarter-final-3-pinnacle-fall-series-2
.
.
.

Scraping Data after filling form in python of a website

I have tried to scrape data from http://www.educationboardresults.gov.bd/ with python and BeautifulSoup.
Firstly, website need to fill the form. After filling the form the website provide results. I have attached two image here.
Before Submitting Form: https://prnt.sc/w4lo7i
After Submission: https://prnt.sc/w4lqd0
I have tried with following code
import requests
from bs4 import BeautifulSoup as bs
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
}
headers ={
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
'Origin': 'http://www.educationboardresults.gov.bd',
'Referer': 'http://www.educationboardresults.gov.bd/',
'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
}
with requests.Session() as s:
url = 'http://www.educationboardresults.gov.bd'
r = s.get(url, headers=headers)
soup = bs(r.content,'html5lib')
#Scraping and by passing Captcha
alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
value_one, value_two = int(captcha[0]), int(captcha[1])
resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)
While printing r.content it is showing first page's code. I want to scrape the second page.
Thanks in Advance

You are making post requests to the wrong url. Moreover, you are supposed to add the value of two numbers and use the result right next to value_s. If you are using bs4 version 3.7 or later, the following selector will work for you as I've used pseudo css selector. The bottom line is your issue is solved. Try the following:
import requests
from bs4 import BeautifulSoup
link = 'http://www.educationboardresults.gov.bd/'
result_url = 'http://www.educationboardresults.gov.bd/result.php'
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
}
def get_number(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"html5lib")
num = 0
captcha_numbers = soup.select_one("tr:has(> td > #value_s) > td + td").text.split("+")
for i in captcha_numbers:
num+=int(i)
return num
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
resultdata['value_s'] = get_number(s,link)
r = s.post(result_url, data=resultdata)
print(r.text)

I am also trying.
import requests
from bs4 import BeautifulSoup as bs
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",
}
headers ={
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
'Origin': 'http://www.educationboardresults.gov.bd',
'Referer': 'http://www.educationboardresults.gov.bd/',
'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
}
with requests.Session() as s:
url = 'http://www.educationboardresults.gov.bd/index.php'
r = s.get(url, headers=headers)
soup = bs(r.content,'lxml')
# print(soup.prettify())
#Scraping and by passing Captcha
alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
print(captcha)
value_one, value_two = int(captcha[0]), int(captcha[1])
print(value_one, value_one)
resultdata['value_s'] = value_one+value_two
resultdata['button2'] = 'Submit'
print(resultdata)
r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
soup = bs(r.content, 'lxml')
print(soup.prettify())

unable to scrape website pages with unchanged url - python

im trying to get the names of all games within this website "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList".To do so im using the following code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "https://slotcatalog.com/en/The-Best-Slots#anchorFltrList"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
data = []
table = soup.find_all('div', attrs={'class':'providerCard'})
for game in range(0,len(table)-1):
print(table[game].find('a')['title'])
and i get what i want.
I would like to replicate the same across all pages available on the website, but given that the url is not changing, I looked at the network (XMR) events on the page happening when clicking on a different page and I tried to send a request using the following code:
for page_no in range(1, 100):
data = {
"blck":"fltrGamesBlk",
"ajax":"1",
"lang":"end",
"p":str(page_no),
"translit":"The-Best-Slots",
"tag":"TOP",
"dt1":"",
"dt2":"",
"sorting":"SRANK",
"cISO":"GB",
"dt_period":"",
"rtp_1":"50.00",
"rtp_2":"100.00",
"max_exp_1":"2.00",
"max_exp_2":"250000.00",
"min_bet_1":"0.01",
"min_bet_2":"5.00",
"max_bet_1":"3.00",
"max_bet_2":"10000.00"
}
page = requests.post('https://slotcatalog.com/index.php',
data=data,
headers={'Host' : 'slotcatalog.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0'
})
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.find_all('div', attrs={'class':'providerCard'}):
name = row.find('a')['title']
print(name)
result : ("KeyError: 'title'") - meaning that its not finding the class "providerCard".
Has the request to the website been done in the wrong way? If so, where should i change the code?
thanks in advance

Alright, so, you had a typo. XD It was this "lang":"end" from the payload but it should have been "lang": "en", among other things.
Anyhow, I've cleaned your code up a bit and it works as expected. You can keep looping for all the games, if you want.
import requests
from bs4 import BeautifulSoup
headers = {
"referer": "https://slotcatalog.com/en/The-Best-Slots",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/50.0.2661.102 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
payload = {
"blck": "fltrGamesBlk",
"ajax": "1",
"lang": "en",
"p": 1,
"translit": "The-Best-Slots",
"tag": "TOP",
"dt1": "",
"dt2": "",
"sorting": "SRANK",
"cISO": "EN",
"dt_period": "",
"rtp_1": "50.00",
"rtp_2": "100.00",
"max_exp_1": "2.00",
"max_exp_2": "250000.00",
"min_bet_1": "0.01",
"min_bet_2": "5.00",
"max_bet_1": "3.00",
"max_bet_2": "10000.00"
}
page = requests.post(
"https://slotcatalog.com/index.php",
data=payload,
headers=headers,
)
soup = BeautifulSoup(page.content, "html.parser")
print([i.get("title") for i in soup.find_all("a", {"class": "providerName"})])
Output (for page 1 only):
['Starburst', 'Bonanza', 'Rainbow Riches', 'Book of Dead', "Fishin' Frenzy", 'Wolf Gold', 'Twin Spin', 'Slingo Rainbow Riches', "Gonzo's Quest", "Gonzo's Quest Megaways", 'Eye of Horus (Reel Time Gaming)', 'Age of the Gods God of Storms', 'Lightning Roulette', 'Buffalo Blitz', "Fishin' Frenzy Megaways", 'Fluffy Favourites', 'Blue Wizard', 'Legacy of Dead', '9 Pots of Gold', 'Buffalo Blitz II', 'Cleopatra (IGT)', 'Quantum Roulette', 'Reel King Mega', 'Mega Moolah', '7s Deluxe', "Rainbow Riches Pick'n'Mix", "Shaman's Dream"]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract this dictionary using Beautiful Soup - python

Related

(Beginner) Python web scraping BeautifulSoup

Created list from BeautifulSoup contains multiplicate entries -- need a unique link list

How i can get href from row

Scraping Data after filling form in python of a website

unable to scrape website pages with unchanged url - python

Categories

Resources