(Beginner) Python web scraping BeautifulSoup

(Beginner) Python web scraping BeautifulSoup - python

how to scrape a text from an element with multiple attributes?
<h2 class="_63-j _1rimQ" data-qa="heading">Popular Dishes</h2>
I used this
category = soup.find(name="h2", attrs={"class":"_63-j _1rimQ","data-qa":"heading"}).getText()
but it returns an error
AttributeError: 'NoneType' object has no attribute 'getText'
Same error is returned when using this
category = soup.find(name="h2",class_="_63-j _1rimQ")

from bs4 import BeautifulSoup as bs
html = """<h2 class="_63-j _1rimQ" data-qa="heading">Popular Dishes</h2>"""
soup = bs(html, 'html.parser')
soup.find('h2', class_ = '_63-j _1rimQ').getText() # 'Popular Dishes'
Works very well here. Maybe the 'html.parser'?
BeautifulSoup 4.10.0, Python 3.10.2

The content you wish to get from that page generate dynamically, so BeautifulSoup will not help you grab them. The requests is being issued to an endpoint. The following is how you can achieve using requests:
import requests
link = 'https://cw-api.takeaway.com/api/v29/restaurant'
params = {
'slug': 'c-pizza-c-kebab'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'x-country-code': 'fr',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link,params=params)
container = res.json()['menu']['products']
for key,val in container.items():
print(val['name'])
Output (truncated):
Kebab veau
Pot de kebabs
Pot de frites
Margherita
Bambino
Reine
Sicilienne
Végétarienne
Calzone soufflée jambon
Calzone soufflée bœuf haché
Pêcheur

Related

How to get past the consent page of google shopping using requests and python

For a small project I tried to get past the google consent page to webscrape the prices found by it. However i could not get the to the good stuff. Sofar i tried the following code, as also proposed by Using python requests with google search.
First try:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
#%% get acces
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
"cookie": "CONSENT=YES"}
cookie = {
'set-cookie': 'CONSENT=YES+cb.20211212-16-p1.nl+FX+671'
}
url = 'https://www.google.com/search?q=kitesurf+bar&rlz=1C1CAFA_enNL603NL674&sxsrf=AOaemvJF3BPrjeczkCI1e1YotCSJYcKjug:1641909060258&source=lnms&tbm=shop&sa=X&ved=2ahUKEwiBwqjy66n1AhXO-KQKHUKVBSoQ_AUoAXoECAEQAw'
s = requests.Session()
s.get(url,headers=headers)
purl = 'https://consent.google.com/s'
payload = {
'gl': 'NL',
'm': 'false',
'pc': 'srp',
'continue': 'https://www.google.com/search?q=kitesurf+bar&rlz=1C1CAFA_enNL603NL674&sxsrf=AOaemvJF3BPrjeczkCI1e1YotCSJYcKjug:1641909060258&source=lnms&tbm=shop&sa=X&ved=2ahUKEwiBwqjy66n1AhXO-KQKHUKVBSoQ_AUoAXoECAEQAw',
'ca': 'r',
'x': '6',
'v': 'cb.20211212-16-p1.nl+FX+092',
't': 'ADw3F8jHa3HqOqq133-wOkXCYf4K_r-AIA:1641909071268',
'hl': 'nl',
'src': '1'
}
s.post(purl,params=payload,headers=headers)
page = s.get(url, headers=headers, cookies=cookie)
Second try:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
# url = f"https://www.google.com/search?q=fitness+wear"
headers = {
"referer":"referer: https://www.google.com/",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
s.post(url, headers=headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)```

Created list from BeautifulSoup contains multiplicate entries -- need a unique link list

I am trying to create a list containing all unique year links from a website (see below).
When I execute the append function it gives me a huge list containing dupli-multiplicate entries.
I need to get a list containing only the unique year links.
The website : https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
Code written so far :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
print(link.get('href'))
#year.append(link.get('href'))
#print(year)
The desired result would look like this (but I need this in list format):
https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/83-2022.html
/apofaseis-gnomodotiseis/itemlist/category/78-2021.html
/apofaseis-gnomodotiseis/itemlist/category/71-2020.html
/apofaseis-gnomodotiseis/itemlist/category/4-2019.html
/apofaseis-gnomodotiseis/itemlist/category/5-2018.html
/apofaseis-gnomodotiseis/itemlist/category/6-2017.html
/apofaseis-gnomodotiseis/itemlist/category/7-2016.html
/apofaseis-gnomodotiseis/itemlist/category/8-2015.html
/apofaseis-gnomodotiseis/itemlist/category/9-2014.html
/apofaseis-gnomodotiseis/itemlist/category/10-2013.html
/apofaseis-gnomodotiseis/itemlist/category/11-2012.html
/apofaseis-gnomodotiseis/itemlist/category/12-2011.html
/apofaseis-gnomodotiseis/itemlist/category/13-2010.html
/apofaseis-gnomodotiseis/itemlist/category/18-2009.html
/apofaseis-gnomodotiseis/itemlist/category/19-2008.html
/apofaseis-gnomodotiseis/itemlist/category/20-2007.html
/apofaseis-gnomodotiseis/itemlist/category/21-2006.html
/apofaseis-gnomodotiseis/itemlist/category/22-2005.html
/apofaseis-gnomodotiseis/itemlist/category/23-2004.html
/apofaseis-gnomodotiseis/itemlist/category/24-2003.html
/apofaseis-gnomodotiseis/itemlist/category/25-2002.html
/apofaseis-gnomodotiseis/itemlist/category/26-2001.html
/apofaseis-gnomodotiseis/itemlist/category/27-2000.html
/apofaseis-gnomodotiseis/itemlist/category/44-1999.html
/apofaseis-gnomodotiseis/itemlist/category/45-1998.html
/apofaseis-gnomodotiseis/itemlist/category/48-1997.html
/apofaseis-gnomodotiseis/itemlist/category/47-1996.html
/apofaseis-gnomodotiseis/itemlist/category/46-1995.html
/apofaseis-gnomodotiseis/itemlist/category/49-1994.html
Edit : I am Trying to create a case list for every year in year list :
Code :
# 1) Created an year list (year = [])
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
import re
total_cases = []
#Url to scrape
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
# 2) Created a case list
case = []
for link in soup.find_all('a', href=lambda href: href and "apofasi" in href):
if link.get('href') not in case :
case.append(link.get('href'))
print(case)
#Trying to create a case list for every year in year list
# A)Get every year link in year list
for year_link in year :
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(year_link, headers = headers1)
soup2 = BeautifulSoup(page.content,"html.parser")
print(year)
# B)Get every case link for every case in a fixed year
for case_link in case :
total_cases.append(case_link)
#Get case link for every case for every year_link (element of year[])
???
EDIT 2 :
When I try to run the code you (HedgeHog) so kinldy posted it gives me this error :
--------------------------------------------------------------------------
FeatureNotFound Traceback (most recent call last)
C:\Users\ARISTE~1\AppData\Local\Temp/ipykernel_13944/1621925083.py in <module>
8 "X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
9 page = requests.get(URL, headers = headers)
---> 10 soup = BeautifulSoup(page.content,'lxml')
11
12 baseUrl = 'https://www.epant.gr'
~\Documents\conda\envs\conda\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
243 builder_class = builder_registry.lookup(*features)
244 if builder_class is None:
--> 245 raise FeatureNotFound(
246 "Couldn't find a tree builder with the features you "
247 "requested: %s. Do you need to install a parser library?"
FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Any ideas? Thanks!

EDIT
Based on your question edits I would recommend to use a dict instead of all this lists - Following example will create a data dictionary with years as keys, it´s own url and a list of cases urls.
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
baseUrl = 'https://www.epant.gr'
data = {}
for href in [x['href'] for x in soup.select('a[href*=category]:has(span)')]:
page = requests.get(f'{baseUrl}{href}', headers = headers)
soup = BeautifulSoup(page.content,'html.parser')
data[href.split('-')[-1].split('.')[0]] = {
'url': f'{baseUrl}{href}'
}
data[href.split('-')[-1].split('.')[0]]['cases'] = [f'{baseUrl}{x["href"]}' for x in soup.select('h3 a')]
data
Output
{'2022': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/83-2022.html',
'cases': []},
'2021': {'url': 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html',
'cases': ['https://www.epant.gr/apofaseis-gnomodotiseis/item/1578-apofasi-749-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1633-apofasi-743-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1575-apofasi-738-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1624-apofasi-737-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1510-apofasi-735-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1595-apofasi-733-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1600-apofasi-732-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1451-apofasi-730-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1508-apofasi-728-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1584-apofasi-727-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1586-apofasi-726-2021.html',
'https://www.epant.gr/apofaseis-gnomodotiseis/item/1583-apofasi-725-2021.html']},...}
How to fix?
Just check if the link is not in your list of links - So it is True append it to your list:
if link.get('href') not in year:
year.append(link.get('href'))
Note
The desired result would look like this (but I need this in list
format)
This is not a list in the sense of data structure it is a printed version of each single element of a list.
Alternativ
Example
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
page = requests.get(URL, headers = headers1)
soup = BeautifulSoup(page.content,"html.parser")
year = []
for link in soup.find_all('a', href=lambda href: href and "category" in href):
if link.get('href') not in year:
year.append(link.get('href'))
print(year)
Output
['https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/83-2022.html', '/apofaseis-gnomodotiseis/itemlist/category/78-2021.html', '/apofaseis-gnomodotiseis/itemlist/category/71-2020.html', '/apofaseis-gnomodotiseis/itemlist/category/4-2019.html', '/apofaseis-gnomodotiseis/itemlist/category/5-2018.html', '/apofaseis-gnomodotiseis/itemlist/category/6-2017.html', '/apofaseis-gnomodotiseis/itemlist/category/7-2016.html', '/apofaseis-gnomodotiseis/itemlist/category/8-2015.html', '/apofaseis-gnomodotiseis/itemlist/category/9-2014.html', '/apofaseis-gnomodotiseis/itemlist/category/10-2013.html', '/apofaseis-gnomodotiseis/itemlist/category/11-2012.html', '/apofaseis-gnomodotiseis/itemlist/category/12-2011.html', '/apofaseis-gnomodotiseis/itemlist/category/13-2010.html', '/apofaseis-gnomodotiseis/itemlist/category/18-2009.html', '/apofaseis-gnomodotiseis/itemlist/category/19-2008.html', '/apofaseis-gnomodotiseis/itemlist/category/20-2007.html', '/apofaseis-gnomodotiseis/itemlist/category/21-2006.html', '/apofaseis-gnomodotiseis/itemlist/category/22-2005.html', '/apofaseis-gnomodotiseis/itemlist/category/23-2004.html', '/apofaseis-gnomodotiseis/itemlist/category/24-2003.html', '/apofaseis-gnomodotiseis/itemlist/category/25-2002.html', '/apofaseis-gnomodotiseis/itemlist/category/26-2001.html', '/apofaseis-gnomodotiseis/itemlist/category/27-2000.html', '/apofaseis-gnomodotiseis/itemlist/category/44-1999.html', '/apofaseis-gnomodotiseis/itemlist/category/45-1998.html', '/apofaseis-gnomodotiseis/itemlist/category/48-1997.html', '/apofaseis-gnomodotiseis/itemlist/category/47-1996.html', '/apofaseis-gnomodotiseis/itemlist/category/46-1995.html', '/apofaseis-gnomodotiseis/itemlist/category/49-1994.html']

Use a set as the intermediate storage for the HREFs then convert to a list later.
from bs4 import BeautifulSoup
import requests
URL = 'https://www.epant.gr/apofaseis-gnomodotiseis/itemlist/category/78-2021.html'
headers1 = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB"}
page = requests.get(URL, headers=headers1)
soup = BeautifulSoup(page.content, "lxml")
year = set()
for link in soup.find_all('a', href=lambda href: href and "category" in href):
year.add(link.get('href'))
print(list(year))

How to extract this dictionary using Beautiful Soup

I want to get to variable last element of dictionary (pasted below), it's in another dictionary "offers", and i have no clue how to extract it.
html = s.get(url=url, headers=headers, verify=False, timeout=15)
soup = BeautifulSoup(html.text, 'html.parser')
products = soup.find_all('script', {'type': "application/ld+json"})
{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"} (...)

As mentioned extract contents via BeautifulSoup decode the string with json.loads():
import json
products = '{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"}]}'
products = json.loads(products)
To get the last element (dict) in offers:
products['offers'][-1]
Output:
{'#type': 'Offer',
'availability': 'http://schema.org/OutOfStock',
'price': '489',
'priceCurrency': 'PLN',
'sku': 'NE215O06U-A110002000',
'url': '/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html'}
Example
In your special case you also have to replace('"','"') first:
from bs4 import BeautifulSoup
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
html = requests.get('https://www.zalando.de/new-balance-550-unisex-sneaker-low-whitered-ne215o06u-a11.html', headers=headers)
soup = BeautifulSoup(html.content, 'lxml')
jsonData = json.loads(soup.select_one('script[type="application/ld+json"]').text.replace('"','"'))
jsonData['offers'][-1]

Scraping Data after filling form in python of a website

I have tried to scrape data from http://www.educationboardresults.gov.bd/ with python and BeautifulSoup.
Firstly, website need to fill the form. After filling the form the website provide results. I have attached two image here.
Before Submitting Form: https://prnt.sc/w4lo7i
After Submission: https://prnt.sc/w4lqd0
I have tried with following code
import requests
from bs4 import BeautifulSoup as bs
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
}
headers ={
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
'Origin': 'http://www.educationboardresults.gov.bd',
'Referer': 'http://www.educationboardresults.gov.bd/',
'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
}
with requests.Session() as s:
url = 'http://www.educationboardresults.gov.bd'
r = s.get(url, headers=headers)
soup = bs(r.content,'html5lib')
#Scraping and by passing Captcha
alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
value_one, value_two = int(captcha[0]), int(captcha[1])
resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)
While printing r.content it is showing first page's code. I want to scrape the second page.
Thanks in Advance

You are making post requests to the wrong url. Moreover, you are supposed to add the value of two numbers and use the result right next to value_s. If you are using bs4 version 3.7 or later, the following selector will work for you as I've used pseudo css selector. The bottom line is your issue is solved. Try the following:
import requests
from bs4 import BeautifulSoup
link = 'http://www.educationboardresults.gov.bd/'
result_url = 'http://www.educationboardresults.gov.bd/result.php'
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
}
def get_number(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"html5lib")
num = 0
captcha_numbers = soup.select_one("tr:has(> td > #value_s) > td + td").text.split("+")
for i in captcha_numbers:
num+=int(i)
return num
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
resultdata['value_s'] = get_number(s,link)
r = s.post(result_url, data=resultdata)
print(r.text)

I am also trying.
import requests
from bs4 import BeautifulSoup as bs
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",
}
headers ={
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
'Origin': 'http://www.educationboardresults.gov.bd',
'Referer': 'http://www.educationboardresults.gov.bd/',
'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
}
with requests.Session() as s:
url = 'http://www.educationboardresults.gov.bd/index.php'
r = s.get(url, headers=headers)
soup = bs(r.content,'lxml')
# print(soup.prettify())
#Scraping and by passing Captcha
alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
print(captcha)
value_one, value_two = int(captcha[0]), int(captcha[1])
print(value_one, value_one)
resultdata['value_s'] = value_one+value_two
resultdata['button2'] = 'Submit'
print(resultdata)
r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
soup = bs(r.content, 'lxml')
print(soup.prettify())

Parse URL with python

I would like to parse the following URL :
Espacenet link
and I would like to obtain the URL corresponding to the text :
BATTERY PACK WITH A BUS BAR HAVING NOVEL STRUCTURE
I'm using python but I'm not really familiar with javascript.
How can I can get the job done ?
So far I've seen requests_html and I tried this code :
from requests_html import HTMLSession
from bs4 import BeautifulSoup
publication_number_to_scrape = "EP2814089"
url = "https://worldwide.espacenet.com/searchResults?ST=singleline&locale=fr_EP&submitted=true&DB=&query=ep2814089" + publication_number_to_scrape
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
resp = session.get(url, headers=headers)
print(resp.content)
# Run JavaScript code on webpage
html2 = resp.html.render()
soup = BeautifulSoup(resp.content, 'html.parser')
print(soup)
and in the printed result, I've seen this part :
</li>
<li class="bendractive"><a accesskey="b" href="">Liste de résultats</a></li>
<li class="bendr"><a accesskey="c" class="ptn" href="/mydocumentslist?submitted=true&locale=fr_EP" id="menuPnStar">Ma liste de brevets (<span id="menuPnCount"></span>)</a></li>
<li class="bendr"><a accesskey="d" href="/queryHistory?locale=fr_EP">Historique des requêtes</a></li>
<li class="spacer"></li>
<li class="bendl"><a accesskey="e" href="/settings?locale=fr_EP">Paramètres</a></li>
<li class="bendl last">
<a accesskey="f" href="/help?locale=fr_EP&method=handleHelpTopic&topic=index">Aide</a>
</li>
My goal is to obtain the following URL from the results :
Wanted URL
My final goal is to get a list with the string of each document appearing in that URL:
I don't need the URLs of said documents, only the following list :
result = ['EP2814089 (A4)', 'EP2814089 (B1)', ....]

Use selenium from Pyp https://pypi.org/project/selenium/
and get the id of what you're intersted or the xpath.
In your case :
id=publicationId1 or //a[#id='publicationId1']
or xpath=(.//*[normalize-space(text()) and normalize-space(.)='|'])[5]/following::a[2]

I think this will do the job:
import requests
from bs4 import BeautifulSoup
cookies = {
'JSESSIONID': '9ULYIsd9+RmCkgzGPoLdCWMP.espacenet_levelx_prod_1',
'org.springframework.web.servlet.i18n.CookieLocaleResolver.LOCALE': 'fr_EP',
'menuCurrentSearch': '%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr_EP%26query%3Dep2814089',
'currentUrl': 'https%3A%2F%2Fworldwide.espacenet.com%2FsearchResults%3FDB%3D%26ST%3Dsingleline%26locale%3Dfr_EP%26query%3Dep2814089',
'PGS': '10',
}
headers = {
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36',
'Sec-Fetch-User': '?1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-Mode': 'navigate',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'tr,tr-TR;q=0.9',
}
params = (
('DB', ''),
('ST', 'singleline'),
('locale', 'fr_EP'),
('query', 'ep2814089'),
)
response = requests.get('https://worldwide.espacenet.com/searchResults', headers=headers, params=params, cookies=cookies)
soup = BeautifulSoup(response.text, 'html.parser')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

(Beginner) Python web scraping BeautifulSoup - python

from bs4 import BeautifulSoup as bs html = """<h2 class="_63-j _1rimQ" data-qa="heading">Popular Dishes</h2>""" soup = bs(html, 'html.parser') soup.find('h2', class_ = '_63-j _1rimQ').getText() # 'Popular Dishes' Works very well here. Maybe the 'html.parser'? BeautifulSoup 4.10.0, Python 3.10.2

Related

How to get past the consent page of google shopping using requests and python

Created list from BeautifulSoup contains multiplicate entries -- need a unique link list

How to extract this dictionary using Beautiful Soup

Scraping Data after filling form in python of a website

Parse URL with python

Categories

Resources