Python: Scraping Amazon webpage with bs4, BeautifulSoup

Python: Scraping Amazon webpage with bs4, BeautifulSoup - python

I'm trying to read specific information(name, price, etc. ...) from an Amazon webpage.
For that I'm using "BeautifulSoup" & "requests" as suggested in most tutorials. My code can load the page and find the item I'm looking for but fails to actually get it. I checked the webpage the item definetly exists.
Here is my code:
#import time
import requests
#import urllib.request
from bs4 import BeautifulSoup
URL = ('https://www.amazon.de/dp/B008JCUXNK/?coliid=I9G2T92PZXG06&colid=3ESRXLK53S0NY&psc=1&ref_=lv_ov_lig_dp_it')
# user agent = browser information (get via google search "my user agent")
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
page = requests.get(URL, headers=headers)# webpage
soup = BeautifulSoup(page.content, 'html.parser')# webpage as html
title = soup.find(id="productTitle")
print(title)
title is always "NONE" so calling get_Text will cause an error.
Can anybody tell me what's wrong?

Found a way to get past the captcha.
The request needs to contain a better header.
Example:
import datetime
import requests
KEY = "YOUR_KEY_HERE"
date = datetime.datetime.now().strftime("%Y%m%d")
BASE_REQUEST = ('https://www.amazon.de/Philips-Haartrockner-ThermoProtect-Technologie-HP8230/dp/B00BCQIIMS?pf_rd_r=T1T8Z7QTQTGYM8F7KRN5&pf_rd_p=c832d309-197e-4c59-8cad-735a8deab917&pd_rd_r=20c6ed33-d548-47d7-a262-c53afe32df96&pd_rd_w=63hR3&pd_rd_wg=TYwZH&ref_=pd_gw_crs_zg_bs_84230031')
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
payload = {
"api-key": KEY,
"begin_date": date,
"end_date": date,
"q": "Donald Trump"
}
r = requests.get(BASE_REQUEST, headers=headers)
print(r.status_code)
if r.status_code == 200:
print('success')
For information on status codes just google html status codes.
Hope this helps anyone with similar problems
Cheers!

Your code is 100% correct, but I've tried your code and checked value of page.content. It contains captcha. Looks like Amazon don't want you to scrape their site.
You can read about your case here: https://www.reddit.com/r/learnpython/comments/bf21fn/how_to_prevent_captcha_while_scraping_amazon/.
But I also recommend to read Amazon's Terms And Conditions https://www.amazon.com/gp/help/customer/display.html/ref=hp_551434_conditions to be sure if you can legally scrape it.

Related

BeautifulSoup find_all not finding all containers of a class

I'm trying to write a script to scrape some data off a Zillow page (https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/). Obviously I'm just trying to gather data from every listing. However I cannot grab the data from every listing as it only finds 9 instances of the class I'm searching for ('list-card-addr') even though I've checked the html from listings it does not find and the class exists. Anyone have any ideas for why this is? Here's my simple code
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
url="https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/"
response = requests.get(url, headers=req_headers)
data = response.text
soup = BeautifulSoup(data,'html.parser')
address = soup.find_all(class_='list-card-addr')
print(len(address))

Data is stored within a comment. You can regex it out easily as string defining JavaScript object you can handle with json
import requests, re, json
r = requests.get('https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22savedSearchEnrollmentId%22%3A%22X1-SSc26hnbsm6l2u1000000000_8e08t%22%2C%22mapBounds%22%3A%7B%22west%22%3A-77.65840518457031%2C%22east%22%3A-77.11870181542969%2C%22south%22%3A38.13250414385234%2C%22north%22%3A38.444339281260426%7D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22mostrecentchange%22%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
print(data['cat1'])
print(data['cat1']['searchList'].keys())
Within this are details on pagination and the next url, if applicable, to get all results. You have only asked for page 1 here.
For example, print addresses
for i in data['cat1']['searchResults']['listResults']:
print(i['address'])

JSON data webscraping

I am attempting to scrape job titles from here.
Using Beautifulsoup I can scrape Job Titles from the first page. I am not able to scrape Job titles from the remaining pages. Using the Developertool > network I understood content type is JSON.
import requests
import json
import BeautifulSoup
from os import link
import pandas as pd
s = requests.Session()
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '^\\^',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',
'Content-Type': 'application/json; charset=utf-8',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://jobs.epicor.com/search-jobs',
'Accept-Language': 'en-US,en;q=0.9',
}
url=’https://jobs.epicor.com/search-jobs/results?ActiveFacetID=0&CurrentPage=2&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&ShowRadius=False&IsPagination=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=1&SearchType=5&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=’
response = s.get(url, headers=headers).json()
data=json.dumps(response)
#print(data)
d2=json.loads(data)
for x in d2.keys():
print(x)
###from above json results how to extract “jobtiltle”
The issue is above result's JSON data contains Html tags. In this case how to scrape job titles from the JSON data?
Would really appreciate any help on this.
I am unfortunately currently limited to using only requests or another popular python library.
Thanks in advance.

If the job titles is all that you need from your response text:
from bs4 import BeautifulSoup
# your code here
soup = BeautifulSoup(response["results"])
for item in soup.findAll("span", { "class" : "jobtitle" }):
print(item.text)
To navigate over the pages, if you hover your mouse cursor over the Prev or Next buttons there, you will see the url to request data from.

Failure to access different pages with (python) selenium vs direct access

I am quite puzzled with this code :
from bs4 import BeautifulSoup
from selenium import webdriver
import time
PATH = 'My_path_to/chromedriver'
for i in range(3):
driver = webdriver.Chrome(PATH)
url = 'https://www.daraz.pk/laptops/?page='+str(i+1)
print(url)
driver.get(url)
time.sleep(5)
driver.quit()
Even thought the link is each time correctly written in the navigation bar, it shows always the page 1, but when I click on the printed links, the display is correct... Is it a javascript artefact to stop web scraping? Is it possible to overcome it?

You don't need to use selenium to scrape daraz.pk
Use requests, page source has a json object that you can parse using json and let pandas export it to a csv file.
Following code does the job for one page.
You may way to unpack the dictionary to show all information on first level in csv file.
import requests
from bs4 import *
import json
import pandas as pd
pd.DataFrame((json.loads([x for x in BeautifulSoup(requests.get('https://www.daraz.pk/laptops/?page=1').content, 'html.parser').findAll('script') if 'window.pageData' in x.text][0].text.split('window.pageData=')[1]))['mods']['listItems']).to_csv('page1.csv', index=False)
This should make more sense:
import requests
from bs4 import *
import json
import pandas as pd
page = requests.get('https://www.daraz.pk/laptops/?page=1')
soup = BeautifulSoup(page.content, 'html.parser')
json_data = [x for x in soup.findAll('script') if 'window.pageData' in x.text][0].text.split('window.pageData=')[1]
json_object = json.loads(json_data)
listed_items_dict = json_object['mods']['listItems']
dataframe = pd.DataFrame(listed_items_dict)
dataframe.to_csv('page1.csv', index=False)
Based on the comment use this function. It uses requests and sends cookie with request and this will return a dictionary of listed items.
def get_page_ajax(n):
headers = {
'authority': 'www.daraz.pk',
'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36',
'content-type': 'application/json',
'accept': '*/*',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.daraz.pk/laptops/?page=1',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'cna=OZ/WFbQKLVwCAbKZLti/7Hsq; _scid=7861a3f6-8fc7-4051-8f21-7ff77b830e4f; lzd_cid=84c13174-852a-4f95-ed5d-a8dcdd6767bb; t_uid=84c13174-852a-4f95-ed5d-a8dcdd6767bb; lzd_sid=1ba989d0707604107e91e9001996d1fd; _tb_token_=5e88e18541b9f; hng=PK|en-PK|PKR|586; userLanguageML=en-PK; t_fv=1619138165830; _gcl_au=1.1.1640796331.1619138166; _ga=GA1.2.213815672.1619138167; _gid=GA1.2.1272622655.1619138167; __auc=3682a279178fc27bf6267d1df0c; _bl_uid=h1kzbndntF4lIg2hX9vq1m1oLdU0; _sctr=1|1619125200000; xlly_s=1; t_sid=oBnSCDmQiPyrQNhDz6NRhBUUjfRvyUHF; utm_channel=NA; daraz-marketing-tracker=hide; _m_h5_tk=a8d8092d608b9889dec7cefccdc3b351_1619183643275; _m_h5_tk_enc=068447fd0471b8537f47a430f7fe4128; __asc=f4f31b38178fe5456d962e14446; _gat_UA-31709783-1=1; JSESSIONID=7A1B772A77330A06B2D19937BDE5775B; tfstk=ctmcBmfEqqzb16zoRnZjhTB12gpRZYSatcoS4cfIen_xLumPiBQP80ByEScmXN1..; l=eBM3wODljUtubHaFBO5whurza77OUIOf1sPzaNbMiInca1uO6wI2zNCCg036JdtjQtf0uetzd4S4yReM7rzU-xNbmbKe6QuI2ov6-; isg=BMvLGwrYLxDvEE44amWSgStYWm-1YN_ieR8y9T3J-YoyXOi-xTPLM_26NkSy_Dfa',
}
params = (
('ajax', 'true'),
('page', str(n)),
)
response = requests.get('https://www.daraz.pk/laptops/', headers=headers, params=params)
json_object = json.loads(response.content.decode())
listed_items_dict = json_object['mods']['listItems']
return listed_items_dict

I've tried your code, but still don't know what the problem is. The pages is displayed correctly by what I've noticed that there is no difference in the elements shown in the first and second page but the third has another elements shown. I don't now why but I've visited the URL manually and got the same result.

Amazon login and acess to orders

Problem:
I've read myself trough articles for days and try to login into my amazon account with python. But I'm failing each time. Since every article has a different approach is very hard to find the potential error source. Especially as a lot of articles are older than 2-3 years.
I think from my current point of view the most straight forward way is to use BeautifulSoup bs4and requests. Which parser is the best is another discussion but I've seen html.parser, html5lib and lxml as most amazon login related articles are working with html.parser this is the one currently in my code even if I would love to use lxml or html5lib later on.
All kinds of input and feedback helps to summaries all important points and turnarounds.
I'm currently trying to get to the login page via 'https://www.amazon.de/gp/css/order-history?ref_=nav_orders_first' as the 'https://www.amazon.de/ap/signin' gives me an error, at least in my browser. So I'm going to a page where a user needs to login (my orders) to be forwarded to the login page and try to log in there. Is there a possibility to be logged out again when making a new requests to another subsite like switching pages? Also, I found an article using with requests.Session() as s:is this a better way to request a site compared to not doing with an intend and Session(). I'm by the way using "de" in the URL but you can exchange that with "com" I guess.
Current code:
import bs4
from bs4 import BeautifulSoup
import requests
amazon_orders_url = r'https://www.amazon.de/gp/css/order-history?ref_=nav_orders_first' # First time visit login
amazon_login_url = r'https://www.amazon.de/ap/signin' # Not working by browser access
credentials = {'email': "EMAILADRESS", "password": "PASSWORD"}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en,de-DE;q=0.9,de;q=0.8,en-US;q=0.7',
'referer': 'https://www.amazon.de/ap/signin'}
# print(credentials['email']) # print Email address
with requests.Session() as s:
s.headers = headers
site = s.get(amazon_orders_url) # , headers=headers
# HTML parsing
soup = BeautifulSoup(site.content, "html.parser") # Alternative "html5lib" / , "html.parser" / , "lxml"
# Print whole page
# print(soup)
# Check if Anmelden/Login exists
for div in soup.find_all('div', class_='a-box'):
headline = div.h1
print(headline)
signin_data = {s["name"]: s["value"]
for s in soup.select("form[name=signIn]")[0].select("input[name]")
if s.has_attr("value")}
# signin_data = {}
# signin_form = soup.find('form', {'name': 'signIn'})
# for field in signin_form.find_all('input'):
# try:
# signin_data[field['name']] = field['value']
# except:
# pass
signin_data[u'email'] = credentials['email']
signin_data[u'password'] = credentials['password']
post_response = s.post('https://www.amazon.de/ap/signin', data=signin_data)
soup = BeautifulSoup(post_response.text, "html.parser")
warning = soup.find('div', {'id': 'message_warning'})
# if warning:
# print('Failed to login: {0}'.format(warning.text))
print(soup)
# print(post_response.content)

Also I have a similar problem in my project so far I can obtain some of the parameters of the header to enter into the logging process.
from pprint import pprint
from bs4 import BeautifulSoup
import os
import requests
cookie = AMAZON_COOKIE = os.getenv("AMAZON_COOKIE", "")
#https://read.amazon.com/notebook
res = requests.get('https://read.amazon.com/notebook',
headers={
'Accept-Encoding': 'gzip, deflate',
'Content-Type': 'application/json; charset=UTF-8',
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:87.0) Gecko/20100101 Firefox/87.0",
'Cookie': cookie
}
)
pprint(res.text)
def login():
signin_page_res = requests.get('https://read.amazon.com/notebook',
headers={
'Accept-Encoding': 'gzip, deflate',
'Content-Type': 'application/json; charset=UTF-8',
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:87.0) Gecko/20100101 Firefox/87.0",
}
)
soup = BeautifulSoup(signin_page_res.text, 'html.parser')
login_inputs = {}
for input_tag in soup.find_all('input'):
name = input_tag.get('name')
value = input_tag.get('value')
if not value and name not in ('email', 'password'):
continue
login_inputs[name] = value
login_inputs['email'] = os.getenv("AMAZON_USERNAME", "email")
login_inputs['password'] = os.getenv("AMAZON_PASSWORD", "pass")
pprint(login_inputs)
login_res = requests.post('https://read.amazon.com/notebook',
headers={
'Accept-Encoding': 'gzip, deflate',
'Content-Type': 'application/json; charset=UTF-8',
'User-Agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:87.0) Gecko/20100101 Firefox/87.0",
},
data=login_inputs
)
print(login_res.text)
with open('index.html','w',encoding="utf-8") as f:
f.write(login_res.text)
if __name__ == '__main__':
login()
Recently I found that there is one project called audible where they used the api to connect to the account of amazon, and fix the problem of the encryption of password. here

How to Scrape Amazon using python 3

I am trying to read all the comments for a given product , this is to both learn python and also for a project,to simplify my task I chose a product randomly to code.
The link I want to read is Amazons and I used urllib to to open the link
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
After reading the link into "amazon" variable when I display amazon , I get the below message
print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>
so I read online , and found I need to use read command to read the source ,but sometimes it gives me a webpage kind of result other times not
print(amazon.read())
b''
How do I read the page, and pass it to beautiful soup ?
Edit 1
I did use request.get , and when I check what is present in the text of the retrieved page, I found the below content , which doe not match with the website link.
print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>
<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b>Go to the Amazon.in home page to continue shopping</b>
</font>
</center>
</body>
</html>

Using your current library urllib. This is what you could do! Use .read() to get HTML. Then pass it into BeautifulSoup like this. Keep in mind amazon is heavy-anti-scraping website. The likelihood of you getting different result might be because the HTML is wrapped inside JavaScript. For that you might have to use Selenium or Dryscrape. You may also need to pass in headers/Cookies and extra attributes into your request.
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)
EDIT ---- Turns out you're using requests now. I could get 200 response using requests passing in my headers like this.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]
--- Using Dryscrape
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup
##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html

I personally would use the requests library for this and not urllib. Requests has more features
import requests
From there something like:
resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)
Should answer the mail for this one as it is rather simple http request
Edit:
Based on your error, you are going to have to research parameters to pass to make your requests look correct. In general with requests it'll look something like this (obviously with the values you discover -- check your browsers debug/developer options to check your network traffic and see what you are sending to amazon when using a browser):
url = "https://www.base.url.here"
params = {
'param1': 'value1'
.....
}
resp = requests.get(url,params)

For web scrapping use requests and BeautifulSoup modules in python3.
Installing BeautifulSoup:
pip install beautifulsoup4
Use appropriate headers while sending request.
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
Scrap.py
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.in/s/ref=mega_elec_s23_2_3_1_1?rh=i%3Acomputers%2Cn%3A3011505031&ie=UTF8&bbn=976392031"
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
response = requests.get(f"{url}", headers=headers)
with open("webpg.html","w", encoding="utf-8") as file: # saving html file to disk
file.write(response.text)
bs = BeautifulSoup(response.text, "html.parser")
print(bs) # displaying html file use bs.prettify() for making the document more readable

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Scraping Amazon webpage with bs4, BeautifulSoup - python

Related

BeautifulSoup find_all not finding all containers of a class

JSON data webscraping

Failure to access different pages with (python) selenium vs direct access

Amazon login and acess to orders

How to Scrape Amazon using python 3

Categories

Resources