How to Scrape Amazon using python 3 - python

I am trying to read all the comments for a given product , this is to both learn python and also for a project,to simplify my task I chose a product randomly to code.
The link I want to read is Amazons and I used urllib to to open the link
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
After reading the link into "amazon" variable when I display amazon , I get the below message
print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>
so I read online , and found I need to use read command to read the source ,but sometimes it gives me a webpage kind of result other times not
print(amazon.read())
b''
How do I read the page, and pass it to beautiful soup ?
Edit 1
I did use request.get , and when I check what is present in the text of the retrieved page, I found the below content , which doe not match with the website link.
print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>
<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b>Go to the Amazon.in home page to continue shopping</b>
</font>
</center>
</body>
</html>

Using your current library urllib. This is what you could do! Use .read() to get HTML. Then pass it into BeautifulSoup like this. Keep in mind amazon is heavy-anti-scraping website. The likelihood of you getting different result might be because the HTML is wrapped inside JavaScript. For that you might have to use Selenium or Dryscrape. You may also need to pass in headers/Cookies and extra attributes into your request.
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)
EDIT ---- Turns out you're using requests now. I could get 200 response using requests passing in my headers like this.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]
--- Using Dryscrape
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup
##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html

I personally would use the requests library for this and not urllib. Requests has more features
import requests
From there something like:
resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)
Should answer the mail for this one as it is rather simple http request
Edit:
Based on your error, you are going to have to research parameters to pass to make your requests look correct. In general with requests it'll look something like this (obviously with the values you discover -- check your browsers debug/developer options to check your network traffic and see what you are sending to amazon when using a browser):
url = "https://www.base.url.here"
params = {
'param1': 'value1'
.....
}
resp = requests.get(url,params)

For web scrapping use requests and BeautifulSoup modules in python3.
Installing BeautifulSoup:
pip install beautifulsoup4
Use appropriate headers while sending request.
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
Scrap.py
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.in/s/ref=mega_elec_s23_2_3_1_1?rh=i%3Acomputers%2Cn%3A3011505031&ie=UTF8&bbn=976392031"
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
response = requests.get(f"{url}", headers=headers)
with open("webpg.html","w", encoding="utf-8") as file: # saving html file to disk
file.write(response.text)
bs = BeautifulSoup(response.text, "html.parser")
print(bs) # displaying html file use bs.prettify() for making the document more readable

Related

Scrape a table from a website and store as pandas

In Python, I want to scrape the table in a website(it's a Japanese option trading information), and store it as a pandas dataframe.
The website is here, and you need to click "Options Quotes" in order to access the page where I want to scrape the table. The final URL is https://svc.qri.jp/jpx/english/nkopm/ but you cannot directly access this page.
Here is my attempt:
pd.read_html("https://svc.qri.jp/jpx/english/nkopm/")
...HTTPError: HTTP Error 400: Bad Request
So I thought I need to add a user agent. Here is my another attempt:
url = "https://svc.qri.jp/jpx/english/nkopm/"
pd.read_html(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text)
...ValueError: No tables found
Another attempt
import urllib
url = 'https://svc.qri.jp/jpx/english/nkopm/'
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"price-table"})[0]
...HTTPError: HTTP Error 400: Bad Request
I know how to play with pandas, so it doesn't have to be imported in a neat dataframe at first place. I just need to first import the table in pandas, but I'm not sure why I cannot even read the page. Any help would be appreciated!
By the way, if you click gray arrows in the middle column, ,
it will add another row like this.
And it can be all opened and closed by clicking these buttons.
It would be nice if I can import these rows as well, but it not really a must.
Reading the documentation of the pandas function read_html it says
Read HTML tables into a list of DataFrame objects.
So the function expects structured input in form of an html table. I actually can't access the website you're linking to but I'm guessing it will give you back an entire website.
You need to extract the data in a structured format in order for pandas to make sense of it. You need to scrape it. There's a bunch of tools for that, and one popular one is BeautifulSoup.
Tl;dr: So what you need to do is download the website with requests, pass it into BeautifulSoup and then use BeautifulSoup to extract the data in a structured format.
Updated answer:
Seems like the reason why the requests is returning a 400 is because the website is expecting some additional headers - I just dumped the request my browser does into requests and it works!
import requests
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.jpx.co.jp/english/markets/index.html',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,it;q=0.6,la;q=0.5',
}
response = requests.get('https://svc.qri.jp/jpx/english/nkopm/', headers=headers, cookies=cookies)
Based on Ahmad's answer, you're almost there:
All you need to get the table is this:
import requests
import pandas as pd
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.jpx.co.jp/english/markets/index.html',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,it;q=0.6,la;q=0.5',
}
response = requests.get('https://svc.qri.jp/jpx/english/nkopm/', headers=headers)
table = pd.read_html(response.text, attrs={"class": "price-table"})[0]
print(table)
This outputs:
CALL ... PUT
Settlement09/18 ... Settlement09/18
0 2 ... 3030
1 Delta Gamma Theta Vega 0.0032 0.0000 -0.... ... Delta Gamma Theta Vega - - - -
2 Delta ... NaN
3 0.0032 ... NaN
4 Delta ... NaN
.. ... ... ...

Python: Scraping Amazon webpage with bs4, BeautifulSoup

I'm trying to read specific information(name, price, etc. ...) from an Amazon webpage.
For that I'm using "BeautifulSoup" & "requests" as suggested in most tutorials. My code can load the page and find the item I'm looking for but fails to actually get it. I checked the webpage the item definetly exists.
Here is my code:
#import time
import requests
#import urllib.request
from bs4 import BeautifulSoup
URL = ('https://www.amazon.de/dp/B008JCUXNK/?coliid=I9G2T92PZXG06&colid=3ESRXLK53S0NY&psc=1&ref_=lv_ov_lig_dp_it')
# user agent = browser information (get via google search "my user agent")
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
page = requests.get(URL, headers=headers)# webpage
soup = BeautifulSoup(page.content, 'html.parser')# webpage as html
title = soup.find(id="productTitle")
print(title)
title is always "NONE" so calling get_Text will cause an error.
Can anybody tell me what's wrong?
Found a way to get past the captcha.
The request needs to contain a better header.
Example:
import datetime
import requests
KEY = "YOUR_KEY_HERE"
date = datetime.datetime.now().strftime("%Y%m%d")
BASE_REQUEST = ('https://www.amazon.de/Philips-Haartrockner-ThermoProtect-Technologie-HP8230/dp/B00BCQIIMS?pf_rd_r=T1T8Z7QTQTGYM8F7KRN5&pf_rd_p=c832d309-197e-4c59-8cad-735a8deab917&pd_rd_r=20c6ed33-d548-47d7-a262-c53afe32df96&pd_rd_w=63hR3&pd_rd_wg=TYwZH&ref_=pd_gw_crs_zg_bs_84230031')
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
payload = {
"api-key": KEY,
"begin_date": date,
"end_date": date,
"q": "Donald Trump"
}
r = requests.get(BASE_REQUEST, headers=headers)
print(r.status_code)
if r.status_code == 200:
print('success')
For information on status codes just google html status codes.
Hope this helps anyone with similar problems
Cheers!
Your code is 100% correct, but I've tried your code and checked value of page.content. It contains captcha. Looks like Amazon don't want you to scrape their site.
You can read about your case here: https://www.reddit.com/r/learnpython/comments/bf21fn/how_to_prevent_captcha_while_scraping_amazon/.
But I also recommend to read Amazon's Terms And Conditions https://www.amazon.com/gp/help/customer/display.html/ref=hp_551434_conditions to be sure if you can legally scrape it.

Extract data from dynamic HTML Table with Python 3

I've been working on a python 3 script to generate BibTeX entries, and have ISSN's that I would like to use to get information regarding the associated Journal.
For instance, I would like to take the ISSN 0897-4756 and find that this is Chemistry of Materials journal, which is published by ACS Publications.
I can do this manually using this site, where the info that I am looking for is stored in the lxml table //table[#id="journal-search-results-table"], or more specifically, in the cells of the table body thereof.
I have, however, not been able to get this to automate successfully using python 3.x
I have attempted to access the data using approaches from the httplib2, requests, urllib2, and lxml.html packages, with no success thusfar.
What I have so far is shown below:
import certifi
import lxml.html
import urllib.request
ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib.request.Request(address,None,hdr) #The assembled request
response = urllib.request.urlopen(request)
html = response.read()
tree = lxml.html.fromstring(html)
print(tree.xpath('//table[#id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
# Shows that I am connecting to the table
print(tree.xpath('//table[#id="journal-search-results-table"]//td/text()'))
# >> []
# Should???? hold the data segments that I am looking for?
Exact page being queryed by the above
From what I can tell, it would appear that the table's tbody element, and thus the tr and td elements that it contains are not being loaded at the time that python is interpretting the HTML - which is accordingly preventing me from reading the data.
How do I make it so that I can read out the Journal Name and Publisher from the specified table above?
Like you mentioned in your question, this table dynamically changes by javascript. To get around this you actually have to render the javascript using:
A web driver like selenium which simulates a website the same way it would look to the user (by rendering the javascript)
requests-html, which is a relatively new module that allows you to render javascript on a webpage and has a lot of other amazing features for web scraping
This is one way to solve your problem using requests-html:
from requests_html import HTMLSession
ISSN = "0897-4756"
address = "https://www.journalguide.com/journals/search?type=journal-name&journal-name={}".format(ISSN)
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
ses = HTMLSession()
response = ses.get(address, headers=hdr)
response.html.render() # render the javascript to load the elements in the table
tree = response.html.lxml # no need to import lxml.html because requests-html can do this for you
print(tree.xpath('//table[#id="journal-search-results-table"]/text()'))
# >> ['\n', '\n']
print(tree.xpath('//table[#id="journal-search-results-table"]//td/text()'))
# >> ['ACS Publications', '1.905', 'No', '\n', '\n', '\n']

Getting time out errors when downloading csv's using request api

I previously wrote a program to analyze stock info and to get historical data I used NASDAQ. For example in the past if I wanted to pull a years worth of price quotes for CMG all I needed to do was make a request to the following link h_url= https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30 to download a csv of the historical quotes. However, now when I make the request I my connection times out and I cannot get any response. If I just enter the url into a web-browser it still downloads the file just fine. Some example code is below:
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30'
page_response = rq.get(h_url, timeout=30)
page=bs(page_response.content, 'html.parser')
dwnld_fl=os.path.join(os.path.dirname(__file__),'Temp_data','hist_CMG_data.txt')
fl=open(dwnld_fl,'w')
fl.write(page.text)
Can someone please let me know if this works for them or if there is something I should do differently to get it to work again ? This is only an example not the actual code so if I accidentally made a simple syntax error you can assume the actual file is correct since it has worked without issue in the past.
You are missing the headers and making a request to an invalid URL (the file downloaded in a browser is empty).
import requests
from bs4 import BeautifulSoup as bs
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2019-06-30/2020-06-30'
headers = {
'authority': 'www.nasdaq.com',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
page_response = requests.get(h_url, timeout=30, allow_redirects=True, headers=headers)
with open("dump.txt", "w") as out:
out.write(str(page_response.content))
This will result in writing a byte string to a the file "dump.txt" of the data received. You do not need to use BeautifulSoup to parse HTML, as the response is a text file, not HTML.

How can I get the reputation of filehashes on Virustotal using request and bs4 module and not using Virustotal's PublicAPI?

My requirement is to check multiple filehashes's reputation on Virustotal using python. I do not want to use Virustotal's Public API since there is a cap of 4 requests/min. I thought of using requests module and beautiful soup to get this done.
Please check the link below:
https://www.virustotal.com/gui/file/f8ee4c00a3a53206d8d37abe5ed9f4bfc210a188cd5b819d3e1f77b34504061e/summary
I need to capture 54/69 for this file. I have a list of filehashes in an excel which I can loop for detection status once I can get it done for this one hash.
But I am not able to get the specific count of engines detected the filehash as malicious. The CSS selector for the count is giving me only a blank list. Please help. Please check the code I have written below:
import requests
from bs4 import BeautifulSoup
filehash='F8EE4C00A3A53206D8D37ABE5ED9F4BFC210A188CD5B819D3E1F77B34504061E'
filehash_lower = filehash.lower()
URL = 'https://www.virustotal.com/gui/file/' +filehash+'/detection'
response = requests.get(URL)
print(response)
soup = BeautifulSoup(response.content,'html.parser')
detection_details = soup.select('div.detections')
print(detection_details)
Here is an approach using the ajax calls :
import requests
import json
headers = {
'pragma': 'no-cache',
'x-app-hostname': 'https://www.virustotal.com/gui/',
'dnt': '1',
'x-app-version': '20190611t171116',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,la;q=0.6,mt;q=0.5',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'accept': 'application/json',
'cache-control': 'no-cache',
'authority': 'www.virustotal.com',
'referer': 'https://www.virustotal.com/',
}
response = requests.get('https://www.virustotal.com/ui/files/f8ee4c00a3a53206d8d37abe5ed9f4bfc210a188cd5b819d3e1f77b34504061e', headers=headers)
data = json.loads(response.content)
malicious = data['data']['attributes']['last_analysis_stats']['malicious']
undetected = data['data']['attributes']['last_analysis_stats']['undetected']
print(malicious, 'malicious out of', malicious + undetected)
output:
54 malicious out of 69

Categories