I previously wrote a program to analyze stock info and to get historical data I used NASDAQ. For example in the past if I wanted to pull a years worth of price quotes for CMG all I needed to do was make a request to the following link h_url= https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30 to download a csv of the historical quotes. However, now when I make the request I my connection times out and I cannot get any response. If I just enter the url into a web-browser it still downloads the file just fine. Some example code is below:
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30'
page_response = rq.get(h_url, timeout=30)
page=bs(page_response.content, 'html.parser')
dwnld_fl=os.path.join(os.path.dirname(__file__),'Temp_data','hist_CMG_data.txt')
fl=open(dwnld_fl,'w')
fl.write(page.text)
Can someone please let me know if this works for them or if there is something I should do differently to get it to work again ? This is only an example not the actual code so if I accidentally made a simple syntax error you can assume the actual file is correct since it has worked without issue in the past.
You are missing the headers and making a request to an invalid URL (the file downloaded in a browser is empty).
import requests
from bs4 import BeautifulSoup as bs
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2019-06-30/2020-06-30'
headers = {
'authority': 'www.nasdaq.com',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
page_response = requests.get(h_url, timeout=30, allow_redirects=True, headers=headers)
with open("dump.txt", "w") as out:
out.write(str(page_response.content))
This will result in writing a byte string to a the file "dump.txt" of the data received. You do not need to use BeautifulSoup to parse HTML, as the response is a text file, not HTML.
Related
I have a question.
Is there any possibility to download images that are currently on the website through requests, but without using the url?
In my case, it does not work, because the image on the website is different than the one under the link from which I am downloading it. This is due to the fact that the image changes each time the link is entered. And I want to download exactly what's on the page to rewrite the code.
Previously, I used selenium and the screenshot option for this, but I have already rewritten all the code for requests and I only miss this one.
Anyone have an idea how to download a photo that is currently on the site?
Below is the code with links:
import requests
from requests_html import HTMLSession
headers = {
'Content-Type': 'image/png',
'Host': 'www.oglaszamy24.pl',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-GPC': '1',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.oglaszamy24.pl/dodaj-ogloszenie2.php?c1=8&c2=40&at=1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7'
}
session = HTMLSession()
r = session.get('https://www.oglaszamy24.pl/dodaj-ogloszenie2.php?c1=8&c2=40&at=1')
r.html.render(sleep=2,timeout=20)
links = r.html.find("#captcha_img")
result = str(links)
results = result.split("src=")[1].split("'")[1]
resultss = "https://www.oglaszamy24.pl/"+results
with open ('image.png', 'wb') as f:
f.write(requests.get(resultss, headers=headers).content)
i'd rather try to use PIL (Python Image Library) and get a screenshot of the element's bounding box (coordinates and size). You can get those with libs like BS4 (BeautifulSoup) or Selenium.
Then you'd have a local copy (screenshot) of what the user would see.
A lot of sites have protection against scrapers and capcha services usually do not allow their resources to be downloaded, either via requests or otherwise.
But like that NFT joke goes: you don't download a screenshot...
I would like to scrape the source URL for images on an interactive map that holds the location for various traffic cameras and export them to a JSON or CSV as a list. Because i am trying to gather this data from different websites i have attempted to use parsehub and octoparse to no avail. I previously attempted to use BS4 and selenium but wasn't able to extract the div / img tag with the src ur. Any help would be appreciated. Below are examples of two different websites with similar but different methods for housing the images.
https://tripcheck.com/
https://cwwp2.dot.ca.gov/vm/iframemap.htm (cal trans uses iframes)
The image names (for trip check) come from an api call. you would need to request the cctv ids then you can build the urls.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'X-Requested-With': 'XMLHttpRequest',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://tripcheck.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
}
params = {
'dt': '1659122796377',
}
response = requests.get('https://tripcheck.com/Scripts/map/data/cctvinventory.js', params=params, headers=headers)
response.json()['features'][0]['attributes']['filename']
output:
'AstoriaUS101MeglerBrNB_pid392.jpg'
Above, you iterate over the attributes array in the json response. and then for the url:
import time
cam = response.json()['features'][0]['attributes']['filename']
rand = str(time.time()).replace('.','')[:13]
f'https://tripcheck.com/RoadCams/cams/{cam}?rand={rand}'
output:
'https://tripcheck.com/RoadCams/cams/AstoriaUS101MeglerBrNB_pid392.jpg?rand=1659123325440'
Note the rand parameter appears to be part of time stamp. As does the 'dt' parameter in the original request. You can use time.time() to generate a time stamp and manipulate it as you need.
In Python, I want to scrape the table in a website(it's a Japanese option trading information), and store it as a pandas dataframe.
The website is here, and you need to click "Options Quotes" in order to access the page where I want to scrape the table. The final URL is https://svc.qri.jp/jpx/english/nkopm/ but you cannot directly access this page.
Here is my attempt:
pd.read_html("https://svc.qri.jp/jpx/english/nkopm/")
...HTTPError: HTTP Error 400: Bad Request
So I thought I need to add a user agent. Here is my another attempt:
url = "https://svc.qri.jp/jpx/english/nkopm/"
pd.read_html(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text)
...ValueError: No tables found
Another attempt
import urllib
url = 'https://svc.qri.jp/jpx/english/nkopm/'
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"price-table"})[0]
...HTTPError: HTTP Error 400: Bad Request
I know how to play with pandas, so it doesn't have to be imported in a neat dataframe at first place. I just need to first import the table in pandas, but I'm not sure why I cannot even read the page. Any help would be appreciated!
By the way, if you click gray arrows in the middle column, ,
it will add another row like this.
And it can be all opened and closed by clicking these buttons.
It would be nice if I can import these rows as well, but it not really a must.
Reading the documentation of the pandas function read_html it says
Read HTML tables into a list of DataFrame objects.
So the function expects structured input in form of an html table. I actually can't access the website you're linking to but I'm guessing it will give you back an entire website.
You need to extract the data in a structured format in order for pandas to make sense of it. You need to scrape it. There's a bunch of tools for that, and one popular one is BeautifulSoup.
Tl;dr: So what you need to do is download the website with requests, pass it into BeautifulSoup and then use BeautifulSoup to extract the data in a structured format.
Updated answer:
Seems like the reason why the requests is returning a 400 is because the website is expecting some additional headers - I just dumped the request my browser does into requests and it works!
import requests
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.jpx.co.jp/english/markets/index.html',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,it;q=0.6,la;q=0.5',
}
response = requests.get('https://svc.qri.jp/jpx/english/nkopm/', headers=headers, cookies=cookies)
Based on Ahmad's answer, you're almost there:
All you need to get the table is this:
import requests
import pandas as pd
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.jpx.co.jp/english/markets/index.html',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,it;q=0.6,la;q=0.5',
}
response = requests.get('https://svc.qri.jp/jpx/english/nkopm/', headers=headers)
table = pd.read_html(response.text, attrs={"class": "price-table"})[0]
print(table)
This outputs:
CALL ... PUT
Settlement09/18 ... Settlement09/18
0 2 ... 3030
1 Delta Gamma Theta Vega 0.0032 0.0000 -0.... ... Delta Gamma Theta Vega - - - -
2 Delta ... NaN
3 0.0032 ... NaN
4 Delta ... NaN
.. ... ... ...
I'm trying to read specific information(name, price, etc. ...) from an Amazon webpage.
For that I'm using "BeautifulSoup" & "requests" as suggested in most tutorials. My code can load the page and find the item I'm looking for but fails to actually get it. I checked the webpage the item definetly exists.
Here is my code:
#import time
import requests
#import urllib.request
from bs4 import BeautifulSoup
URL = ('https://www.amazon.de/dp/B008JCUXNK/?coliid=I9G2T92PZXG06&colid=3ESRXLK53S0NY&psc=1&ref_=lv_ov_lig_dp_it')
# user agent = browser information (get via google search "my user agent")
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'}
page = requests.get(URL, headers=headers)# webpage
soup = BeautifulSoup(page.content, 'html.parser')# webpage as html
title = soup.find(id="productTitle")
print(title)
title is always "NONE" so calling get_Text will cause an error.
Can anybody tell me what's wrong?
Found a way to get past the captcha.
The request needs to contain a better header.
Example:
import datetime
import requests
KEY = "YOUR_KEY_HERE"
date = datetime.datetime.now().strftime("%Y%m%d")
BASE_REQUEST = ('https://www.amazon.de/Philips-Haartrockner-ThermoProtect-Technologie-HP8230/dp/B00BCQIIMS?pf_rd_r=T1T8Z7QTQTGYM8F7KRN5&pf_rd_p=c832d309-197e-4c59-8cad-735a8deab917&pd_rd_r=20c6ed33-d548-47d7-a262-c53afe32df96&pd_rd_w=63hR3&pd_rd_wg=TYwZH&ref_=pd_gw_crs_zg_bs_84230031')
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.com/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
payload = {
"api-key": KEY,
"begin_date": date,
"end_date": date,
"q": "Donald Trump"
}
r = requests.get(BASE_REQUEST, headers=headers)
print(r.status_code)
if r.status_code == 200:
print('success')
For information on status codes just google html status codes.
Hope this helps anyone with similar problems
Cheers!
Your code is 100% correct, but I've tried your code and checked value of page.content. It contains captcha. Looks like Amazon don't want you to scrape their site.
You can read about your case here: https://www.reddit.com/r/learnpython/comments/bf21fn/how_to_prevent_captcha_while_scraping_amazon/.
But I also recommend to read Amazon's Terms And Conditions https://www.amazon.com/gp/help/customer/display.html/ref=hp_551434_conditions to be sure if you can legally scrape it.
I am trying to read all the comments for a given product , this is to both learn python and also for a project,to simplify my task I chose a product randomly to code.
The link I want to read is Amazons and I used urllib to to open the link
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
After reading the link into "amazon" variable when I display amazon , I get the below message
print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>
so I read online , and found I need to use read command to read the source ,but sometimes it gives me a webpage kind of result other times not
print(amazon.read())
b''
How do I read the page, and pass it to beautiful soup ?
Edit 1
I did use request.get , and when I check what is present in the text of the retrieved page, I found the below content , which doe not match with the website link.
print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<!--
To discuss automated access to Amazon data please contact api-services-support#amazon.com.
For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->
<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>
<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b>Go to the Amazon.in home page to continue shopping</b>
</font>
</center>
</body>
</html>
Using your current library urllib. This is what you could do! Use .read() to get HTML. Then pass it into BeautifulSoup like this. Keep in mind amazon is heavy-anti-scraping website. The likelihood of you getting different result might be because the HTML is wrapped inside JavaScript. For that you might have to use Selenium or Dryscrape. You may also need to pass in headers/Cookies and extra attributes into your request.
amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)
EDIT ---- Turns out you're using requests now. I could get 200 response using requests passing in my headers like this.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]
--- Using Dryscrape
import dryscrape
from bs4 import BeautifulSoup
sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup
##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html
I personally would use the requests library for this and not urllib. Requests has more features
import requests
From there something like:
resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)
Should answer the mail for this one as it is rather simple http request
Edit:
Based on your error, you are going to have to research parameters to pass to make your requests look correct. In general with requests it'll look something like this (obviously with the values you discover -- check your browsers debug/developer options to check your network traffic and see what you are sending to amazon when using a browser):
url = "https://www.base.url.here"
params = {
'param1': 'value1'
.....
}
resp = requests.get(url,params)
For web scrapping use requests and BeautifulSoup modules in python3.
Installing BeautifulSoup:
pip install beautifulsoup4
Use appropriate headers while sending request.
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
Scrap.py
from bs4 import BeautifulSoup
import requests
url = "https://www.amazon.in/s/ref=mega_elec_s23_2_3_1_1?rh=i%3Acomputers%2Cn%3A3011505031&ie=UTF8&bbn=976392031"
headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
response = requests.get(f"{url}", headers=headers)
with open("webpg.html","w", encoding="utf-8") as file: # saving html file to disk
file.write(response.text)
bs = BeautifulSoup(response.text, "html.parser")
print(bs) # displaying html file use bs.prettify() for making the document more readable