Scraping image source from an interactive map - python

I would like to scrape the source URL for images on an interactive map that holds the location for various traffic cameras and export them to a JSON or CSV as a list. Because i am trying to gather this data from different websites i have attempted to use parsehub and octoparse to no avail. I previously attempted to use BS4 and selenium but wasn't able to extract the div / img tag with the src ur. Any help would be appreciated. Below are examples of two different websites with similar but different methods for housing the images.
https://tripcheck.com/
https://cwwp2.dot.ca.gov/vm/iframemap.htm (cal trans uses iframes)

The image names (for trip check) come from an api call. you would need to request the cctv ids then you can build the urls.
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'X-Requested-With': 'XMLHttpRequest',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://tripcheck.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
}
params = {
'dt': '1659122796377',
}
response = requests.get('https://tripcheck.com/Scripts/map/data/cctvinventory.js', params=params, headers=headers)
response.json()['features'][0]['attributes']['filename']
output:
'AstoriaUS101MeglerBrNB_pid392.jpg'
Above, you iterate over the attributes array in the json response. and then for the url:
import time
cam = response.json()['features'][0]['attributes']['filename']
rand = str(time.time()).replace('.','')[:13]
f'https://tripcheck.com/RoadCams/cams/{cam}?rand={rand}'
output:
'https://tripcheck.com/RoadCams/cams/AstoriaUS101MeglerBrNB_pid392.jpg?rand=1659123325440'
Note the rand parameter appears to be part of time stamp. As does the 'dt' parameter in the original request. You can use time.time() to generate a time stamp and manipulate it as you need.

Related

Download image by requests not work correct

I have a question.
Is there any possibility to download images that are currently on the website through requests, but without using the url?
In my case, it does not work, because the image on the website is different than the one under the link from which I am downloading it. This is due to the fact that the image changes each time the link is entered. And I want to download exactly what's on the page to rewrite the code.
Previously, I used selenium and the screenshot option for this, but I have already rewritten all the code for requests and I only miss this one.
Anyone have an idea how to download a photo that is currently on the site?
Below is the code with links:
import requests
from requests_html import HTMLSession
headers = {
'Content-Type': 'image/png',
'Host': 'www.oglaszamy24.pl',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-GPC': '1',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.oglaszamy24.pl/dodaj-ogloszenie2.php?c1=8&c2=40&at=1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7'
}
session = HTMLSession()
r = session.get('https://www.oglaszamy24.pl/dodaj-ogloszenie2.php?c1=8&c2=40&at=1')
r.html.render(sleep=2,timeout=20)
links = r.html.find("#captcha_img")
result = str(links)
results = result.split("src=")[1].split("'")[1]
resultss = "https://www.oglaszamy24.pl/"+results
with open ('image.png', 'wb') as f:
f.write(requests.get(resultss, headers=headers).content)
i'd rather try to use PIL (Python Image Library) and get a screenshot of the element's bounding box (coordinates and size). You can get those with libs like BS4 (BeautifulSoup) or Selenium.
Then you'd have a local copy (screenshot) of what the user would see.
A lot of sites have protection against scrapers and capcha services usually do not allow their resources to be downloaded, either via requests or otherwise.
But like that NFT joke goes: you don't download a screenshot...

JSON data webscraping

I am attempting to scrape job titles from here.
Using Beautifulsoup I can scrape Job Titles from the first page. I am not able to scrape Job titles from the remaining pages. Using the Developertool > network I understood content type is JSON.
import requests
import json
import BeautifulSoup
from os import link
import pandas as pd
s = requests.Session()
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '^\\^',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',
'Content-Type': 'application/json; charset=utf-8',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://jobs.epicor.com/search-jobs',
'Accept-Language': 'en-US,en;q=0.9',
}
url=’https://jobs.epicor.com/search-jobs/results?ActiveFacetID=0&CurrentPage=2&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&ShowRadius=False&IsPagination=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=1&SearchType=5&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=’
response = s.get(url, headers=headers).json()
data=json.dumps(response)
#print(data)
d2=json.loads(data)
for x in d2.keys():
print(x)
###from above json results how to extract “jobtiltle”
The issue is above result's JSON data contains Html tags. In this case how to scrape job titles from the JSON data?
Would really appreciate any help on this.
I am unfortunately currently limited to using only requests or another popular python library.
Thanks in advance.
If the job titles is all that you need from your response text:
from bs4 import BeautifulSoup
# your code here
soup = BeautifulSoup(response["results"])
for item in soup.findAll("span", { "class" : "jobtitle" }):
print(item.text)
To navigate over the pages, if you hover your mouse cursor over the Prev or Next buttons there, you will see the url to request data from.

Scrape a table from a website and store as pandas

In Python, I want to scrape the table in a website(it's a Japanese option trading information), and store it as a pandas dataframe.
The website is here, and you need to click "Options Quotes" in order to access the page where I want to scrape the table. The final URL is https://svc.qri.jp/jpx/english/nkopm/ but you cannot directly access this page.
Here is my attempt:
pd.read_html("https://svc.qri.jp/jpx/english/nkopm/")
...HTTPError: HTTP Error 400: Bad Request
So I thought I need to add a user agent. Here is my another attempt:
url = "https://svc.qri.jp/jpx/english/nkopm/"
pd.read_html(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text)
...ValueError: No tables found
Another attempt
import urllib
url = 'https://svc.qri.jp/jpx/english/nkopm/'
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"price-table"})[0]
...HTTPError: HTTP Error 400: Bad Request
I know how to play with pandas, so it doesn't have to be imported in a neat dataframe at first place. I just need to first import the table in pandas, but I'm not sure why I cannot even read the page. Any help would be appreciated!
By the way, if you click gray arrows in the middle column, ,
it will add another row like this.
And it can be all opened and closed by clicking these buttons.
It would be nice if I can import these rows as well, but it not really a must.
Reading the documentation of the pandas function read_html it says
Read HTML tables into a list of DataFrame objects.
So the function expects structured input in form of an html table. I actually can't access the website you're linking to but I'm guessing it will give you back an entire website.
You need to extract the data in a structured format in order for pandas to make sense of it. You need to scrape it. There's a bunch of tools for that, and one popular one is BeautifulSoup.
Tl;dr: So what you need to do is download the website with requests, pass it into BeautifulSoup and then use BeautifulSoup to extract the data in a structured format.
Updated answer:
Seems like the reason why the requests is returning a 400 is because the website is expecting some additional headers - I just dumped the request my browser does into requests and it works!
import requests
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.jpx.co.jp/english/markets/index.html',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,it;q=0.6,la;q=0.5',
}
response = requests.get('https://svc.qri.jp/jpx/english/nkopm/', headers=headers, cookies=cookies)
Based on Ahmad's answer, you're almost there:
All you need to get the table is this:
import requests
import pandas as pd
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'cross-site',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.jpx.co.jp/english/markets/index.html',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,it;q=0.6,la;q=0.5',
}
response = requests.get('https://svc.qri.jp/jpx/english/nkopm/', headers=headers)
table = pd.read_html(response.text, attrs={"class": "price-table"})[0]
print(table)
This outputs:
CALL ... PUT
Settlement09/18 ... Settlement09/18
0 2 ... 3030
1 Delta Gamma Theta Vega 0.0032 0.0000 -0.... ... Delta Gamma Theta Vega - - - -
2 Delta ... NaN
3 0.0032 ... NaN
4 Delta ... NaN
.. ... ... ...

Getting time out errors when downloading csv's using request api

I previously wrote a program to analyze stock info and to get historical data I used NASDAQ. For example in the past if I wanted to pull a years worth of price quotes for CMG all I needed to do was make a request to the following link h_url= https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30 to download a csv of the historical quotes. However, now when I make the request I my connection times out and I cannot get any response. If I just enter the url into a web-browser it still downloads the file just fine. Some example code is below:
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2020-06-30/2019-06-30'
page_response = rq.get(h_url, timeout=30)
page=bs(page_response.content, 'html.parser')
dwnld_fl=os.path.join(os.path.dirname(__file__),'Temp_data','hist_CMG_data.txt')
fl=open(dwnld_fl,'w')
fl.write(page.text)
Can someone please let me know if this works for them or if there is something I should do differently to get it to work again ? This is only an example not the actual code so if I accidentally made a simple syntax error you can assume the actual file is correct since it has worked without issue in the past.
You are missing the headers and making a request to an invalid URL (the file downloaded in a browser is empty).
import requests
from bs4 import BeautifulSoup as bs
h_url= 'https://www.nasdaq.com/api/v1/historical/CMG/stocks/2019-06-30/2020-06-30'
headers = {
'authority': 'www.nasdaq.com',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
page_response = requests.get(h_url, timeout=30, allow_redirects=True, headers=headers)
with open("dump.txt", "w") as out:
out.write(str(page_response.content))
This will result in writing a byte string to a the file "dump.txt" of the data received. You do not need to use BeautifulSoup to parse HTML, as the response is a text file, not HTML.

How can I get the reputation of filehashes on Virustotal using request and bs4 module and not using Virustotal's PublicAPI?

My requirement is to check multiple filehashes's reputation on Virustotal using python. I do not want to use Virustotal's Public API since there is a cap of 4 requests/min. I thought of using requests module and beautiful soup to get this done.
Please check the link below:
https://www.virustotal.com/gui/file/f8ee4c00a3a53206d8d37abe5ed9f4bfc210a188cd5b819d3e1f77b34504061e/summary
I need to capture 54/69 for this file. I have a list of filehashes in an excel which I can loop for detection status once I can get it done for this one hash.
But I am not able to get the specific count of engines detected the filehash as malicious. The CSS selector for the count is giving me only a blank list. Please help. Please check the code I have written below:
import requests
from bs4 import BeautifulSoup
filehash='F8EE4C00A3A53206D8D37ABE5ED9F4BFC210A188CD5B819D3E1F77B34504061E'
filehash_lower = filehash.lower()
URL = 'https://www.virustotal.com/gui/file/' +filehash+'/detection'
response = requests.get(URL)
print(response)
soup = BeautifulSoup(response.content,'html.parser')
detection_details = soup.select('div.detections')
print(detection_details)
Here is an approach using the ajax calls :
import requests
import json
headers = {
'pragma': 'no-cache',
'x-app-hostname': 'https://www.virustotal.com/gui/',
'dnt': '1',
'x-app-version': '20190611t171116',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,la;q=0.6,mt;q=0.5',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'accept': 'application/json',
'cache-control': 'no-cache',
'authority': 'www.virustotal.com',
'referer': 'https://www.virustotal.com/',
}
response = requests.get('https://www.virustotal.com/ui/files/f8ee4c00a3a53206d8d37abe5ed9f4bfc210a188cd5b819d3e1f77b34504061e', headers=headers)
data = json.loads(response.content)
malicious = data['data']['attributes']['last_analysis_stats']['malicious']
undetected = data['data']['attributes']['last_analysis_stats']['undetected']
print(malicious, 'malicious out of', malicious + undetected)
output:
54 malicious out of 69

Categories