Databricks web scraping

Databricks web scraping - python

I wrote some python code to automatically download a csv file from the internet. The code works when it runs on my local computer but not when I run it on DataBricks. The problem is that I don't know how to save it to my DBFS: folder = "/dbfs/mnt/fmi-import/DNB Scenariosets/". The code does execute but the file is nowhere to be found.
import requests
from bs4 import BeautifulSoup
import re
url_scenariosets_dnb = 'https://www.dnb.nl/voor-de-sector/open-boek-toezicht-sectoren/pensioenfondsen/haalbaarheidstoets/uitvoering-en-normen/scenarioset-haalbaarheidstoets-pensioenfondsen/'
folder = "/dbfs/mnt/fmi-import/DNB Scenariosets/"
class dataset_downloader:
def __init__(self,url):
self.url=url
def scrape(self):
reqs = requests.get(self.url, verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')
self.urls=[]
for link in soup.find_all('a'):
self.urls.append(link.get('href'))
return self.urls
def filter_scenarioset(self):
# Search data based on regular expression in the list
self.scenarioset_links=[]
[self.scenarioset_links.append('https://www.dnb.nl'+val) for val in self.urls
if re.search(r'hbt-scenarioset-10k', val)]
return self.scenarioset_links
def download_file(self, year, quarter):
try:
self.downloadlink=[]
[self.downloadlink.append(val) for val in self.scenarioset_links
if re.search(r'hbt-scenarioset-10k-{}q{}'.format(year,quarter),val)]
filename='hbt-scenarioset-10k-{}q{}.xlsx'.format(year,quarter)
with requests.get(self.downloadlink[0]) as req:
with open(filename, 'wb') as f:
for chunk in req.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
return "/dbfs/mnt/fmi-import/DNB Scenariosets/"+filename
except Exception as e:
print(e)
return None
#%% EXECUTE
download = dataset_downloader(url_scenariosets_dnb)
download.scrape()
download.filter_scenarioset()
download.download_file(2020,2) # select year and quarter
Do you have any suggestion on how you can download a csv file with databricks and save it to a DBFS folder? Thank you in advance!
Vincent

Problem is that in code you have only filename without path to folder. It should be:
with open(folder + filename, 'wb')

Related

Change twitter banner from url

How would I go by changing the twitter banner using an image from url using tweepy library: https://github.com/tweepy/tweepy/blob/v2.3.0/tweepy/api.py#L392
So far I got this and it returns:
def banner(self):
url = 'https://blog.snappa.com/wp-content/uploads/2019/01/Twitter-Header-Size.png'
file = requests.get(url)
self.api.update_profile_banner(filename=file.content)
ValueError: stat: embedded null character in path
It seems like filename requires an image to be downloaded. Anyway to process this without downloading the image and then removing it?

Looking at library's code you can do what you want.
def update_profile_banner(self, filename, *args, **kargs):
f = kargs.pop('file', None)
So what you need to do is supply the filename and the file kwarg:
filename = url.split('/')[-1]
self.api.update_profile_banner(filename, file=file.content)

import tempfile
def banner():
url = 'file_url'
file = requests.get(url)
temp = tempfile.NamedTemporaryFile(suffix=".png")
try:
temp.write(file.content)
self.api.update_profile_banner(filename=temp.name)
finally:
temp.close()

How to download images from the web page?

I have a python script that searches for images on a web page and it's supposed to download them to folder named 'downloaded'. Last 2-3 lines are problematic, I don't know how to write the correct 'with open' code.
The biggest part of the script is fine, lines 42-43 give an error
import os
import requests
from bs4 import BeautifulSoup
downloadDirectory = "downloaded"
baseUrl = "http://pythonscraping.com"
def getAbsoluteURL(baseUrl, source):
if source.startswith("http://www."):
url = "http://"+source[11:]
elif source.startswith("http://"):
url = source
elif source.startswith("www."):
url = source[4:]
url = "http://"+source
else:
url = baseUrl+"/"+source
if baseUrl not in url:
return None
return url
def getDownloadPath(baseUrl, absoluteUrl, downloadDirectory):
path = absoluteUrl.replace("www.", "")
path = path.replace(baseUrl, "")
path = downloadDirectory+path
directory = os.path.dirname(path)
if not os.path.exists(directory):
os.makedirs(directory)
return path
html = requests.get("http://www.pythonscraping.com")
bsObj = BeautifulSoup(html.content, 'html.parser')
downloadList = bsObj.find_all(src=True)
for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl,download["src"])
if fileUrl is not None:
print(fileUrl)
with open(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory), 'wb') as out_file:
out_file.write(fileUrl.content)
It opens downloaded folder on my computer and misc folder within it. And it gives a traceback error.
Traceback:
http://pythonscraping.com/misc/jquery.js?v=1.4.4
Traceback (most recent call last):
File "C:\Python36\kodovi\downloaded.py", line 43, in <module>
with open(fileUrl, getDownloadPath(baseUrl, fileUrl, downloadDirectory), 'wb
') as out_file:
TypeError: an integer is required (got type str)

It seems your downloadList includes some URLs that aren't images. You could instead look for any <img> tags in the HTML:
downloadList = bsObj.find_all('img')
Then use this to download those images:
for download in downloadList:
fileUrl = getAbsoluteURL(baseUrl,download["src"])
r = requests.get(fileUrl, allow_redirects=True)
filename = os.path.join(downloadDirectory, fileUrl.split('/')[-1])
open(filename, 'wb').write(r.content)
EDIT: I've updated the filename = ... line so that it writes the file of the same name to the directory in the string downloadDirectory. By the way, the normal convention for Python variables is not to use camel case.

Flask: api call downloads excel sheet, need to store in folder

The following url downloads an excel spreadsheet
http://www.bocsar.nsw.gov.au/Documents/RCS-Annual/bluemountainslga.xlsx
via flask code I want to call that url, and save the spreadsheet to a folder
So far I have
r = requests.get('http://www.bocsar.nsw.gov.au/Documents/RCS-Annual/bluemountainslga.xlsx')
But need help moving the spread sheet to a downloads folder inside the project. The folder structure is.
App
+static
+templates
main.py
+downloads
|__ move file here

This is a minimal example of something I just got to work;
import requests
import shutil
def download(url):
filename = url.split("/")[-1]
path = "downloads/" + filename
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
else:
r.raise_for_status()
download('http://www.bocsar.nsw.gov.au/Documents/RCS-Annual/bluemountainslga.xlsx')
It reads the raw data from the request object and writes to a file where you want it to.

Downloading csv data from an API

I am attempting to download csv data from an API which I will then edit I am struggling to get the different functions to work together.
i.e. passing the export link through to download the file and then through to opening it.
'''
File name: downloadAWR.py
Author: Harry&Joe
Date created: 3/10/17
Date last modified: 5/10/17
Version: 3.6
'''
import requests
import json
import urllib2
import zipfile
import io
import csv
import os
from urllib2 import urlopen, URLError, HTTPError
geturl() is used to create a download link for the csv data, one link will be created with user input data in this case the name and dates, this will then create a link that we can use to download the data. the link is stored in export_link
def geturl():
#getProjectName
project_name = 'BIMM'
#getApiToken
api_token = "API KEY HERE"
#getStartDate
start_date = '2017-01-01'
#getStopDate
stop_date = '2017-09-01'
url = "https://api.awrcloud.com/get.php?action=export_ranking&project=%s&token=%s&startDate=%s&stopDate=%s" % (project_name,api_token,start_date,stop_date)
export_link = requests.get(url).content
return export_link
dlfile is used to actually use the link a get a file we can manipulate and edit e.g. removing columns and some of the data.
def dlfile(export_link):
# Open the url
try:
f = urlopen(export_link)
print ("downloading " + export_link)
# Open our local file for writing
with open(os.path.basename(export_link), "wb") as local_file:
local_file.write(f.read())
#handle errors
except HTTPError as e:
print ("HTTP Error:", e.code, export_link)
except URLError as e:
print ("URL Error:", e.reason, export_link)
return f
readdata is used to go into the file and open it for us to use.
def readdata():
with zipfile.ZipFile(io.BytesIO(zipdata)) as z:
for f in z.filelist:
csvdata = z.read(f)
#reader = csv.reader(io.StringIO(csvdata.decode()))
def main():
#Do something with the csv data
export_link = (geturl())
data = dlfile(export_link)
csvdata = data.readdata()
if __name__ == '__main__':
main()
Generally I'm finding that the code works independently but struggles when I try to put it all together synchronously.

You need to clean up and call your code appropriately. It seems you copy pasted from different sources and now you have some salad bowl of code that isn't mixing well.
If the task is just to read and open a remote file to do something to it:
import io
import zipfile
import requests
def get_csv_file(project, api_token, start_date, end_date):
url = "https://api.awrcloud.com/get.php"
params = {'action': 'export_ranking',
'project': project,
'token': api_token,
'startDate': start_date,
'stopDate': end_date}
r = requests.get(url, params)
r.raise_for_status()
return zipfile.ZipFile(io.BytesIO(request.get(r.content).content))
def process_csv_file(zip_file):
contents = zip_file.extractall()
# do stuff with the contents
if __name__ == '__main__':
process_zip_file(get_csv_file('BIMM', 'api-key', '2017-01-01', '2017-09-01'))

python - NameError global name 'buf' is not defined

I don't know what this error means. Any advice about the error or the rest of the code is greatly appreciated.
import urllib
import urllib2
import os
import re
from bs4 import BeautifulSoup
def image_scrape():
url = raw_input("Type url for image scrape: ")
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
name = 0
for tag in soup.find_all(re.compile("img")):
path = 'C:\Users\Sorcerer\Downloads'
name += 1
filename = name
file_path = "%s%s" % (path, filename)
downloaded_image = file(file_path, "wb")
downloaded_image.write(buf)
downloaded_image.close()
image_scrape()

You have a line in your code:
downloaded_image.write(buf)
The Python interpreter has not seen this variable buf before in your code. And hence the error.
Thoughts on the rest of your code:
It is advisable to use the os module to do what you are doing with this line:
file_path = "%s%s" % (path, filename)
like this:
import os
path = os.path.normpath('C:\\Users\\Sorcerer\\Downloads')
file_path = os.path.join(path, name)
Looks like you are trying to find all the image links in the page and trying to save it to the file system at the location referenced by file_path. Assuming the link to the image is in the variable tag, this is what you do:
import requests
r = requests.get(tag, stream=True)
if r.status_code == 200:
with open(name, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)
f.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Databricks web scraping - python

Problem is that in code you have only filename without path to folder. It should be: with open(folder + filename, 'wb')

Related

Change twitter banner from url

How to download images from the web page?

Flask: api call downloads excel sheet, need to store in folder

Downloading csv data from an API

python - NameError global name 'buf' is not defined

Categories

Resources