I am attempting to write a program that sends a url request to a site which then produces an animation of weather radar. I then scrape that page to get the image urls (they're stored in a Java module) and download them to a local folder. I do this iteratively over many radar stations and for two radar products. So far I have written the code to send the request, parse the html, and list the image urls. What I can't seem to do is rename and save the images locally. Beyond that, I want to make this as streamlined as possible- which is probably NOT what I've got so far. Any help 1) getting the images to download to a local folder and 2) pointing me to a more pythonic way of doing this would be great.
# import modules
import urllib2
import re
from bs4 import BeautifulSoup
##test variables##
stationName = "KBYX"
prod = ("bref1","vel1") # a tupel of both ref and vel
bkgr = "black"
duration = "1"
#home_dir = "/path/to/home/directory/folderForImages"
##program##
# This program needs to do the following:
# read the folder structure from home directory to get radar names
#left off here
list_of_folders = os.listdir(home_dir)
for each_folder in list_of_folders:
if each_folder.startswith('k'):
print each_folder
# here each folder that starts with a "k" represents a radar station, and within each folder are two other folders bref1 and vel1, the two products. I want the program to read the folders to decide which radar to retrieve the data for... so if I decide to add radars, all I have to do is add the folders to the directory tree.
# first request will be for prod[0] - base reflectivity
# second request will be for prod[1] - base velocity
# sample path:
# http://weather.rap.ucar.edu/radar/displayRad.php?icao=KMPX&prod=bref1&bkgr=black&duration=1
#base part of the path
base = "http://weather.rap.ucar.edu/radar/displayRad.php?"
#additional parameters
call = base+"icao="+stationName+"&prod="+prod[0]+"&bkgr="+bkgr+"&duration="+duration
#read in the webpage
urlContent = urllib2.urlopen(call).read()
webpage=urllib2.urlopen(call)
#parse the webpage with BeautifulSoup
soup = BeautifulSoup(urlContent)
#print (soup.prettify()) # if you want to take a look at the parsed structure
tag = soup.param.param.param.param.param.param.param #find the tag that holds all the filenames (which are nested in the PARAM tag, and
# located in the "value" parameter for PARAM name="filename")
files_in=str(tag['value'])
files = files_in.split(',') # they're in a single element, so split them by comma
directory = home_dir+"/"+stationName+"/"+prod[1]+"/"
counter = 0
for file in files: # now we should THEORETICALLY be able to iterate over them to download them... here I just print them
print file
To save images locally, something like
import os
IMAGES_OUTDIR = '/path/to/image/output/directory'
for file_url in files:
image_content = urllib2.urlopen(file_url).read()
image_outfile = os.path.join(IMAGES_OUTDIR, os.path.basename(file_url))
with open(image_outfile, 'wb') as wfh:
wfh.write(image_content)
If you want to rename them, use the name you want instead of os.path.basename(file_url).
I use these three methods for downloading from the internet:
from os import path, mkdir
from urllib import urlretrieve
def checkPath(destPath):
# Add final backslash if missing
if destPath != None and len(destPath) and destPath[-1] != '/':
destPath += '/'
if destPath != '' and not path.exists(destPath):
mkdir(destPath)
return destPath
def saveResource(data, fileName, destPath=''):
'''Saves data to file in binary write mode'''
destPath = checkPath(destPath)
with open(destPath + fileName, 'wb') as fOut:
fOut.write(data)
def downloadResource(url, fileName=None, destPath=''):
'''Saves the content at url in folder destPath as fileName'''
# Default filename
if fileName == None:
fileName = path.basename(url)
destPath = checkPath(destPath)
try:
urlretrieve(url, destPath + fileName)
except Exception as inst:
print 'Error retrieving', url
print type(inst) # the exception instance
print inst.args # arguments stored in .args
print inst
There are a bunch of examples here to download images from various sites
Related
I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png
If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)
My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
So I'm trying out a program which downloads google street view images. The addresses are located in a .txt file. Every time i try to run the code, the HTTP Error 403: Forbidden comes up. In my actual code, i use my Google Developer API of course, and the right file paths.
I've tried to set up a user agent, but it just doesn't work. Can anyone help me, what should i do? And how do i implement it in this code?
# import os and urllib modules
# os for file path creation
# urllib for accessing web content
import urllib
import os
import requests
# this is the first part of the streetview, url up to the address, this url will return a 600x600px image
pre = "https://maps.googleapis.com/maps/api/streetview?size=600x600&location="
# this is the second part of the streetview url, the text variable below, includes the path to a text file containing one address per line
# the addresses in this text file will complete the URL needed to return a streetview image and provide the filename of each streetview image
text = r"C:\Users\.............
# this is the third part of the url, needed after the address
# this is my API key, please replace the one below with your own (google 'google streetview api key'), thanks!
suf = "&key=abcdertghjhrtrwhgrh"
# this is the directory that will store the streetview images
# this directory will be created if not present
dir = r"C:\Users\..........
headers = {'User-Agent': 'Chrome/75.0.3770.100 Safari/537.36',
'From': 'asdasd#asd.com'
}
# checks if the dir variable (output path) above exists and creates it if it does not
if not os.path.exists(dir):
os.makedirs(dir)
# opens the address list text file (from the 'text' variable defined above) in read mode ("r")
with open(text, "r") as text_file:
# the variable 'lines' below creates a list of each address line in the source 'text' file
lines = [line.rstrip('\n') for line in open(text)]
print
"THE CONTENTS OF THE TEXT FILE:\n" + str(lines)
# start a loop through the 'lines' list
for line in lines:
# string clean-up to get rid of commas in the url and filename
ln = line.replace(",", "")
print
"CLEANED UP ADDRESS LINE:\n" + ln
# creates the url that will be passed to the url reader, this creates the full, valid, url that will return a google streetview image for each address in the address text file
URL = pre + ln + suf
response = requests.get(URL, headers = headers)
"URL FOR STREETVIEW IMAGE:\n" + URL
# creates the filename needed to save each address's streetview image locally
filename = os.path.join(dir, ln + ".jpg")
print
"OUTPUT FILENAME:\n" + filename
# you can run this up to this line in the python command line to see what each step does
# final step, fetches and saves the streetview image for each address using the url created in the previous steps
urllib.urlretrieve(URL, filename)
I created a document site on plone from which file uploads can be made. I saw that plone saves them in the filesystem in the form of a blob, now I need to take them through a python script that will process the pdfs downloaded with an OCR. Does anyone have any idea how to do it? Thank you
Not sure how to extract PDFs from BLOB-storage or if it's possible at all, but you can extract them from a running Plone-site (e.g. executing the script via a browser-view):
import os
from Products.CMFCore.utils import getToolByName
def isPdf(search_result):
"""Check mime_type for Plone >= 5.1, otherwise check file-extension."""
if mimeTypeIsPdf(search_result) or search_result.id.endswith('.pdf'):
return True
return False
def mimeTypeIsPdf(search_result):
"""
Plone-5.1 introduced the mime_type-attribute on files.
Try to get it, if it doesn't exist, fail silently.
Return True if mime_type exists and is PDF, otherwise False.
"""
try:
mime_type = search_result.mime_type
if mime_type == 'application/pdf':
return True
except:
pass
return False
def exportPdfFiles(context, export_path):
"""
Get all PDF-files of site and write them to export_path on the filessytem.
Remain folder-structure of site.
"""
catalog = getToolByName(context, 'portal_catalog')
search_results = catalog(portal_type='File', Language='all')
for search_result in search_results:
# For each PDF-file:
if isPdf(search_result):
file_path = export_path + search_result.getPath()
file_content = search_result.getObject().data
parent_path = '/'.join(file_path.split('/')[:-1])
# Create missing directories on the fly:
if not os.path.exists(parent_path):
os.makedirs(parent_path)
# Write PDF:
with open(file_path, 'w') as fil:
fil.write(file_content)
print 'Wrote ' + file_path
print 'Finished exporting PDF-files to ' + export_path
The example keeps the folder-structure of the Plone-site in the export-directory. If you want them flat in one directory, a handler for duplicate file-names is needed.
I am trying to create a script that scrapes a webpage and downloads any image files found.
My first function is a wget function that reads the webpage and assigns it to a variable.
My second function is a RegEx that searches for the 'ssrc=' in a webpages html, below is the function:
def find_image(text):
'''Find .gif, .jpg and .bmp files'''
documents = re.findall(r'\ssrc="([^"]+)"', text)
count = len(documents)
print "[+] Total number of file's found: %s" % count
return '\n'.join([str(x) for x in documents])
The output from this is something like this:
example.jpg
image.gif
http://www.webpage.com/example/file01.bmp
I am trying to write a third function that downloads these files using urllib.urlretrieve(url, filename) but I am not sure how to go about this, mainly because some of the output is absolute paths where as others are relative. I am also unsure how to download these all at same time and download without me having to specify a name and location every time.
Path-Agnostic fetching of resources (Can handle absolute/relative paths) -
from bs4 import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os
def fetch_url(url, out_folder="test/"):
"""Downloads all the images at 'url' to /test/"""
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))
for image in soup.findAll("img"):
print "Image: %(src)s" % image
filename = image["src"].split("/")[-1]
parsed[2] = image["src"]
outpath = os.path.join(out_folder, filename)
if image["src"].lower().startswith("http"):
urlretrieve(image["src"], outpath)
else:
urlretrieve(urlparse.urlunparse(parsed), outpath)
fetch_url('http://www.w3schools.com/html/')
I can't write you the complete code and I'm sure that's not what you would want as well, but here are some hints:
1) Do not parse random HTML pages with regex, there are quite a few parsers made for that. I suggest BeautifulSoup. You will filter all img elements and get their src values.
2) With the src values at hand, you download your files the way you are already doing. About the relative/absolute problem, use the urlparse module, as per this SO answer. The idea is to join the src of the image with the URL from which you downloaded the HTML. If the src is already absolute, it will remain that way.
3) As for downloading them all, simply iterate over a list of the webpages you want to download images from and do steps 1 and 2 for each image in each page. When you say "at the same time", you probably mean to download them asynchronously. In that case, I suggest downloading each webpage in one thread.
I would like to load a list/tuple dynamically from a settings file.
I need to write a crawler that crawls a website, but I want to be made aware of files that were found, rather than pages.
I allow the user to specify such file types in settings.py file, like this:
# Document Types during crawling
textFiles = ['.doc', '.docx', '.log', '.msg', '.pages', '.rtf', '.txt', '.wpd', '.wps']
dataFiles = ['.csv', '.dat', '.efx', '.gbr', '.key', '.pps', '.ppt', '.pptx', '.sdf', '.tax2010', '.vcf', '.xml']
audioFiles = ['.3g2','.3gp','.asf','.asx','.avi','.flv','.mov','.mp4','.mpg','.rm','.swf','.vob','.wmv']
#What lists would you like to use ?
fileLists = ['textFiles', 'dataFiles', 'audioFiles']
I import my settings file in the crawler.py
I use beautifulsoup module to find links from the HTML content and process as follows:
for item in soup.find_all("a"):
# we dont want some of them because it is just a link to the current page or the startpage
if item['href'] in dontWantList:
continue
#check if link is a file based on the fileLists from the settings
urlpath = urlparse.urlparse(item['href']).path
ext = os.path.splitext(urlpath)[1]
file = False
for list in settings.fileLists:
if ext in settings.list:
file = True
#found file link
if self.verbose:
messenger("Found a file of type: %s" % ext, Colors.PURPLE)
if ext not in fileLinks:
fileLinks.append(item['href'])
#Only add the link if it is not a file
if file is not True:
links.append(item['href'])
else:
#Do not add the file to the other lists
continue
The following code segment throws error:
for list in settings.fileLists:
if ext in settings.list:
clearly because python thinks that settings.list is a list.
Is there anyway to tell python to dynamically look in the lists from the settings file?
I think that what you are looking for is instead of:
if ext in settings.list:
You need
ext_list = getattr(settings, list)
if ext in ext_list:
EDIT:
I agree with jonrsharpe on the list thing, so I renamed it in my code