Having trouble using requests to download images off of wiki - python

I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png

If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)

My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e

Related

How do you skip over files with no extension when downloading them?

My code is working correctly to scour a directory of PDFs, download weblinks embedded within those PDFs, and sequentially name them with appropriate file extension.
That being said - I am getting a few random files that download but DON'T have an extension associated with them. In doing quality checks, I have all the attachments that matter - these extra files are truly garbage.
Is there a way to not download them or build in a check in the code so that I don't end up with these phantom files?
#!/usr/bin/env python3
import os
import glob
import pdfx
import wget
import urllib.parse
import requests
## Accessing and Creating Six Digit File Code
pdf_dir = "./"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
## Identify File Name and Limit to Digits
filename = os.path.basename(file)
newname = filename[0:6]
## Run PDFX to identify and download links
pdf = pdfx.PDFx(filename)
url_list = pdf.get_references_as_dict()
attachment_counter = (1)
for x in url_list["url"]:
if x[0:4] == "http":
parsed_url = urllib.parse.quote(x)
extension = os.path.splitext(x)[1]
r = requests.get(x)
with open('temporary', 'wb') as f:
f.write(r.content)
##Concatenate File Name Once Downloaded
os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
##Increase Attachment Count
attachment_counter += 1
for x in url_list["pdf"]:
parsed_url = urllib.parse.quote(x)
extension = os.path.splitext(x)[1]
r = requests.get(x)
with open('temporary', 'wb') as f:
f.write(r.content)
##Concatenate File Name Once Downloaded
os.rename('./temporary', str(newname) + '_attach' + str(attachment_counter) + str(extension))
##Increase Attachment Count
attachment_counter += 1
It's not clear which part of your code produces these "phantom" files, but anyplace you want to avoid downloading a file which doesn't have an extension, you can make the download conditional. If the component after the last slash doesn't contain a dot, do nothing.
if '.' in x.split('/')[-1]:
... dowload(x) etc

Python download multiple files within for loop

I have a list of URLs, which direct to filings from the SEC (e.g., https://www.sec.gov/Archives/edgar/data/18651/000119312509042636/d10k.htm)
My goal ist to write a for loop that opens the URLs, request the document and save it to a folder.
However, I need to be able to identify the documents later. Thats why I wanted to use "htps://www.sec.gov/Archives/edgar/data/18651/000119312509042636/d10k.htm" this filing-specific number as document name
directory = r"\Desktop\10ks"
for url in url_list:
response = requests.get(url).content
path = (directory + str(url)[40:-5] +".txt")
with open(path, "w") as f:
f.write(response)
f.close()
But everytime, I get the following error message: filenotfounderror: [errno 2] no such file or directory:
I really hope you can help me out!!
Thanks
import requests
import os
url_list = ["https://www.sec.gov/Archives/edgar/data/18651/000119312509042636/d10k.htm"]
#Create the path Desktop/10ks/
directory = os.path.expanduser("~/Desktop") + "\\10ks"
for url in url_list:
#Get the content as string instead of getting it as bytes
response = requests.get(url).text
#Replace slash in filename with underscore
filename = str(url)[40:-5].replace("/", "_")
#print filename to check if it is correct
print(filename)
path = (directory + "\\" + filename +".txt")
with open(path, "w") as f:
f.write(response)
f.close()
See comments.
I guess backslashes in filenames are not allowed, since
filename = str(url)[40:-5].replace("/", "\\")
gives me
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\user/Desktop\\10ks\\18651\\000119312509042636\\d10.txt'
See also:
https://docs.python.org/3/library/os.path.html#os.path.expanduser
Get request python as a string
https://docs.python.org/3/library/stdtypes.html#str.replace
This works
for url in url_list:
response = requests.get(url).content.decode('utf-8')
path = (directory + str(url)[40:-5] +".txt").replace('/', '\\')
with open(path, "w+") as f:
f.write(response)
f.close()
the path that you were build was something like this \\Desktop\\10ks18651/000119312509042636/d10.txt I suppose you are working on windows for those backslashes, anyways you just need to replace the slashes that were coming in the url to backslashes.
Another thing, write receives a string, because of that you need to decode your response that is coming in bytes to string.
I hope this helps you!

How to batch read and then write a list of weblink .JSON files to specified locations on C drive in Python v2.7

I have a long list of .json files that I need to download to my computer. I want to download them as .json files (so no parsing or anything like that at this point).
I have some code that works for small files, but it is pretty buggy. Also it doesn't handle multiple links well.
Appreciate any advice to fix up this code:
import os
filename = 'test.json'
path = "C:/Users//Master"
fullpath = os.path.join(path, filename)
import urllib2
url = 'https://www.premierlife.com/secure/index.json'
response = urllib2.urlopen(url)
webContent = response.read()
f = open(fullpath, 'w')
f.write(webContent)
f.close
It's creating a blank file because the f.close at the end should be f.close().
I took your code and made a little function and then called it on a little loop to go through a .txt file with the list of urls called "list_of_urls.txt" having 1 url per line (you can change the delimiter in the split function if you want to format it differently).
def save_json(url):
import os
filename = url.replace('/','').replace(':','')
# this replaces / and : in urls
path = "C:/Users/Master"
fullpath = os.path.join(path, filename)
import urllib2
response = urllib2.urlopen(url)
webContent = response.read()
f = open(fullpath, 'w')
f.write(webContent)
f.close()
And then the loop:
f = open('list_of_urls.txt')
p = f.read()
url_list = p.split('\n') #here's where \n is the line break delimiter that can be changed
for url in url_list:
save_json(url)

How to extract images from a PDF in pure Python?

I'm developing a service in which I now need to extract images from a PDF file. From a Linux command line I can extract images using the Poppler library like this:
pdfimages my_file.pdf /tmp/image
Since I'm using the Python Flask framework and I want to run my service on Heroku I want to extract the images using pure Python (or any library that can run on Heroku in a Flask system).
So does anybody know how I can extract images from pdf in pure Python? I prefer open source solutions, but I'm willing to pay for it if needed (as long as it works under my own control on Heroku).
import minecart
import os
from NumberOfPages import getPageNumber
def extractImages(filename):
# making new directory if it doesn't exist
new_dir_name = filename[:-4]
if not os.path.exists(new_dir_name):
os.makedirs(new_dir_name + '/images')
os.makedirs(new_dir_name + '/text')
# open the target file
pdf_file = open(filename, 'rb')
# parse the document through the minecart. Document function
doc = minecart.Document(pdf_file)
# getting the number of pages in the pdf file.
num_pages = getPageNumber(filename)
# getting the list of all the pages
page = doc.get_page(num_pages)
count = 0
for page in doc.iter_pages():
for i in range(len(page.images)):
try:
im = page.images[i].as_pil() # requires pillow
name = new_dir_name + '/images/image_' + str(count) + '.jpg'
count = count + 1
im.save(name)
except:
print('Error encountered at %s' % filename)
doc_name = new_dir_name + '/images/info.txt'
with open(doc_name, 'a') as x:
print( x.write('Number of images in document: {}'.format(count)))

most pythonic way to retrieve images from within a module (in HTML)

I am attempting to write a program that sends a url request to a site which then produces an animation of weather radar. I then scrape that page to get the image urls (they're stored in a Java module) and download them to a local folder. I do this iteratively over many radar stations and for two radar products. So far I have written the code to send the request, parse the html, and list the image urls. What I can't seem to do is rename and save the images locally. Beyond that, I want to make this as streamlined as possible- which is probably NOT what I've got so far. Any help 1) getting the images to download to a local folder and 2) pointing me to a more pythonic way of doing this would be great.
# import modules
import urllib2
import re
from bs4 import BeautifulSoup
##test variables##
stationName = "KBYX"
prod = ("bref1","vel1") # a tupel of both ref and vel
bkgr = "black"
duration = "1"
#home_dir = "/path/to/home/directory/folderForImages"
##program##
# This program needs to do the following:
# read the folder structure from home directory to get radar names
#left off here
list_of_folders = os.listdir(home_dir)
for each_folder in list_of_folders:
if each_folder.startswith('k'):
print each_folder
# here each folder that starts with a "k" represents a radar station, and within each folder are two other folders bref1 and vel1, the two products. I want the program to read the folders to decide which radar to retrieve the data for... so if I decide to add radars, all I have to do is add the folders to the directory tree.
# first request will be for prod[0] - base reflectivity
# second request will be for prod[1] - base velocity
# sample path:
# http://weather.rap.ucar.edu/radar/displayRad.php?icao=KMPX&prod=bref1&bkgr=black&duration=1
#base part of the path
base = "http://weather.rap.ucar.edu/radar/displayRad.php?"
#additional parameters
call = base+"icao="+stationName+"&prod="+prod[0]+"&bkgr="+bkgr+"&duration="+duration
#read in the webpage
urlContent = urllib2.urlopen(call).read()
webpage=urllib2.urlopen(call)
#parse the webpage with BeautifulSoup
soup = BeautifulSoup(urlContent)
#print (soup.prettify()) # if you want to take a look at the parsed structure
tag = soup.param.param.param.param.param.param.param #find the tag that holds all the filenames (which are nested in the PARAM tag, and
# located in the "value" parameter for PARAM name="filename")
files_in=str(tag['value'])
files = files_in.split(',') # they're in a single element, so split them by comma
directory = home_dir+"/"+stationName+"/"+prod[1]+"/"
counter = 0
for file in files: # now we should THEORETICALLY be able to iterate over them to download them... here I just print them
print file
To save images locally, something like
import os
IMAGES_OUTDIR = '/path/to/image/output/directory'
for file_url in files:
image_content = urllib2.urlopen(file_url).read()
image_outfile = os.path.join(IMAGES_OUTDIR, os.path.basename(file_url))
with open(image_outfile, 'wb') as wfh:
wfh.write(image_content)
If you want to rename them, use the name you want instead of os.path.basename(file_url).
I use these three methods for downloading from the internet:
from os import path, mkdir
from urllib import urlretrieve
def checkPath(destPath):
# Add final backslash if missing
if destPath != None and len(destPath) and destPath[-1] != '/':
destPath += '/'
if destPath != '' and not path.exists(destPath):
mkdir(destPath)
return destPath
def saveResource(data, fileName, destPath=''):
'''Saves data to file in binary write mode'''
destPath = checkPath(destPath)
with open(destPath + fileName, 'wb') as fOut:
fOut.write(data)
def downloadResource(url, fileName=None, destPath=''):
'''Saves the content at url in folder destPath as fileName'''
# Default filename
if fileName == None:
fileName = path.basename(url)
destPath = checkPath(destPath)
try:
urlretrieve(url, destPath + fileName)
except Exception as inst:
print 'Error retrieving', url
print type(inst) # the exception instance
print inst.args # arguments stored in .args
print inst
There are a bunch of examples here to download images from various sites

Categories