Python is known to be an easy and powerful language. I have a List, literally, of URL images,
>>> for i in images: print i
http://upload.wikimedia.org/wikipedia/commons/8/86/Influenza_virus_research.jpg
http://upload.wikimedia.org/wikipedia/commons/f/f8/Wiktionary-logo-en.svg
http://upload.wikimedia.org/wikipedia/en/e/e7/Cscr-featured.svg
http://upload.wikimedia.org/wikipedia/commons/f/fa/Wikiquote-logo.svg
http://upload.wikimedia.org/wikipedia/commons/4/4c/Wikisource-logo.svg
http://upload.wikimedia.org/wikipedia/commons/1/1b/Wikiversity-logo-en.svg
http://upload.wikimedia.org/wikipedia/commons/1/1b/Wikiversity-logo-en.svg
I wonder if there's some library (or snippet of code) in python to easily display a list of URL images in a browser, or maybe save them in a folder.
import urllib
urllib.urlretrieve("http://8020.photos.jpgmag.com/3670771_314453_2ee7120da5_m.jpg", "my.jpg")
The "my.jpg" is the path to save the file. It can be "/home/user/pics/my.jpg" etc..
Related
When I try to save a pyplot figure as a jpg, I keep getting a directory error saying that the given file name is not a directory. I am working in Colab. I have a numpy array called z_img and have opened a zip file.
import matplotlib.pyplot as plt
from zipfile import ZipFile
zipObj = ZipFile('slices.zip', 'w') # opening zip file
plt.imshow(z_img, cmap='binary')
The plotting works fine. I did a test of saving the image into Colab's regular memory like so:
plt.savefig(str(ii)+'um_slice.jpg')
And this works perfectly, except I am intending to use this code in a for loop. ii is an index to differentiate between each image, and several hundred images would be created so I want them going in the zipfile. Now when I try adding the path to the zipfile:
plt.savefig('/content/slices.zip/'+str(ii)+'um_slice.jpg')
I get: NotADirectoryError: [Errno 20] Not a directory: '/content/slices.zip/150500um_slice.jpg'
I assume it's because the {}.jpg string is a filename, and not a directory per se. But I am quite new to Python, and don't know how to get the plot into the zip file. That's all I want. Would love any advice!
First off, for anything that's not photographic content (ie. nice and soft), JPEG is the wrong format. You'll have a better time using a different file format. PNG is nice for pixels, SVG for vector graphics (in case you embed this in a website later!), PDF for vector, too.
The error message is quite on point: you cannot just save to a zip file as if it was a directory.
Multiple ways around:
use the tempfile module's mkdtemp to make a temporary directory, save into that, and zip the result
save not into a filename, but into a buffer (BytesIO I guess) and append that to the compressed stream (I'm not too familiar with ZipFile)
use PDF as output and simply generate a multipage PDF; it's not hard, and probably much nicer in the long term. You can still convert that vector graphic result to PNG (or any other pixel format9 as desired, but for the time being, it's space efficient, arbitrarily scaleable and keeps all your pages in one place. It's easy to import selected pages into LaTeX (matter of fact, \includegraphics does it directly) or into websites (pdf.js).
From the docs, matplotlib.pyplot.savefig accepts a binary file-like object. ZipFile.open creates binary file like objects. These two have to get todgether!
with zipobj.open(str(ii)+'um_slice.jpg', 'w') as fp:
plt.savefig(fp)
I'm trying to save images from the Spotify API
I get album art in the form of a link:
https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce
I think it's a jpg file.
I run into errors in trying to display or save this in python.
I'm not even sure how I'm meant to format something like:
Do I need str around the link?
str(https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce)
Or should I create a new variable e.g.
image_path = 'https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce'
And then:
im1 = im1.save(image_path)
Your second suggestion should work with an addition of actually downloading the image using urllib.request:
import urllib.request
image_path = 'https://i.scdn.co/image/ab67616d00004851c96f7c7b077c224975b4c5ce'
urllib.request.urlretrieve(image_path, "image.jpg")
I want to webscrape a site, and save some, but not all images to my computer. I want to save about 5,600 images, so doing it manually would be difficult. All of the images urls start with
https://assets.pokemon.com/assets/cms2/img/cards/
and then some other stuff that is specific to the image. How can I download only images that meet that criteria?
Also (sorry this is kind of 2 questions in 1, but its related) how can I save the images alt text as the file name?
Thanks!
Also sorry if this is a dumb question, if you can't tell by the fact that I'm scraping pokemon.com, I'm not exactly a professional.
Here's what I ended up doing:
import requests
import urllib.request
contents = requests.get(url) # Get request to site
data = contents.text # Get HTMl file as text
x = data.split("\"") # Splits it into an array using double quotes as separators (Because all of the image urls were in quotes)
for a in range(len(x)): # Runs this code for every member of the array
if 'https://assets.pokemon.com/assets/cms2/img/cards' in x[a]: # Checks for that URL snippet. (That's not the full URL, each full URL just started with that)
link = x[a] # If it is, store that member of the array separately to be extracted
name = x[a+2] # Alt text was always 2 members of the array later, not sure if this is true for all sites.
path = "/Users/myName/Desktop/Poke/" + name + ".png" # This is where I wanted to store the files
urllib.request.urlretrieve(link, path) # Retrieved the file from the link, and saved it to the path
I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.
My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.
Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python
From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.
Any ideas?
If your goal is to redact images maybe this code I used for a similar usecase could be useful:
import sys
import zipfile
from PIL import Image, ImageFilter
import io
blur = ImageFilter.GaussianBlur(40)
def redact_images(filename):
outfile = filename.replace(".docx", "_redacted.docx")
with zipfile.ZipFile(filename) as inzip:
with zipfile.ZipFile(outfile, "w") as outzip:
for info in inzip.infolist():
name = info.filename
print(info)
content = inzip.read(info)
if name.endswith((".png", ".jpeg", ".gif")):
fmt = name.split(".")[-1]
img = Image.open(io.BytesIO(content))
img = img.convert().filter(blur)
outb = io.BytesIO()
img.save(outb, fmt)
content = outb.getvalue()
info.file_size = len(content)
info.CRC = zipfile.crc32(content)
outzip.writestr(info, content)
Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.
I don't think it's currently implemented in python-docx.
Pictures in the Word Object Model are defined as either floating shapes or inline shapes. The docx documentation states that it only supports inline shapes.
The Word Object Model for Inline Shapes supports a Delete() method, which should be accessible. However, it is not listed in the examples of InlineShapes and there is also a similar method for paragraphs. For paragraphs, there is an open feature request to add this functionality - which dates back to 2014! If it's not added to paragraphs it won't be available for InlineShapes as they are implemented as discrete paragraphs.
You could do this with win32com if you have a machine with Word and Python installed.
This would allow you to call the Word Object Model directly, giving you access to the Delete() method. In fact you could probably cheat - rather than scrolling through the document to get each image, you can call Find and Replace to clear the image. This SO question talks about win32com find and replace:
import win32com.client
from os import getcwd, listdir
docs = [i for i in listdir('.') if i[-3:]=='doc' or i[-4:]=='docx'] #All Word file
FromTo = {"First Name":"John",
"Last Name":"Smith"} #You can insert as many as you want
word = win32com.client.DispatchEx("Word.Application")
word.Visible = True #Keep comment after tests
word.DisplayAlerts = False
for doc in docs:
word.Documents.Open('{}\\{}'.format(getcwd(), doc))
for From in FromTo.keys():
word.Selection.Find.Text = From
word.Selection.Find.Replacement.Text = FromTo[From]
word.Selection.Find.Execute(Replace=2, Forward=True) #You made the mistake here=> Replace must be 2
name = doc.rsplit('.',1)[0]
ext = doc.rsplit('.',1)[1]
word.ActiveDocument.SaveAs('{}\\{}_2.{}'.format(getcwd(), name, ext))
word.Quit() # releases Word object from memory
In this case since we want images, we would need to use the short-code ^g as the find.Text and blank as the replacement.
word.Selection.Find
find.Text = "^g"
find.Replacement.Text = ""
find.Execute(Replace=1, Forward=True)
I don't know about this library, but looking through the documentation I found this section about images. It mentiones that it is currently not possible to insert images other than inline. If that is what you currently have in your documents, I assume you can also retrieve these by looking in the Document object and then remove them?
The Document is explained here.
Although not a duplicate, you might also want to look at this question's answer where user "scanny" explains how he finds images using the library.
I am trying to download a file from web using python requests and then passing this file to python-selenium webdriver keys to an HTML file field. My current code for this is as follow.
image = requests.get('https://theartgalleryumd.files.wordpress.com/2011/10/dsc0017.jpg')
i = Image.open(StringIO(image.content))
image_name = "{0}_file_name.jpg".format(unicode(time.time()),)
image = i.save(os.getcwd()+"/{0}".format(image_name))
driver.find_element_by_id('image').send_keys(os.getcwd()+"/{0}".format(image_name))
This code is working but I think there will be much better ways to do it. and is it possible to assign the file / image to a file field in python selenium without saving / creating it on the hard disk.
You are doing it correctly. The input element with type="file" requires a local path to the file to be passed in. This means that you have to download the file first and have it saved on a disk.
I'd use urlretrieve instead of a series of PIL commands to download the image and save it:
urllib.urlretrieve('https://theartgalleryumd.files.wordpress.com/2011/10/dsc0017.jpg')