I'm not sure that this is possible, but I'm trying to generate a number of thumbnails from pdfs in an automated way and then store them within elasticsearch. Basically I would like to convert the pdf to a series of jpgs (or pngs, or anything similar) and then index them as binaries. Currently I'm producing these jpgs like this:
import subprocess
params = ['convert', 'pdf_file', 'thumb.jpg']
subprocess.check_call(params)
which works well, but it just writes the jpgs out to the filesystem. I would like to have these files as strings without writing them out to the local file system at all. I've tried using the stdout methods of subprocess, but I'm fairly new to using subprocesses, so I wasn't able to figure this one out.
I'm using imagemagick for this conversion, but I am open to switching to any other tool so long as I can achieve this goal.
Any ideas?
You can have it send the data to stdout instead...
import subprocess
params = ['convert', 'pdf_file', 'jpg:-']
image_data = subprocess.check_output(params)
you can use imagemagick's python API, for example something like:
import PythonMagick
img = PythonMagick.Image("file.pdf")
img.depth = 8
img.magick = "RGB"
data = img.data
or use wand:
from wand.image import Image
with Image(filename='file.pdf') as img:
data = img.make_blob('png')
I would like to have these files as strings without writing them out to the local file system at all.
The way to do this is to tell the command to write its data to stdout instead of a file, then just read it from proc.stdout.
Not every command has a way to tell it to do this, but in many cases, just passing - as the output filename will do it, and that's true for ImageMagick's convert. Of course you'll also need to give it a format, because it can no longer guess it from the extension of thumb.jpg. The easiest way to do this is in convert is to prefix the type to the - pseudo-filename. (Don't try that with anything other than ImageMagick.)
So:
import subprocess
params = ['convert', 'pdf_file', 'jpg:-']
converted = subprocess.check_output(params)
However, this is going to get you one giant string. If you were trying to get a bunch of separate images, you'll need to split the one giant string into separate images, which will presumably require some knowledge of the JPEG/JFIF format.
Related
Wow, it was hard to encapsulate my issue here into a succinct headline. I hope I managed.
I've got a simple thumbnail feature that is causing me issues when I try to retrieve a URL from Amazon S3, then convert it using ImageMagick. I would normally use PIL to read in an image file and convert it, but PIL doesn't read in PDF formats, so I'm resorting to convert through a subprocess call.
Here's some code from a django views.py. The idea here is that I get a file url from S3, open it with convert, process it into a PNG, send it to stdout, and then use the outputted buffer to load up a StringIO object, which then gets passed back to default_storages to save the thumbnail file back to S3. Quite a faff for such a simple job, but there you go.
Please note: I cannot reliably save a file to disk using convert on my production set-up with Heroku, otherwise, I'd be doing that already.
def _large_thumbnail_s3(p):
# get the URL from S3, trimming off the expiry info etc. So far so good.
url = default_storage.url(p+'.pdf').split('?')
url = url[0]
# this opens the PDF file fine, and proceeds to convert and send
# the new PNG to the buffer via standard output.
from subprocess import call
call("convert -thumbnail '400x600>' -density 96 -quality 85 "
+url
+" png:-", shell=True)
from StringIO import StringIO
# here's where the problem starts. I'm clearly not reading
# in the stdout correctly, as I get a IOError: File not open for reading
# from this next line of code:
completeStdin = sys.stdout.read()
im = StringIO(completeStdin)
# now take the StringIO PNG object and save it back to S3 (this
# should not be an issue.
im = default_storage.open(p+'_lg.png', 'w+b')
im.close()
Can anyone tell me a) where I might be going wrong with regards sending the output back to the thumbnail function, and b) whether you can suggest any more robust alternatives to what seems a pretty hacky way of doing this!
TIA
You need to use subprocess.check_output, not subprocess.call:
from subprocess import check_output
from StringIO import StringIO
out, err = check_output("convert -thumbnail '400x600>' -density 96 -quality 85 "
+url
+" png:-", shell=True)
buffer = StringIO(out)
I have written piece of code which runs sextractor from python, however I only know how to do this for one file, and i need to loop it over 62 files. Im not sure how i would go about doing this. I have attached my code bellow:
#!/usr/bin/env python
# build a catalog using sextractor on des image here
sys.path.append('/home/fitsfiles') #not sure if this does anything/is correct
def sex(image, output, sexdir='/home/sextractor-2.5.0', check_img=None,config=None, l=None) :
'''Construct a sextractor command and run it.'''
#creates a sextractor line e.g sex img.fits -catalog_name -checkimage_name
q="/home/fitsfiles/"+ "01" +".fits"
com = [ "sex ", q, " -CATALOG_NAME " + output]
s0=''
com = s0.join(com)
res = os.system(com)
return res
img_name=sys.argv[0]
output=img_name[0:1]+'_star_catalog.fits'
t=sex(img_name,output)
print '----done !---'
so this code produces a command in my main terminal of, sex /home/fitsfiles/01.fits -CATALOG_NAME g_star_catalog.fits
which successfully produces a star catalogue as I want.
However I want my code to to this for 62 fits files and change the name of star_catalog.fits depending upon which fitsfile is being used. any help would be appreciated.
There are many ways you could approach this. Let's assume you want to run your script as something like
python extract_stars.py /home/fitsfiles/*.fits
Then, you could try something like this:
for arg in len(sys.argv):
filename = arg.split('/')[-1].strip('.fits')
t = sex(arg, filename +'_star_catalog.fits')
# Whatever else
This assumes that you remove the line in sex that reformats the input filename. Also, you do not need to append the fits directory to your path.
The alternative approach is, if you do not plan to do anything else in python, you could write a bash script which would really simplify the task.
And, as a side note, you if you had asked this question more generally (ie, I wish to apply a function I wrote to a number of input files) and without reference to a rather uncommonly used application, you would have likely received an answer much more quickly.
The community has now developed some python wrappers which allow you to run sextractor as if it was a python command. These are: pysex, sewpy and astromatic_wrapper.
The good thing about sextractor wrappers is that allow you to write much cleaner code without the need of defining extra functions, invoking os commands or having the configuration files and the outputfiles on your machine. Moreover, the output can be an astropy table, a pandas dataframe or a numpy array.
For your specific case, you could use pysex and do:
import pysex
import glob
filelist = glob.glob('/directory/*.fits')
for fitsfile in filelist:
cat = pysex.run(fitsfile, params=['X_IMAGE', 'Y_IMAGE', 'FLUX_APER'],
conf_args={'PHOT_APERTURES':5})
print cat['FLUX_APER']
I would like to embed a single long string (several thousand characters) in the header of an image, and retrieve it later when reading the image, both using Python. I would like to be able to do this with PNG, TIFF, and JPEG formats. What is the easiest way to do this? (in particular I'm looking for a method with the easiest and fewest dependencies to install).
In my opinion, the easiest way with the fewest dependencies is to just use exiftool:
import subprocess as sub
def write_data(filename, data):
cmd = ('exiftool', '-Comment=%s' % data, filename)
sub.check_call(cmd)
def get_data(filename):
cmd = ('exiftool', '-Comment', filename)
return sub.check_output(cmd).split(':', 1)[-1].strip()
write_data('IMG_0001.jpg', 'a'*2048)
assert get_data('IMG_0001.jpg') == 'a'*2048
There are a few considerations that need to be taken into account depending on the type of data that you will be writing. Have a look at pyexiv2 and gexiv2 if you don't like using exiftool directly.
is there a way to capture a single frame of a video file in python?
it could also be done by command line. im using handbrakecli to convert the videos,
but i would need some screenshots of it too.
thank you
You should first check out PyFFmpeg.
PyFFmpeg is a wrapper around FFmpeg's
libavcodec, libavformat and libavutil
libraries whose main purpose is to
provide access to individual frames of
video files of various formats
(including MPEG and DIVX encoded
videos). It also provides access to
audio data.
It is also possible using ffmpeg, so call that using subprocess. A simple search will give you the command required to extract a frame from a video file. Just call that command using subprocess and that should do it.
>>> import subprocess
>>> import shlex # to split the command that follows
>>> command = 'ffmpeg -i sample.avi' # your command goes here
>>> subprocess.call(shlex.split(command))
The similar procedure applies to handbrakecli or whatever you might use. Just call the appropriate command.
I came up with the following problem: CODE A works right now.. I am saving a png file called chart.png locally, and then I am loading it into the proprietary function (which I do not have access).
However, in CODE B, am trying to use cStringIO.StringIO() so that I do not have to write the file "chart.png" to the disk. But I cannot find a way to pass it to the pproprietaryfunction because it is expecting a real filename like "chart.png" (it looks like it even uses the split function to identify the extension).
CODE A (code running right now):
file = "chart.png"
pylab.savefig(file, format='png')
a = proprietaryfunction.add(file)
CODE B (what I am trying to do - and does not work):
file = cStringIO.StringIO()
pylab.savefig(file, format='png')
a = proprietaryfunction.add(file)
How can I make the use of cStringIO.StringIO() transparent to the proprietary function? Is there anyway that I can emulate a virtual file system in memory for this?
Probably not, but there's always tempfile if you need a "clean" workaround...