Embed and retrieve long string from image meta-data in Python - python

I would like to embed a single long string (several thousand characters) in the header of an image, and retrieve it later when reading the image, both using Python. I would like to be able to do this with PNG, TIFF, and JPEG formats. What is the easiest way to do this? (in particular I'm looking for a method with the easiest and fewest dependencies to install).

In my opinion, the easiest way with the fewest dependencies is to just use exiftool:
import subprocess as sub
def write_data(filename, data):
cmd = ('exiftool', '-Comment=%s' % data, filename)
sub.check_call(cmd)
def get_data(filename):
cmd = ('exiftool', '-Comment', filename)
return sub.check_output(cmd).split(':', 1)[-1].strip()
write_data('IMG_0001.jpg', 'a'*2048)
assert get_data('IMG_0001.jpg') == 'a'*2048
There are a few considerations that need to be taken into account depending on the type of data that you will be writing. Have a look at pyexiv2 and gexiv2 if you don't like using exiftool directly.

Related

Prefer BytesIO or bytes for internal interface in Python?

I'm trying to decide on the best internal interface to use in my code, specifically around how to handle file contents. Really, the file contents are just binary data, so bytes is sufficient to represent them.
I'm storing files in different remote locations, so have a couple of different classes for reading and writing. I'm trying to figure out the best interface to use for my functions. Originally I was using file paths, but that was suboptimal because it meant that disk was always used (which meant lots of clumsy tempfiles).
There are several areas of the code that have the same requirement, and would directly use whatever was returned from this interface. As a result whatever abstraction I choose will touch a fair bit of code.
What are the various tradeoffs to using BytesIO vs bytes?
def put_file(location, contents_as_bytes):
def put_file(location, contents_as_fp):
def get_file_contents(location):
def get_file_contents(location, fp):
Playing around I've found that using the File-Like interfaces (BytesIO, etc) requires a bit of administration overhead in terms of seek(0) etc. That raises a questions like:
is it better to seek before you start, or after you've finished?
do you seek to the start or just operate from the position the file is in?
should you tell() to maintain the position?
looking at something like shutil.copyfileobj it doesn't do any seeking
One advantage I've found with using file-like interfaces instead is that it allows for passing in the fp to write into when you're retrieving data. Which seems to give a good deal of flexibility.
def get_file_contents(location, write_into=None):
if not write_into:
write_into = io.BytesIO()
# get the contents and put it into write_into
return write_into
get_file_contents('blah', file_on_disk)
get_file_contents('blah', gzip_file)
get_file_contents('blah', temp_file)
get_file_contents('blah', bytes_io)
new_bytes_io = get_file_contents('blah')
# etc
Is there a good reason to prefer BytesIO over just using fixed bytes when designing an interface in python?
The benefit of io.BytesIO objects is that they implement a common-ish interface (commonly known as a 'file-like' object). BytesIO objects have an internal pointer (whose position is returned by tell()) and for every call to read(n) the pointer advances n bytes. Ex.
import io
buf = io.BytesIO(b'Hello world!')
buf.read(1) # Returns b'H'
buf.tell() # Returns 1
buf.read(1) # Returns b'e'
buf.tell() # Returns 2
# Set the pointer to 0.
buf.seek(0)
buf.read() # This will return b'H', like the first call.
In your use case, both the bytes object and the io.BytesIO object are maybe not the best solutions. They will read the complete contents of your files into memory.
Instead, you could look at tempfile.TemporaryFile (https://docs.python.org/3/library/tempfile.html).

Embedding binary data in a script efficiently

I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?
You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))
Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.

Capture jpgs produced in subprocess in main script

I'm not sure that this is possible, but I'm trying to generate a number of thumbnails from pdfs in an automated way and then store them within elasticsearch. Basically I would like to convert the pdf to a series of jpgs (or pngs, or anything similar) and then index them as binaries. Currently I'm producing these jpgs like this:
import subprocess
params = ['convert', 'pdf_file', 'thumb.jpg']
subprocess.check_call(params)
which works well, but it just writes the jpgs out to the filesystem. I would like to have these files as strings without writing them out to the local file system at all. I've tried using the stdout methods of subprocess, but I'm fairly new to using subprocesses, so I wasn't able to figure this one out.
I'm using imagemagick for this conversion, but I am open to switching to any other tool so long as I can achieve this goal.
Any ideas?
You can have it send the data to stdout instead...
import subprocess
params = ['convert', 'pdf_file', 'jpg:-']
image_data = subprocess.check_output(params)
you can use imagemagick's python API, for example something like:
import PythonMagick
img = PythonMagick.Image("file.pdf")
img.depth = 8
img.magick = "RGB"
data = img.data
or use wand:
from wand.image import Image
with Image(filename='file.pdf') as img:
data = img.make_blob('png')
I would like to have these files as strings without writing them out to the local file system at all.
The way to do this is to tell the command to write its data to stdout instead of a file, then just read it from proc.stdout.
Not every command has a way to tell it to do this, but in many cases, just passing - as the output filename will do it, and that's true for ImageMagick's convert. Of course you'll also need to give it a format, because it can no longer guess it from the extension of thumb.jpg. The easiest way to do this is in convert is to prefix the type to the - pseudo-filename. (Don't try that with anything other than ImageMagick.)
So:
import subprocess
params = ['convert', 'pdf_file', 'jpg:-']
converted = subprocess.check_output(params)
However, this is going to get you one giant string. If you were trying to get a bunch of separate images, you'll need to split the one giant string into separate images, which will presumably require some knowledge of the JPEG/JFIF format.

Determine file-/content-/mime-type from file-like?

Is it possible to determine the type of a file-like object in Python?
For instance, if I were to read the contents of a file into a StringIO container and store it in a database, could I later work-out the original file-/content-/mime-type from the data? Eg. are there any common headers I could search for?
If not, are there any ways to determine "common" files (images, office docs, etc)?
You could try the filemagic module:
with magic.Magic as m:
m.id_filename('setup.py')
# => 'Python script, ASCII text executable'
b = open("image.jpg", "rb").read()
m.id_buffer(b)
# => 'JPEG image data, JFIF standard 1.01'
Yes, you should evaluate the hex signature.

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:
http://www.4shared.com/document/kmJF67E4/forms.html
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()
Note that extractText() still has problems extracting the text properly. From the documentation for extractText():
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf and saved to output.
This works in the case of your PDF file and extracts the text you want.
This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.
You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.
I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.
I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
import subprocess
try:
extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
print (e)
I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

Categories