How can I access images from powerpoint (python-pptx) - python

I'm having a hard time trying to access/save images using the python-pptx library. So, if the image is of shape type PICTURE (that's shape.shape_type == MSO_SHAPE_TYPE.PICTURE) I can access/save the image easily using the 'blob' attribute. Here is the code:
import argparse
import os
from PIL import Image
import pptx
from pptx.enum.shapes import MSO_SHAPE_TYPE
from pptx import Presentation
from mdutils.mdutils import MdUtils
from mdutils import Html
def main():
parser = argparse.ArgumentParser()
parser.add_argument('ppt_name', type=str, help='add the name of the PowerPoint file(NOTE: the folder must be in the same directory as the prorgram file')
args = parser.parse_args()
pptx_name = args.ppt_name
pptx_name_formatted = pptx_name.split('.')[0]
prs = Presentation(pptx_name)
path = '{}_converted'.format(pptx_name_formatted)
if not os.path.exists(path):
os.mkdir(path)
images_folder = '{}_images'.format(pptx_name_formatted)
images_path = os.path.join(path, images_folder)
if not os.path.exists(images_path):
os.mkdir(images_path)
ppt_dict = {} #Keys: slide numbers, values: slide content
texts = []
slide_count = 0
picture_count = 0
for slide in prs.slides:
texts = []
slide_count += 1
for shape in slide.shapes:
if shape.has_text_frame:
if '\n' in shape.text:
splitted = shape.text.split('\n')
for word in splitted:
if word != '':
texts.append(word)
elif shape.text == '':
continue
else:
texts.append(shape.text)
elif shape.shape_type == MSO_SHAPE_TYPE.PICTURE:
with open('{}/image{}_slide{}.png'.format(images_path, picture_count, slide_count), 'wb') as f:
f.write(shape.image.blob)
picture_count += 1
ppt_dict[slide_count] = texts
ppt_content = ''
for k,v in ppt_dict.items():
ppt_content = ppt_content + ' - Slide number {}\n'.format(k)
for a in v:
ppt_content = ppt_content + '\t - {}\n'.format(a)
mdFile = MdUtils(file_name='{}/{}'.format(path,path)) #second argument isn't path, it just shares the path name.
mdFile.write(ppt_content)
mdFile.create_md_file()
if __name__ == "__main__":
main()
The problem is when the picture is of shape type 'auto shape' , I tried a lot of approaches but to no avail. When I do run the following code for a shape that I know is a picture:
if shape.shape_type == MSO_SHAPE_TYPE.AUTO_SHAPE:
print(shape.auto_shape_type)
print(shape.fill.type)
#indented because it's in a for loop
It outputs RECTANGLE for shape.auto_shape_type
and PICTURE for shape.fill.type
But what I want now is to save the picture (maybe by writing the the binary image bytestream of the image). Can someone help?

The "link" to the image (part, having the blob) is in the fill definition. Using that you can get to the image.
Print out the XML for the surroundings of the fill definition with shape.fill._xPr.xml. That will give you a look at what you need to navigate to. Good chance it will look something like "rId9" with some particular other number where the "9" placeholder is in that example. Probably in the neighborhood of something like "blipfill". The image is used as the "fill" of the shape, so that's what's going on here.
Then get the slide part with something like slide._part and use its .related_parts "dict" to look up the image "fill" part using the relationship-id (the string like "rId9").
image_part = slide._part.related_parts["rId9"]
The ImagePart implementation is here:
https://github.com/scanny/python-pptx/blob/master/pptx/parts/image.py#L21
and it gives access to the image and a lot of details about it as well.
You'll have to retrieve the "rId9"-like string using lxml calls, something roughly like:
rIds = shape.fill._xPr.xpath(".//#embed")
rId = rIds[0]
You'll need to do a little research on XPath to work out the right expression, based on the XML you print out in the earlier step. There's a lot out there on XPath, including here on SO, this is one resource to get started: http://www.rpbourret.com/xml/XPathIn5.htm
If you can't work it out, post the XML you printed out and we can get you to the next step.

Here is my approach, thanks to scanny.
for slide in prs.slides:
slide_count += 1
slide_parts = list(slide._part.related_parts.keys())
for part in slide_parts:
image_part = slide._part.related_parts[part]
if type(image_part) == pptx.parts.image.ImagePart or pptx.opc.package.Part:
file_startswith = image_part.blob[0:1]
if file_startswith == b'\x89' or file_startswith == b'\xff' or file_startswith == b'\x47':
with open('{}/image{}_slide{}.png'.format(images_path, picture_count, slide_count), 'wb') as f:
f.write(image_part.blob)
picture_count += 1
the if condition to check for PNG, JPEG or GIF is there because pptx.opc.package.Part isn't always an image.
Actually, I think since I'm checking for the beginning of image_part.blob, I don't think I need to include say if type(image_part) == pptx.parts.image.ImagePart or pptx.opc.package.Part:
But as long as it's working...

Related

Highlight text content in pdf files using python and save a screenshot

I have a list of pdf files and I need to highlight specific text on each page of these files and save a snapshot for each of the text instances.
So far I am able to highlight the text and save the entire page of a pdf file as a snapshot. But, I want to find the position of highlighted text and take a zoomed in the snapshot which will be more detailed compared to the full page snapshot.
I'm pretty sure there must be a solution to this problem. I am new to Python and hence I am not able to find it. I would be really grateful if someone can help me out with this.
I have tried using PyPDF2, Pymupdf libraries but I couldn't figure out the solution. I also tried highlighting by providing coordinates which works but couldn't find a way to get these coordinates as output.
[![Sample snapshot from the code[![\]\[1\]][1]][1]][1]
#import PyPDF2
import os
import fitz
from wand.image import Image
import csv
#import re
#from pdf2image import convert_from_path
check = r'C:\Users\Pradyumna.M\Desktop\Pradyumna\Automation\Intel Bytes\Create Source Docs\Sample Check 8 Apr 2019'
dir1 = check + '\\Source Docs\\'
dir2 = check + '\\Output\\'
dir = [dir1, dir2]
for x in dir:
try:
os.mkdir(x)
except FileExistsError:
print("Directory ", x, " already exists")
### READ PDF FILE
with open('upload1.csv', newline='') as myfile:
reader = csv.reader(myfile)
for row in reader:
rowarray = '; '.join(row)
src = rowarray.split("; ")
file = check + '\\' + src[4] + '.pdf'
print(file)
#pdfFileObj = open(file,'rb')
#pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#print("Total number of pages: " + str(pdfReader.numPages))
doc = fitz.open(file)
print(src[5])
for i in range(int(src[5])-1, int(src[5])):
i = int(i)
page = doc[i]
print("Processing page: " + str(i))
text = src[3]
#SEARCH TEXT
print("Searching: " + text)
text_instances = page.searchFor(text)
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
file1 = check + '\\Output\\' + src[4] + '_output.pdf'
print(file1)
doc.save(file1, garbage=4, deflate=True, clean=True)
### Screenshot
with(Image(filename=file1, resolution=150)) as source:
images = source.sequence
newfilename = check + "\\Source Docs\\" + src[0] + '.jpeg'
Image(images[i]).save(filename=newfilename)
print("Screenshot of " + src[0] + " saved")
"couldn't find a way to get these coordinates as output"
- you can get the coordinates out by doing this:
for inst in text_instances:
print(inst)
inst are fitz.Rect objects which contain the top left and bottom right coordinates of the piece of text that was found. All the information is available in the docs.
I managed to highlight points and also save a cropped region using the following snippet of code. I am using python 3.7.1 and my output for fitz.version is ('1.14.13', '1.14.0', '20190407064320').
import fitz
doc = fitz.open("foo.pdf")
inst_counter = 0
for pi in range(doc.pageCount):
page = doc[pi]
text = "hello"
text_instances = page.searchFor(text)
five_percent_height = (page.rect.br.y - page.rect.tl.y)*0.05
for inst in text_instances:
inst_counter += 1
highlight = page.addHighlightAnnot(inst)
# define a suitable cropping box which spans the whole page
# and adds padding around the highlighted text
tl_pt = fitz.Point(page.rect.tl.x, max(page.rect.tl.y, inst.tl.y - five_percent_height))
br_pt = fitz.Point(page.rect.br.x, min(page.rect.br.y, inst.br.y + five_percent_height))
hl_clip = fitz.Rect(tl_pt, br_pt)
zoom_mat = fitz.Matrix(2, 2)
pix = page.getPixmap(matrix=zoom_mat, clip = hl_clip)
pix.writePNG(f"pg{pi}-hl{inst_counter}.png")
doc.close()
I tested this on a sample pdf that i peppered with "hello":
Some of the outputs from the script:
I composed the solution out of the following pages of the documentation:
Tutorial page to get introduced into the library
page.searchFor to figure out the return type of the searchFor method
fitz.Rect to understand what the returned objects from page.searchFor are
Collection of Recipes page (called faq in the URL) to figure out how to crop and save part of a pdf page

How do I convert multiple pictures, in a folder with ITK color map?

Every time I run my program in the terminal it prints out:
thumb0496.jpg is not converted
{} is not converted
Whatever I do it never works... I am new to Python and have installed it via Anaconda along with OpenCV, Pip and ITK. I have only been doing this for 4 days and am stuck. Python is my first language also. Why are my code not working?
In case this code looks similar it is. I had to try out with combining some elements. Unfortunately I cannot find the post again. The code was worse before but I (somehow) fixed it. It is just this (new) piece I can't fix on my own!
import cv2
import sys
import itk
import os,glob
from os import listdir,makedirs
from os.path import isfile,join
path = '/Users/admin/Desktop/ff'
dstpath = '/Users/admin/Desktop/test'
PixelType = itk.UC
Dimension = 2
ImageType = itk.Image[PixelType, Dimension]
RGBPixelType = itk.RGBPixel[PixelType]
RGBImageType = itk.Image[RGBPixelType, Dimension]
ColormapType = itk.CustomColormapFunction[PixelType, RGBPixelType]
colormap = ColormapType.New()
ColormapFilterType = itk.ScalarToRGBColormapImageFilter[ImageType,RGBImageType]
colormapFilter1 = ColormapFilterType.New()
colormapFilter1.SetInput(reader.GetOutput())
colormapFilter1.SetColormap(colormap)
WriterType = itk.ImageFileWriter[RGBImageType]
writer = WriterType.New()
writer.SetFileName(dstpath)
writer.SetInput(colormapFilter1.GetOutput())
try:
makedirs(dstpath)
except:
print ("Directory already exist, images will be written in same folder")
files = [f for f in listdir(path) if isfile(join(path,f))]
for image in files:
try:
reader = ReaderType(os.path.join(path,image))
map = ColormapFilterType(reader, PixelType, RGBImageType, ImageType)
dstPath = join(dstpath,image)
cv2.imwrite(dstPath,map)
except:
print ("{} is not converted".format(image))
for fil in glob.glob("*.jpg"):
try:
img = ReaderType(os.path.join(path,fil))
map_imag = ColormapType(img, PixelType, RGBImageType,ImageType)
cv2.imwrite(os.path.join(dstpath,fil),map_image)
except:
print('{} is not converted')
Why don't you start from a working example, and gradually change it to suit your needs? Examples can be found in the quick-start guide and this blog post.

Add page title, img2pdf

I've recently found this (wonderful) python software to convert multiple images to a single pdf, img2pdf. After create the first pdf I realized that every page hasn't got any title and it's difficult identify what's the original image (because there're 400), does anyone know how can I add a page title?
Thanks in advance.
I tried to find the same solution but ended up writing a Python program to solve it. I dont know if it helps you but here is a solution nonetheless.
In Python I used PIL.Image and ImageDraw to go through all images and put the filename in each if the images. After that I used img2pdf as a python library to generate the pdf.
Must be run it in the same folder of the images.
import os
import img2pdf
from PIL import Image, ImageDraw, ImageFont, ExifTags
# Enter the path to the font you want, 'fc-list' on ubuntu will get a list of fonts you can use.
#image_text_font = ImageFont.truetype('/Library/Fonts/Arial.ttf', 15)
image_text_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf", 32)
# Tags the images with 'file name' in the upper left corner
def tag_images():
for file in os.listdir('.'):
if file.endswith(".jpg") and str(file+"_tagged.jpg") not in os.listdir('.') and not file.endswith("_tagged.jpg"):
one_image = check_and_adjust_rotation(Image.open(file))
one_image_draw = ImageDraw.Draw(one_image)
# Add textbox to image
size = one_image_draw.textsize(file, font=image_text_font)
offset = image_text_font.getoffset(file)
one_image_draw.rectangle((10, 10, 10 + size[0] + offset[0], 10 + size[1] + offset[1]), fill='white', outline='black')
# Add text to image
one_image_draw.text((10,10), file, font=image_text_font, fill='black')
# Save tagged image
one_image.save(file + "_tagged.jpg")
print(f'Tagged and saved "{file}_tagged.jpg".')
# Generate the PDF
def generate_pdf_from_multiple_images():
with open("output.pdf", "wb") as f:
f.write(img2pdf.convert([image_file for image_file in os.listdir('.') if image_file.endswith("_tagged.jpg")]))
# Use exif information about rotation to apply proper rotation to the image
def check_and_adjust_rotation(image):
try :
for orientation in ExifTags.TAGS.keys() :
if ExifTags.TAGS[orientation]=='Orientation' : break
exif=dict(image._getexif().items())
print(exif[orientation])
if exif[orientation] == 3 :
image=image.rotate(180, expand=True)
elif exif[orientation] == 6 :
image=image.rotate(270, expand=True)
elif exif[orientation] == 8 :
image=image.rotate(90, expand=True)
except:
traceback.print_exc()
return image
def main():
tag_images()
generate_pdf_from_multiple_images()
if __name__ == '__main__':
main()

Copying .docx and preserving images

I am trying to copy elements of a doc from one doc file to other. The text part is easy, the images is where it gets tricky.
Attaching an image to explain the structure of the doc: Just some text and 1 image.
from docx import Document
import io
doc = Document('/Users/neha/Desktop/testing.docx')
new_doc = Document()
for elem in doc.element.body:
new_doc.element.body.append(elem)
new_doc.save('/Users/neha/Desktop/out.docx')
This gets me the whole structure of the doc in the new_doc but the image is still blank. Image below:
Good thing is I have the blank image in the right place so I thought of getting the byte level data from the previous image and insert it in the new doc. Here is how I extended the above code:
from docx import Document
import io
doc = Document('/Users/neha/Desktop/testing.docx')
new_doc = Document()
for elem in doc.element.body:
new_doc.element.body.append(elem)
im = doc.inline_shapes[0]
blip = im._inline.graphic.graphicData.pic.blipFill.blip
rId = blip.embed
doc_part = doc.part
image_part = doc_part.related_parts[rId]
bytes = image_part._blob #Here I get the byte level data for the image
im2 = new_doc.inline_shapes[0]
blip2 = im2._inline.graphic.graphicData.pic.blipFill.blip
rId2 = blip2.embed
document_part2 = new_doc.part
document_part2.related_parts[rId2]._blob = bytes
new_doc.save('/Users/neha/Desktop/out.docx')
But the image still shows empty in the new_doc. What should I do from here?
I figured out a solution a couple of days back. However the text loses formatting using this way, but the images are correctly placed.
So the idea is, for para in paras for the source doc, if there is text, I write it to dest doc. And if there is an inline image present, I add a unique identifier at that place in the dest doc (refer here to see how these identifiers work, and contexts in docxtpl). These identifiers and docxtpl proved to be particularly useful here. And then using those unique identifiers I create a 'context' (as shown below) which is basically a map mapping the unique identifier to its particular InlineImage, and finally I render this context..
Below is my code (Apologies for the unnecessary indentation, I copied it directly from my text editor, and shift+tab doesn't work here :P)
from docxtpl import DocxTemplate, InlineImage
import Document
import io
import xml.etree.ElementTree as ET
dest = DocxTemplate()
source = Document(source_path)
context = {}
ims = [im for im in source.inline_shapes]
im_addresses = []
im_streams = []
count = 0
for im in ims:
blip = im._inline.graphic.graphicData.pic.blipFill.blip
rId = blip.embed
doc_part = source.part
image_part = doc_part.related_parts[rId]
byte_data = image_part._blob
image_stream = io.BytesIO(byte_data)
im_streams.append(image_stream)
image_name = self.img_path+"img_"+"_"+str(count)+".jpeg"
with open(image_name, "wb") as fh:
fh.write(byte_data)
fh.close()
im_addresses.append(image_name)
count += 1
paras = source.paragraphs
im_idx = 0
for para in paras:
p = dest.add_paragraph()
r = p.add_run()
if(para.text):
r.add_text(para.text)
root = ET.fromstring(para._p.xml)
namespace = {'wp':"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"}
inlines = root.findall('.//wp:inline',namespace)
if(len(inlines) > 0):
uid = "img_"+str(im_idx)
r.add_text("{{ " + uid + " }}")
context[uid] = InlineImage(dest,im_addresses[im_idx])
im_idx += 1
try:
dest.render(context)
except Exception as e:
print(e)
dest.save(dest_path)
PS: If a paragraph has two images, this code will prove to be sub-optimal.. One will have to make some change in the following:
if(len(inlines) > 0):
uid = "img_"+str(im_idx)
r.add_text("{{ " + uid + " }}")
context[uid] = InlineImage(dest,im_addresses[im_idx])
im_idx += 1
Will have to add a for loop inside the if statement as well. Since I didn't need as usually my images were big enough, so they always came in different paragraphs. Just a side note for anyone who may need it..
Cheers!
You could try:
Extracting the images from the first document by unzipping the .docx file (per How can I search a word in a Word 2007 .docx file?)
Save those images to the file system (as foo.png, for instance)
Generate the new .docx file with Python and add the .png file using document.add_picture('foo.png').
This problem is solved by this package https://docxtpl.readthedocs.io/en/latest/

Python, ignore files with no Exif data

I am trying to do a mass extraction of gps exif data, my code below:
from PIL import Image
from PIL.ExifTags import TAGS, GPSTAGS
def get_exif_data(image):
exif_data = {}
info = image._getexif()
if info:
for tag, value in info.items():
decoded = TAGS.get(tag, tag)
if decoded == "GPSInfo":
gps_data = {}
for t in value:
sub_decoded = GPSTAGS.get(t, t)
gps_data[sub_decoded] = value[t]
exif_data[decoded] = gps_data
else:
exif_data[decoded] = value
return exif_data
def _get_if_exist(data, key):
if key in data:
return data[key]
else:
pass
def get_lat_lon(exif_data):
gps_info = exif_data["GPSInfo"]
lat = None
lon = None
if "GPSInfo" in exif_data:
gps_info = exif_data["GPSInfo"]
gps_latitude = _get_if_exist(gps_info, "GPSLatitude")
gps_latitude_ref = _get_if_exist(gps_info, "GPSLatitudeRef")
gps_longitude = _get_if_exist(gps_info, "GPSLongitude")
gps_longitude_ref = _get_if_exist(gps_info, "GPSLongitudeRef")
if gps_latitude and gps_latitude_ref and gps_longitude and gps_longitude_ref:
lat = _convert_to_degrees(gps_latitude)
if gps_latitude_ref != "N":
lat = 0 - lat
lon = _convert_to_degrees(gps_longitude)
if gps_longitude_ref != "E":
lon = 0 - lon
return lat, lon
Code source
Which is run like:
if __name__ == "__main__":
image = Image.open("photo directory")
exif_data = get_exif_data(image)
print(get_lat_lon(exif_data)
This works fine for one photo, so I've used glob to iterate over all photos in a file:
import glob
file_names = []
for name in glob.glob(photo directory):
file_names.append(name)
for item in file_names:
if __name__ == "__main__":
image = Image.open(item)
exif_data = get_exif_data(image)
print(get_lat_lon(exif_data))
else:
pass
Which works fine, as long as every photo in the file is a) an image and b) has gps data. I have tried adding a pass in the _get_if_exist function as well as my file iteration, however, neither same to have had any impact and I'm still receiving KeyError: 'GPSInfo'
Any ideas on how I can ignore photos with no data or different file types?
A possible approach would be writing a small helper function that first checks, if the file is actually an image file and as a second step checks if the image contains EXIF data.
def is_metadata_image(filename):
try:
image = Image.open(filename)
return 'exif' in image.info
except OSError:
return False
I found that PIL does not work every time with .png files that do contain EXIF information when using _getexif(). So instead I check for the key exif in the info dictionary of an image.
I've tried this source code.
Simply you need to remove
gps_info = exif_data["GPSInfo"]
from the first line of get_lat_lon(exif_data) function, it works well for me.

Categories