UnrecognizedImageError - image insertion error - python-docx - python

I am trying to insert an wmf file to docx using python-docx which is producing the following traceback.
Traceback (most recent call last):
File "C:/Users/ADMIN/PycharmProjects/ppt-to-word/ppt_reader.py", line 79, in <module>
read_ppt(path, file)
File "C:/Users/ADMIN/PycharmProjects/ppt-to-word/ppt_reader.py", line 73, in read_ppt
write_docx(ppt_data, False)
File "C:/Users/ADMIN/PycharmProjects/ppt-to-word/ppt_reader.py", line 31, in write_docx
document.add_picture(slide_data.get('picture_location'), width=Inches(5.0))
File "C:\Python34\lib\site-packages\docx\document.py", line 72, in add_picture
return run.add_picture(image_path_or_stream, width, height)
File "C:\Python34\lib\site-packages\docx\text\run.py", line 62, in add_picture
inline = self.part.new_pic_inline(image_path_or_stream, width, height)
File "C:\Python34\lib\site-packages\docx\parts\story.py", line 56, in new_pic_inline
rId, image = self.get_or_add_image(image_descriptor)
File "C:\Python34\lib\site-packages\docx\parts\story.py", line 29, in get_or_add_image
image_part = self._package.get_or_add_image_part(image_descriptor)
File "C:\Python34\lib\site-packages\docx\package.py", line 31, in get_or_add_image_part
return self.image_parts.get_or_add_image_part(image_descriptor)
File "C:\Python34\lib\site-packages\docx\package.py", line 74, in get_or_add_image_part
image = Image.from_file(image_descriptor)
File "C:\Python34\lib\site-packages\docx\image\image.py", line 55, in from_file
return cls._from_stream(stream, blob, filename)
File "C:\Python34\lib\site-packages\docx\image\image.py", line 176, in _from_stream
image_header = _ImageHeaderFactory(stream)
File "C:\Python34\lib\site-packages\docx\image\image.py", line 199, in _ImageHeaderFactory
raise UnrecognizedImageError
docx.image.exceptions.UnrecognizedImageError
The image file is in .wmf format.
Any help or suggestion appreciated.

python-docx identifies the type of an image-file by "recognizing" its distinctive header. In this way it can distinguish JPEG from PNG, from TIFF, etc. This is much more reliable than mapping a filename extension and much more convenient than requiring the user to tell you the type. It's a pretty common approach.
This error indicates python-docx is not finding a header it recognizes. Windows Metafile format (WMF) can be tricky this way, there's a lot of leeway in the proprietary spec and variation in file specimens in the field.
To fix this, I recommend you read the file with something that does recognize it (I would start with Pillow) and have it "convert" it into the same or another format, hopefully correcting the header in the process.
First I would try just reading it and saving it as WMF (or perhaps EMF if that's an option). This might be enough to do the trick. If you have to change to an intermediate format and then back, that could be lossy, but maybe better than nothing.
ImageMagick might be another good choice to try because it probably has better coverage than Pillow does.

Explanation
python-docx/image.py will read differernt picture file format from SIGNATURES
Format
1.jpg
Use Image converter to convert 1.jpg to different file formats.
Use magic to get mime type.
File format
Mime type
add_picture()
.jpg
image/jpeg
√
.png
image/png
√
.jfif
image/jpeg
√
.exif
√
.gif
image/gif
√
.tiff
image/tiff
√
.bmp
image/x-ms-bmp
√
.eps
application/postscript
×
.hdr
application/octet-stream
×
.ico
image/x-icon
×
.svg
image/svg+xml
×
.tga
image/x-tga
×
.wbmp
application/octet-stream
×
.webp
image/webp
×
How to solve
Plan A
Convert other format to supported formats like .jpg
Install
pip install pillow
Code
from pathlib import Path
from PIL import Image
def image_to_jpg(image_path):
path = Path(image_path)
if path.suffix not in {'.jpg', '.png', '.jfif', '.exif', '.gif', '.tiff', '.bmp'}:
jpg_image_path = f'{path.parent / path.stem}_result.jpg'
Image.open(image_path).convert('RGB').save(jpg_image_path)
return jpg_image_path
return image_path
if __name__ == '__main__':
from docx import Document
document = Document()
document.add_picture(image_to_jpg('1.jpg'))
document.add_picture(image_to_jpg('1.webp'))
document.save('test.docx')
Plan B
First, try to add picture into Word manually. If success, it means Word supports this format. Then modify this library by inheriting the BaseImageHeader class and implementing the from_stream() method with SIGNATURES adding the image format.
Lack of file suffix
modify 1.jpg to 1
from docx import Document
document = Document()
document.add_picture('1')
document.save('test.docx')
It will show this
Using this
from docx import Document
document = Document()
document.add_picture(open('1', mode='rb'))
document.save('test.docx')
Conclusion
import io
from pathlib import Path
import magic
from PIL import Image
def image_to_jpg(image_path_or_stream):
f = io.BytesIO()
if isinstance(image_path_or_stream, str):
path = Path(image_path_or_stream)
if path.suffix in {'.jpg', '.png', '.jfif', '.exif', '.gif', '.tiff', '.bmp'}:
f = open(image_path_or_stream, mode='rb')
else:
Image.open(image_path_or_stream).convert('RGB').save(f, format='JPEG')
else:
buffer = image_path_or_stream.read()
mime_type = magic.from_buffer(buffer, mime=True)
if mime_type in {'image/jpeg', 'image/png', 'image/gif', 'image/tiff', 'image/x-ms-bmp'}:
f = image_path_or_stream
else:
Image.open(io.BytesIO(buffer)).convert('RGB').save(f, format='JPEG')
return f
if __name__ == '__main__':
from docx import Document
document = Document()
document.add_picture(image_to_jpg('1.jpg'))
document.add_picture(image_to_jpg('1.webp'))
document.add_picture(image_to_jpg(open('1.jpg', mode='rb')))
document.add_picture(image_to_jpg(open('1', mode='rb'))) # copy 1.webp and rename it to 1
document.save('test.docx')

Related

Deep Learning model on M1 chip(TensorFlow and Keras) [duplicate]

I am trying to read a png file into a python-flask application running in docker and am getting an error that says
ValueError: Could not find a format to read the specified file in mode
'i'
i have uploaded a file using an HTML file and now i am trying to read it for further processing. i see that scipy.misc.imread is deprecated and i am trying to replace this with imageio.imread
if request.method=='POST':
file = request.files['image']
if not file:
return render_template('index.html', label="No file")
#img = misc.imread(file)
img = imageio.imread(file)
i get this error :
File "./appimclass.py", line 34, in make_prediction
img = imageio.imread(file)
File "/usr/local/lib/python3.6/site-packages/imageio/core/functions.py", line 221, in imread
reader = read(uri, format, "i", **kwargs)
File "/usr/local/lib/python3.6/site-packages/imageio/core/functions.py", line 139, in get_reader
"Could not find a format to read the specified file " "in mode %r" % mode
Different, but in case helpful. I had an identical error in a different library (skimage), and the solution was to add an extra 'plugin' parameter like so -
image = io.imread(filename,plugin='matplotlib')
Had the exact same problem recently, and the issue was a single corrupt file. Best is to use something like PIL to check for bad files.
import os
from os import listdir
from PIL import Image
dir_path = "/path/"
for filename in listdir(dir_path):
if filename.endswith('.jpg'):
try:
img = Image.open(dir_path+"\\"+filename) # open the image file
img.verify() # verify that it is, in fact an image
except (IOError, SyntaxError) as e:
print('Bad file:', filename)
#os.remove(dir_path+"\\"+filename) (Maybe)
I had this problem today, and found that if I closed the file before reading it into imageio the problem went away.
Error was:
File "/home/vinny/pvenvs/chess/lib/python3.6/site-packages/imageio/core/functions.py", line 139, in get_reader "Could not find a format to read the specified file " "in mode %r" % mode ValueError: Could not find a format to read the specified file in mode 'i'
Solution:
Put file.close() before images.append(imageio.imread(filename)), not after.
Add the option "pilmode":
imageio.imread(filename,pilmode="RGB")
It worked for me.
I encountered the same error, and at last, I found it was because the picture was damaged.
I had accidentally saved some images as PDF, so the error occurred. resolved after deleting those incompatible format images.

Python PIL can't open PDFs for some reason

So my program is able to open PNGs but not PDFs, so I made this just to test, and it still isn't able to open even a simple PDF. And I don't know why.
from PIL import Image
with Image.open(r"Adams, K\a.pdf") as file:
print file
Traceback (most recent call last):
File "C:\Users\Hayden\Desktop\Scans\test4.py", line 3, in <module>
with Image.open(r"Adams, K\a.pdf") as file:
File "C:\Python27\lib\site-packages\PIL\Image.py", line 2590, in open
% (filename if filename else fp))
IOError: cannot identify image file 'Adams, K\\a.pdf'
After trying PyPDF2 as suggested (Thanks for the link by the way), I am getting this error with my code.
import PyPDF2
pdf_file= open(r"Adams, K (6).pdf", "rb")
read_pdf= PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print number_of_pages
Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
Following this article: https://www.geeksforgeeks.org/convert-pdf-to-image-using-python/ you can use the pdf2image package to convert the pdf to a PIL object.
This should solve your problem:
from pdf2image import convert_from_path
fname = r"Adams, K\a.pdf"
pil_image_lst = convert_from_path(fname) # This returns a list even for a 1 page pdf
pil_image = pil_image_lst[0]
I just tried this out with a one page pdf.
As pointed out by #Kevin (see comment below) PIL has support for writing pdfs but not reading them.
To read a pdf you will need some other library. You can look here which is a tutorial for handling PDFs with PyPDF2.
https://pythonhosted.org/PyPDF2/?utm_source=recordnotfound.com

How to save image using Python

I have an image file decoded by base64.
Now, I want to save the image file to the specified directory.
the directory is described as image_dir_path
image_dir_path = '/images/store/'
image_file = base64.b64decode(image_file)
How can I save the image_file to image_dir_path?
I tried shutil.copy(image_file, image_dir_path), but it doesn't work for my case.
I'm sorry I coundn't find the question like this.
You can write any content to a file with a file object and its write method. As an example, let's grab some base64 encoded data from the web:
import base64, urllib
decoded = base64.b64decode(urllib.urlopen("http://git.io/vYT4p").read())
with open('/tmp/31558315.png', 'w') as handle:
handle.write(decoded)
You should be able to open the file under /tmp/31558315.png as a regular image.

Issues with excel file. XLRDError: Unsupported format, or corrupt file: What kind of file is this?

I have a bit of code that works with an xls file. It works for everything I've thrown at it except this one file and I don't know how to properly identify what this one file is. I get the file off of a website I am navigating with Selenium. This particular spreadsheet always downloads as a file type that causes this error.
The full error is:
Traceback (most recent call last):
File "/Users/Meir/Documents/PYTHON/IFG User Update/code/ifg_TPA_update_excel.py", line 44, in <module>
rb = open_workbook((os.path.expanduser("~/Documents/PYTHON/Selenium test/TPA_Example.xls")),formatting_info=True)
File "/usr/local/lib/python2.7/site-packages/xlrd/__init__.py", line 443, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 94, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 1262, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 1256, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe<\x00S\x00T\x00'
The file I am trying to open displays as an xls file in my finder. However, when I open it, it does not open with the file name as the header but rather displays "Workbook1". When I hit save, it opens the save menu as if I had clicked save as, and defaults to "Workbook1.xlsx". I tried changing my code to open it as an xlsx file, but then it errors out saying it cannot find the file. Whenever I try googling it, I don't know how to phrase it to get a relevant answer.
When I contacted the websites support team asking what kind of file the TPA bulks op sheet is they replied:
The TPA bulk ops is an older version than the rest of the bulk ops, it's due to be rebuilt some time later this year. When downloading the file your best bet is to do a Save As and save it as an older version of .xls, I usually select Microsoft Excel 5.0/95 Workbook, and also format it as text. Formatted that way it should upload without issue.
Any ideas as to how I can open this right from Python?
Currently I am building each part as a separate code and I was going to combine them all together once I get it sorted out. The below is the section of code that will be opening the file and is experiencing the error.
My code:
#!/usr/bin/env python
## Import OS and Modules
import os
import csv
import xlrd
import xlwt
import xlutils
import csv
import collections
## Define Input File from IFG
ifg_user_file = "New_PCs_to_set_up_in_marketing_database_-_4-11-2013.csv"
## Import data
data = [row for row in csv.reader(open (os.path.expanduser("~/Downloads/" + ifg_user_file),'U'))]
## Find number of rows
row_count = sum(1 for row in data)
print row_count
## Set to turn off when reaching the end of data
end_of_data = False
from xlutils.copy import copy # http://pypi.python.org/pypi/xlutils
from xlrd import open_workbook # http://pypi.python.org/pypi/xlrd
from xlwt import easyxf # http://pypi.python.org/pypi/xlwt
##################################################################################
## THE ERROR OCCURS AT THE LINE BELOW
rb = open_workbook((os.path.expanduser("~/Documents/PYTHON/Selenium test/TPA_Example.xls")),formatting_info=True)
r_sheet = rb.sheet_by_index(0) # read only copy to introspect the file
EDIT: I tried to open it with codecs rather than open for diagnostics
rb=codecs.open((os.path.expanduser("~/Documents/PYTHON/Selenium test/TPA_Example.xls")), 'r', encoding='utf16');
print rb;
print rb.readline();
print rb.read(20);
It printed the following result:
<open file '/Users/Meir/Documents/PYTHON/Selenium test/TPA_Example.xls', mode 'rb' at 0x110fe51e0>
<STYLE>
.excel { BORDER-RIGHT: black 1px solid; BORDER-TOP: black 1px solid; BORDER-LEFT: black 1px so
It looks like it is an excel document then. Not sure how to proceed. Is there a universal open an excel document command?

python create jpg file from raw data

I have this strange xml file which apparently contains jpeg image data:
<?xml version="1.0" encoding="UTF-8"?>
<AttachmentDocument xmlns="http://echa.europa.eu/schemas/iuclid5/20070330" documentReferencePK="ECB5-d18039fe-6fb0-44d6-be9e-d6ade38be543/0" encoding="0" fileSize="5788" fileTimestamp="2007-04-17T12:38:44Z" parentDocumentPK="ECB5-fb07efbf-ee93-4cdd-865b-49efa51cbd15/0" version="2007-03-19T14:13:29Z">
<modificationHistory>
<modification date="2007-05-10T09:00:00Z">
<comment>Created</comment>
<modificationBy>European Commision/Joint Research Centre/European Chemicals Bureau</modificationBy>
</modification>
</modificationHistory>
<ownershipProtection copyProtection="false" fractionalDocument="false" sealed="false"/>
<fileName>33952-38-4-V2.jpeg</fileName>
<fileMimetype>image/jpeg</fileMimetype>
<rawContent>
H4sIAAAAAAAAAO2XZ1AU65qAe5iBIQwgOCMZRkHJCIgEySBhyEEyIyDgMBKHLEFQBJEoIHBEQFQE
JUjOSo4iOQ+Ss2QkSZhZvLXn7j11726d3draH1vn7Xp+dH1fd/XzvV+//TZxlDgNnNNQRakCIBAA
gM4OgEgA5JQNVBRv6RrcQGLsBO+52WOQ3iJCwkgeLw+sCwaJ0lBDauipqCG9xUV5BZB29ndtvJw8
kTgvGyes531K4jigDJCTkUHJSMmhUCgFBTklDE4No6KCMdGfp4WzMXOwszGzsiK5hLiRlwQ4WVl5
...
</rawContent>
<MD5>0d80850b0c4085500f80e1430b90c70910d4110cc0d7</MD5>
</AttachmentDocument>
(The full version here)
And I can't read image out of it.
My attempt:
from PIL import Image
import StringIO
import base64
# I've eleminated all newlines and tabs to produce data string
data="H4sIAAAAAAAAAO2XZ1AU65qAe5..."
im = Image.open(StringIO.StringIO(base64.b64decode(data)))
But I'm getting an error:
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line 1980, in open
raise IOError("cannot identify image file")
If you'd check what you're getting in the output of base64 decode, then you would notice that it's a gzip file. Extract the compressed file and you'll get the desired JPEG.
Comment stored in the image:
CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), default quality

Categories