python create jpg file from raw data - python

I have this strange xml file which apparently contains jpeg image data:
<?xml version="1.0" encoding="UTF-8"?>
<AttachmentDocument xmlns="http://echa.europa.eu/schemas/iuclid5/20070330" documentReferencePK="ECB5-d18039fe-6fb0-44d6-be9e-d6ade38be543/0" encoding="0" fileSize="5788" fileTimestamp="2007-04-17T12:38:44Z" parentDocumentPK="ECB5-fb07efbf-ee93-4cdd-865b-49efa51cbd15/0" version="2007-03-19T14:13:29Z">
<modificationHistory>
<modification date="2007-05-10T09:00:00Z">
<comment>Created</comment>
<modificationBy>European Commision/Joint Research Centre/European Chemicals Bureau</modificationBy>
</modification>
</modificationHistory>
<ownershipProtection copyProtection="false" fractionalDocument="false" sealed="false"/>
<fileName>33952-38-4-V2.jpeg</fileName>
<fileMimetype>image/jpeg</fileMimetype>
<rawContent>
H4sIAAAAAAAAAO2XZ1AU65qAe5iBIQwgOCMZRkHJCIgEySBhyEEyIyDgMBKHLEFQBJEoIHBEQFQE
JUjOSo4iOQ+Ss2QkSZhZvLXn7j11726d3draH1vn7Xp+dH1fd/XzvV+//TZxlDgNnNNQRakCIBAA
gM4OgEgA5JQNVBRv6RrcQGLsBO+52WOQ3iJCwkgeLw+sCwaJ0lBDauipqCG9xUV5BZB29ndtvJw8
kTgvGyes531K4jigDJCTkUHJSMmhUCgFBTklDE4No6KCMdGfp4WzMXOwszGzsiK5hLiRlwQ4WVl5
...
</rawContent>
<MD5>0d80850b0c4085500f80e1430b90c70910d4110cc0d7</MD5>
</AttachmentDocument>
(The full version here)
And I can't read image out of it.
My attempt:
from PIL import Image
import StringIO
import base64
# I've eleminated all newlines and tabs to produce data string
data="H4sIAAAAAAAAAO2XZ1AU65qAe5..."
im = Image.open(StringIO.StringIO(base64.b64decode(data)))
But I'm getting an error:
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line 1980, in open
raise IOError("cannot identify image file")

If you'd check what you're getting in the output of base64 decode, then you would notice that it's a gzip file. Extract the compressed file and you'll get the desired JPEG.
Comment stored in the image:
CREATOR: gd-jpeg v1.0 (using IJG JPEG v62), default quality

Related

cv2.imread file with accent (unicode)

I am trying to load the following file: 'data/chapter_1/capd_yard_signs\\Dueñas_2020.png'
But when I do so, cv2.imread returns an error: imread_('data/chapter_1/capd_yard_signs\Due├▒as_2020.png'): can't open/read file: check file path/integrity load file
When I specified the file name with os.path.join, I tried encoding and decoding the file
f = os.path.join("data/chapter_1/capd_yard_signs", filename.encode().decode())
But that didn't solve the problem.
What am I missing?
This is how I ended up getting it to work:
from PIL import Image
pil = Image.open(f).convert('RGB') # load the image with pillow and make sure it is in RGB
pilCv = np.array(pil) # convert the image to an array
img = pilCv[:,:,::-1].copy() # convert the array to be in BGR

UnrecognizedImageError - image insertion error - python-docx

I am trying to insert an wmf file to docx using python-docx which is producing the following traceback.
Traceback (most recent call last):
File "C:/Users/ADMIN/PycharmProjects/ppt-to-word/ppt_reader.py", line 79, in <module>
read_ppt(path, file)
File "C:/Users/ADMIN/PycharmProjects/ppt-to-word/ppt_reader.py", line 73, in read_ppt
write_docx(ppt_data, False)
File "C:/Users/ADMIN/PycharmProjects/ppt-to-word/ppt_reader.py", line 31, in write_docx
document.add_picture(slide_data.get('picture_location'), width=Inches(5.0))
File "C:\Python34\lib\site-packages\docx\document.py", line 72, in add_picture
return run.add_picture(image_path_or_stream, width, height)
File "C:\Python34\lib\site-packages\docx\text\run.py", line 62, in add_picture
inline = self.part.new_pic_inline(image_path_or_stream, width, height)
File "C:\Python34\lib\site-packages\docx\parts\story.py", line 56, in new_pic_inline
rId, image = self.get_or_add_image(image_descriptor)
File "C:\Python34\lib\site-packages\docx\parts\story.py", line 29, in get_or_add_image
image_part = self._package.get_or_add_image_part(image_descriptor)
File "C:\Python34\lib\site-packages\docx\package.py", line 31, in get_or_add_image_part
return self.image_parts.get_or_add_image_part(image_descriptor)
File "C:\Python34\lib\site-packages\docx\package.py", line 74, in get_or_add_image_part
image = Image.from_file(image_descriptor)
File "C:\Python34\lib\site-packages\docx\image\image.py", line 55, in from_file
return cls._from_stream(stream, blob, filename)
File "C:\Python34\lib\site-packages\docx\image\image.py", line 176, in _from_stream
image_header = _ImageHeaderFactory(stream)
File "C:\Python34\lib\site-packages\docx\image\image.py", line 199, in _ImageHeaderFactory
raise UnrecognizedImageError
docx.image.exceptions.UnrecognizedImageError
The image file is in .wmf format.
Any help or suggestion appreciated.
python-docx identifies the type of an image-file by "recognizing" its distinctive header. In this way it can distinguish JPEG from PNG, from TIFF, etc. This is much more reliable than mapping a filename extension and much more convenient than requiring the user to tell you the type. It's a pretty common approach.
This error indicates python-docx is not finding a header it recognizes. Windows Metafile format (WMF) can be tricky this way, there's a lot of leeway in the proprietary spec and variation in file specimens in the field.
To fix this, I recommend you read the file with something that does recognize it (I would start with Pillow) and have it "convert" it into the same or another format, hopefully correcting the header in the process.
First I would try just reading it and saving it as WMF (or perhaps EMF if that's an option). This might be enough to do the trick. If you have to change to an intermediate format and then back, that could be lossy, but maybe better than nothing.
ImageMagick might be another good choice to try because it probably has better coverage than Pillow does.
Explanation
python-docx/image.py will read differernt picture file format from SIGNATURES
Format
1.jpg
Use Image converter to convert 1.jpg to different file formats.
Use magic to get mime type.
File format
Mime type
add_picture()
.jpg
image/jpeg
√
.png
image/png
√
.jfif
image/jpeg
√
.exif
√
.gif
image/gif
√
.tiff
image/tiff
√
.bmp
image/x-ms-bmp
√
.eps
application/postscript
×
.hdr
application/octet-stream
×
.ico
image/x-icon
×
.svg
image/svg+xml
×
.tga
image/x-tga
×
.wbmp
application/octet-stream
×
.webp
image/webp
×
How to solve
Plan A
Convert other format to supported formats like .jpg
Install
pip install pillow
Code
from pathlib import Path
from PIL import Image
def image_to_jpg(image_path):
path = Path(image_path)
if path.suffix not in {'.jpg', '.png', '.jfif', '.exif', '.gif', '.tiff', '.bmp'}:
jpg_image_path = f'{path.parent / path.stem}_result.jpg'
Image.open(image_path).convert('RGB').save(jpg_image_path)
return jpg_image_path
return image_path
if __name__ == '__main__':
from docx import Document
document = Document()
document.add_picture(image_to_jpg('1.jpg'))
document.add_picture(image_to_jpg('1.webp'))
document.save('test.docx')
Plan B
First, try to add picture into Word manually. If success, it means Word supports this format. Then modify this library by inheriting the BaseImageHeader class and implementing the from_stream() method with SIGNATURES adding the image format.
Lack of file suffix
modify 1.jpg to 1
from docx import Document
document = Document()
document.add_picture('1')
document.save('test.docx')
It will show this
Using this
from docx import Document
document = Document()
document.add_picture(open('1', mode='rb'))
document.save('test.docx')
Conclusion
import io
from pathlib import Path
import magic
from PIL import Image
def image_to_jpg(image_path_or_stream):
f = io.BytesIO()
if isinstance(image_path_or_stream, str):
path = Path(image_path_or_stream)
if path.suffix in {'.jpg', '.png', '.jfif', '.exif', '.gif', '.tiff', '.bmp'}:
f = open(image_path_or_stream, mode='rb')
else:
Image.open(image_path_or_stream).convert('RGB').save(f, format='JPEG')
else:
buffer = image_path_or_stream.read()
mime_type = magic.from_buffer(buffer, mime=True)
if mime_type in {'image/jpeg', 'image/png', 'image/gif', 'image/tiff', 'image/x-ms-bmp'}:
f = image_path_or_stream
else:
Image.open(io.BytesIO(buffer)).convert('RGB').save(f, format='JPEG')
return f
if __name__ == '__main__':
from docx import Document
document = Document()
document.add_picture(image_to_jpg('1.jpg'))
document.add_picture(image_to_jpg('1.webp'))
document.add_picture(image_to_jpg(open('1.jpg', mode='rb')))
document.add_picture(image_to_jpg(open('1', mode='rb'))) # copy 1.webp and rename it to 1
document.save('test.docx')

Python PIL can't open PDFs for some reason

So my program is able to open PNGs but not PDFs, so I made this just to test, and it still isn't able to open even a simple PDF. And I don't know why.
from PIL import Image
with Image.open(r"Adams, K\a.pdf") as file:
print file
Traceback (most recent call last):
File "C:\Users\Hayden\Desktop\Scans\test4.py", line 3, in <module>
with Image.open(r"Adams, K\a.pdf") as file:
File "C:\Python27\lib\site-packages\PIL\Image.py", line 2590, in open
% (filename if filename else fp))
IOError: cannot identify image file 'Adams, K\\a.pdf'
After trying PyPDF2 as suggested (Thanks for the link by the way), I am getting this error with my code.
import PyPDF2
pdf_file= open(r"Adams, K (6).pdf", "rb")
read_pdf= PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
print number_of_pages
Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]
Following this article: https://www.geeksforgeeks.org/convert-pdf-to-image-using-python/ you can use the pdf2image package to convert the pdf to a PIL object.
This should solve your problem:
from pdf2image import convert_from_path
fname = r"Adams, K\a.pdf"
pil_image_lst = convert_from_path(fname) # This returns a list even for a 1 page pdf
pil_image = pil_image_lst[0]
I just tried this out with a one page pdf.
As pointed out by #Kevin (see comment below) PIL has support for writing pdfs but not reading them.
To read a pdf you will need some other library. You can look here which is a tutorial for handling PDFs with PyPDF2.
https://pythonhosted.org/PyPDF2/?utm_source=recordnotfound.com

Read image from URL and keep it in memory

I am using Python and requests library. I just want to download an image to a numpy array for example and there are multiple questions where you can find different combinations (using opencv, PIL, requests, urllib...)
None of them work for my case. I basically receive this error when I try to download the image:
cannot identify image file <_io.BytesIO object at 0x7f6a9734da98>
A simple example of my code can be:
import requests
from PIL import Image
response = requests.get(url, stream=True)
response.raw.decode_content = True
image = Image.open(response.raw)
image.show()
The main this that is driving me crazy is that, if I download the image to a file (using urllib), the whole process runs without any problem!
import urllib
urllib.request.urlretrieve(garment.url, os.path.join(download_folder, garment.get_path()))
What can I be doing wrong?
EDIT:
My mistake was finally related with URL formation and not with requests
or PIL library. My previous code example should work perfectly if the URL is correct.
I think you are using data from requests.raw object somehow before save them in Image but requests response raw object is not seekable, you can read from it only once:
>>> response.raw.seekable()
False
First open is ok:
>>> response.raw.tell()
0
>>> image = Image.open(response.raw)
Second open throws error (stream position is on the end of file already):
>>> response.raw.tell()
695 # this file length https://docs.python.org/3/_static/py.png
>>> image = Image.open(response.raw)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2295, in open
% (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x7f11850074c0>
You should save data from requests response in file-like object (or file of course) if you want to use them several times:
import io
image_data = io.BytesIO(response.raw.read())
Now you can read image stream and rewind it as many times as needed:
>>> image_data.seekable()
True
image = Image.open(image_data)
image1 = Image.open(image_data)

How to save image using Python

I have an image file decoded by base64.
Now, I want to save the image file to the specified directory.
the directory is described as image_dir_path
image_dir_path = '/images/store/'
image_file = base64.b64decode(image_file)
How can I save the image_file to image_dir_path?
I tried shutil.copy(image_file, image_dir_path), but it doesn't work for my case.
I'm sorry I coundn't find the question like this.
You can write any content to a file with a file object and its write method. As an example, let's grab some base64 encoded data from the web:
import base64, urllib
decoded = base64.b64decode(urllib.urlopen("http://git.io/vYT4p").read())
with open('/tmp/31558315.png', 'w') as handle:
handle.write(decoded)
You should be able to open the file under /tmp/31558315.png as a regular image.

Categories