I am using Django-media-tree to import images into a site image library. I am hitting a bug in PIL where some unknown EXIF data on the image is causing a non-handled exception in the generation of the thumbnail images. Rather than hacking around in PIL I am looking to simply remove all EXIF data from the image before it is handled by PIL.
Using chilkat.CkXmp() I am attempting to rewrite the image to a new directory in clean form, however the RemoveAllEmbedded() method returns None, and the image is rewritten with the EXIF data intact.
import os
import sys
import chilkat
ALLOWED_EXTENSIONS = ['.jpg', 'jpeg', '.png', '.gif', 'tiff']
def listdir_fullpath(d):
list = []
for f in os.listdir(d):
if len(f) > 3:
if f[-4:] in ALLOWED_EXTENSIONS:
list.append(os.path.join(d, f))
return list
def trim_xmp_data(file, dir):
xmp = chilkat.CkXmp()
success = xmp.UnlockComponent("Anything for 30-day trial.")
if (success != True):
print xmp.lastErrorText()
sys.exit()
success = xmp.LoadAppFile(file)
if (success != True):
print xmp.lastErrorText()
sys.exit()
print "Num embedded XMP docs: %d" % xmp.get_NumEmbedded()
xmp.RemoveAllEmbedded()
# Save the JPG.
fn = "%s/amended/%s" % (dir, file.rsplit('/')[-1])
success = xmp.SaveAppFile(fn)
if (success != True):
print xmp.lastErrorText()
sys.exit()
for item in listdir_fullpath('/Users/harrin2/Desktop/tmp/'):
trim_xmp_data(item, '/Users/harrin2/Desktop/tmp')
Can anyone tell me where I am going wrong, or if there is a better method of cleaning the images I am open to suggestions.....
TIA
Related
I have a lot of URLs of images stored on the web, example of a URL is as follows :
https://m.media-amazon.com/images/M/MV5BOWE4M2UwNWEtODFjOS00M2JiLTlhOGQtNTljZjI5ZTZlM2MzXkEyXkFqcGdeQXVyNjUwNzk3NDc#._V1_QL75_UX190_CR0
I want to load images from a similar URL as mentioned above and then do some operations on that image then return the resulting image.
So here's my code :
def get_image_from_url(url, path):
try:
# downloading image from url
img = requests.get(url)
with open(path, 'wb') as f:
f.write(img.content)
# reading image, str(path) since path is object of Pathlib's path class
img = cv2.imread(str(path), cv2.IMREAD_COLOR)
# some operations
# deleting that downloaded image since it is of no use now
if os.path.exists(path):
os.remove(path)
return resulting_image
except Exception as e:
return np.zeros((224, 224, 3), np.uint8)
But this process is taking too much time so I thought instead of downloading and deleting the image I will directly load that image present on the URL into a variable.
Something like this :
def store_image_from_url(url):
image = get_image_from_url(url) # without downloading it into my computer
# do some operations
return resulting_image
Is there any way to do the same?
Thank you
As How can I read an image from an Internet URL in Python cv2, scikit image and mahotas?, it can be something like this :
import cv2
import urllib
import numpy as np
def get_image_from_url(url):
req = urllib.urlopen(url)
arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
img = cv2.imdecode(arr, -1)
return img
Code:
import zxing
from PIL import Image
reader = zxing.BarCodeReader()
path = 'C:/Users/UI UX/Desktop/Uasa.png'
im = Image.open(path)
barcode = reader.decode(path)
print(barcode)
when i use code above work fine and return result:
BarCode(raw='P<E....
i need to use this code:
import zxing
import cv2
reader = zxing.BarCodeReader()
path = 'C:/Users/UI UX/Desktop/Uasa.png'
img = cv2.imread (path)
cv2.imshow('img', img)
cv2.waitKey(0)
barcode = reader.decode(img)
print(barcode)
but this code return an error:
TypeError: expected str, bytes or os.PathLike object, not numpy.ndarray
In another program i have image at base64 could help me somewhere here?
any body could help me with this?
ZXing does not support passing an image directly as it is using an external application to process the barcode image. If you're not locked into using the ZXing library for decoding PDF417 barcodes you can take a look at the PyPI package pdf417decoder.
If you're starting with a Numpy array like in your example then you have to convert it to a PIL image first.
import cv2
import pdf417decoder
from PIL import Image
npimg = cv2.imread (path)
cv2.imshow('img', npimg)
cv2.waitKey(0)
img = Image.fromarray(npimg)
decoder = PDF417Decoder(img)
if (decoder.decode() > 0):
print(decoder.barcode_data_index_to_string(0))
else:
print("Failed to decode barcode.")
You cannot. if you look at the source code you will see that what it does is call a java app with the provided path (Specifically com.google.zxing.client.j2se.CommandLineRunner).
If you need to pre-process your image then you will have to save it somewhere and pass the path to it to your library
I fix this by:
path = os.getcwd()
# print(path)
writeStatus = cv2.imwrite(os.path.join(path, 'test.jpg'), pdf_image)
if writeStatus is True:
print("image written")
else:
print("problem") # or raise exception, handle problem, etc.
sss = (os.path.join(path, 'test.jpg'))
# print(sss)
pp = sss.replace('\\', '/')
# print(pp)
reader = zxing.BarCodeReader()
barcode = reader.decode(pp)
The zxing package is not recommended. It is just a command line tool to invoke Java ZXing libraries.
You should use zxing-cpp, which is a Python module built with ZXing C++ code. Here is the sample code:
import cv2
import zxingcpp
img = cv2.imread('myimage.png')
results = zxingcpp.read_barcodes(img)
for result in results:
print("Found barcode:\n Text: '{}'\n Format: {}\n Position: {}"
.format(result.text, result.format, result.position))
if len(results) == 0:
print("Could not find any barcode.")
I would like to extract text from scanned PDFs.
My "test" code is as follows:
from pdf2image import convert_from_path
from pytesseract import image_to_string
from PIL import Image
converted_scan = convert_from_path('test.pdf', 500)
for i in converted_scan:
i.save('scan_image.png', 'png')
text = image_to_string(Image.open('scan_image.png'))
with open('scan_text_output.txt', 'w') as outfile:
outfile.write(text.replace('\n\n', '\n'))
I would like to know if there is a way to extract the content of the image directly from the object converted_scan, without saving the scan as a new "physical" image file on the disk?
Basically, I would like to skip this part:
for i in converted_scan:
i.save('scan_image.png', 'png')
I have a few thousands scans to extract text from. Although all the generated new image files are not particularly heavy, it's not negligible and I find it a bit overkill.
EDIT
Here's a slightly different, more compact approach than Colonder's answer, based on this post. For .pdf files with many pages, it might be worth adding a progress bar to each loop using e.g. the tqdm module.
from wand.image import Image as w_img
from PIL import Image as p_img
import pyocr.builders
import regex, pyocr, io
infile = 'my_file.pdf'
tool = pyocr.get_available_tools()[0]
tool = tools[0]
req_image = []
txt = ''
# to convert pdf to img and extract text
with w_img(filename = infile, resolution = 200) as scan:
image_png = scan.convert('png')
for i in image_png.sequence:
img_page = w_img(image = i)
req_image.append(img_page.make_blob('png'))
for i in req_image:
content = tool.image_to_string(
p_img.open(io.BytesIO(i)),
lang = tool.get_available_languages()[0],
builder = pyocr.builders.TextBuilder()
)
txt += content
# to save the output as a .txt file
with open(infile[:-4] + '.txt', 'w') as outfile:
full_txt = regex.sub(r'\n+', '\n', txt)
outfile.write(full_txt)
UPDATE MAY 2021
I realized that although pdf2image is simply calling a subprocess, one doesn't have to save images to subsequently OCR them. What you can do is just simply (you can use pytesseract as OCR library as well)
from pdf2image import convert_from_path
for img in convert_from_path("some_pdf.pdf", 300):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
EDIT: you can also try and use pdftotext library
pdf2image is a simple wrapper around pdftoppm and pdftocairo. It internally does nothing more but calls subprocess. This script should do what you want, but you need a wand library as well as pyocr (I think this is a matter of preference, so feel free to use any library for text extraction you want).
from PIL import Image as Pimage, ImageDraw
from wand.image import Image as Wimage
import sys
import numpy as np
from io import BytesIO
import pyocr
import pyocr.builders
def _convert_pdf2jpg(in_file_path: str, resolution: int=300) -> Pimage:
"""
Convert PDF file to JPG
:param in_file_path: path of pdf file to convert
:param resolution: resolution with which to read the PDF file
:return: PIL Image
"""
with Wimage(filename=in_file_path, resolution=resolution).convert("jpg") as all_pages:
for page in all_pages.sequence:
with Wimage(page) as single_page_image:
# transform wand image to bytes in order to transform it into PIL image
yield Pimage.open(BytesIO(bytearray(single_page_image.make_blob(format="jpeg"))))
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'
langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))
# Ex: Will use lang 'fra'
# Note that languages are NOT sorted in any way. Please refer
# to the system locale settings for the default language
# to use.
for img in _convert_pdf2jpg("some_pdf.pdf"):
txt = tool.image_to_string(img,
lang=lang,
builder=pyocr.builders.TextBuilder())
I believe this is my first StackOverflow question, so please be nice.
I am OCRing a repository of PDFs (~1GB in total) ranging from 50-200 pages each and found that suddenly all of the available 100GB of remaining harddrive space on my Macbook Pro were gone. Based on a previous post, it seems that ImageMagick is the culprit as shown here.
I found that these files are called 'magick-*' and are stored in /private/var/tmp. For only 23 PDFs it had created 3576 files totaling 181GB.
How can I delete these files immediately within the code after they are no longer needed? Thank you in advance for any suggestions to remedy this issue.
Here is the code:
import io, os
import json
import unicodedata
from PIL import Image as PI
import pyocr
import pyocr.builders
from wand.image import Image
from tqdm import tqdm
# Where you want to save the PDFs
destination_folder = 'contract_data/Contracts_Backlog/'
pdfs = [unicodedata.normalize('NFKC',f.decode('utf8')) for f in os.listdir(destination_folder) if f.lower().endswith('.pdf')]
txt_files = [unicodedata.normalize('NFKC',f.decode('utf8')) for f in os.listdir(destination_folder) if f.lower().endswith('.txt')]
### Perform OCR on PDFs
def ocr_pdf_to_text(filename):
tool = pyocr.get_available_tools()[0]
lang = 'spa'
req_image = []
final_text = []
image_pdf = Image(filename=filename, resolution=300)
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('jpeg'))
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt)
return final_text
for filename in tqdm(pdfs):
txt_file = filename[:-3] +'txt'
txt_filename = destination_folder + txt_file
if not txt_file in txt_files:
print 'Converting ' + filename
try:
ocr_txt = ocr_pdf_to_text(destination_folder + filename)
with open(txt_filename,'w') as f:
for i in range(len(ocr_txt)):
f.write(json.dumps({i:ocr_txt[i].encode('utf8')}))
f.write('\n')
f.close()
except:
print "Could not OCR " + filename
A hacky way of dealing with this was to add an os.remove() statement within the main loop to remove the tmp files after creation.
tempdir = '/private/var/tmp/'
files = os.listdir(tempdir)
for file in files:
if "magick" in file:
os.remove(os.path.join(tempdir,file))
Image should be used as a context manager, because Wand determine timings to dispose resources including temporary files, in-memory buffers, and so on. with block help Wand to know boundaries when these Image objects are still needed and when they are now unnecessary.
See also the official docs.
I have just written a small function to download and save some images to my hard disk. Now that some urls redirect and/or contain bad file extensions. I have added some validations, however, they cause the script to stop immediately as they hit a bad url. Now, I would like to modify the script a bit that loop continues discarding any bad urls, eventually breaking the loop as I successfully download an image. (Here I need to download just one image successfully). Can you please take a look at my code and share some tips? Thank you
from pattern.web import URL, DOM, plaintext, extension
import requests, re, os, sys, datetime, time, re, random
def download_single_image(query, folder, image_options=None):
download_fault = 0
url_link = None
valid_image_ext_list = ['.png', '.jpg', '.gif', '.bmp', '.tiff', 'jpeg'] # not comprehensive
pic_links = scrape_links(query, image_options) # pic_links contains an array of urls
for url in pic_links:
url = URL(url)
print "checking re-direction"
if url.redirect:
print "redirected, returning"
return # if there is a redirect, return
file_ext = extension(url.page)
print "checking file extension", file_ext
if file_ext.lower() not in valid_image_ext_list:
print "not a valid extension, returning"
return # return if not valid image extension found
# Download the image.
print('Downloading image %s... ' % (pic))
res = requests.get(pic)
try:
res.raise_for_status()
except Exception as exc:
print('There was a problem: %s' % (exc))
print ('Saving image to %s...'% (folder))
if not os.path.exists(folder + '/' + os.path.basename(pic)):
imageFile = open(os.path.join(folder, os.path.basename(pic)), mode='wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
print('pic saved %s' % os.path.basename(pic))
else:
print('File already exists!')
return os.path.basename(pic)
Change this:
return # return if not valid image extension found
to this:
continue # return if not valid image extension found
First just aborts the loop, second skips to next step.
PS.File extensions in the world of Internet mean nothing... I would rather just send HEAD request with CURL to check if it's image or not (by content-type that servers returns).