I have been using pyexiv2 to read exif information from JPEG files in python, and noticed that one tag in particular - ExposureTime - is not reported the same by exiv2 as with another exif library, libexif.
Any exiv2-based utility I've tried will simplify the exposuretime tag to a "rational" such as 0/1, 0, or similar. libexif based utilities (in particular, a tool "exif") will report a much more detailed "1/-21474836 sec." for the same tag, in the same image.
Firstly I'd like to understand: what can account for this difference? I'm assuming that the latter of the two is correct.
Secondly, and assuming that the more detailed tag as reported by libexif is correct, I'd like to be able to obtain this value in Python, where as far as I can see it is not possible using any EXIF tools that I have come across (pyexiv2 for example). Is there a tool or method that I am not considering?
I have stumbled across one potential solution with the use of the libexif C library in python with ctypes as noted in this previously answered question - though I could not find examples of how I could do this.
Any help is greatly appreciated. Thanks!
In case this helps, here are some hacks I recently did to set missing lens / F-Number,.. informations as I was using a manual lens plus I computed actaul absolute EV for automatic retrieval by later HDR processing tools (HDR Luminace). I commented out the "write" action for safety below. Should be pretty much self explanatory.
The top files section makes a list of files to work on in the current folder (here all *.ARW (Sony raw files)). Adjust the pattern and path as needed.
#!/usr/bin/env python
import os
import time
import array
import math
# make file list (take all *.ARW files in current folder)
files = [f for f in os.listdir(".") if f.endswith(".ARW")]
files.sort() # just to be nice
# have a dict. of tags to work with in particular
tags = {'Aperture':10., 'Exposure Time ':1./1250, 'Shutter Speed':1./1250, 'ISO':200., 'Stops Above Base ISO':0., 'Exposure Compensation':0. }
# arbitrary chosen base EV to get final EV compensation numbers into +/-10 range
EVref = math.log (math.pow(tags['Aperture'],2.0)/tags['Shutter Speed'], 2.0) - 4
print ('EVref=', EVref)
for f in files:
print (f)
meta=os.popen("exiftool "+f).readlines()
for tag in meta:
set = str(tag).rstrip("\n").split(":")
for t,x in tags.items():
if str(set[0]).strip(" ") == t:
tags[t] = float ( str(os.popen("calc -- "+set[1]).readlines()).strip("[]'~\\t\\n"))
print (t, tags[t], set[1])
ev = math.log (math.pow(tags['Aperture'],2.0)/tags['Shutter Speed'], 2.0)
EV = EVref - ev + tags['Stops Above Base ISO']
print ('EV=', EV)
# uncomment/edit to update EXIF in place:
# os.system('exiftool -ExposureCompensation='+str(EV)+' '+f)
# os.system('exiftool -FNumber=10 '+f)
# os.system('exiftool -FocalLength=1000.0 '+f)
# os.system('exiftool -FocalLengthIn35mmFormat=1000.0 '+f)
Related
For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task:
exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"'
cmd = exe+" "+"\""+pdf+"\""+" "+"\""+pdf+".txt"+"\""
subprocess.check_output(cmd)
with open(pdf+".txt") as f:
texto_converted = f.read()
But unfortunately, for few of them, I was unable to get the text because they are using "stream" on their pdf source, like this one.
The result is something like this:
59!"#$%&'()*+,-.#/#01"21"" 345667.0*(879:4$;<;4=<6>4?$#"12!/ 21#$#A$3A$>#>BCDCEFGCHIJKIJLMNIJILOCNPQRDS QPFTRPUCTCVQWBCTTQXFPYTO"21 "#/!"#(Z[12\&A+],$3^_3;9`Z &a# .2"#.b#"(#c#A(87*95d$d4?$d3e#Z"f#\"#2b?2"#`Z 2"!eb2"#H1TBRgF JhiO
jFK# 2"k#`Z !#212##"elf/e21m#*c!n2!!#/bZ!#2#`Z "eo ]$5<$#;A533> "/\ko/f\#e#e#p
I Even trying using zlib + regex:
import re
import zlib
pdf = open("pdfa.pdf", "rb").read()
stream = re.compile(b'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in re.findall(stream,pdf):
s = s.strip(b'\r\n')
try:
print(zlib.decompress(s).decode('UTF-8'))
print("")
except:
pass
The result was something like this:
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
I even tried pdftopng (xpdf) to try tesseract after, without success
So, Is there any way to extract pure text from a PDF like that using Python or a third party app?
If you want to decompress the streams in a PDF file, I can recommend using qdpf, but on this file
qpdf --decrypt --stream-data=uncompress document.pdf out.pdf
doesn't help either.
I am not sure though why your efforts with xpdf and tesseract did not work out, using image-magick's convert
to create PNG files in a temporary directory and tesseract, you can do:
import os
from pathlib import Path
from tempfile import TemporaryDirectory
import subprocess
DPI=600
def call(*args):
cmd = [str(x) for x in args]
return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')
def ocr(docpath, lang):
result = []
abs_path = Path(docpath).expanduser().resolve()
old_dir = os.getcwd()
out = Path('out.txt')
with TemporaryDirectory() as tmpdir:
os.chdir(tmpdir)
call('convert', '-density', DPI, abs_path, 'out.png')
index = -1
while True:
# names have no leading zeros on the digits, would be difficult to sort glob() output
# so just count them
index += 1
png = Path(f'out-{index}.png')
if not png.exists():
break
call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
result.append(out.read_text())
os.chdir(old_dir)
return result
pages = ocr('~/Downloads/document.pdf', 'por')
print('\n'.join(pages[1].splitlines()[21:24]))
which gives:
DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO
Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre
If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.
The final print is limited as the full output is quite large.
This converting and OCR-ing takes a while so you might want to uncomment the print in call() to get some sense of progress.
There are two fairly simple techniques you can use.
1) Google's "Tessaract" open source OCR (optical character recognition). You could apply this evenly to all PDFs, though converting all that data into pixels and then working magic upon them is going to be more computationally expensive. Which is more important, engineer time or CPU time? There's a pytesseract module. Note that this tool works on image formats, so you'd have to use something like GhostScript (another open source project) to convert all of a PDF's pages to images, then run [py]tessaract on those images.
2) pyPDF can get each page and programmatically extract any text draw operations in the order they were drawn onto the page. This may be nothing like the logical reading order of the page... While a PDF could draw all the 'a's and then all the 'b's (and so forth), it's actually more efficient to draw everything in "font a" , then everything in "font b". It's important to note that "font b" might just be the italic version of "font a". This produces a shorter/more efficient stream of drawing commands, though probably not by such an amount as to be a good business decision to do so.
The kicker here is that a random pile of PDF files might require you to do some OCR. A poorly assembled PDF (one with a font subset that has no "to unicode" data) can't be properly mined for text even though it has nothing but text drawing operations. "Draw glyphs one through five from "font C" doesn't mean much if you don't know that those first five glyphs are "g-l-y-p-h", because that's the order they were used in.
On the other hand, if you've got home-grown PDFs or all your pdfs are from some known source (Word's pdf converter for example), you'll know what to expect in advance.
Note that the only thing mentioned above that I've actually used is Ghostscript. I remember it having a solid command line interface we used to generate images for some online PDF viewer Many Years Ago.
I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.
My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.
Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python
From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.
Any ideas?
If your goal is to redact images maybe this code I used for a similar usecase could be useful:
import sys
import zipfile
from PIL import Image, ImageFilter
import io
blur = ImageFilter.GaussianBlur(40)
def redact_images(filename):
outfile = filename.replace(".docx", "_redacted.docx")
with zipfile.ZipFile(filename) as inzip:
with zipfile.ZipFile(outfile, "w") as outzip:
for info in inzip.infolist():
name = info.filename
print(info)
content = inzip.read(info)
if name.endswith((".png", ".jpeg", ".gif")):
fmt = name.split(".")[-1]
img = Image.open(io.BytesIO(content))
img = img.convert().filter(blur)
outb = io.BytesIO()
img.save(outb, fmt)
content = outb.getvalue()
info.file_size = len(content)
info.CRC = zipfile.crc32(content)
outzip.writestr(info, content)
Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.
I don't think it's currently implemented in python-docx.
Pictures in the Word Object Model are defined as either floating shapes or inline shapes. The docx documentation states that it only supports inline shapes.
The Word Object Model for Inline Shapes supports a Delete() method, which should be accessible. However, it is not listed in the examples of InlineShapes and there is also a similar method for paragraphs. For paragraphs, there is an open feature request to add this functionality - which dates back to 2014! If it's not added to paragraphs it won't be available for InlineShapes as they are implemented as discrete paragraphs.
You could do this with win32com if you have a machine with Word and Python installed.
This would allow you to call the Word Object Model directly, giving you access to the Delete() method. In fact you could probably cheat - rather than scrolling through the document to get each image, you can call Find and Replace to clear the image. This SO question talks about win32com find and replace:
import win32com.client
from os import getcwd, listdir
docs = [i for i in listdir('.') if i[-3:]=='doc' or i[-4:]=='docx'] #All Word file
FromTo = {"First Name":"John",
"Last Name":"Smith"} #You can insert as many as you want
word = win32com.client.DispatchEx("Word.Application")
word.Visible = True #Keep comment after tests
word.DisplayAlerts = False
for doc in docs:
word.Documents.Open('{}\\{}'.format(getcwd(), doc))
for From in FromTo.keys():
word.Selection.Find.Text = From
word.Selection.Find.Replacement.Text = FromTo[From]
word.Selection.Find.Execute(Replace=2, Forward=True) #You made the mistake here=> Replace must be 2
name = doc.rsplit('.',1)[0]
ext = doc.rsplit('.',1)[1]
word.ActiveDocument.SaveAs('{}\\{}_2.{}'.format(getcwd(), name, ext))
word.Quit() # releases Word object from memory
In this case since we want images, we would need to use the short-code ^g as the find.Text and blank as the replacement.
word.Selection.Find
find.Text = "^g"
find.Replacement.Text = ""
find.Execute(Replace=1, Forward=True)
I don't know about this library, but looking through the documentation I found this section about images. It mentiones that it is currently not possible to insert images other than inline. If that is what you currently have in your documents, I assume you can also retrieve these by looking in the Document object and then remove them?
The Document is explained here.
Although not a duplicate, you might also want to look at this question's answer where user "scanny" explains how he finds images using the library.
I need to generate a customized PDF copy of a template document.
The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name> and <last_name>, and then replace these with the correct values.
I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?
I looked at PyPDF2 and ReportLab but neither seem to be able to do so.
Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.
There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.
If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.
From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.
There is no definite solution but I found 2 solutions that works most of the time.
In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:
# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
# First convert all dash-like characters to dashes.
(
re.compile(u"Tom Xavier"),
lambda m : "XXXXXXX"
),
# Then do an actual SSL regex.
# See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
(
re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
lambda m : "XXX-XX-XXXX"
),
]
# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)
Full Example can be found here
In ruby https://github.com/gettalong/hexapdf works for black out text.
Example code:
require 'hexapdf'
class ShowTextProcessor < HexaPDF::Content::Processor
def initialize(page, to_hide_arr)
super()
#canvas = page.canvas(type: :overlay)
#to_hide_arr = to_hide_arr
end
def show_text(str)
boxes = decode_text_with_positioning(str)
return if boxes.string.empty?
if #to_hide_arr.include? boxes.string
#canvas.stroke_color(0, 0 , 0)
boxes.each do |box|
x, y = *box.lower_left
tx, ty = *box.upper_right
#canvas.rectangle(x, y, tx - x, ty - y).fill
end
end
end
alias :show_text_with_positioning :show_text
end
file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")
doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
processor = ShowTextProcessor.new(page, strings_to_black)
page.process_contents(processor)
end
new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)
puts "Writing updated file [#{new_file_name}]."
In this you can black out text on select text will be visible.
As another solution you may try Aspose.PDF Cloud SDK for Python, it provides the feature to replace text in a PDF document.
First thing first, install the Aspose.PDF Cloud SDK for Python
pip install asposepdfcloud
Sample Code upload PDF file to your cloud storage and replace multiple strings in a PDF document
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://aspose.cloud
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#Replace Text
text_replace1 = asposepdfcloud.models.TextReplace(old_value='origami',new_value='aspose',regex='true')
text_replace2 = asposepdfcloud.models.TextReplace(old_value='candy',new_value='biscuit',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace1,text_replace2])
response = pdf_api.post_document_text_replace(remote_name, text_replace_list)
print(response)
I'm developer evangelist at aspose.
I have a high resolution triangular mesh with about 2 million triangles. I want to reduce the number of triangles and vertices to about ~10000 each, while preserving its general shape as much as possible.
I know this can be done in Matlab using reducepatch. Another alternative is qslim package. Also there is decimation functionality in VTK which has python interface, so technically it is possible in python as well. Meshlab is probably available in python as well (?).
How can I do this kind of mesh decimation in python? Examples would be greatly appreciated.
Here is a minimal python prototype translated from its c++ equivalent vtk example (http://www.vtk.org/Wiki/VTK/Examples/Cxx/Meshes/Decimation), as MrPedru22 well suggested.
from vtk import (vtkSphereSource, vtkPolyData, vtkDecimatePro)
def decimation():
sphereS = vtkSphereSource()
sphereS.Update()
inputPoly = vtkPolyData()
inputPoly.ShallowCopy(sphereS.GetOutput())
print("Before decimation\n"
"-----------------\n"
"There are " + str(inputPoly.GetNumberOfPoints()) + "points.\n"
"There are " + str(inputPoly.GetNumberOfPolys()) + "polygons.\n")
decimate = vtkDecimatePro()
decimate.SetInputData(inputPoly)
decimate.SetTargetReduction(.10)
decimate.Update()
decimatedPoly = vtkPolyData()
decimatedPoly.ShallowCopy(decimate.GetOutput())
print("After decimation \n"
"-----------------\n"
"There are " + str(decimatedPoly.GetNumberOfPoints()) + "points.\n"
"There are " + str(decimatedPoly.GetNumberOfPolys()) + "polygons.\n")
if __name__ == "__main__":
decimation()
I would recommend you to use vtkQuadricDecimation, the quality of the output model is visually better than using vtkDecimatePro (without proper settings).
decimate = vtkQuadricDecimation()
decimate.SetInputData(inputPoly)
decimate.SetTargetReduction(0.9)
One of the most important things is to use Binary representation when saving STL:
stlWriter = vtkSTLWriter()
stlWriter.SetFileName(filename)
stlWriter.SetFileTypeToBinary()
stlWriter.SetInputConnection(decimate.GetOutputPort())
stlWriter.Write()
Best elegant and most beautiful Python decimation tool using meshlab (mainly MeshlabXML Library) can be found in this Dr. Hussein Bakri's repository
https://github.com/HusseinBakri/3DMeshBulkSimplification
I use it all the time. Have a look at the code
Another option is to apply open-source library MeshLib, which can be called both from C++ and Python code (where it is installed by pip).
And the decimating code will look like
import meshlib.mrmeshpy as mr
# load high-resolution mesh:
mesh = mr.loadMesh(mr.Path("busto.stl"))
# decimate it with max possible deviation 0.5:
settings = mr.DecimateSettings()
settings.maxError = 0.5
result = mr.decimateMesh(mesh, settings)
print(result.facesDeleted)
# 708298
print(result.vertsDeleted)
# 354149
# save low-resolution mesh:
mr.saveMesh(mesh, mr.Path("simplified-busto.stl"))
Visually both meshes look as follows:
I have a question regarding logging for somescript.py
The script performs some actions to find matches for words the user is looking for in some pages that have become unreadable due to re-formatting and printing of the pages.
Because of this, OCR techniques don't work for us anymore so i've come up with a script that compares countours of words to find matches.
the script looks something like:
import cv2
from cv2 import *
import numpy as np
method = cv.CV_TM_SQDIFF_NORMED
template_name = "this.png"
image_name = "3.tif"
needle = cv2.imread(template_name)
haystack = cv2.imread(image_name)
# Convert to gray:
needle_g = cv2.cvtColor(needle, cv2.CV_32FC1)
haystack_g = cv2.cvtColor(haystack, cv2.CV_32FC1)
# Attempt match
d = cv2.matchTemplate(needle_g, haystack_g, cv2.cv.CV_TM_SQDIFF_NORMED)
#we want the minimum squared difference
mn,_,mnLoc,_ = cv2.minMaxLoc(d)
print mnLoc
# Draw the rectangle
MPx,MPy = mnLoc
trows,tcols = needle_g.shape[:2]
#Normed methods give better results, ie matchvalue = [1,3,5], others sometimes showserrors
cv2.rectangle(haystack, (MPx,MPy),(MPx+tcols,MPy+trows),(0,0,255),2)
cv2.imshow('output',haystack)
cv2.waitKey(0)
import sys
sys.exit(0)
Now i want to log the various tasks that the script performs, like
converting the image to grayscale
attempting a match
drawing the rectangle
I have seen a few scripts on stackoverflow explaining how to log an entire script or the entire output but i haven't found anything that just logs a few actions.
Also i would like to add the date and time the activity was performed.
Furthermore i have wrote a function that calculates an MD5 and SHA1 hash of the input file, for this particular case, that is for 'this.png' and '3.tif', I have yet to implement this piece of code but would it be easy to log that as well?
I am a python-noob so if the anwsers are obvious to you guys you know why i couldn't figure it out myself.
I hope you can help me out on this one!