An efficient way to convert document to pdf format

An efficient way to convert document to pdf format - python

I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and oowriter, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?
What i have tried:
from subprocess import Popen, PIPE
import time
def convert(src, dst):
d = {'src': src, 'dst': dst}
commands = [
'/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d,
'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d,
]
for i in range(len(commands)):
command = commands[i]
st = time.time()
process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True`
out, err = process.communicate()
errcode = process.returncode
if errcode != 0:
raise Exception(err)
en = time.time() - st
print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2)))
if __name__ == '__main__':
src = '/path/to/source/file/'
dst = '/path/to/destination/folder/'
convert(src, dst)
Output:
Command 1: Completed in 11.91 seconds
Command 2: Completed in 11.55 seconds
Environment:
Linux - Ubuntu 12.04
Python 2.7.3
More tools result:
jodconverter took 11.32 seconds

Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:
time unoconv 15.\ Text-Files.pptx
real 0m8.604s

Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).

Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:
Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf
xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf
Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf
HTH,
Vasudev Ram

For doc and docx (but not ppt/pptx), you could try our independent (but commercial) high fidelity rendering engine online at OnlineDemo/docx_to_pdf
By "high fidelity", I mean it is designed from the ground up to have the same line and paragraph breaks, tab stops etc etc as Microsoft Word.

Related

use Acrobat to reduce file size with Python or from command line

I would like to reduce the size of some PDF files. There are many ways to do so, but most of them don't work for my purposes. For example, pdftk, cpdf, and pdfoptsize all fail to reduce the sizes of my files. Ghostscript can reduce the file size, but only at an unacceptable cost in terms of legibility of the figures. There seem to be some great APIs for size reduction, but I don't want to pay. So I would like to automate the "Reduce File Size" option in Acrobat, which works well. Is there a way to do this in Python or from the command line?
I am running Windows 10 with Acrobat DC; I also have access to Acrobat X. I can set up a "Batch Processing" job in Acrobat, but even then, I would need to run it from Python or from the command line.
I can use the Acrobat API from Python, but I don't see how to use it to run the "Reduce File Size" command. I can set the PDSaveCollectGarbage flag, but it doesn't help. Here is a minimal example of a Python script that opens and resaves a file -- it illustrates the extent of my knowledge in this area:
import os
from win32com.client.dynamic import Dispatch
src = os.path.abspath('original.pdf')
PDSaveFull = 0x01
PDSaveCollectGarbage = 0x20
SAVEFLAG = PDSaveFull | PDSaveCollectGarbage
try:
app = Dispatch("AcroExch.AVDoc")
if app.Open(src, src):
pddoc = app.GetPDDoc()
pddoc.Save(SAVEFLAG, os.path.abspath('./new.pdf'))
except Exception as e:
print(str(e))
finally:
app.Close(-1)

FFmpeg-split.py can't determine video length

First, I'm not a developer. I'm trying to split a movie into 1 minute clips usinf ffmpeg-split.py python script. I made sure FFmpeg is installed it trying a simple command and it worked like magic:
ffmpeg -i soccer.mp4 -ss 00:00:00 -codec copy -t 10 soccer1.mp4
A new video file was created in the same folder.
I saved the FFmpeg-split.py in the same dir, updated python PATH and typed the following command:
python ffmpeg-split.py -f soccer.mp4 -s 10
what I got back was:
can't determine video length
I believe it just can't find the file. I switched video files and even deleted it and got the same message.
Any ideas?

first time I've seen that name!? Because I believe you were able to run ffmpeg from the command line and execute basic python stuff I recommend following my example as it should avoid any weird directory.connection.stuff in the given file (which i ignored). "Earlier that day": Let me ignore the .py script and share as follows:
Assuming you ran
ffmpeg -i soccer.mp4 ...stuff... soccer1.mp4
from a windows.command.line...
It would be better to write
ffmpeg -t 10 -i "Z:\\full\\input\\path.mp4" -c copy "Z:\\full\\output\\path.mp4"
This says, run ffmpeg, -t=input.duration.seconds, -i=input.file.next,
"fullinpath" in quotes cause spaces etc., -c=all.codecs, copy=atlantian.magic.trick,
"fulloutpath" also to be safe, nothing else!
"Piping" through python to windows works great for this:
import subprocess as subprocess
def pegRunner(cmd): #Takes a list of strings we'll pass to windows.
command = [x for x in cmd] # peg short for mpeg, shoulda used meg.gem.gepm.gipper.translyvania.otheroptions
result = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
output, err = result.communicate()
print result.wait()
return "pegRannered"
#########
# Find the duration from properties or something. If you need to do this
# often it's more complicated. Let's say you found 4mins33secs.
############
leng = 4*60+33 # time in seconds
last_dur = int(leng%60) #remaining time after the 4 one.min.vids
if last_dur == 0: num_vids = int(leng/60)
else: num_vids = int(leng/60)+1
for i in range(num_vids):
da_command = ['ffmpeg']
da_command.append('-ss')
da_command.append(str(i*60))
da_command.append('-t')
if i != num_vids: da_command.append('60')
else: da_command.append(str(last_dur))
da_command.append('-i')
da_command.append('Z:\\full\\input\\path.mp4') #this format!
da_command.append('-c')
da_command.append('copy')
#optionally to overwrite!!!! da_command.append('-y')
da_command.append('Z:\\full\\output\\path\\filename_'+str(i)+'.mp4')
print pegRunner(da_command)
print "Finished "+str(i)+" filez."
This should handle the 1.min pieces and provide a good starting place for ffmpeg from python.

is there a better way to extract the data from the output using python code

I'm writing a newbie python code to find list of softwares installed on a system from which I will be running the code from. if the software is not installed, i'm planning to say that to the user.
The output will be something like this: (dpkg -l)
A snippet below:
----------------
ii git 1:1.7.9.5-1 fast, scalable, distributed revision control system
ii git-man 1:1.7.9.5-1 fast, scalable, distributed revision control system (manual pages)
c = subprocess.Popen(['dpkg','-l'],stdout=subprocess.PIPE,stderr=subprocess.PIPE)
list_of_packages,error = c.communicate()
for item in list_of_packages.split('\n'):
print item.split('ii')[-1]
which splits and Looks like i will have to apply few more splits to get the required data.
git and 1.7.9.5(version name).
I'm just trying to figure out if there is a better way of achieving this.
please advice..
Thanks,
-Vijay

Trying to parse human-readable output is fragile, as you've observed. Fortunately you can replace with dpkg -l with dpkg-query -W -f='${Package}\t${Version}\n' which is designed to produce machine-readable output. See http://manpages.ubuntu.com/manpages/lucid/man1/dpkg-query.1.html for full list of options to dpkg-query.
>>> args = ["dpkg-query", "-W", "-f=${Package}\t${Version}\n"]
>>> out, err = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
>>> print out #output is summarized, clearly
git 1:1.7.9.5-1
git-man 1:1.7.9.5-1

dpkg -l outputs lines not containing ii. And there are possibly packages containing ii in their name.
I would do it this way:
for item in list_of_packages.splitlines():
if item.startswith('ii'):
print item[4:]

Automating PDF generation

What would be a solid tool to use for generating PDF reports? Particularly, we are interested in creating interactive PDFs that have video, like the example found here.
Right now we are using Python and reportlab to generate PDFs, but have not explored the library completely (mostly because the license pricing is a little prohibitive)
We have been looking at the Adobe's SDK and iText libraries but it's hard to say what the capabilities of either are.
The ability to generate a document from a template PDF would be a plus.
Any pointers or comments will be appreciated.
Thanks,

Recently, I needed to create PDF reports for a Django application; a ReportLab license was available, but I ended up choosing LaTeX. The benefit of this approach is that we could use Django templates to generate the LaTeX source, and not get over encumbered writing lots of code for the many reports we needed to create. Plus, we could take advantage of the relatively much more concise LaTeX syntax (which does have it's many quirks and is not suitable for every purpose).
This snippet provides a general overview of the approach. I found it necessary to make some changes, which I have provided at the end of this question. The main addition is detection for Rerun LaTeX messages, which indicates an additional pass is required. Usage is as simple as:
def my_view(request):
pdf_stream = process_latex(
'latex_template.tex',
context=RequestContext(request, {'context_obj': context_obj})
)
return HttpResponse(pdf_stream, content_type='application/pdf')
It is possible to embed videos in LaTeX generated PDFs, however I do not have any experience with it. Here is a top Google result.
This solution does require spawning a new process (pdflatex), so if you want a pure Python solution keep looking.
import os
from subprocess import Popen, PIPE
from tempfile import NamedTemporaryFile
from django.template import loader, Context
class LaTeXException(Exception):
pass
def process_latex(template, context={}, type='pdf', outfile=None):
"""
Processes a template as a LaTeX source file.
Output is either being returned or stored in outfile.
At the moment only pdf output is supported.
"""
t = loader.get_template(template)
c = Context(context)
r = t.render(c)
tex = NamedTemporaryFile()
tex.write(r)
tex.flush()
base = tex.name
names = dict((x, '%s.%s' % (base, x)) for x in (
'log', 'aux', 'pdf', 'dvi', 'png'))
output = names[type]
stdout = None
if type == 'pdf' or type == 'dvi':
stdout = pdflatex(base, type)
elif type == 'png':
stdout = pdflatex(base, 'dvi')
out, err = Popen(
['dvipng', '-bg', '-transparent', names['dvi'], '-o', names['png']],
cwd=os.path.dirname(base), stdout=PIPE, stderr=PIPE
).communicate()
os.remove(names['log'])
os.remove(names['aux'])
# pdflatex appears to ALWAYS return 1, never returning 0 on success, at
# least on the version installed from the Ubuntu apt repository.
# so instead of relying on the return code to determine if it failed,
# check if it successfully created the pdf on disk.
if not os.path.exists(output):
details = '*** pdflatex output: ***\n%s\n*** LaTeX source: ***\n%s' % (
stdout, r)
raise LaTeXException(details)
if not outfile:
o = file(output).read()
os.remove(output)
return o
else:
os.rename(output, outfile)
def pdflatex(file, type='pdf'):
out, err = Popen(
['pdflatex', '-interaction=nonstopmode', '-output-format', type, file],
cwd=os.path.dirname(file), stdout=PIPE, stderr=PIPE
).communicate()
# If the output tells us to rerun, do it by recursing over ourself.
if 'Rerun LaTeX.' in out:
return pdflatex(file, type)
else:
return out

I suggest to use https://github.com/mreiferson/py-wkhtmltox to render HTML to PDF.
And use any tool you choose to render reports as HTML. I like http://www.makotemplates.org/

Is there any way to get ps output programmatically?

I've got a webserver that I'm presently benchmarking for CPU usage. What I'm doing is essentially running one process to slam the server with requests, then running the following bash script to determine the CPU usage:
#! /bin/bash
for (( ;; ))
do
echo "`python -c 'import time; print time.time()'`, `ps -p $1 -o '%cpu' | grep -vi '%CPU'`"
sleep 5
done
It would be nice to be able to do this in Python so I can run it in one script instead of having to run two. I can't seem to find any platform independent (or at least platform independent to linux and OS X) way to get the ps output in Python without actually launching another process to run the command. I can do that, but it would be really nice if there were an API for doing this.
Is there a way to do this, or am I going to have to launch the external script?

You could check out this question about parsing ps output using Python.
One of the answers suggests using the PSI python module. It's an extension though, so I don't really know how suitable that is for you.
It also shows in the question how you can call a ps subprocess using python :)

My preference is to do something like this.
collection.sh
for (( ;; ))
do
date; ps -p $1 -o '%cpu'
done
Then run collection.sh >someFile while you "slam the server with requests".
Then kill this collection.sh operation after the server has been slammed.
At the end, you'll have file with your log of date stamps and CPU values.
analysis.py
import datetime
with( "someFile", "r" ) as source:
for line in source:
if line.strip() == "%CPU": continue
try:
date= datetime.datetime.strptime( line, "%a %b %d %H:%M:%S %Z %Y" )
except ValueError:
cpu= float(line)
print date, cpu # or whatever else you want to do with this data.

You could query the CPU usage with PySNMP. This has the added benefit of being able to take measurements from a remote computer. For that matter, you could install a VM of Zenoss or its kin, and let it do the monitoring for you.

if you don't want to invoke PS then why don't you try with /proc file system.I think you can write you python program and read the files from /proc file system and extract the data you want.I did this using perl,by writing inlined C code in perl script.I think you can find similar way in python as well.I think its doable,but you need to go through /prof file system and need to figure out what you want and how you can get it.
http://www.faqs.org/docs/kernel/x716.html
above URL might give some initial push.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

An efficient way to convert document to pdf format - python

Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you: time unoconv 15.\ Text-Files.pptx real 0m8.604s

Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).

For doc and docx (but not ppt/pptx), you could try our independent (but commercial) high fidelity rendering engine online at OnlineDemo/docx_to_pdf By "high fidelity", I mean it is designed from the ground up to have the same line and paragraph breaks, tab stops etc etc as Microsoft Word.

Related

use Acrobat to reduce file size with Python or from command line

FFmpeg-split.py can't determine video length

is there a better way to extract the data from the output using python code

Automating PDF generation

Is there any way to get ps output programmatically?

Categories

Resources