Python: How to replace text in pdf - python

I have a pdf file and i want to replace some text in pdf file and generate new pdf. How can i do that in python?
I have tried reportlab , reportlab does not have any fucntion to search text and replace it. What other module can i use?

You can try Aspose.PDF Cloud SDK for Python, Aspose.PDF Cloud is a REST API PDF Processing solution. It is paid API and its free package plan provides 50 credits per month.
I'm developer evangelist at Aspose.
import os
import asposepdfcloud
from asposepdfcloud.apis.pdf_api import PdfApi
# Get App key and App SID from https://cloud.aspose.com
pdf_api_client = asposepdfcloud.api_client.ApiClient(
app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
app_sid='xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx')
pdf_api = PdfApi(pdf_api_client)
filename = '02_pages.pdf'
remote_name = '02_pages.pdf'
copied_file= '02_pages_new.pdf'
#upload PDF file to storage
pdf_api.upload_file(remote_name,filename)
#upload PDF file to storage
pdf_api.copy_file(remote_name,copied_file)
#Replace Text
text_replace = asposepdfcloud.models.TextReplace(old_value='origami',new_value='polygami',regex='true')
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace])
response = pdf_api.post_document_text_replace(copied_file, text_replace_list)
print(response)

Have a look in THIS thread for one of the many ways to read text from a PDF. Then you'll need to create a new pdf, as they will, as far as I know, not retrieve any formatting for you.

The CAM::PDF Perl Library can output text that's not too hard to parse (it seems to fairly randomly split lines of text). I couldn't be bothered to learn too much Perl, so I wrote these really basic Perl command line scripts, one that reads a single page pdf to a text file perl read.pl pdfIn.pdf textOut.txt and one that writes the text (that you can modify in the meantime) to a pdf perl write.pl pdfIn.pdf textIn.txt pdfOut.pdf.
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textOut = $ARGV[1];
$pdf = CAM::PDF->new($pdfIn);
$page = $pdf->getPageContent(1);
open(my $fh, '>', $textOut);
print $fh $page;
close $fh;
exit;
and
#!/usr/bin/perl
use Module::Load;
load "CAM::PDF";
$pdfIn = $ARGV[0];
$textIn = $ARGV[1];
$pdfOut = $ARGV[2];
$pdf = CAM::PDF->new($pdfIn);
my $page;
open(my $fh, '<', $textIn) or die "cannot open file $filename";
{
local $/;
$page = <$fh>;
}
close($fh);
$pdf->setPageContent(1, $page);
$pdf->cleanoutput($pdfOut);
exit;
You can call these with python either side of doing some regex etc stuff on the outputted text file.
If you're completely new to Perl (like I was), you need to make sure that Perl and CPAN are installed, then run sudo cpan, then in the prompt install "CAM::PDF";, this will install the required modules.
Also, I realise that I should probably be using stdout etc, but I was in a hurry :-)
Also also, any ideas what the format CAM-PDF outputs is? is there any doc for it?

Related

How to stop printing the properties of an .rtf file out when I use print(file.read()) in python

I am new to coding python and have trouble when I print out from a file (only tried from .rtf) as it displays all the file properties. I've tried a variety of ways to code the same thing, but the output is always similar. Example of the code and the output:
opener=open("file.rtf","r")
print(opener.read())
opener.close()
The file only contains this:
Camila
Employee
Try it
But the outcome is always:
{\rtf1\ansi\ansicpg1252\cocoartf1671\cocoasubrtf600
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
\f0\fs24 \cf0 Camila\
\
Employees\
\
Try it}
Help? How to stop that from happening or what am I doing wrong?
The RTF filetype contains more information than just the text, like fonts etc..
Python reads the RTF file as plain text, and therefore includes this information.
If you want to get the plain text, you need a module that can translate it, like striprtf
Make sure the module is installed by running this in the commandline:
pip install striprtf
Then, to get your text:
from striprtf.striprtf import rtf_to_text
file = open("file.rtf", "r")
plaintext = rtf_to_text(file.read())
file.close()
Use this package https://github.com/joshy/striprtf.
from striprtf.striprtf import rtf_to_text
rtf = "some rtf encoded string"
text = rtf_to_text(rtf)
print(text)

Python & MS Word: Convert .doc to .docx?

I found several questions that were similar to mine, but none of the answers came close to what I need.
Specifications: I'm working with Python 3 and do not have MS Word. My programming machine is running OS X and cloud machine is linux/ubuntu too.
I'm using python-docx to extract values from a .doc file that is sent to me nightly. However, python-docx only works with .docx files, so I need to convert the file to that extension first.
So, I've got a .doc file that I need to convert to .docx. This script might have to run in the cloud so I can't install any kind of Office or Office-like software. Can this be done?
You are working with Linux/ubuntu, you can use LibreOffice’s inbuilt converter.
SYNTAX
lowriter --convert-to docx *.doc
Example
lowriter --convert-to docx testdoc.doc
This will convert all doc files to docx and save in the same folder itself.
You could use unoconv - Universal Office Converter. Convert between any document format supported by LibreOffice/OpenOffice.
unoconv -d document --format=docx *.doc
subprocess.call(['unoconv', '-d', 'document', '--format=docx', filename])
Aspose.Words Cloud SDK for Python can convert DOC to DOCX. The package can open, generate, edit, split, merge, compare and convert a Word document in Python on any platform without depending on MS Word.
It is a paid product, but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Get your credentials from https://dashboard.aspose.cloud (free registration is required).
words_api = asposewordscloud.WordsApi(app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx',app_key='xxxxxxxxxxxxxxxxxxxxxxxxx')
words_api.api_client.configuration.host = 'https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.doc'
dest_name = 'C:/Temp/02_pages.docx'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='docx')
result = words_api.convert_document(request)
copyfile(result, dest_name)
import aspose.words as aw
path1="doc file path"
path2="path to save converted file"
file2=file.rsplit('.',1)[0]+'.docx'
filename1=os.path.join(path2,file2)
filename=os.path.join(path1,file)
doc = aw.Document(filename)
doc.save(filename1)
First you will need to be using Windows. If that is an acceptable barrier then please read on....
Next you need to install the Microsoft Office Compatibility Pack.
Now download and install the Microsoft Office Migration Planning Manager.
To run the tool you need to create a .ini file that controls the program. An example .ini file and further information is available on this blog post.
There is more detailed information from Microsoft here.

Reading a gnuplot file from a Python script

I am trying to run my gnuplot commands from a Python script. I came across a suggestion that I can save all the gnuplot commands in a file (with format codes for the variables I need to drop in) and then just read that file in with Python and format it. However, I couldn't find any examples to understand how it's done. Can somebody help me out?
PS: I am using gnuplot version 4.2 and I have already tried using -e pipe but for some reason it doesn't work.
PPS: I would prefer not to use Gnuplot-py package
A simple example: Start with a text file with formatting codes,
The {adjective1} {adjective2} {mammal1} has escaped!
Then in Python you can substitute values like so:
# load the template
with open("template.txt") as inf:
template = inf.read()
# substitute values
result = template.format(
adjective1 = "lazy",
adjective2 = "brown",
mammal1 = "bear"
)
# save the result
with open("plot.dem", "w") as outf:
outf.write(result)
The plot.dem file now contains
The lazy brown bear has escaped!

How to simulate command prompt in Perl and Python

I normally use WGET to download an image or two from some web-page, I do something like this from the command prompt: wget 'webpage-url' -P 'directory to where I wanna save it'. Now how do I automate it in Perl and Python? That is what command shall enable me to simulate as if I am entering the command at the command-prompt? In Python there are so many similar looking modules like subprocess, os, etc that I am quite confused.
In Perl, the easiest way is to use LWP::Simple:
use LWP::Simple qw(getstore);
getstore('www.example.com', '/path/to/saved/file.ext');
import subprocess
subprocess.call(["wget", "www.example.com", "-P", "/dir/to/save"])
If you want to read URL and process the response:
import urllib2
response = urllib2.urlopen('http://example.com/')
html = response.read()
How to extract images from the html you can read here on SO
in Perl, also, you can use qx(yourcommandhere). this is external call of programs.
so, in your example: qx(wget 'webpage-url' -P '/home/myWebPages/'). this is enough for you.
But, as s0me0ne said, using LWP::Simple is better.
If you have a list of urls in a file, you can use this code:
my $fh; # filehandler
open $fh, "<", "fileWithUrls.txt" or die "can't find file with urls!";
my #urls = <$fh>; # read all urls, one in each raw of file
my $wget = '/path/to/wget.exe';
for my $url(#urls) {
qx($wget $url '/home/myWebPages/');
}

Remote control or script Open Office to edit Word document from Python

I want to (preferably on Windows) start Open Office on a particular document, search for a fixed string and replace it with another string selected by my program.
How do I do that, from an external Python program? OLE-something? The native Python scripting solution?
(The document is in the Word 97-2003 format, but that is probably not relevant?)
I'd say using the Python-UNO bridge. Does this work for you?
import uno
ctx = uno.getComponentContext()
service_manager = ctx.getServiceManager()
desktop = service_manager.createInstanceWithContext("com.sun.star.frame.Desktop", ctx)
document = desktop.loadComponentFromURL("file:///file.doc", "_blank", 0, ())
replace_desc = document.createReplaceDescriptor()
replace_desc.setSearchString("text_to_replace")
find_iter = document.findFirst(replace_desc)
while find_iter:
find_iter.String = "replacement_text"
find_iter = document.findNext(find_iter.End, replace_desc)
See the XSearchable docs for details on searching. Also, make sure to have OpenOffice started with the following command line: swriter "-accept=socket,host=localhost,port=2002;urp;".

Categories