I have a MS Word document that consists of several hundred pages.
Each page is identical apart from the name of a person which is unique across each page. (One page is one user).
I would like to take this word document and automate the process to save each page individually so I end up with several hundred word documents, one for each person, rather than one document that consists of everyone, that I can then distribute to the different people.
I have been using the module python-docx found here: https://python-docx.readthedocs.io/en/latest/
I am struggling on how to achieve this task.
As far as I have researched it is not possible to loop over each page as the pages are not determined in the .docx file itself but are generated by the program, i.e. Microsoft Word.
However python-docx can interpret the text and since each page is the same can I not say to python when you see this text (the last piece of text on a given page) consider this to be the end of a page and anything after this point is a new page.
Ideally if I could write a loop that would consider such a point and create a document up until that point, and repeating the over all pages that would be great. It would need to take all formatting/pictures as well.
I am not against other methods, such as converting to PDF first if that is an option.
Any thoughts?
I had the exact same problem. Unfortunately I could not find a way to split .docx by page. The solution was to first use python-docx or docx2python (whatever you like) to iterate over each page and extract the unique (person) information and put it in a list so you end up with:
people = ['person_A', 'person_B', 'person_C', ....]
Then save the .docx as a pdf split the pdfs up by page and then save them as person_A.pdf etc like this:
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("document.pdf", "rb"))
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open(f"{people[i]}.pdf", "wb") as outputStream:
output.write(outputStream)
The result is a bunch of one page PDFs saved as Person_A.pdf, Person_B.pdf etc.
Hope that helps.
I would suggest another package aspose-words-cloud to split a word document into separate pages. Currently, it works with cloud storage(Aspose cloud storage, Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage and FTP Storage). However, in near future, it will support process files from the request body(streams).
P.S: I am developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
import os
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxx-xxxxx-xxxx-xxxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
remoteFolder = 'Temp'
localFolder = 'C:/Temp'
localFileName = '02_pages.docx'
remoteFileName = '02_pages.docx'
#upload file
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName,'rb'),remoteFolder + '/' + remoteFileName))
#Split DOCX pages as a zip file
request = asposewordscloud.models.requests.SplitDocumentRequest(name=remoteFileName, format='docx', folder=remoteFolder, zip_output= 'true')
result = words_api.split_document(request)
print("Result {}".format(result.split_result.zipped_pages.href))
#download file
request_download=asposewordscloud.models.requests.DownloadFileRequest(result.split_result.zipped_pages.href)
response_download = words_api.download_file(request_download)
copyfile(response_download, 'C:/'+ result.split_result.zipped_pages.href)
Related
I'm looking to password protect a PDF for editing, but without needing the password to view the file.
Is there a way to do this?
I looked at PyPDF2, but I could only find full encryption.
Just be aware the ISO standard describes this optional feature as
ISO 32000-1:
»Once the document has been opened and decrypted successfully, a conforming reader technically has access to the entire contents of the document. There is nothing inherent in PDF encryption that enforces the document permissions specified in the encryption dictionary.«
It is better described as
Specifying access restrictions for a document, such as Printing allowed: None disables the respective function in Acrobat. However, this not necessarily holds true for third-party PDF viewers or other software. It is up to the developer of PDF tools whether or not access permissions are honored. Indeed, several PDF tools are known to ignore permission settings altogether; commercially available* PDF cracking tools can be used to disable all access restrictions. This has nothing to do with cracking the encryption; there is simply no way that a PDF file can make sure it won’t be printed while it still remains viewable.
Forget aiding commercial exploitation of your limited file as many online sites will unlock files for "FREE" and thus your restricted file is more likely to become touted freely across the web (great for web coverage but poor for your personal demands.) However they just need to use their browser to save the PDF copy contents. See here the "protected" Adobe Master book, in two common viewers.
You can use the permisions flag. For example:
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_writer = PdfFileWriter()
# CREATE THE PDF HERE
# This is the key line, not the permission_flag parameter
pdf_writer.encrypt(user_pwd='', owner_pwd=PASSWORD, permissions_flag=0b0100)
with open('NewPDF.pdf', "wb") as out_file:
pdf_writer.write(out_file)
You could try pikepdf, it could set the pdf permission.
e.g.
from pikepdf import Pdf, Permissions, Encryption, PasswordError
userkey = 'abcd'
ownerkey = 'king'
with Pdf.open(ori_pdf_file) as pdf:
no_print = Permissions(print_lowres=False, print_highres=False)
pdf.save(enc_pdf_file, encryption = Encryption(user=userkey, owner=ownerkey, allow=no_print))
It would prevent print while opening the enc_pdf_file with 'abcd'(userkey).
You could read the documents for more details:
https://pikepdf.readthedocs.io/en/latest/tutorial.html#saving-changes
https://pikepdf.readthedocs.io/en/latest/api/models.html#pikepdf.Permissions
I'm late to answer this and of course, #MYK is right, just worth mentioning in PyPDF2 version 3.0.0 PdfFileReader, PdfFileWriter is deprecated and some functions like getPage() and addPage() have changed.
from PyPDF2 import PdfWriter, PdfReader
out = PdfWriter()
file = PdfReader("yourFile.pdf")
num = len(file.pages)
for idx in range(num):
page = file.pages[idx]
out.add_page(page)
password = "pass"
out.encrypt(user_pwd='', owner_pwd=password, permissions_flag=0b0100)
with open("yourfile_encrypted.pdf", "wb") as f:
out.write(f)
and permissions are
I am using Amazon Textract to extract text from PDF files. For some of these documents, I want to be able to specify the pages from which data is to be extracted, rather than having to go through the entire thing. Is this possible? If so, how do I do it? I cannot seem to find an answer in the docs.
I do not believe Textract offers this feature, but you can easily implement it programatically. Since your tags mention python, I'll suggest a way to do this using python.
You can use a library like PyPDF2 which lets you specify which pages you want to extract and creates a new pdf with just those pages.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
This library can be used with AWS Lambda as a layer. You can save the file temporarily in the /tmp/ folder on lambda.
Source: https://learndataanalysis.org/how-to-extract-pdf-pages-and-save-as-a-separate-pdf-file-using-python/
Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.
I'm trying to write a programm for Data extraction from PDF in Python (Excel Macro could be an option) .
At first at want to select a text or a position in a pdf file and generate a local path/link to that file in that position. This link will be copied to an excel cell. When I click on the link the PDF document should open on the specified coordinates of the previously selected text.
I know the question is very broad. I'm an enthusiast beginner and need a nudge in the right direction and to know if it is possible.
How can I get the path of the active pdf file in the desktop? and the coordinate of the selected text? I could give these automatically as parameters then to my programm.
Thank you !
There are lot of ways to do this, I would say look into the Slate --> https://pypi.python.org/pypi/slate , or http://www.unixuser.org/~euske/python/pdfminer/index.html
And yes it's quite easy , also look into pyPdf
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent("test.pdf")
Hello I am trying to make a python function to save a list of URLs in .txt file
Example: visit http://forum.domain.com/ and save all viewtopic.php?t= word URL in .txt file
http://forum.domain.com/viewtopic.php?t=1333
http://forum.domain.com/viewtopic.php?t=2333
I use this function but not save
I am very new in python can someone help me to create this
web_obj = opener.open('http://forum.domain.com/')
data = web_obj.read()
fl_url_list = open('urllist.txt', 'r')
url_arr = fl_url_list.readlines()
fl_url_list.close()
This is far from trivial and can have quite a few corner cases (I suppose the page you're referring to is a web page)
To give you a few pointers, you need to:
download the web page : you're already doing it (in data)
extract the URLs : this is hard, most probably, you'll want to usae an html parser, extract <a> tags, fetch the hrefattribute and put that into a list. then filter that list to have only the url formatted like you like (say with viewtopic). Let's say you got it into urlList
then open a file for Writing Text (thus wt, not r).
write the content f.write('\n'.join(urlList))
close the file
I advise to try to follow these steps and ask relevant questions when you're stuck on a particular issue.