Remove OCG layers from PDF with Python

Remove OCG layers from PDF with Python - python

Is there a way to delete an OCG layer from a PDF within Python?
I normally work with pymupdf but couldn't the functionality there. Is there any other library with this functionality?

disclaimer: I am the author of borb the library mentioned in this answer.
borb will turn any input PDF into a JSON-like data structure. If you know what to delete in the content-tree, you can simply remove that item from the dictionary as you would in a normal dictionary object.
Reading the Document is easy:
with open("input.pdf", "rb") as in_file_handle:
document = PDF.loads(in_file_handle)
You need document["XRef"]["Trailer"]["Root"]["OCGs"] which will be a List of layers. Remove whatever element(s) you want.
If you then store the PDF, the layer will be gone.
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, document)

Related

The extractText() fucntion does not return text

pdfFileObject = open('MDD.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText()
Above is my code and when i run the script it just outputs a bunch of numbers and numerical(s) and not the text of the file. Could anyone help me with that?

This function doesn't work for all PDF files. This is explained in documentation:
This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
Try your code on this file. I'm sure it should work, so it seems that the problem is not in your code.
If you really need to parse files that are created the same way as your original MDD.pdf you have to choose another library.

How can I create a word (.docx) document if not found using python and write in it?

How can I create a word (.docx) document if not found using python and write in it?
I certainly cannot do either of the following:
file = open(file_name, 'r')
file = open(file_name, 'w')
or, to create or append if found:
f = open(file_name, 'a+')
Also I cannot find any related info in python-docx documentation at:
https://python-docx.readthedocs.io/en/latest/
NOTE:
I need to create an automated report via python with text and pie charts, graphs etc.

Probably the safest way to open (and truncate) a new file for writing is using 'xb' mode. 'x' will raise a FileExistsError if the file is already there. 'b' is necessary because a word document is fundamentally a binary file: it's a zip archive with XML and other files inside it. You can't compress and decompress a zip file if you convert bytes through character encoding.
Document.save accepts streams, so you can pass in a file object opened like that to save your document.
Your work-flow could be something like this:
doc = docx.Document(...)
...
# Make your document
...
with open('outfile.docx', 'xb') as f:
doc.save(f)
It's a good idea to use with blocks instead of raw open to ensure the file gets closed properly even in case of an error.
In the same way that you can't simply write to a Word file directly, you can't append to it either. The way to "append" is to open the file, load the Document object, and then write it back, overwriting the original content. Since the word file is a zip archive, it's very likely that appended text won't even be at the end of the XML file it's in, much less the whole docx file:
doc = docx.Document('file_to_append.docx')
...
# Modify the contents of doc
...
doc.save('file_to_append.docx')
Keep in mind that the python-docx library may not support loading some elements, which may end up being permanently discarded when you save the file this way.

Looks like I found an answer:
The important point here was to create a new file, if not found, or
otherwise edit the already present file.
import os
from docx import Document
#checking if file already present and creating it if not present
if not os.path.isfile(r"file_path"):
#Creating a blank document
document = Document()
#saving the blank document
document.save('file_name.docx')
#------------editing the file_name.docx now------------------------
#opening the existing document
document = Document('file_name.docx')
#editing it
document.add_heading("hello world" , 0)
#saving document in the end
document.save('file_name.docx')
Further edits/suggestions are welcome.

How do I store dictionaries in a file and read/write that file?

I am using tkinter to manage the GUI for a note retrieval program. I can pull my notes by typing a key word and hitting Enter in a text field but I would like to move my dictionary to a file so that my code space is not filled up with a massive dictionary.
I have been looking around but I am not sure how I would go about doing this.
I have the file in my directory. I know I can use open(“filename”, “mode”) to open said file for reading but how do I call each section of the notes.
For example what I do now is just call a keyword from my dictionary and have it write the definition for that keyword to a text box in my GUI. Can I do the same from the file?
How would I go about reading from the file the keyword and returning the definition to a variable or directly to the text box? For now I just need to figure out how to read the data. I think once I know that I can figure out how to write new notes or edit existing notes.
This is how I am set up now.
To call my my function
root.bind('<Return>', kw_entry)
How I return my definition to my text box
def kw_entry(event=None):
e1Current = keywordEntry.get().lower()
if e1Current in notes:
root.text.delete(1.0, END)
root.text.insert(tkinter.END, notes[e1Current])
root.text.see(tkinter.END)
else:
root.text.delete(1.0, END)
root.text.insert(tkinter.END, "Not a Keyword")
root.text.see(tkinter.END)

Sound's like you'd need to load the dictionary to memory at init time, and use it like a normal dictionary.
I am assuming your dictionary is a standard python dict of strings, so I recommend using the python json lib.
Easiest way to do this is to export the dictionary as json once to a file using something like:
with open(filename, 'w') as fp:
json.dump(dictionary, fp)
and then change your code to load the dict at init time using:
with open(filename) as fp:
dictionary = json.load(fp)
Alternatively, if your data is more complex than text, you can use python shelve which is a persistent, dictionary-like object to which you can pass any pickle-able object. Note that shelve has its drawbacks so read the attached doc.

sqlitedict is a project providing a persistent dictionary using sqlite in the background. You can use it like a normal dictionary e.g. by assigning arbitrary (picklable) objects to it.
If you access an element from the dictionary, only the value you requested is loaded from disk.

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()

If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)

oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

cPickle.load( ) error

I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,

pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file

As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove OCG layers from PDF with Python - python

Is there a way to delete an OCG layer from a PDF within Python? I normally work with pymupdf but couldn't the functionality there. Is there any other library with this functionality?

Related

The extractText() fucntion does not return text

How can I create a word (.docx) document if not found using python and write in it?

How do I store dictionaries in a file and read/write that file?

Pulling data out of MS Word with pywin32

cPickle.load( ) error

Categories

Resources