Problem with exiting a Word doc using Python - python

This is my first time using this so be kind :) basically my question is I am making a program that opens many Microsoft Word 2007 docs and reads from a certain table in that document and writes that info to an excel file there is well in excess of 1000 word docs. I have all of this working but the only problem when I run my code it does not close MSword after opening each doc I have to manually do this at the end of the program run by opening word and selecting exit word option from the Home menu. Another problem is also if a run this program consecutively on the second run everything goes to hell it prints the same thing repeatedly no matter which doc is selected I think this may have to do with how MSword is deciding which doc is active e.g. is it still opening the last active document that was not closed from the last run. Anyways here is my code for the opening and closing part I wont bore you guys with the rest::
MSWord = win32com.client.Dispatch("Word.Application")
MSWord.Visible = 0
# Open a specific file
#myWordDoc = tkFileDialog.askopenfilename()
MSWord.Documents.Open("C:\\Documents and Settings\\fdosier" + chosen_doc)
#Get the textual content
docText = MSWord.Documents[0].Content
charText = MSWord.Documents[0].Characters
# Get a list of tables
ListTables = MSWord.Documents[0].Tables
------Main Code---------
MSWord.Documents.Close
MSWord.Documents.Quit
del MSWord

Basically, Python is not VBA, so this:
MSWord.Documents.Close
is equivalent to:
getattr(MSWord.Documents, "Close")
i.e. you just get some method object and do nothing with it. You need to call the method with the call operator (the parentheses :) :
MSWord.Documents.Close()
Accordingly for .Quit.

Before your MSWord.Quit did you try using:
MSWord.ActiveWindow.Close
Or even more simpley just doing
MSWord.Quit
I dont really understand if you are trying to close a document or the application.

I think you need a MSWord.Quit at the end (before and/or instead of the the del)

Related

Python text command doesn't output the string data on the actual text file

I am just learning about the text file function in python 3 by using website called, https://www.w3schools.com/python/python_file_write.asp and https://www.geeksforgeeks.org/reading-writing-text-files-python/ although the program seems collect, the text data in the python's programming screen doesn't show in the actual text data file.
Is there any mistake I've ever made in the text program below?
The version of my Python is Python 3.7.5 .
File = open("NewTextFile.Txt", "a")
string = "ABC"
File.write(string)
File.close
You forgot to put () at File.close, so the file is not properly closed. Try putting ().
Often it is recommended to use with clause:
with open('NewTextFile.Txt', 'a') as file:
string = 'ABC'
file.write(string)
Note that you don't need to explicitly close the file here. The file is kept open within the clause. Once your python program exits the with clause, the file is automatically closed; in this way your program gets less prone to mistakes.
For more information, see a relevant python doc:
It is good practice to use the with keyword when dealing with file objects. The advantage is that the file is properly closed after its suite finishes, even if an exception is raised at some point.
— https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

How to access remote and encrypted PDF text without writing to local drive [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am very new to the coding world and have been stuck on this one problem for 3 days now, searching everywhere for an answer, so any help will be greatly appreciated. I am needing to extract a small amount of text from a url-located Pdf file. I'm using sessions.get(chart_PDF) as the driver for locating the URL where chart_PDF is the example url below.
Example url is https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
I know I am able to write it to my local drive but I don't want to do that, I want to be able to do it remotely, since I only need a couple of numbers from it.
I have tried finding the password from the url page for decrypting, couldn't find. I've tried to use PyPDF2, pdfminer and pikepdf (probably not well).
I only need to retrieve two numbers near the bottom of the PDF that can be used for the rest of my code. Please help, even if it is a simple fix, I'm new to all this and need some help. Thanks.
from io import BytesIO
from pikepdf import Pdf as PDF
from pdfminer import high_level
chart_PDF = https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
retrieve = s.get(chart_PDF)
content = retrieve.content
response =urllib.request.urlopen(chart_PDF)
p = BytesIO(content)
p.getbuffer()
check = PDFPage.get_pages(p, check_extractable=False)
extract = high_level.extract_text(p)
I'm getting:
PDFTextExtractionNotAllowedWarning: The PDF <_io.BytesIO object at 0x000001B007ABEC20> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding.warnings.warn(warning_msg, PDFTextExtractionNotAllowedWarning)
Alternately, if I try this:
from pikepdf import Pdf as PDF
from pdfminer.pdfpage import PDFPage
from PyPDF2 import PdfFileReader
new_pdf = PDF.new()
with PDF.open(p) as pdf:
print(len(pdf.pages))
page1 = pdf.pages[0]
if PdfFileReader.getIsEncrypted(pdf):
print(True)
PdfFileReader.decrypt(page1, password='')
pdf.close()
I get:
line 1987, in decrypt
return self._decrypt(password)
AttributeError: _decrypt
UPDATE 3/8/21
Thank you so much K J! You've seriously been a huge help!
from io import BytesIO
from pdfminer.pdfpage import PDFPage
from pdfminer import high_level
retrieve = s.get(chart_PDF)
content = retrieve.content
bytes = BytesIO(content)
bytes.getbuffer()
PDFPage.get_pages(bytes, check_extractable=False)
extract = high_level.extract_text(bytes, password='') #THIS LINE THROWS ERROR
joined = ''.join(extract)
find_txt = re.findall(r'[(]\d*[-]\d[.]\d[)]', joined)
print(find_txt)
bytes.close()
This is now working well and I have been able to pull the numbers that I need (I have basically pulled all numbers from inside brackets off the PDF). I'll sort through that to find which one I need.
Strangely enough, although its giving me what I need, my extract = high_level.extract_text(bytes, password='') line still throws the Warning: (warning_msg, PDFTextExtractionNotAllowedWarning) which is rather annoying. Not sure how this process works but its still letting the info out.
I can't use try except or it skips over it. What is the way around this? how can I stop that error coming up?
FINAL UPDATE
I got around the warning and it works well now.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
extract = high_level.extract_text(bytes)
Cheers fellas for putting up with my ignorance, you've helped so much.
The whole file has to be downloaded to a device via RAM so the blob as a FILE can be parsed at the very END for one OR more %%EOF and the location of page 0 (it gets converted to 1 or i) it could be ANYWHERE IN THE STREAM,.
THEN you can navigate to other sequential numbered pages in the RANDOM order they are built. Any complaints please contact Adobe.
However it is easiest if it is cached as a physical FILE object. If you dont want that on disk use a ram drive for your browser.
Again those two objects at bottom of page one could be anywhere mixed into the content of "page" 99's objects, or otherwise. each letter in a PDF can in its extreme be more than one object anywhere in the file. but a good authoring editor would try to keep them as lines by lines. (there is no such PDF thing as a word or paragraph.)
We can Print the file as Plain Text to see how it is composited and although (secured) that is allowed.
I tried printing from browser with little success but know that can depend on browser system and OS print drivers. Here I have printed the page as text using Acrobat portable, so we can see the sequential offsets of each text block from Left Hand margin JUST LIKE a PDF VIEWER would need to rebuild them.
UPDATE
You said your target is (1380-4.4) to the RIGHT of ALTERNATE but again A PDF has no concept of Left and Right or BEFORE or AFTER so we find IN THIS FILE the variable target is in 2 separate pieces PRIOR to the KNOWN characters which luckily is a complete single block (alternate). Thus here proximity of plain text could well work if the capture is confined to that nearby locality. However there is no guarantee that ALTERNATE would always be a single block.
It was perhaps not a good Idea To show the way a Printer would be given a stream of sequential data
Here is the way one PDF viewer goes about decrypting the file
As stated on this occasion the word ALTERNATE is defined as text however the next item is the "3" under "B" which is text as a vector path it is not called a "character" although it looks like one but a numbered glyph from a font table. We do see later that some of those numbers are stored as "text" and for your target it is mixed in with similar text in the same object.
Thus you need to call a PDF interpreter to give you a meaningful translation of all bits and pieces of objects so that you can extract the "right" text.
The easiest way for a "simple" one line target in a complex file is to use MuPDF to first tidy up the file
mutool clean -gggg -D infile.pdf outfile.pdf
combined with
PDFTOTXT -layout outfile.pdf outfile.txt
or similar to hopefully export that text on a line by line basis, such that you can consistently find your target instantly before ! or after ALTERNATE.
N.B Mutool convert to HTML would place the target value in a table entry AFTER the key word, and if the lines are consistent in number that would be a simpler way to find or grep.

Is there a way to close only the word doc being used by win32com object, wihout impacting other open docs?

I am initializing a word application object using:
import win32com.client as win32
wordapp = win32.Dispatch("Word.Application")
Now after making some changes, I am closing the doc and killing the object.
doc.Close()
wordapp.quit()
Now on quitting all my word docs are being closed. Is there a way to only close the doc used by win32com and leave all other open docs untouched?
If i don't quit the object on using the application the second time lot of different errors are cropping up.
This bit works for me:
wordapp.ActiveDocument.Save
wordapp.ActiveDocument.Close()
This question is old, but I see it has been viewed more than 2000 times - so it is worth referring to:
When you use the 'Dispatch' function - python checks if there is Word software (in this case) open; If not - he opens a new one, then - he joins you to the one that is already open. Then when you close 'your' word, you are actually closing the first word.
Therefore, you just need to use the DispatchEx function - which creates a new instance of the application, anyway. this is how it is done:
from win32com import client as wc
word = wc.DispatchEx('Word.Application')

Blocking until a file is closed in python

I have Python set up to create and open a txt file [see Open document with default application in Python ], which I then manually make some changes to and close. Immidiately after this is complete I want Python to open up next txt file. I currently have this set up so that python waits for a key command that I type after I have closed the file, and on that key, it opens the next one for me to edit.
Is there a way of getting Python to open the next document as soon as the prior one is closed (i.e to skip out having python wait for a key to be clicked). ... I will be repeating this task approximately 100,000 times, and thus every fraction of a second of clicking mounts up very quickly. I basically want to get rid of having to interface with python, and simply to have the next txt file automatically appear as soon as prior one is closed.
I couldn't work out how to do it, but was thinking along the lines of a wait until the prior file is closed (wasn't sure if there was a way for python to be able to tell if a file is open/closed).
For reference, I am using python2.7 and Windows.
Use the subprocess module's Popen Constructor to open the file. It will return an object with a wait() method which will block until the file is closed.
How about something like:
for fname in list_of_files:
with open(fname, mode) as f:
# do stuff
In case of interest, the following code using the modified time method worked:
os.startfile(text_file_name)
modified = time.ctime(os.path.getmtime(text_file_name))
created = time.ctime(os.path.getctime(text_file_name))
while modified == created:
sleep(0.5)
modified = time.ctime(os.path.getmtime(text_file_name))
print modified
print "moving on to next item"
sleep(0.5)
sys.stdout.flush()
Athough I think I will use the Popen constructor in the future since that seems a much more elegant way of doing (and also allows for situations where the file is closed without an edit been needed).

Deleting temporary files in python

I really would like to learn how submit questions using the cool formatting that seems to be available but it is not obvious to me just how to do that....
My question: My plan was to print "birdlist" (output from a listbox) to the file "Plain.txt" and then
delete the file so as to make it unavailable after the program exits. The problem with this is that for some reason "Plain.txt" gets deleted before the printing even starts...
The code below works quite well so long as I don't un-comment the last two lines in order to delete "Plain.txt... I have also tried to use the "tempfile" function that exists....it does not like me to send formatted string data to the temporary file. Is there a way to do this that is simple enough for my bird-brain to understand???
text_file = open("Plain.txt","w")
for name,place,time in birdlist:
text_file.write('{0:30}\t {1:>5}\t {2:10}\n'.format(name, place, time))
win32api.ShellExecute (0,"print",'Plain.txt','/d:"%s"' % win32print.GetDefaultPrinter (),".",0)
text_file.close()
#os.unlink(text_file.name)
#os.path.exists(text_file.name)
The problem is that Windows ShellExecute will just start the process and then return to you. It won't wait until the print process has finished with it.
If using the windows API directly, you can wait using the ShellExecuteEx function, but it doesn't appear to be in win32api.
If the user is going to be using your application for a while, you can keep a record of the file and delete it later.
Or you can write your own printing code so you don't have to hand it off to somebody else. See Print to standard printer from Python?
I had a similar issue with a program i'm writing. I was calling win32api.ShellExecute() under a for loop, to print a list of files and delete them afterwards. I started getting Notepad.exe popup messages on my screen telling me the file doesn't exist. After inserting some raw_input("press enter") statements to debug, i discovered that I needed a delay to avoid deleting the file too fast, so adding a time.sleep(.25) line after my ShellExecute("print",...) seemed to do the trick and fix it.
Might not be the cleanest approach, but I couldn't find anything more elegant for printing in Python that handles it better.
One thing i've been thinking about is using the 'Instance Handle ID' that is returned on successful ShellExecute() calls.. if its > 32 and >= 0 the call was successful. Maybe only run the delete if ShellExecute returns in that range, rather than trying to use an arbitrary time.sleep value. The only problem with this is it returns an exception if it's not successful and breaks out of the program.
Hope this helps!

Categories