Pulling data out of MS Word with pywin32

Pulling data out of MS Word with pywin32 - python

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()

If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)

oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

Related

LibreOffice/other method of filling template .txt file for import into Zim Wiki

I am using the application Zim Wiki(cross-platform, FOSS), which I am using to keep a personal wiki with lots of data coming from tables, copy and pasting, my own writing, and downloading and attaching .png and .html files for viewing and offline access.
The data that is not written or pasted can be stored in tables in the form of names, url addresses, and the names and locations of images and other attachments.
To insert into zim, I can use the front end with WSIWYG, or to make the skeleton of each entry, I could modify a template text entry. If I do this, nothing matters except for the location and identity of each character in each line.
By supplying the text in this image:
DandelionDemo source text,
--I can make this entry for Dandelion:
DandelionDemo Wiki.
So, I can generate and name the Wiki entry in Zim, which creates the .txt file for me, and inserts the time stamp and title, so, the template for this type of entry without the pasted fields would be:
**Full Scientific Name: **[[|]]**[syn]**
**Common Name(s): **
===== =====
**USDA PLANTS entry for Code:** [[https://plants.usda.gov/core/profile?symbol=|]] **- CalPhotos available images for:** [[https://calphotos.berkeley.edu/cgi/img_query?query_src=photos_index&where-taxon=|]]
**---**
**From - Wikipedia **[[wp?]] **- **[[/Docs/Plants/]]
{{/Docs/Plants/?height=364}}{{/Docs/Plants/?height=364}}
**()** //,// [[|(source)]]
**()** //// [[|(source)]]
**Wikipedia Intro: **////
---
So the first line with content, after the 31st character(which is a tab), you paste "http... {etc}. Then the procedure would insert "Taraxacum officinale... {etc}" after the "|", or what was the 32nd character, and so on. This data could be from "table1" and "table2", or combining the tables to make an un-normalized "table1table2", where each row could be converted to text or a .csv or I don't know, what do you think?
Is there a way, in LibreOffice to do this? I have used LibreOffice Base to generate a "book" form that populated fields, but it was much less complex data, without wiki liking and drag-and-drop pasting of images and attachments. So maybe the answer is to go simpler? The tables are not currently part of a registered database, but I could do that, once I have decided on the method of doing this.
I am ultimately looking for a "way", hopefully an "easy" way. However, that may not be in LibreOffice. If not, I know that I could do this in Python, but I haven't learned much about Python yet. If it involves a language, that is the first and only one I don't know that I will invest in learning for this project. If you know a "way" to do this in Python, let me know, and my first project and way of framing my study process will be in learning the methods that you share.
If you know of some other Linux GUI, I am definitely interested, but only in active free and open source builds that involve minimal/no compiling. I know the basics of SQL and DBMS's. In the past, have gotten Microsoft SQL server lite to work, but not DBeaver, yet. If you know of a CLI way also let me know, but I am a self-taught outdoors-loving Linux newb and mostly know about how to tweak little settings in programs, how to use moderately easy programs like ImageMagick, and I have built a few Lamp stacks for Drupal and Wordpress (no BASH etc...).
Thank you very much!

Ok, since you want to learn some python, let me propose you a way to do it this. First you need a template engine -like jinja2 (there are many others)-, a data source in our example a .csv file, -could be other like a db- and finally some code that reads the csv line by line and mix the content with the template.
Sample CSV file:
1;sample
2;dandelion
3;just for fun
Sample template:
**Full Scientific Name: **[[|]]**[syn]**
**Common Name(s): *{{name}}*
===== =====
USDA PLANTS entry for Code: *{{symbol}}*
---
Sample code:
#!/usr/bin/env/python
#
# Using the file system load
#
# We now assume we have a file in the same dir as this one called
# test_template.ziim
#
from jinja2 import Environment, FileSystemLoader
import os
import csv
# Capture our current directory
THIS_DIR = os.path.dirname(os.path.abspath(__file__))
def print_zim_doc():
# Create the jinja2 environment.
# Notice the use of trim_blocks, which greatly helps control whitespace.
j2_env = Environment(loader=FileSystemLoader(THIS_DIR),
trim_blocks=True)
template = j2_env.get_template('test_template.zim')
with open('example.csv') as File:
reader = csv.reader(File, delimiter=';')
for row in reader:
result = template.render(
symbol=row[0]
, name=row[1]
)
# to save the results
with open(row[0]+".txt", "wt") as fh:
fh.write(result)
fh.close()
if __name__ == '__main__':
print_zim_doc()
The code is pretty simple, reads the template located in the same folder as the python code, opens the csv file (also located in the same place), iterates over each line of the csv and renders the template using the values of the csv columns to fill the {{var_name}} in the template, finally saves the rendered result in a new file named as one of the csv column values. This sample will generate 3 files (1.txt, 2.txt, 3.txt). From here you can extend and improve the code to get your desired results.

Open a word document with python using windows

I am trying to open a word document with python in windows, but I am unfamiliar with windows.
My code is as follows.
import docx as dc
doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
Through another post, I learned that I had to put the r in front of my string to convert it to a raw string or it would interpret the \U as an escape sequence.
The error I get is
PackageNotFoundError: Package not found at 'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx'
I'm unsure of why it cannot find my document, 01100-Allergan-UD1314-SUMMARY OF WORK.docx. The pathway is correct as I copied it directly from the file system.
Any help is appreciated thanks.

try this
import StringIO
from docx import Document
file = r'H:\myfolder\wordfile.docx'
with open(file) as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
http://python-docx.readthedocs.io/en/latest/user/documents.html
Also, in regards to debugging the file not found error, simplify your directory names and files names. Rename the file to 'file' instead of referring to a long path with spaces, etc.

If you want to open the document in Microsoft Word try using os.startfile().
In your example it would be:
os.startfile(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
This will open the document in word on your computer.

cPickle.load( ) error

I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,

pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file

As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).

Python - Moving entire text between two .doc files

I have been having this issue for a while and cannot figure how should I start to do this with python. My OS is windows xp pro. I need the script that moves entire (100% of the text) text from one .doc file to another. But its not so easy as it sounds. The target .doc file is not the only one but can be many of them. All the target .doc files are always in the same folder (same path) but all of them don't have the same name. The .doc file FROM where I want to move entire text is only one, always in the same folder (same path) and always with the same file name.
Names of the target are only similar but as I have said before, not the same. Here is the point of whole script:
Target .doc files have the names:
HD1.doc HD2.doc HD3.doc HD4.doc
and so on
What I would like to have is moved the entire (but really all of the text, must be 100% all) text into the .doc file with the highest ( ! ) number. The target .doc files will always start with ''HD'' and always be similar to above examples.
It is possible that the doc file (target file) is only one, so only HD1.doc. Therefore ''1'' is the maximum number and the text is moved into this file.
Sometimes the target file is empty but usually won't be. If it won't be then the text should be moved to the end of the text, into first new line (no empty lines inbetween).
So for example in the target file which has the maximum number in its name is the following text:
a
b
c
In the file from which I want to move the text is:
d
This means I need in the target file this:
a
b
c
d
But no empty lines anywhere.
I have found (showing three different codes):
http://paste.pocoo.org/show/169309/
But neither of them make any sense to me. I know I would need to begin with finding the correct target file (correct HDX file where X is the highest number - again all HD files are and will be in the same folder) but no idea how to do this.
I meant microsoft office word .doc files. They have "pure text". What I mean with pure text is that Im also able to see them in notepad (.txt). But I need to work with .doc extensions. Python is because I need this as automated system, so I wouldn't even need to open any file. Why exsactly python and not any other programming language? The reason for this is because recently I have started learning python and need this script for my work - Python is the "only" programming language that Im interested for and thats why I would like to make this script with it. By "really 100%" I meant that entire text (everything in source file - every single line, no matter if there are 2 or several thousands) would be moved to correct (which one is correct is described in my first post) target file. I cannot move the whole file because I need to move entire text (everything gathered - source file will be always the same but contest of text will be always different - different words in lines) and not whole file because I need the text in correct .doc file with correct name and together (with "together" i mean inside the same file) with already exsisting text IF is there anything already in the target file. Because its possible that the correct target file is empty also.
If someone could suggest me anything, I would really appreciate it.
Thank you, best wishes.
I have tried to ask on openoffice forum but they don't answer. Seen the code could be something like this:
from time import sleep
import win32com.client
from win32com.client import Dispatch
wordApp = win32com.client.Dispatch('Word.Application')
wordApp.Visible=False
wordApp.Documents.Open('C:\\test.doc')
sleep(5)
HD1 = wordApp.Documents.Open('C:\\test.doc') #HD1 word document as object.
HD1.Content.Select.Copy() #Selects entire document and copies it. `
But I have no idea what does that mean. Also I cannot use the .doc file like that because I never know what is the correct filename (HDX.doc where X is maximum integer number, all HD are in same directory path) of the file and therefore I cannot use its name - the script should find the correct file. Also ''filename'' = wordApp.Documents.open... would for sure give me syntax error. :-(

Openoffice ships with full python scripting support, have a look: http://wiki.services.openoffice.org/wiki/Python
Might be easier than trying to mess around with MS Word and COM apis.

So you want to take the text from a doc file, and append it to the end of the text in another doc file. And the problem here is that's MS Word files. It's a proprietary format, and as far as I know there is not module to access them from Python.
But if you are on Windows, you can access them via the COM API, but that's pretty complicated. But look into that. Otehrwise I recommend you to not us MS Word files. The above sounds like some sort of logging facility, and it sounds like a bad idea to use Word files for this, it's too fragile.

Python Script to find instances of a set of strings in a set of files

I have a file which I use to centralize all strings used in my application. Lets call it Strings.txt;
TITLE="Title"
T_AND_C="Accept my terms and conditions please"
START_BUTTON="Start"
BACK_BUTTON="Back"
...
This helps me with I18n, the issue is that my application is now a lot larger and has evolved. As such a lot of these strings are probably not used anymore. I want to eliminate the ones that have gone and tidy up the file.
I want to write a python script, using regular expressions I can get all of the string aliases but how can I search all files in a Java package hierarchy for an instance of a string? If there is a reason I use use perl or bash then let me know as I can but I'd prefer to stick to one scripting language.
Please ask for clarification if this doesn't make sense, hopefully this is straightforward, I just haven't used python much.
Thanks in advance,
Gav

Assuming the files are of reasonable size (as source files will be) so you can easily read them in memory, and that you're looking for the parts in quotes right of the = signs:
import collections
files_by_str = collections.defaultdict(list)
thestrings = []
with open('Strings.txt') as f:
for line in f:
text = line.split('=', 1)[1]
text = text.strip().replace('"', '')
thestrings.append(text)
import os
for root, dirs, files in os.walk('/top/dir/of/interest'):
for name in files:
path = os.path.join(root, name)
with open(path) as f:
data = f.read()
for text in thestrings:
if text in data:
files_by_str[text].append(path)
break
This gives you a dict with the texts (those that are present in 1+ files, only), as keys, and lists of the paths to the files containing them as values. If you care only about a yes/no answer to the question "is this text present somewhere", and don't care where, you can save some memory by keeping only a set instead of the defaultdict; but I think that often knowing what files contained each text will be useful, so I suggest this more complete version.

to parse your strings.txt you don't need regular expressions:
all_strings = [i.partition('=')[0] for i in open('strings.txt')]
to parse your source you could use the dumbest regex:
re.search('\bTITLE\b', source) # for each string in all_strings
to walk the source directory you could use os.walk.
Successful re.search would mean that you need to remove that string from the all_strings: you'll be left with strings that needs to be removed from strings.txt.

You might consider using ack.
% ack --java 'search_string'
This will search under the current directory.

You should consider using YAML: easy to use, human readable.

You are re-inventing gettext, the standard for translating programs in the Free Software sphere (even outside python).
Gettext works with, in principle, large files with strings like these :-). Helper programs exist to merge in new marked strings from the source into all translated versions, marking unused strings etc etc. Perhaps you should take a look at it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.