Open a word document with python using windows - python

I am trying to open a word document with python in windows, but I am unfamiliar with windows.
My code is as follows.
import docx as dc
doc = dc.Document(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
Through another post, I learned that I had to put the r in front of my string to convert it to a raw string or it would interpret the \U as an escape sequence.
The error I get is
PackageNotFoundError: Package not found at 'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx'
I'm unsure of why it cannot find my document, 01100-Allergan-UD1314-SUMMARY OF WORK.docx. The pathway is correct as I copied it directly from the file system.
Any help is appreciated thanks.

try this
import StringIO
from docx import Document
file = r'H:\myfolder\wordfile.docx'
with open(file) as f:
source_stream = StringIO(f.read())
document = Document(source_stream)
source_stream.close()
http://python-docx.readthedocs.io/en/latest/user/documents.html
Also, in regards to debugging the file not found error, simplify your directory names and files names. Rename the file to 'file' instead of referring to a long path with spaces, etc.

If you want to open the document in Microsoft Word try using os.startfile().
In your example it would be:
os.startfile(r'C:\Users\justin.white\Desktop\01100-Allergan-UD1314-SUMMARY OF WORK.docx')
This will open the document in word on your computer.

Related

Inserting txt file info into docx

I am using python 3.5, python-docx, and TKinter. Is there any way to be able to type a .txt file name into an entry box and have python insert the contents of that .txt file into a specific place in docx? I know how to get information from Entry boxes in tkinter and how to convert them into a string.
I want to be able to enter someones name (which would also be the name of the text file) into the entry box and have python insert the content of the .txt file.
Thanks
Update here is my code that I'm trying
from tkinter import *
from tkinter import ttk
import tkinter as tk
from docx import Document
root=Tk()
def make_document():
testbox=ProjectEngineerEntry.get()
TestBox=str(testbox)
def projectengineer():
with open(+TestBox+'.txt') as f:
for line in f:
document.add_paragraph(line)
document=Document()
h1=document.add_heading('engineer test',level=1)
h1a=document.add_heading('Project Engineer',level=2)
projectengineer()
document.save('test.docx')
notebook=ttk.Notebook()
notebook.pack(fill=BOTH)
frame1=ttk.Frame(notebook)
notebook.add(frame1,text='Tester',sticky=W)
tester1=Label(frame1,text='Test1')
ProjectEngineerEntry=Entry(frame1)
tester1.pack()
ProjectEngineerEntry.pack()
save=Button(frame1,text='Save',command=make_document).pack()
As you can see I am trying to take the information from the entry box, convert it to a string and then use that string to open a text file with that specific name. However I keep getting
a TypeError: bad operand type for unary +: 'str'
I don't understand whats going on here. In the actual document I used the ++ method when saving the file (saves it as the current date and time).
python-docx doesn't have this functionality and it's unlikely it ever will. Typically this sort of thing is done to suit by the script or application using python-docx. Something like this:
from docx import Document
document = Document()
with open('textfile.txt') as f:
for line in f:
document.add_paragraph(line)
document.save('wordfile.docx')
You'll need to deal with the particulars, like how a paragraph is separated (perhaps a blank line) and so on, but the code doesn't need to be much longer than this.
Take a look at the following python library: https://python-docx.readthedocs.io/en/latest/
just a pretty adaptation from scanny answer
%pip install python-docx
from docx import Document
def to_docx(file_path_plain_text, file_path_docx):
document = Document()
with open(file_path_plain_text) as f:
for line in f:
document.add_paragraph(line)
document.save(file_path_docx)
to_docx('my_file.txt', 'report.docx')
Possible charmap and encoding errors may arise in capturing the content of the plaintext file. But writing a robust function about this capture would not be part of the scope of this topic.

Regex search term used in Notepad++ does not work with python

I'm working with a large .json filled with twitter bios and would like to extract screen_names. To prevent that the search also returns potential users mentioned in the bio section it is important only to extract the first match ofeach line.
When I open the file in Notepad++ I can use the following regex to do exactly that:
(^.*?)\K"screen_name": "(\w+)"
Using the same as part of an re.findall or re.search in python does not result in any matches.
I'm totally new to both Python and regex so I'm fairly certain I'm not fully aware of all the necessary coding.
Many thanks in advance!
As noted by other users Python and Notepad use different search codes, and so to achieve my wanted result I deployed the following code:
import re
regex=re.compile(r'"screen_name":\s*"(\w+)"')
with open("followers.json", "r") as f:
for line in f:
output=regex.search(line)
with open("followers.txt", "a") as outp:
outp.write(output.group(1)+"\n")
This will analyse your specified .json file, read it line by line, and save every first match of each line in the file "followers.txt".

python cannot open and edit a .reg file

I am trying to edit a .reg file in python to replace strings in a file. I can do this for any other file type such as .txt.
Here is the python code:
with open ("C:/Users/UKa51070/Desktop/regFile.reg", "r") as myfile:
data=myfile.read()
print data
It returns an empty string
I am not sure why you are not seeing any output, perhaps you could try:
print len(data)
Depending on your version of Windows, your REG file will be saved using UTF-16 encoding, unless you specifically export it using the Win9x/NT4 format.
You could try using the following script:
import codecs
with codecs.open("C:/Users/UKa51070/Desktop/regFile.reg", encoding='utf-16') as myfile:
data = myfile.read()
print data
It's probably not a good idea to edit .reg files manually. My suggestion is to search for a Python package that handles it for you. I think the _winreg Python built-in library is what you are looking for.

How to open a .data file extension

I am working on side stuff where the data provided is in a .data file. How do I open a .data file to see what the data looks like and also how do I read from a .data file programmatically through python? I have Mac OSX
NOTE: The Data I am working with is for one of the KDD cup challenges
Kindly try using Notepad or Gedit to check delimiters in the file (.data files are text files too). After you have confirmed this, then you can use the read_csv method in the Pandas library in python.
import pandas as pd
file_path = "~/AI/datasets/wine/wine.data"
# above .data file is comma delimited
wine_data = pd.read_csv(file_path, delimiter=",")
It vastly depends on what is in it. It could be a binary file or it could be a text file.
If it is a text file then you can open it in the same way you open any file (f=open(filename,"r"))
If it is a binary file you can just add a "b" to the open command (open(filename,"rb")). There is an example here:
Reading binary file in Python and looping over each byte
Depending on the type of data in there, you might want to try passing it through a csv reader (csv python module) or an xml parsing library (an example of which is lxml)
After further into from above and looking at the page the format is:
Data Format
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
Therefore see this answer:
parsing a tab-separated file in Python
I would advise trying to process one line at a time rather than loading the whole file, but if you have the ram why not...
I suspect it doesnt open in sublime because the file is huge, but that is just a guess.
To get a quick overview of what the file may content you could do this within a terminal, using strings or cat, for example:
$ strings file.data
or
$ cat -v file.data
In case you forget to pass the -v option to cat and if is a binary file you could mess your terminal and therefore need to reset it:
$ reset
I was just dealing with this issue myself so I thought I would share my answer. I have a .data file and was unable to open it by simply right clicking it. MACOS recommended I open it using Xcode so I tried it but it did not work.
Next I tried open it using a program named "Brackets". It is a text editing program primarily used for HTML and CSS. Brackets did work.
I also tried PyCharm as I am a Python Programmer. Pycharm worked as well and I was also able to read from the file using the following lines of code:
inf = open("processed-1.cleveland.data", "r")
lines = inf.readlines()
for line in lines:
print(line, end="")
It works for me.
import pandas as pd
# define your file path here
your_data = pd.read_csv(file_path, sep=',')
your_data.head()
I mean that just take it as a csv file if it is seprated with ','.
solution from #mustious.

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)
oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

Categories