Python - Moving entire text between two .doc files - python

I have been having this issue for a while and cannot figure how should I start to do this with python. My OS is windows xp pro. I need the script that moves entire (100% of the text) text from one .doc file to another. But its not so easy as it sounds. The target .doc file is not the only one but can be many of them. All the target .doc files are always in the same folder (same path) but all of them don't have the same name. The .doc file FROM where I want to move entire text is only one, always in the same folder (same path) and always with the same file name.
Names of the target are only similar but as I have said before, not the same. Here is the point of whole script:
Target .doc files have the names:
HD1.doc HD2.doc HD3.doc HD4.doc
and so on
What I would like to have is moved the entire (but really all of the text, must be 100% all) text into the .doc file with the highest ( ! ) number. The target .doc files will always start with ''HD'' and always be similar to above examples.
It is possible that the doc file (target file) is only one, so only HD1.doc. Therefore ''1'' is the maximum number and the text is moved into this file.
Sometimes the target file is empty but usually won't be. If it won't be then the text should be moved to the end of the text, into first new line (no empty lines inbetween).
So for example in the target file which has the maximum number in its name is the following text:
a
b
c
In the file from which I want to move the text is:
d
This means I need in the target file this:
a
b
c
d
But no empty lines anywhere.
I have found (showing three different codes):
http://paste.pocoo.org/show/169309/
But neither of them make any sense to me. I know I would need to begin with finding the correct target file (correct HDX file where X is the highest number - again all HD files are and will be in the same folder) but no idea how to do this.
I meant microsoft office word .doc files. They have "pure text". What I mean with pure text is that Im also able to see them in notepad (.txt). But I need to work with .doc extensions. Python is because I need this as automated system, so I wouldn't even need to open any file. Why exsactly python and not any other programming language? The reason for this is because recently I have started learning python and need this script for my work - Python is the "only" programming language that Im interested for and thats why I would like to make this script with it. By "really 100%" I meant that entire text (everything in source file - every single line, no matter if there are 2 or several thousands) would be moved to correct (which one is correct is described in my first post) target file. I cannot move the whole file because I need to move entire text (everything gathered - source file will be always the same but contest of text will be always different - different words in lines) and not whole file because I need the text in correct .doc file with correct name and together (with "together" i mean inside the same file) with already exsisting text IF is there anything already in the target file. Because its possible that the correct target file is empty also.
If someone could suggest me anything, I would really appreciate it.
Thank you, best wishes.
I have tried to ask on openoffice forum but they don't answer. Seen the code could be something like this:
from time import sleep
import win32com.client
from win32com.client import Dispatch
wordApp = win32com.client.Dispatch('Word.Application')
wordApp.Visible=False
wordApp.Documents.Open('C:\\test.doc')
sleep(5)
HD1 = wordApp.Documents.Open('C:\\test.doc') #HD1 word document as object.
HD1.Content.Select.Copy() #Selects entire document and copies it. `
But I have no idea what does that mean. Also I cannot use the .doc file like that because I never know what is the correct filename (HDX.doc where X is maximum integer number, all HD are in same directory path) of the file and therefore I cannot use its name - the script should find the correct file. Also ''filename'' = wordApp.Documents.open... would for sure give me syntax error. :-(

Openoffice ships with full python scripting support, have a look: http://wiki.services.openoffice.org/wiki/Python
Might be easier than trying to mess around with MS Word and COM apis.

So you want to take the text from a doc file, and append it to the end of the text in another doc file. And the problem here is that's MS Word files. It's a proprietary format, and as far as I know there is not module to access them from Python.
But if you are on Windows, you can access them via the COM API, but that's pretty complicated. But look into that. Otehrwise I recommend you to not us MS Word files. The above sounds like some sort of logging facility, and it sounds like a bad idea to use Word files for this, it's too fragile.

Related

How to store a txt file in your program and reference it

Let me preface by saying I am very new to programming. I'm creating a fun program that I can use to start my day at work. One of the things I want it to do is display a random compliment. I made a text file that has multiple lines in it. How do I store that text file then open it?
I've opened text files before that were on my desktop but I want this one to be embedded in the code so when I compile the program I can take it to any computer.
I've googled a ton of different key words and keep finding the basics of opening and reading txt files but that's not exactly what I need.
Perhaps start with defining a default path to your file; this makes it easier to change the path when moving to another computer. Next, define a function in your program to read and return the contents of the file:
FILE_PATH = "my/path/to/file/"
def read_file(file_name):
with open(FILE_PATH + file_name) as f:
return f.read()
With that in place, you can use this function to read, modify, or display the file contents, for example to edit something from your file:
def edit_comments():
text = read_file("daily_comments.txt")
text = text.replace("foo", "foo2")
return text
There are obviously many ways to approach this task, this is just a simple example to get you started.

Create variable from unassigned text

I'm trying to set up a program to be able to read in text located the the program file but which is not assigned to a variable.
What I mean by that is:
There once was a boy who went on an adventure.
He did many, many thing. Yada, yada, yada.
[begin code here to read text]
I'm trying to design it to take in the typed lines of text and then be able to give variable names to each line until it reaches a stopping point like a blank like or a line with only a period. Assuming the text will begin on line 2.
If anybody has any ideas on how to make this work they would be very much appreciated.
Maybe if you tried smuggling the text in as a docstring?
"""Test Doc String"""
print __doc__
Results in (if this code is saved in source.py):
me#Bob:~$ python source.py
Test Doc String
This relies upon keeping your program within a single file. If you import that file (e.g. from source.py to target.py) you would then have to refer to that docstring using the name of the source file (e.g. source.doc). Anyway, you can assign doc to the variable of your choice and then parse away...
But do you really want to do this? Why not read the text in from a separate text file?

How to extract text from several .txt files with Python?

I'm relatively new to programming and using Python, and I couldn't find anything on here that quite answered my question. Basically what I'm looking to do is extract a certain section of about 150 different .txt files and collect each of these pieces into a single .txt file.
Each of the .txt files contains DNA sequence alignment data, and each file basically reads out several dozen different possible sequences. I'm only interested in one of the sequences in each file, and I want to be able to use a script to excise that sequence from all of the files and combine them into a single file that I can then feed into a program that translates the sequences into protein code. Really what I'm trying to avoid is having to go one by one through each of the 150 files and copy/paste the desired sequence into the software.
Does anyone have any idea how I might do this? Thanks!
Edit: I tried to post an image of one of the text files, but apparently I don't have enough "reputation."
Edit2: Hi y'all, I'm sorry I didn't get back to this sooner. I've uploaded the image, here's a link to the upload: http://imgur.com/k3zBTu8
Im assuming you have 150 fasta files and in each fasta file you have sequence id that you want its sequence. you could use Biopython module to do this, put all your 150 files in a folder such as "C:\seq_folder"(folder should not contain any other file, and txt files should not be open)
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('C:\\seq_folder') # changing working directory, to make it easy for python finding txt files
seq_id=x # the sequence id you want the sequence
txt_list=os.listdir('C:\\seq_folder')
result=open('result.fa','w')
for item in txt_list:
with open (item,'rU') as file:
for records in SeqIO.parse(file,'fasta'):
if records.id == seq_id:
txt.write('>'+records.id+'\n')
txt.write(str(records.seq)+'\n')
else:
continue
result.close()
this code will produce a fasta file including the sequence from your desired id from all the files and put them in 'result.fa'. you can also translate them into protein using Biopythn module.

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)
oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

What is the best way to do a find and replace of multiple queries on multiple files?

I have a file that has over 200 lines in this format:
name old_id new_id
The name is useless for what I'm trying to do currently, but I still want it there because it may become useful for debugging later.
Now I need to go through every file in a folder and find all the instances of old_id and replace them with new_id. The files I'm scanning are code files that could be thousands of lines long. I need to scan every file with each of the 200+ ids that I have, because some may be used in more than one file, and multiple times per file.
What is the best way to go about doing this? So far I've been creating python scripts to figure out the list of old ids and new ids and which ones match up with each other, but I've been doing it very inefficient because I basically scanned the first file line by line and got the current id of the current line, then I would scan the second file line by line until I found a match. Then I did this over again for each line in the first file, which ended up with my reading the second file a lot. I didn't mind doing this inefficiently because they were small files.
Now that I'm searching probably somewhere around 30-50 files that can have thousands of line of code in it, I want it to be a little more efficient. This is just a hobbyist project, so it doesn't need to be super good, I just don't want it to take more than 5 minutes to find and replace everything, then look at the result and see that I made a little mistake and need to do it all over again. Taking a few minutes is fine(although I'm sure with computers nowadays they can do this almost instantly still) but I just don't want it to be ridiculous.
So what's the best way to go about doing this? So far I've been using python but it doesn't need to be a python script. I don't care about elegance in the code or way I do it or anything, I just want an easy way to replace all of my old ids with my new ids using whatever tool is easiest to use or implement.
Examples:
Here is a line from the list of ids. The first part is the name and can be ignored, the second part is the old id, and the third part is the new id that needs to replace the old id.
unlock_music_play_grid_thumb_01 0x108043c 0x10804f0
Here is an example line in one of the files to be replaced:
const v1, 0x108043c
I need to be able to replace that id with the new id so it looks like this:
const v1, 0x10804f0
Use something like multiwordReplace (I've edited it for your situation) with mmap.
import os
import os.path
import re
from mmap import mmap
from contextlib import closing
id_filename = 'path/to/id/file'
directory_name = 'directory/to/replace/in'
# read the ids into a dictionary mapping old to new
with open(id_filename) as id_file:
ids = dict(line.split()[1:] for line in id_file)
# compile a regex to do the replacement
id_regex = re.compile('|'.join(map(re.escape, ids)))
def translate(match):
return ids[match.group(0)]
def multiwordReplace(text):
return id_regex.sub(translate, text)
for code_filename in os.listdir(directory_name):
with open(os.path.join(directory, code_filename), 'r+') as code_file:
with closing(mmap(code_file.fileno(), 0)) as code_map:
new_file = multiword_replace(code_map)
with open(os.path.join(directory, code_filename), 'w') as code_file:
code_file.write(new_file)

Categories