Search and Replace not working in header? Python docx - python

I'm using python-docx module to do some edits on a large number of documents. They all contain a header in which I need to replace a number, but everytime I do this the document won't open, with the error that the content is unreadable. Anyone have any ideas as to why this is happening, or sample working code snippets? Thanks.
from docx import *
#document = yourdocument.docx
filename = "NUR-ADM-2001"
relationships = relationshiplist()
document = opendocx("C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
docbody = document.xpath('/w:document/w:body',namespaces=nsprefixes)[0]
advReplace(docbody, "NUR-NPM 101", "NUR-NPM 202")
# Create our properties, contenttypes, and other support files
coreprops = coreproperties(title='Nursing Doc',subject='Policies',creator='IA',keywords='Policy'])
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
# Save our document
savedocx(document,coreprops,appprops,contenttypes,websettings, wordrelationships,"C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
Edit: So it eventually can open the document, but it says some content cannot be displayed and the headers have vanished... thoughts?

I don't know this module, but in general you should not edit a file in place. Open file "A", write file "/tmp/A". Close both files and make sure you have no errors, then move "/tmp/A" to "A". Otherwise you risk clobbering your file if something goes wrong during the write.

Related

How to download more than one file in Streamlit

I need to make a download button for more than one file. Streamlit's download button doesn't let you download more than one file. I tried to make a few buttons, but the rest just disappear when I click the first one. Is there any way to download two or more files in Streamlit?
I tried this solution from Github, this is what the code looks like:
if st.button("Rozpocznij proces"):
raport2 = Raport.raport_naj_10(gender,year,week,engine)
raportM = raport2[0]
raportO = raport2[1]
st.dataframe(raportM)
st.dataframe(raportO)
zipObj = ZipFile("sample.zip", "w")
# Add multiple files to the zip
zipObj.write("raportM")
zipObj.write("raportO")
# close the Zip File
zipObj.close()
ZipfileDotZip = "sample.zip"
with open(ZipfileDotZip, "rb") as f:
bytes = f.read()
b64 = base64.b64encode(bytes).decode()
href = f"<a href=\"data:file/zip;base64,{b64}\" download='{ZipfileDotZip}.zip'>\
Click last model weights\
</a>"
st.sidebar.markdown(href, unsafe_allow_html=True)
But I get this error:
FileNotFoundError: [WinError 2] Nie można odnaleźć określonego pliku: 'raportM'
It says that can't find the file named "raportM".
You are having those errors because the code is written with an assumption that you already have the files stored and you want to generate a zip file for them. zipObj.write("raportM") is looking for the file named "raportM" and there isn't any, because in your case you do not have these files stored. I can see that you are passing variable names as files and that is not going to work.
What you will have to do is to save those variable names as CSV files in your local machine before doing the above operations.
In this case lets modify your code. But before that we need to initialize a session state for the button st.button("Rozpocznij proces") because streamlit button have no callbacks.
processbtn = st.button("Rozpocznij proces")
# Initialized session states
if "processbtn_state" not in st.session_state:
st.session_state.processbtn_state = False
if processbtn or st.session_state.processbtn_state:
st.session_state.processbtn_state = True
raport2 = Raport.raport_naj_10(gender,year,week,engine)
raportM = raport2[0]
raportO = raport2[1]
st.dataframe(raportM)
st.dataframe(raportO)
# Save files
raportM.to_csv('raportM.csv') # You can specify a directory where you want
raportO.to_csv('raportO.csv') # these files to be stored
# Create a zip folder
zipObj = ZipFile("sample.zip", "w")
# Add multiple files to the zip
zipObj.write("raportM.csv")
zipObj.write("raportO.csv")
# close the Zip File
zipObj.close()
ZipfileDotZip = "sample.zip"
with open(ZipfileDotZip, "rb") as f:
bytes = f.read()
b64 = base64.b64encode(bytes).decode()
href = f"<a href=\"data:file/zip;base64,{b64}\" download='{ZipfileDotZip}.zip'>\
Click last model weights\
</a>"
st.sidebar.markdown(href, unsafe_allow_html=True)
At this moment, when you pay close attention to your directories you will find 'raportM.csv' and 'raportO.csv' files. You can pass a condition to the download button so that whenever a download is made the files should be deleted in case you don't want to keep them.
Note: You may encounter fileNotFound Error but does not mean that it won't work, you will just need to know where you are saving the files.

How to avoid MS-Word dialog box of a .docx file containing comments to pause python execution at saving?

Problem:
I need to batch some Word files with python to:
check if they are .doc files
if so change their name
save them as .docx files
So that I can then extract some info from the tables contained in the document with docx lib.
I encounter an issue when trying to save docx files containing comments since a popup appears to ask me to confirm if I want to save the file with comments. It pauses the code execution untill an operator manually confirm by clicking OK into the popup.
It prevents the code to be run automatically without any operator input.
Note: The comments don't need to be kept in the .docx files since I won't use them for further computation.
What I do:
Here's the code I have right now, that stops before end of execution untill you confirm in word you accept to keep the comments (in case your doc file contained some):
import win32com.client
doc_file = "path\\of\\document.doc"
docx_file = "path\\of\\new_document.docx"
word = win32com.client.Dispatch("Word.application")
#get the file extension
file_extension = '.'+doc_file.split('\\').pop().split('.').pop()
#test file extension and convert it to docx if original document is a .doc
if file_extension.lower() == '.doc':
wordDoc = word.Documents.Open(doc_file, False, False, False)
wordDoc.SaveAs2(docx_file, FileFormat = 12)
wordDoc.Close()
#test file extension and print a message in the console if not a .doc document
else:
print('Extension of document {0} is not .doc, will not be treated'.format(doc_file))
word.Quit()
What I've tried:
I tried to look for solutions to remove the comments before saving since I do not use them later in the .docx file I created, but I didn't find any satisfying solution.
Maybe I'm just using the wrong approach and there's a super simple way to dismiss the dialog box or something, but somehow didn't find it.
Thanks!
This seems to do the job, but removes all comments:
import win32com.client
doc_file = "path\\of\\document.doc"
docx_file = "path\\of\\new_document.docx"
word = win32com.client.Dispatch("Word.application")
#get the file extension
file_extension = '.'+doc_file.split('\\').pop().split('.').pop()
#test file extension and convert it to docx if original document is a .doc
if file_extension.lower() == '.doc':
wordDoc = word.Documents.Open(doc_file, False, False, False)
# Accept all revisions
word.ActiveDocument.Revisions.AcceptAll()
# Delete all comments
if word.ActiveDocument.Comments.Count >= 1:
word.ActiveDocument.DeleteAllComments()
wordDoc.SaveAs2(docx_file, FileFormat = 12)
wordDoc.Close()
#test file extension and print a message in the console if not a .doc document
else:
print('Extension of document {0} is not .doc, will not be treated'.format(doc_file))
word.Quit()
I just added the part below that accepts the modifications and remove the comments in original code:
# Accept all revisions
word.ActiveDocument.Revisions.AcceptAll()
# Delete all comments
if word.ActiveDocument.Comments.Count >= 1:
word.ActiveDocument.DeleteAllComments()
I found the solution here: Python - Using win32com.client to accept all changes in Word Documents
But it still doesn't fully answer the initial question. Because it just gets rid of comments since in my own situation I don't need them. But in case you need the comments, I still don't know how to proceed.
I stumbled upon this today:
import win32com.client
doc_file = "path\\of\\document.doc"
docx_file = "path\\of\\new_document.docx"
word = win32com.client.Dispatch("Word.application")
#Disable save with comments warning
word.Options.WarnBeforeSavingPrintingSendingMarkup = False
#get the file extension
file_extension = '.'+doc_file.split('\\').pop().split('.').pop()
#test file extension and convert it to docx if original document is a .doc
if file_extension.lower() == '.doc':
wordDoc = word.Documents.Open(doc_file, False, False, False)
wordDoc.SaveAs2(docx_file, FileFormat = 12)
wordDoc.Close()
#test file extension and print a message in the console if not a .doc document
else:
print('Extension of document {0} is not .doc, will not be treated'.format(doc_file))
word.Quit()
An even easier solution is to use wordconv.exe which is located in your office installation beside the WinWord.exe
The commandline is like this:
wordconv.exe -oice -nme inputfilePath outputFilePath

Create hyperlinks from urls in text file using QTextBrowser

I have a text file with some basic text:
For more information on this topic, go to (http://moreInfo.com)
This tool is available from (https://www.someWebsite.co.uk)
Contacts (https://www.contacts.net)
I would like the urls to show up as hyperlinks in a QTextBrowser, so that when clicked, the web browser will open and load the website. I have seen this post which uses:
Bar
but as the text file can be edited by anyone (i.e. they might include text which does not provide a web address), I would like it if these addresses, if any, can be automatically hyperlinked before being added to the text browser.
This is how I read the text file:
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(file_path, 'r')
text = f.read()
text_browser.setText(text)
text_browser.setOpenExternalLinks(True)
self.dockwidget.show()
Edit:
Made some headway and managed to get the hyperlinks using (assuming the links are inside parenthesis):
import re
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(about_file_path, 'r')
text = f.read()
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
for x in urls:
if x in text:
text = text.replace(x, x.replace('http', '') + x + '')
textBrowser.setHtml(text)
textBrowser.setOpenExternalLinks(True)
self.dockwidget.show()
However, it all appears in one line and not in the same format as in the text file. How could I solve this?
Matching urls correctly is more complex than your current solution might suggest. For a full breakdown of the issues, see: What is the best regular expression to check if a string is a valid URL?
.
The other problem is much easier to solve. To preserve newlines, you can use this:
text = '<br>'.join(text.splitlines())

Python: Use Dropbox API - Save .ODT File

I'm using Dropbox API with Python. I don't have problems with Dropbox API, I make all the authentification steps without problems.
When I use this code:
pdf_dropbox = client.get_file('/Example.pdf')
new_file = open('/home/test.pdf','w')
new_file.write(pdf_dropbox.read())
I generate a file in the path /home/test.pdf, it's a PDF file and the content is displayed same as original.
But when I try same code with an .odt file, it fails generating the new file:
odt_dropbox = client.get_file('/Example.odt')
new_file = open('/home/test_odt.odt','w')
new_file.write(odt_dropbox.read())
This new file test_odt.odt has errors and I can't see it's content.
# With this instruction I have the content of the odt file inside odt_dropbox
odt_dropbox = client.get_file('/Example.odt')
Wich is the best way to save the content of an odt file ?
Is there a better way to write LibreOffice files ?
I'd appreciate any helpfull information,
Thanks
Solved, I forgot 2 things:
Open the file for binary writing wb instead of w
new_file = open('/home/test_odt.odt','wb')
Close the file after creation: new_file.close() to make the flush
Full Code:
odt_dropbox = client.get_file('/Example.odt')
new_file = open('/home/test_odt.odt','wb')
new_file.write(odt_dropbox.read())
new_file.close()

How to structure Python function so that it continues after error?

I am new to Python, and with some really great assistance from StackOverflow, I've written a program that:
1) Looks in a given directory, and for each file in that directory:
2) Runs a HTML-cleaning program, which:
Opens each file with BeautifulSoup
Removes blacklisted tags & content
Prettifies the remaining content
Runs Bleach to remove all non-whitelisted tags & attributes
Saves out as a new file
It works very well, except when it hits a certain kind of file content that throws up a bunch of BeautifulSoup errors and aborts the whole thing. I want it to be robust against that, as I won't have control over what sort of content winds up in this directory.
So, my question is: How can I re-structure the program so that when it errors on one file within the directory, it reports that it was unable to process that file, and then continues to run through the remaining files?
Here is my code so far (with extraneous detail removed):
def clean_dir(directory):
os.chdir(directory)
for filename in os.listdir(directory):
clean_file(filename)
def clean_file(filename):
tag_black_list = ['iframe', 'script']
tag_white_list = ['p', 'div']
attr_white_list = {'*': ['title']}
with open(filename, 'r') as fhandle:
text = BeautifulSoup(fhandle)
text.encode("utf-8")
print "Opened "+ filename
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
print "Prettified"
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("../posts-cleaned/"+filename, "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Saved " + filename +" in /posts-cleaned"
print "Done"
clean_dir("../posts/")
I looking for any guidance on how to write this so that it will keep running after hitting a parsing/encoding/content/attribute/etc error within the clean_file function.
You can handle the Errors using :try-except-finally
You can do the error handling inside clean_file or in the for loop.
for filename in os.listdir(directory):
try:
clean_file(filename)
except:
print "Error processing file %s" % filename
If you know what exception gets raised you can use a more specific catch.

Categories