Code works fine on Mac/Windows, but not linux? - python

I've written a pretty simple code to determine the version of a Revit file. For those that are not aware, there is no way to tell what "Revit year" (i.e. 2020, 2021, 2022, 2023) a .rvt file is in without it either -
being in the name of the file
opening the file in notepad and searching for something along the lines of "Format: 2022"
So I've written a code for it on my windows computer. The code is below.
import shutil
import os
import re
rvt_file = 'revit_file_2020.rvt'
if '.rvt' in rvt_file:
new_f = rvt_file.split(".")[0] + ".txt"
shutil.copyfile(rvt_file, new_f)
content = open(new_f, 'rb').read().decode('utf-8', errors='ignore')
new_s = ""
for char in content:
if char.isalpha() or char.isdigit() or char == ":":
new_s += char
if "build" in new_s.lower():
ind = [m.start() for m in re.finditer('build', new_s.lower())]
new_s = new_s[ind[0]-60:ind[0]+40]
ind = [m.start() for m in re.finditer('format', new_s.lower())][0]
new_s = new_s[ind:ind+15]
rvt_version = ""
for char in new_s:
if char.isdigit():
rvt_version += char
print(f"Revit Version: {rvt_version}")
else:
print("not found")
else:
print("Please send a revit file instead.")
This code works great on windows and mac, but for some reason it will not work on linux for me. It will create the .txt file just fine, but cannot find any of the query words I search for when the txt file is opened. I am trying to incorporate this into a simple discord server that's hosted on linux.
I have a feeling that it has to do with the way windows and mac create/read the .txt file it creates. I tried using an encoding system different than utf-8, but that didn't work.
Any idea what could be so different on linux and how I could alter the code to run there?
For anyone wanting to test it out, I've provided some revit files here:
https://send.johnprovost.com/download/822837095733c2d1/#8rZThFq4UGqc2mOpK6w9PQ
Thanks!

Related

How can I safely implement a user python script that edits a root file

My current goal is to edit the /etc/hosts file using a python script so I can block/unblock certain websites with the execution of a .desktop file (the python script comments/uncomments certain lines)
I already have the python script created and it works fine when run as root
However I'm having trouble running the .desktop as a non-root user, while also allowing the python script to edit the root permission /etc/hosts.
I read that its not recommended to add programs into your /usr/share/applications so I tried using ~/.local/share/applications and no surprise, it wont work.
Essentially I'm wondering if there is a way to execute a root python script through a user level .desktop application, while also maintaining secure permissions.
I'm not even sure if this is possible, but I thought I'd ask as it seems like something out of my league as an Ubuntu newbie.
The code for my python is here:
with open('hosts.txt', 'r') as file:
# read a list of lines into data
data = file.readlines()
t_change = False
for index, line in enumerate(data):
clean_line = ' '.join(line.split())
if t_change and clean_line != "":
if clean_line == "#end-block":
t_change = False
elif line[0] == "#":
data[index] = line[1:]
elif line[0] != "#":
data[index] = "#" + line
elif clean_line == "#start-block":
t_change = True
else:
data[index] = line
# and write everything back
with open('hosts.txt', 'w') as file:
file.writelines(data)
https://pastebin.com/p8Bat817
And the code for my .desktop application is here:
[Desktop Entry]
Type=Application
Terminal=false
Name=focus
Icon=/usr/share/applications/icons/focus.svg
Path=/usr/share/applications/
Exec=python3 /home/main/.local/share/applications/focus.py
https://pastebin.com/MkpZcmjP
Note: The .desktop is favorited to my taskbar so im able to toggle it quickly (not sure if this would impact any of the permissions)

Python open .doc file

I'm working on a project in which I need to read the text from multiple doc and docx files. The docx files were easily done with the docx2txt module but I cannot for the love of me make it work for doc files. I've tried with textract, but it doesn't seem to work on Windows. I just need the text in the file, no pictures or anything like that. Any ideas?
I found that this seems to work:
import win32com.client
text = win32com.client.Dispatch("Word.Application")
text.visible = False
wb = text.Documents.Open("myfile.doc")
document = text.ActiveDocument
print(document.Range().Text)
I had a similar issue, the following function worked for me.
def get_string(path: Path) -> str:
string = ''
with open(path, 'rb') as stream:
stream.seek(2560)
current_stream = stream.read(1)
while not (str(current_stream) == "b'\\x00'"):
if str(current_stream) in special_chars.keys():
string += special_chars[str(current_stream)]
else:
try:
char = current_stream.decode('UTF-8')
if char.isalnum() or char == ' ':
string += char
except UnicodeDecodeError:
string += ''
current_stream = stream.read(1)
return string
I tested it on a .doc file looking like the following:
picture of .doc file
The output from:
string = get_string(filepath)
print(string)
is:
The big red fox jumped over the small barrier to get to the chickens on the other side
And the chickens ran about but had no luck in surviving the day
this||||that||||The other||||

Making a Corpus out of Wiki DumpFile using Python in NLTK

I am trying to create a corpus out of Wiki DumpFile.
I've downloaded the Wiki enwiki-latest-pages-articles.xml.bz2 file, but when I run the code(script) it gives me some errors.
I am relatively new to this, but I do not understand how the python code and the wiki file should be placed (same folders, which folder, etc.).
I've run this command: make_wiki_corpus enwiki-latest-pages-articles.xml.bz2 wiki_en.txt
make_wiki_corpus - being my python script
enwiki-latest-pages-articles.xml.bz2 - is the wikipedia database
wiki_en.txt - the textfile I want to write into.
import sys
from gensim.corpora import WikiCorpus
def make_corpus(in_f, out_f):
"""Convert Wikipedia xml dump file to text corpus"""
output = open(out_f, 'w')
wiki = WikiCorpus(in_f)
i = 0
for text in wiki.get_texts():
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
i = i + 1
if (i % 10000 == 0):
print('Processed ' + str(i) + ' articles')
output.close()
print('Processing complete!')
if __name__ == '__main__':
if len(sys.argv) != 3:
print('Usage: python make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>')
sys.exit(1)
in_f = sys.argv[1]
out_f = sys.argv[2]
make_corpus(in_f, out_f)
I ran the command, containing this code, being in the same file with the enwiki-latest-pages-articles.xml.bz2 file, but at the command prompt I get some error messages like:
line 636 in \__init__
line 92 in __init__
FileNotFound Eroor : [Errorno21] No such file or directory "enwiki-latest-pages-articles.xml.bz2"
Maybe some of these ideas would be useful for u (hopefully):
suggestion #1: if I'm not mistaken python make_wiki_corpus {your bz2 file} {your txt file} should be used;
suggestion #2: try to apply the whole directory path to your needed file;
suggestion #3: u could also print the code from the development environment itself (to avoid any complications possibly).

Watch logs in a folder in real time with Python

I'm trying to make a custom logwatcher of a log folder using python. The objective is simple, finding a regex in the logs and write a line in a text if find it.
The problem is that the script must be running constantly against a folder in where could be multiple log files of unknown names, not a single one, and it should detect the creation of new log files inside the folder on the fly.
I made some kind of tail -f (copying part of the code) in python which is constantly reading a specific log file and write a line in a txt file if regex is found in it, but I don't know how could I do it with a folder instead a single log file, and how can the script detect the creation of new log files inside the folder to read them on the fly.
#!/usr/bin/env python
import time, os, re
from datetime import datetime
# Regex used to match relevant loglines
error_regex = re.compile(r"ERROR:")
start_regex = re.compile(r"INFO: Service started:")
# Output file, where the matched loglines will be copied to
output_filename = os.path.normpath("log/script-log.txt")
# Function that will work as tail -f for python
def follow(thefile):
thefile.seek(0,2)
while True:
line = thefile.readline()
if not line:
time.sleep(0.1)
continue
yield line
logfile = open("log/service.log")
loglines = follow(logfile)
counter = 0
for line in loglines:
if (error_regex.search(line)):
counter += 1
sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ')
out_file=open(output_filename, "a")
out_file.write(sttime + line)
out_file.close()
if (start_regex.search(line)):
sttime = datetime.now().strftime('%Y%m%d_%H:%M:%S - ')
out_file=open(output_filename, "a")
out_file.write(sttime + "SERVICE STARTED\n" + sttime + "Number of errors detected during the startup = {}\n".format(counter))
counter = 0
out_file.close()
You can use watchgod for this purpose. This may be a comment too, not sure if it deserves to be na answer.

.doc to pdf using python

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs
If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.
I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF
I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")
I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.
You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).
import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html
I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

Categories