python reading header from word docx

python reading header from word docx - python

I am trying to read a header from a word document using python-docx and watchdog.
What I am doing is, whenever a new file is created or modified the script reads the file and get the contents in the header, but I am getting an
docx.opc.exceptions.PackageNotFoundError: Package not found at 'Test6.docx'
error and I tried everything including opening it as a stream but nothing has worked, and yes the document is populated.
For reference, this is my code.
**main.py**
import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
import watchdog.observers
import watchdog.events
import os
import re
import xml.dom.minidom
import zipfile
from docx import Document
class Watcher:
DIRECTORY_TO_WATCH = "/path/to/my/directory"
def __init__(self):
self.observer = Observer()
def run(self):
event_handler = Handler()
self.observer.schedule(event_handler,path='C:/Users/abdsak11/OneDrive - Lärande', recursive=True)
self.observer.start()
try:
while True:
time.sleep(5)
except:
self.observer.stop()
print ("Error")
self.observer.join()
class Handler(FileSystemEventHandler):
#staticmethod
def on_any_event(event):
if event.is_directory:
return None
elif event.event_type == 'created':
# Take any action here when a file is first created.
path = event.src_path
extenstion = '.docx'
base = os.path.basename(path)
if extenstion in path:
print ("Received created event - %s." % event.src_path)
time.sleep(10)
print(base)
doc = Document(base)
print(doc)
section = doc.sections[0]
header = section.header
print (header)
elif event.event_type == 'modified':
# Taken any action here when a file is modified.
path = event.src_path
extenstion = '.docx'
base = os.path.basename(path)
if extenstion in base:
print ("Received modified event - %s." % event.src_path)
time.sleep(10)
print(base)
doc = Document(base)
print(doc)
section = doc.sections[0]
header = section.header
print (header)
if __name__ == '__main__':
w = Watcher()
w.run()
Edit:
Tried to change the extension from doc to docx and that worked but is there anyway to open docx because thats what i am finding.
another thing. When opening the ".doc" file and trying to read the header all i am getting is
<docx.document.Document object at 0x03195488>
<docx.section._Header object at 0x0319C088>
and what i am trying to do is to extract the text from the header

You are trying to print the object itself, however you should access its property:
...
doc = Document(base)
section = doc.sections[0]
header = section.header
print(header.paragraphs[0].text)
according to https://python-docx.readthedocs.io/en/latest/user/hdrftr.html)
UPDATE
As I played with python-docx package, it turned out that PackageNotFoundError is very generic as it can occur simply because file is not accessible by some reason - not exist, not found or due to permissions, as well as if file is empty or corrupted. For example, in case of watchdog, it may very well happen that after triggering "created" event and before creating Document file can be renamed, deleted, etc. And for some reason you make this situation more probable by waiting 10 seconds before creating Document? So, try checking if file exists before:
if not os.path.exists(base):
raise OSError('{}: file does not exist!'.format(base))
doc = Document(base)
UPDATE2
Note also, that this may happen when opening program creates some lock file based on file name, e.g. running your code on linux and opening the file with libreoffice causes
PackageNotFoundError: Package not found at '.~lock.xxx.docx#'
because this file is not docx file! So you should update your filtering condition with
if path.endswith(extenstion):
...

Related

Python 'exifread' module returns error when going thourg not picture e.g. .mp3

I'm building a Python program to sort pictures by "EXIF DateTimeOriginal" tag using the exifread module. There is an error when not picture file (e.g. .mp3 file) is processed by exifread.process_file(item). I would like Python to ignore files without EXIF tags so I use try, except statement but it still returns the error File format not recognized which terminates the program.
I added tags["EXIF DateTimeOriginal"] == True which stopped program termination but the error is still printed.
Has anyone idea how to make exifread module to ignore files which are not pictures?
import exifread
item = "D:\TEMP\Vesna.mp3"
with open(item, 'rb') as file:
try:
tags = exifread.process_file(file, stop_tag="EXIF DateTimeOriginal")
tags["EXIF DateTimeOriginal"] == True
except:
print("No tag")
else:
taken = tags["EXIF DateTimeOriginal"]
print(tags["EXIF DateTimeOriginal"])
**returns**
File format not recognized.
No tag
I could sort out not picture files before fetching them to exifread commands but I have the impression that it would take more time and also some images could still not possess the required tag.

From the source, process_file catches file type errors, logs a warning message and returns an empty dict. You could test for the empty dict or, since you are also concerned about an entry in that dict, use a get with a default value for test. And you can change what happens to a logging event with the logging module.
import exifread
import logging
item = "D:\TEMP\Vesna.mp3"
logging.basicConfig(level=logging.ERROR)
with open(item, 'rb') as file:
tags = exifread.process_file(file, stop_tag="EXIF DateTimeOriginal")
if tags.get("EXIF DateTimeOriginal", True):
taken = tags["EXIF DateTimeOriginal"]
print(tags["EXIF DateTimeOriginal"])
else:
print("No tag")

Your solution #tdelaney still raised an Error so I tweaked it slightly and here is the result. Thanks for the introduction to the logging module :)
import exifread
import logging
item = "D:\TEMP\\20160130_215245.jpg"
logging.basicConfig(level=logging.ERROR)
try:
file = open(item, 'rb')
tags = exifread.process_file(file, stop_tag="EXIF DateTimeOriginal")
taken = tags["EXIF DateTimeOriginal"]
except:
print("No tag")
else:
print(tags["EXIF DateTimeOriginal"])

Using Python Click to read a JSON file

I am new to python and I am trying to read in a JSON file, that for now I can just write out to a new file without any changes. I have been attempting to use the python package Click to do this but keep running into errors.
I'm sure this is relatively basic but any help would be appreciated. The latest version of the code I've tried is below.
import json
import os
import click
def dazprops():
"""Read Daz3D property File"""
path = click.prompt('Please specify a location for the Daz3D Properties File')
path = os.path.realpath(path)
#dir_name = os.path.dirname(path)
print(path)
dazpropsf = open('path')
print(dazpropsf)
if __name__ == '__main__':
dazprops()

Something like this could give you an idea how to achieve that using click:
import click
def read_file(fin):
content = None
with open(fin, "r") as f_in:
content = f_in.read()
return content
def write_file(fout, content):
try:
print("Writing file...")
with open(fout, "w") as f_out:
f_out.write(content)
print(f"File created: {fout}")
except IOError as e:
print(f"Couldn't write a file at {fout}. Error: {e}")
#click.command()
#click.argument('fin', type=click.Path(exists=True))
#click.argument('fout', type=click.Path())
def init(fin, fout):
"""
FINT is an input filepath
FOUT is an output filepath
"""
content = read_file(fin)
if content:
write_file(fout, content)
if __name__ == "__main__":
init()

Delete file with python after downloading

I have this script which downloads an excel file from google drive, updates the external link then runs a desired macro. However, I would like it so that the file is deleted from the path after the macro has been run so that I don't need to update the script to change the file name every time.
I've included the script only from the run macro part.
def run_macro(workbook_name, com_instance):
wb = com_instance.workbooks.open(workbook_name)
com_instance.AskToUpdateLinks = False
try:
wb.UpdateLink(Name=wb.LinkSources())
except Exception as e:
print(e)
finally:
wb.Close(True)
wb = None
return True
def main():
dir_root = ("C:\\users\\ciara\\desktop\\test4.xlsm")
xl_app = Dispatch("Excel.Application")
xl_app.Visible = False
xl_app.DisplayAlerts = False
for root, dirs, files in os.walk(dir_root):
for fn in files:
if fn.endswith(".xlsx") and fn[0] is not "~":
run_macro(os.path.join(root, fn), xl_app)
xl_app.Quit()
xl = None
import unittest
import os.path
import win32com.client
class ExcelMacro(unittest.TestCase):
def test_excel_macro(self):
try:
xlApp = win32com.client.DispatchEx('Excel.Application')
xlsPath = os.path.expanduser('C:\\users\\ciara\\desktop\\test4.xlsm')
wb = xlApp.Workbooks.Open(Filename=xlsPath)
xlApp.Run('BridgeHit')
wb.Save()
xlApp.Quit()
print("Macro ran successfully!")
except:
print("Error found while running the excel macro!")
xlApp.Quit()
if __name__ == "__main__":
unittest.main()
if os.path.exists('C:\\users\\ciara\\desktop\\test4.xlsm'):
os.remove('C:\\users\\ciara\\desktop\\test4.xlsm')
else:
print('File does not exists')
The console displays does not display even 'File does not exists'.
If there is any other feedback on the code above I would appreciate it! I'm just starting to learn

Try adding an indent to your if-else block.

You should add the file removal if-else block to ExcelMacro class. Then only, before running (or after) a macro, it could remove the file if it exists.

Take uploaded files on plone and download them via a python script?

I created a document site on plone from which file uploads can be made. I saw that plone saves them in the filesystem in the form of a blob, now I need to take them through a python script that will process the pdfs downloaded with an OCR. Does anyone have any idea how to do it? Thank you

Not sure how to extract PDFs from BLOB-storage or if it's possible at all, but you can extract them from a running Plone-site (e.g. executing the script via a browser-view):
import os
from Products.CMFCore.utils import getToolByName
def isPdf(search_result):
"""Check mime_type for Plone >= 5.1, otherwise check file-extension."""
if mimeTypeIsPdf(search_result) or search_result.id.endswith('.pdf'):
return True
return False
def mimeTypeIsPdf(search_result):
"""
Plone-5.1 introduced the mime_type-attribute on files.
Try to get it, if it doesn't exist, fail silently.
Return True if mime_type exists and is PDF, otherwise False.
"""
try:
mime_type = search_result.mime_type
if mime_type == 'application/pdf':
return True
except:
pass
return False
def exportPdfFiles(context, export_path):
"""
Get all PDF-files of site and write them to export_path on the filessytem.
Remain folder-structure of site.
"""
catalog = getToolByName(context, 'portal_catalog')
search_results = catalog(portal_type='File', Language='all')
for search_result in search_results:
# For each PDF-file:
if isPdf(search_result):
file_path = export_path + search_result.getPath()
file_content = search_result.getObject().data
parent_path = '/'.join(file_path.split('/')[:-1])
# Create missing directories on the fly:
if not os.path.exists(parent_path):
os.makedirs(parent_path)
# Write PDF:
with open(file_path, 'w') as fil:
fil.write(file_content)
print 'Wrote ' + file_path
print 'Finished exporting PDF-files to ' + export_path
The example keeps the folder-structure of the Plone-site in the export-directory. If you want them flat in one directory, a handler for duplicate file-names is needed.

Invalid file path or buffer object type: <class 'list'> when trying to loop through files

I have the below code updated; in effort to allow my script to loop through multiple files in a directory (as opposed to one):
#classmethod
def find_file(cls):
all_files = list()
""""Finds the excel file to process"""
archive = ZipFile(config.FILE_LOCATION)
for file in archive.filelist:
if file.filename.__contains__('Horrible Data Site '):
all_files.append(archive.extract(file.filename, config.UNZIP_LOCATION))
return all_files
Before declaring 'all files = list()' above in my find_files method, this was working on one file in the directory. I added the all_files in attempt to allow loop through all files in a directory.
Also, in the below main.py I just added the for right before PENDING_RECORDS for this objective.
"""Start Point"""
from data.find_pending_records import FindPendingRecords
from vital.vital_entry import VitalEntry
from time import sleep
if __name__ == "__main__":
try:
for PENDING_RECORDS in FindPendingRecords().get_excel_data():
# Do operations on PENDING_RECORDS
# Reads excel to map data from excel to vital
MAP_DATA = FindPendingRecords().get_mapping_data()
# Configures Driver
VITAL_ENTRY = VitalEntry()
# Start chrome and navigate to vital website
VITAL_ENTRY.instantiate_chrome()
# Begin processing Records
VITAL_ENTRY.process_records(PENDING_RECORDS, MAP_DATA)
print (PENDING_RECORDS)
print("All done")
except Exception as exc:
print(exc)
The addition of adding the all_files() and for now outputs the following error in the Anaconda Prompt:
(base) C:\Python>python main.py
Invalid file path or buffer object type: <class 'list'>
this is the config.py
FILE_LOCATION = r"C:\Zip\DATA Docs.zip"
UNZIP_LOCATION = r"C:\Zip\Pending"
VITAL_URL = 'http://horriblewebsite:8080/START'
HEADLESS = False
PROCESSORS = 4
MAPPING_DOC = ".//map/mappingDOC.xlsx"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python reading header from word docx - python

Related

Python 'exifread' module returns error when going thourg not picture e.g. .mp3

Using Python Click to read a JSON file

Delete file with python after downloading

Take uploaded files on plone and download them via a python script?

Invalid file path or buffer object type: <class 'list'> when trying to loop through files

Categories

Resources