zipfile.ZipFile(doc) not creating word/document - textract - python

i am using textract to convert doc/docx file to text
here is my method
def extract_text_from_doc(file):
temp = None
temp = textract.process(file)
temp = temp.decode("UTF-8")
text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
return ''.join(text)
I have two doc files both with docx extension. when i try to convert one file to string it is working fine but for other one it is throwing exception
'There is no item named \'word/document.xml\' in the archive'
I tried to look further and i found that zipfile.ZipFile(docx) is causing the problem
Code looks like this
def process(docx, img_dir=None):
text = u''
# unzip the docx in memory
zipf = zipfile.ZipFile(docx)
filelist = zipf.namelist()
# get header text
# there can be 3 header files in the zip
header_xmls = 'word/header[0-9]*.xml'
for fname in filelist:
if re.match(header_xmls, fname):
text += xml2text(zipf.read(fname))
# get main text
doc_xml = 'word/document.xml'
text += xml2text(zipf.read(doc_xml))
# some more code
In the above code, for the file(for which it is working) returns filelist with values like 'word/document.xml', 'word/header1.xml'
but for the file(for which it is not working) its returns filelist with values
['[Content_Types].xml', '_rels/.rels', 'theme/theme/themeManager.xml', 'theme/theme/theme1.xml', 'theme/theme/_rels/themeManager.xml.rels']
since second filelist dont contain 'word/document.xml'
doc_xml = 'word/document.xml'
text += xml2text(zipf.read(doc_xml))
is throwing exception(internally it try to open file name with word/document.xml)
can anyone please help me. i dont know its problem with docx file or code.

Related

Deleting pdf files from a folder if the search word is present using python

Hi i am trying to delete the pdf files in a folder which contains the word "Publications périodiques" in the first , so far i am able to search for the word but dont know how to delete the files .
Code used to search for the word in pdf files
import PyPDF2
import re
object = PyPDF2.PdfFileReader("202105192101394-60.pdf")
String = "Publications périodiques"
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
print(ResSearch)
Also how to loop this in multiple files
You can delete any file using:
import os
os.remove("C://fake/path/to/file.pdf")
In order to delete a file use
import os
os.unlink(file_path)
where file_path is the path to the relevant file
For browsing through files:
from os import walk
mypath= "./"
_, _, filenames = next(walk(mypath))
Process each file:
for file in filenames:
foundWord = yourFunction(file)
if foundWord:
os.remove(file) # Delete the file
Write yourFunction() such that it returns true/false.
I suppose your re.search() is already functional? Or is that part of your question?
If functional, you could just use os to get all the files, perhaps filter them through a list comprehension to only get the pdf-files like so:
import os
all_files = os.listdir("C:/../or_whatever_path")
only_pdf_files = [file for file in all_files if ".pdf" in file]
from that point on, you can iterate through all pdf-files and just execute the same code you've already written for each one and when "ResSearch" is True, delete the File via os.remove() method:
for file in only_pdf_files:
object = PyPDF2.PdfFileReader(file)
String = "Publications périodiques"
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " + str(i))
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
if ResSearch:
os.remove(file)
else:
pass
EDIT:
When your pdf-files aren't in the same directory as your python script, the path is to be added to the os.remove() method.
for file in only_pdf_files:
object = PyPDF2.PdfFileReader(file)
NumPages = object.getNumPages()
String = "Publications périodiques"
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
# print(Text)
ResSearch = re.search(String, Text)
if ResSearch:
os.remove(file)
else:
pass

How to make my Tesseract-OCR conversion code run faster

I have a conversion script, which converts pdf files and image files to text files. But it takes forever to run my script. It took me almost 48 hours to finished 2000 pdf documents. Right now, I have a pool of documents (around 12000+) that I need to convert. Based on my previous rate, I can't imagine how long will it take to finish the conversion using my code. I am wondering is there anything I can do/change with my code to make it run faster?
Here is the code that I used.
def tesseractOCR_pdf(pdf):
filePath = pdf
pages = convert_from_path(filePath, 500)
# Counter to store images of each page of PDF to image
image_counter = 1
# Iterate through all the pages stored above
for page in pages:
# Declaring filename for each page of PDF as JPG
# For each page, filename will be:
# PDF page 1 -> page_1.jpg
# PDF page 2 -> page_2.jpg
# PDF page 3 -> page_3.jpg
# ....
# PDF page n -> page_n.jpg
filename = "page_"+str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
# Increment the counter to update filename
image_counter = image_counter + 1
# Variable to get count of total number of pages
filelimit = image_counter-1
# Create an empty string for stroing purposes
text = ""
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize text from
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using pytesserct
text += str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#Delete all the jpg files that created from above
for i in glob.glob("*.jpg"):
os.remove(i)
return text
def tesseractOCR_img(img):
filePath = img
text = str(pytesseract.image_to_string(filePath,lang='eng',config='--psm 6'))
text = text.replace('-\n', '')
return text
def Tesseract_ALL(docDir, txtDir, troubleDir):
if docDir == "": docDir = os.getcwd() + "\\" #if no docDir passed in
for doc in os.listdir(docDir): #iterate through docs in doc directory
try:
fileExtension = doc.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = docDir + doc
text = tesseractOCR_pdf(pdfFilename) #get string of text content of pdf
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
else:
# elif (fileExtension == "tif") | (fileExtension == "tiff") | (fileExtension == "jpg"):
imgFilename = docDir + doc
text = tesseractOCR_img(imgFilename) #get string of text content of img
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
except:
print("Error in file: "+ str(doc))
shutil.move(os.path.join(docDir, doc), troubleDir)
for filename in os.listdir(txtDir):
fileExtension = filename.split(".")[-2]
if fileExtension == "pdf":
os.rename(txtDir + filename, txtDir + filename.replace('.pdf', ''))
elif fileExtension == "tif":
os.rename(txtDir + filename, txtDir + filename.replace('.tif', ''))
elif fileExtension == "tiff":
os.rename(txtDir + filename, txtDir + filename.replace('.tiff', ''))
elif fileExtension == "jpg":
os.rename(txtDir + filename, txtDir + filename.replace('.jpg', ''))
docDir = "/drive/codingstark/Project/pdf/"
txtDir = "/drive/codingstark/Project/txt/"
troubleDir = "/drive/codingstark/Project/trouble_pdf/"
Tesseract_ALL(docDir, txtDir, troubleDir)
Does anyone know how can I edit my code to make it run faster?
I think a process pool would be perfect for your case.
First you need to figure out parts of your code that can run independent of each other, than you wrap it into a function.
Here is an example
from concurrent.futures import ProcessPoolExecutor
def do_some_OCR(filename):
pass
with ProcessPoolExecutor() as executor:
for file in range(file_list):
_ = executor.submit(do_some_OCR, file)
The code above will open a new process for each file and start processing things in parallel.
You can find the oficinal documentation here: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor
There is also an really awesome video that shows step-by-step how to use processes for exactly this: https://www.youtube.com/watch?v=fKl2JW_qrso
Here is a compact version of the function removing the file write stuff. I think this should work based on what I was reading on the APIs but I haven't tested this.
Note that I changed from string to list because adding to a list is MUCH less costly than appending to a string (See this about join vs concatenation
How slow is Python's string concatenation vs. str.join?) TLDR is that string concat makes a new string every time you are concatenating so with large strings you start having to copy many times.
Also, when you were calling replace each iteration on the string after concatenation, it was doing again creating a new string. So I moved that to operate on each string that is generated. Note that if for some reason that string '-\n' is an artifact that occured due to the concatenation previously, then it should be removed from where it is and placed here: return ''.join(pageText).replace('-\n','') but realize putting it there will be creating a new string with the join, then creating a whole new string from the replace.
def tesseractOCR_pdf(pdf):
pages = convert_from_path(pdf, 500)
# Counter to store images of each page of PDF to image
# Create an empty list for storing purposes
pageText = []
# Iterate through all the pages stored above will be a PIL Image
for page in pages:
# Recognize the text as string in image using pytesserct
# Add the text to a list while removing the -\n characters.
pageText.append(str(pytesseract.image_to_string(page)).replace('-\n',''))
return ''.join(pageText)
An even more compact one-liner version
def tesseractOCR_pdf(pdf):
#This takes each page of the pdf, extracts the text, removing -\n and combines the text.
return ''.join([str(pytesseract.image_to_string(page)).replace('-\n', '') for page in convert_from_path(pdf, 500)])

Python - FileNotFoundError, parameter appears to pull wrong path?

I'm trying to update a program to pull/read 10-K html and am getting a FileNotFound error. The error throws during the readHTML function. It looks like the FileName parameter is looking for a path to the Form10KName column, when it should be looking to the FileName column. I've no idea why this is happening, any help?
Here is the error code:
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 105, in <module>
main()
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 92, in main
match=readHTML(FileName)
File "C:/Users/crabtreec/Downloads/4_ReadHTML.py", line 18, in readHTML
input_file = open(input_path,'r+')
FileNotFoundError: [Errno 2] No such file or directory: './HTML/a10-k20189292018.htm'
And here is what I'm running.
from bs4 import BeautifulSoup #<---- Need to install this package manually using pip
from urllib.request import urlopen
os.chdir('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = "./HTML/" #<===The subfolder with the 10-K files in HTML format
txtSubPath = "./txt/" #<===The subfolder with the extracted text files
DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms
def readHTML(FileName):
input_path = htmlSubPath+FileName
output_path = txtSubPath+FileName.replace(".htm",".txt")
input_file = open(input_path,'r+')
page = input_file.read() #<===Read the HTML file into Python
#Pre-processing the html content by removing extra white space and combining then into one line.
page = page.strip() #<=== remove white space at the beginning and end
page = page.replace('\n', ' ') #<===replace the \n (new line) character with space
page = page.replace('\r', '') #<===replace the \r (carriage returns -if you're on windows) with space
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
page = page.replace(' ', ' ') #<===replace " " (a special character for space in HTML) with space.
while ' ' in page:
page = page.replace(' ', ' ') #<===remove extra space
#Using regular expression to extract texts that match a pattern
#Define pattern for regular expression.
#The following patterns find ITEM 1 and ITEM 1A as diplayed as subtitles
#(.+?) represents everything between the two subtitles
#If you want to extract something else, here is what you should change
#Define a list of potential patterns to find ITEM 1 and ITEM 1A as subtitles
regexs = ('bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.', #<===pattern 1: with an attribute bold before the item subtitle
'b>\s*Item 1\.(.+?)b>\s*Item 1A\.', #<===pattern 2: with a tag <b> before the item subtitle
'Item 1\.\s*<\/b>(.+?)Item 1A\.\s*<\/b>', #<===pattern 3: with a tag <\b> after the item subtitle
'Item 1\.\s*Business\.\s*<\/b(.+?)Item 1A\.\s*Risk Factors\.\s*<\/b') #<===pattern 4: with a tag <\b> after the item+description subtitle
#Now we try to see if a match can be found...
for regex in regexs:
match = re.search (regex, page, flags=re.IGNORECASE) #<===search for the pattern in HTML using re.search from the re package. Ignore cases.
#If a match exist....
if match:
#Now we have the extracted content still in an HTML format
#We now turn it into a beautiful soup object
#so that we can remove the html tags and only keep the texts
soup = BeautifulSoup(match.group(1), "html.parser") #<=== match.group(1) returns the texts inside the parentheses (.*?)
#soup.text removes the html tags and only keep the texts
rawText = soup.text.encode('utf8') #<=== you have to change the encoding the unicodes
#remove space at the beginning and end and the subtitle "business" at the beginning
#^ matches the beginning of the text
outText = re.sub("^business\s*","",rawText.strip(),flags=re.IGNORECASE)
output_file = open(output_path, "w")
output_file.write(outText)
output_file.close()
break #<=== if a match is found, we break the for loop. Otherwise the for loop continues
input_file.close()
return match
def main():
if not os.path.isdir(txtSubPath): ### <=== keep all texts files in this subfolder
os.makedirs(txtSubPath)
csvFile = open(DownloadLogFile, "r") #<===A csv file with the list of 10k file names (the file should have no header)
csvReader = csv.reader(csvFile, delimiter=",")
csvData = list(csvReader)
logFile = open(ReadLogFile, "a+") #<===A log file to track which file is successfully extracted
logWriter = csv.writer(logFile, quoting = csv.QUOTE_NONNUMERIC)
logWriter.writerow(["filename","extracted"])
i=1
for rowData in csvData[1:]:
if len(rowData):
FileName = rowData[7]
if ".htm" in FileName:
match=readHTML(FileName)
if match:
logWriter.writerow([FileName,"yes"])
else:
logWriter.writerow([FileName,"no"])
i=i+1
csvFile.close()
logFile.close()
print("done!")
if __name__ == "__main__":
main()
CSV of file info
Your error message explains it is not looking inside the "HTML" directory for the file.
I would avoid using os.chdir to change the working directory - it is likely to complicate things. Instead, use pathlib and join paths correctly to ensure file paths are less error prone.
Try with this:
from pathlib import Path
base_dir = Path('C:/Users/crabtreec/Downloads/') # The location of the file "CompanyList.csv
htmlSubPath = base_dir.joinpath("HTML") #<===The subfolder with the 10-K files in HTML format
txtSubPath = base_dir.joinpath("txt") #<===The subfolder with the extracted text files
DownloadLogFile = "10kDownloadLog.csv" #a csv file (output of the 3DownloadHTML.py script) with the download history of 10-K forms
ReadLogFile = "10kReadlog.csv" #a csv file (output of the current script) showing whether item 1 is successfully extracted from 10-K forms
def readHTML(FileName):
input_path = htmlSubPath.joinpath(FileName)
output_path = txtSubPath.joinpath(FileName.replace(".htm",".txt"))

Search,replace text and save as based on text in document in Python

All, I am just getting started with python and I thought this may be a good time to see if it can help me automate a lot of repeative tasks I have to complete.
I am using a script I found on Gethub that will search and replace and then write a new file with the name output.txt. It works fine, but Since I have lots of these files I need to be able to name them different names based on the Text in the final modified document.
To make this a little more difficult the name of the file is based on the text I will be modifing the document with.
So, basically after I run this script, I have a file that sits at C:\Program Files (x86)\Python35-32\Scripts\Text_Find_and_Replace\Result with the name of output.txt in this Modified new file I would like to name it based on what text is in a particular line of the file. So in the modified file of output.txt I would like to have it rename the file to the plain text in line 35.
I have figured out how to read the line within the file using
import linecache
line = linecache.getline("readme.txt", 1)
line
>>> line
'This is Python version 3.5.1\n'
I just need to figure out how to rename the file based on the variable "line"
Any Ideas?
#!/usr/bin/python
import os
import sys
import string
import re
## information/replacingvalues.txt this is the text of the values you want in your final document
information = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\information/replacingvalues.txt", 'r')
#Text_Find_and_Replace\Result\output.txt This is the dir and the sum or final document
output = open("C:\Program Files (x86)\Python35-32\Scripts\Text_Find_and_Replace\Result\output.txt", 'w')
#field = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Field/values.txt"
# Field is the file or words you will be replacing
field = open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Field/values.txt", 'r')
##
##
# modified code for autohot key
# Text_Find_and_Replace\Test/remedy line 1.ahk is the original doc you want modified
with open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Test/remedy line 1.ahk", 'r') as myfile:
inline = myfile.read()
#orig code
##with open("C:\Program Files (x86)\Python35- 32\Scripts\Text_Find_and_Replace\Test/input.txt", 'r') as myfile:
## inline = myfile.read()
informations = []
fields = []
dictionary = {}
i = 0
for line in information:
informations.append(line.splitlines())
for lines in field:
fields.append(lines.split())
i = i+1;
if (len(fields) != len(informations) ):
print ("replacing values and values have different numbers")
exit();
else:
for i in range(0, i):
rightvalue = str(informations[i])
rightvalue = rightvalue.strip('[]')
rightvalue = rightvalue[1:-1]
leftvalue = str(fields[i])
leftvalue = leftvalue.strip('[]')
leftvalue = leftvalue.strip("'")
dictionary[leftvalue] = rightvalue
robj = re.compile('|'.join(dictionary.keys()))
result = robj.sub(lambda m: dictionary[m.group(0)], inline)
output.write(result)
information.close;
output.close;
field.close;
I figured out how...
import os
import linecache
linecache.clearcache()
newfilename= linecache.getline("C:\python 3.5/remedy line 1.txt",37)
filename = ("C:\python 3.5/output.ahk")
os.rename(filename, newfilename.strip())
linecache.clearcache()

File naming problem with Python

I am trying to iterate through a number .rtf files and for each file: read the file, perform some operations, and then write new files into a sub-directory as plain text files with the same name as the original file, but with .txt extensions. The problem I am having is with the file naming.
If a file is named foo.rtf, I want the new file in the subdirectory to be foo.txt. here is my code:
import glob
import os
import numpy as np
dir_path = '/Users/me/Desktop/test/'
file_suffix = '*.rtf'
output_dir = os.mkdir('sub_dir')
for item in glob.iglob(dir_path + file_suffix):
with open(item, "r") as infile:
reader = infile.readlines()
matrix = []
for row in reader:
row = str(row)
row = row.split()
row = [int(value) for value in row]
matrix.append(row)
np_matrix = np.array(matrix)
inv_matrix = np.transpose(np_matrix)
new_file_name = item.replace('*.rtf', '*.txt') # i think this line is the problem?
os.chdir(output_dir)
with open(new_file_name, mode="w") as outfile:
outfile.write(inv_matrix)
When I run this code, I get a Type Error:
TypeError: coercing to Unicode: need string or buffer, NoneType found
How can I fix my code to write new files into a subdirectory and change the file extensions from .rtf to .txt? Thanks for the help.
Instead of item.replace, check out some of the functions in the os.path module (http://docs.python.org/library/os.path.html). They're made for splitting up and recombining parts of filenames. For instance, os.path.splitext will split a filename into a file path and a file extension.
Let's say you have a file /tmp/foo.rtf and you want to move it to /tmp/foo.txt:
old_file = '/tmp/foo.rtf'
(file,ext) = os.path.splitext(old_file)
print 'File=%s Extension=%s' % (file,ext)
new_file = '%s%s' % (file,'.txt')
print 'New file = %s' % (new_file)
Or if you want the one line version:
old_file = '/tmp/foo.rtf'
new_file = '%s%s' % (os.path.splitext(old_file)[0],'.txt')
I've never used glob, but here's an alternative way without using a module:
You can easily strip the suffix using
name = name[:name.rfind('.')]
and then add the new suffix:
name = name + '.txt'
Why not using a function ?
def change_suffix(string, new_suffix):
i = string.rfind('.')
if i < 0:
raise ValueError, 'string does not have a suffix'
if not new_suffix[0] == '.':
new_suffix += '.'
return string[:i] + new_suffix
glob.iglob() yields pathnames, without the character '*'.
therefore your line should be:
new_file_name = item.replace('.rtf', '.txt')
consider working with clearer names (reserve 'filename' for a file name and use 'path' for a complete path to a file; use 'path_original' instead of 'item'), os.extsep ('.' in Windows) and os.path.splitext():
path_txt = os.extsep.join([os.path.splitext(path_original)[0], 'txt'])
now the best hint of all:
numpy can probably read your file directly:
data = np.genfromtxt(filename, unpack=True)
(see also here)
To better understand where your TypeError comes from, wrap your code in the following try/except block:
try:
(your code)
except:
import traceback
traceback.print_exc()

Categories