I just started to use Spark for the first time for a OCR task, i have a folder of PDF files containing scanned text documents and I want to convert it to plain text. I first create a parallelized dataset of all the pdf's in the folder and perform a Map operation to create the images. I use Wand images for this task. Finally with a foreach i do the OCR using pytesseract, which is a wrapper for Tesseract.
The problem I have with this approach is that the memory use is increasing with each new document and finally i get an error "os cannot allocate memory". I have the feeling it stores the complete Img object in memory but all i need is a list of the locations of the temporary files. If I run this with a few PDF files it works but more then 5 files the system crashes...
def toImage(f):
documentName = f[:-4]
def imageList(imgObject):
#get list of generated images
imagePrefix = "{}tmp/{}/{}".format(path,documentName,documentName)
if len(img.sequence) > 1:
images = [ ("{}-{}.jpg".format(imagePrefix, x.index), documentName) for x in img.sequence]
else:
images = [("{}.jpg".format(imagePrefix), documentName)]
return images
#store images for each file in tmp directory
with WandImage(filename=path + f, resolution=300) as img:
#create tmp directory
if not os.path.exists(path + "tmp/" + documentName):
os.makedirs(path + "tmp/" + documentName)
#save images in tmp directory
img.format = 'jpeg'
img.save(filename=path + "tmp/" + documentName + '/' + documentName + '.jpg')
imageL = imageList(img)
return imageL
def doOcr(imageList):
print(imageList[0][1])
content = "\n\n***NEWPAGE***\n\n".join([pytesseract.image_to_string(Image.open(fullPath), lang='nld') for fullPath, documentName in imageList])
with open(path + "/txt/" + imageList[0][1] + ".txt", "w") as text_file:
text_file.write(content)
sc = SparkContext(appName="OCR")
pdfFiles = sc.parallelize([f for f in os.listdir(sys.argv[1]) if f.endswith(".pdf")])
text = pdfFiles.map(toImage).foreach(doOCr)
Im using Ubuntu with 8gb memory Java 7 and Python3.5
Update
I found a solution, the problem appears to be in the part where I create the imagelist, using:
def imageList(imgObject):
#get list of generated images
# imagePrefix = "{}tmp/{}/{}".format(path,documentName,documentName)
# if len(img.sequence) > 1:
# images = [ ("{}-{}.jpg".format(imagePrefix, x.index), documentName) for x in img.sequence]
# else:
# images = [("{}.jpg".format(imagePrefix), documentName)]
fullPath = "{}tmp/{}/".format(path, documentName)
images = [(fullPath + f, documentName) for f in os.listdir(fullPath) if f.endswith(".jpg")]
return natsorted(images, key=lambda y: y[0])
works perfectly, but i'm not sure why.. Everything gets closed but still it remains in memory
Related
I'm trying to use a Folder to deposit images while running a Python Script and storing the result on my Firebase Firestore and the images to the Cloud Storage.
At the moment I have my main Function which runs the storing and the getting of the Images.
An then 3 complement functions that help me with the downloading of the images, optimization (making them smaller and less quality), and the other helps me name the file.
Here the functions:
Download Images Function:
def dl_jpg(url, file_path, file_name):
full_path = file_path + file_name + '.jpg'
path = urllib.request.urlretrieve(url, full_path)
Optimize Image (make it smaller and less Quality):
def optimizeImage(name) -> str:
foo = Image.open(os.path.join('/tmp/', name + '.jpg'))
foo = foo.resize((525,394),Image.ANTIALIAS)
foo.save('/tmp/' + name + '.jpg',optimize=True,quality=50)
print('Optimized Image: ' + name)
return '/tmp/' + name + '.jpg'
Give Random Name:
def random_name() -> str:
# printing lowercase
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(10))
Now on the main Function, I get the images like this:
#Images Section
imagesRaw = []
imagesSection = soup.find('div', {'class': 'src__GalleryContainer-sc-bdjcm0-7'})
imagesInfo = imagesSection.find_all('img', {'class': 'gallery-image__StyledImg-sc-jtk816-0'})
image1 = imagesInfo[0].get('src')
for image in imagesInfo:
img = image.get('data-flickity-lazyload-src')
imagesRaw.append(img)
imagesRaw.pop(0)
imagesRaw.insert(0, image1)
images = imagesRaw[:12]
imageFile = []
#Here we will store the images in local file
for image in images:
#First we change the ending from webp to jpg
newURL = image[:-4] + 'jpg'
print(newURL)
name = find_between(newURL, "_img", "/origin.jpg")
if name == "":
name = random_name()
print(name)
#Here the function to download the image
try:
dl_jpg(newURL, '/tmp/', name)
except:
break
#Here we Optimize the image to size 500 x 394 pixels
# And get the location for the new image
try:
path = optimizeImage(name)
except:
break
# We append the path to the Array of paths
imageFile.append(path)
And Finally, in the main function, I upload the images to Firebase Storage and then the array of URLs from Storage inside the new Detail in Firestore
ref = db.collection('listings').document()
photos = []
for image in listing.photos:
fullpath = image #find_between(image, 'scrapping/', '.jpg') + '.jpg'
filename = fullpath[7:]
path = fullpath[0:6]
print('FileName: ' + filename)
print('path: '+ path)
imagePath = path + '/' + filename
bucket = store.get_bucket('testxxxxxx2365963.appspot.com')
blob = bucket.blob('ListingImages/' + ref.id + '/' + filename)
blob.upload_from_filename(imagePath)
blob.make_public()
photos.append(blob.public_url)
At the moment my problem is that at the moment it is giving an additional subfolder when uploading with this error:
"[Errno 2] No such file or directory: '/tmp/h/cabujfoh.jpg'"
Any Ideas how to fix and allow the imges optimized be uploaded.
For any of you guys, tracking this:
I found the problem, it was that I was using in my local the folder:
images/
and now change to tmp which is shorter and in this lines:
filename = fullpath[7:]
path = fullpath[0:6]
I got the route information, so I notice that the full path wasn't correct so I change into this:
fullpath = image #find_between(image, 'scrapping/', '.jpg') + '.jpg' fullpath2 = fullpath[1:] filename = fullpath2.split('/',1)[1] path = '/tmp' imagePath = path + '/' + filename
Now Working
I am trying to convert all pdf files to .jpg files and then remove them from the directory. I am able to convert all pdf's to jpg's but when I try to delete them, I get the error "The process is being used by another person".
Could you please help me?
Below is the code
Below script wil convert all pdfs to jpegs and storesin the same location.
for fn in files:
doc = fitz.open(pdffile)
page = doc.loadPage(0) # number of page
pix = page.getPixmap()
fn1 = fn.replace('.pdf', '.jpg')
output = fn1
pix.writePNG(output)
os.remove(fn) # one file at a time.
path = 'D:/python_ml/Machine Learning/New folder/Invoice/'
i = 0
for file in os.listdir(path):
path_to_zip_file = os.path.join(path, folder)
if file.endswith('.pdf'):
os.remove(file)
i += 1
As #K J noted in their comment, most probably the problem is with files not being closed, and indeed your code misses closing the doc object(s).
(Based on the line fitz.open(pdffile), I guess you use the pymupdf library.)
The problematic fragment:
doc = fitz.open(pdffile)
page = doc.loadPage(0) # number of page
pix = page.getPixmap()
fn1 = fn.replace('.pdf', '.jpg')
output = fn1
pix.writePNG(output)
...should be adjusted, e.g., in the following way:
with fitz.open(pdffile) as doc:
page = doc.loadPage(0) # number of page
pix = page.getPixmap()
output = fn.replace('.pdf', '.jpg')
pix.writePNG(output)
(Side note: the fn1 variable seems to be completely redundant so I got rid of it. Also, shouldn't pdffile be replaced with fn? What pdffile actually is?)
I have a conversion script, which converts pdf files and image files to text files. But it takes forever to run my script. It took me almost 48 hours to finished 2000 pdf documents. Right now, I have a pool of documents (around 12000+) that I need to convert. Based on my previous rate, I can't imagine how long will it take to finish the conversion using my code. I am wondering is there anything I can do/change with my code to make it run faster?
Here is the code that I used.
def tesseractOCR_pdf(pdf):
filePath = pdf
pages = convert_from_path(filePath, 500)
# Counter to store images of each page of PDF to image
image_counter = 1
# Iterate through all the pages stored above
for page in pages:
# Declaring filename for each page of PDF as JPG
# For each page, filename will be:
# PDF page 1 -> page_1.jpg
# PDF page 2 -> page_2.jpg
# PDF page 3 -> page_3.jpg
# ....
# PDF page n -> page_n.jpg
filename = "page_"+str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
# Increment the counter to update filename
image_counter = image_counter + 1
# Variable to get count of total number of pages
filelimit = image_counter-1
# Create an empty string for stroing purposes
text = ""
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize text from
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using pytesserct
text += str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#Delete all the jpg files that created from above
for i in glob.glob("*.jpg"):
os.remove(i)
return text
def tesseractOCR_img(img):
filePath = img
text = str(pytesseract.image_to_string(filePath,lang='eng',config='--psm 6'))
text = text.replace('-\n', '')
return text
def Tesseract_ALL(docDir, txtDir, troubleDir):
if docDir == "": docDir = os.getcwd() + "\\" #if no docDir passed in
for doc in os.listdir(docDir): #iterate through docs in doc directory
try:
fileExtension = doc.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = docDir + doc
text = tesseractOCR_pdf(pdfFilename) #get string of text content of pdf
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
else:
# elif (fileExtension == "tif") | (fileExtension == "tiff") | (fileExtension == "jpg"):
imgFilename = docDir + doc
text = tesseractOCR_img(imgFilename) #get string of text content of img
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
except:
print("Error in file: "+ str(doc))
shutil.move(os.path.join(docDir, doc), troubleDir)
for filename in os.listdir(txtDir):
fileExtension = filename.split(".")[-2]
if fileExtension == "pdf":
os.rename(txtDir + filename, txtDir + filename.replace('.pdf', ''))
elif fileExtension == "tif":
os.rename(txtDir + filename, txtDir + filename.replace('.tif', ''))
elif fileExtension == "tiff":
os.rename(txtDir + filename, txtDir + filename.replace('.tiff', ''))
elif fileExtension == "jpg":
os.rename(txtDir + filename, txtDir + filename.replace('.jpg', ''))
docDir = "/drive/codingstark/Project/pdf/"
txtDir = "/drive/codingstark/Project/txt/"
troubleDir = "/drive/codingstark/Project/trouble_pdf/"
Tesseract_ALL(docDir, txtDir, troubleDir)
Does anyone know how can I edit my code to make it run faster?
I think a process pool would be perfect for your case.
First you need to figure out parts of your code that can run independent of each other, than you wrap it into a function.
Here is an example
from concurrent.futures import ProcessPoolExecutor
def do_some_OCR(filename):
pass
with ProcessPoolExecutor() as executor:
for file in range(file_list):
_ = executor.submit(do_some_OCR, file)
The code above will open a new process for each file and start processing things in parallel.
You can find the oficinal documentation here: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor
There is also an really awesome video that shows step-by-step how to use processes for exactly this: https://www.youtube.com/watch?v=fKl2JW_qrso
Here is a compact version of the function removing the file write stuff. I think this should work based on what I was reading on the APIs but I haven't tested this.
Note that I changed from string to list because adding to a list is MUCH less costly than appending to a string (See this about join vs concatenation
How slow is Python's string concatenation vs. str.join?) TLDR is that string concat makes a new string every time you are concatenating so with large strings you start having to copy many times.
Also, when you were calling replace each iteration on the string after concatenation, it was doing again creating a new string. So I moved that to operate on each string that is generated. Note that if for some reason that string '-\n' is an artifact that occured due to the concatenation previously, then it should be removed from where it is and placed here: return ''.join(pageText).replace('-\n','') but realize putting it there will be creating a new string with the join, then creating a whole new string from the replace.
def tesseractOCR_pdf(pdf):
pages = convert_from_path(pdf, 500)
# Counter to store images of each page of PDF to image
# Create an empty list for storing purposes
pageText = []
# Iterate through all the pages stored above will be a PIL Image
for page in pages:
# Recognize the text as string in image using pytesserct
# Add the text to a list while removing the -\n characters.
pageText.append(str(pytesseract.image_to_string(page)).replace('-\n',''))
return ''.join(pageText)
An even more compact one-liner version
def tesseractOCR_pdf(pdf):
#This takes each page of the pdf, extracts the text, removing -\n and combines the text.
return ''.join([str(pytesseract.image_to_string(page)).replace('-\n', '') for page in convert_from_path(pdf, 500)])
I am a newbie to python. I am trying to put my 22k images into matrix before i process them using CNN. However, I encounter this situation which I don't know where I did wrong.
path1 = 'C:/Users/Z/Documents/Python Scripts/Data'
path2 = 'C:/Users/Z/Documents/Python Scripts/Data1'
listing = os.listdir(path1)
num_samples=size(listing)
for file in listing:
im = Image.open(path1 + '\\' + file)
img_rows, img_cols = 224, 224
img = im.resize((img_rows,img_cols),3)
img.save(path2 +'\\' + file, "JPEG")
imlist = os.listdir(path2)
img_data_list=[]
a = Image.open('Data1' + '\\'+ imlist[0]) # open one image to get size
im1 = array(a)
m,n = im1.shape[0:3] # get the size of the images
imnbr = len(imlist) # get the number of images
num_samples = len(imlist)
I got this error
Your path is incorrect when you open the single image, it should be:
a = Image.open(path2 + '\\'+ imlist[0])
a = Image.open(path2 + '\\'+ imlist[0])
You just had a small code error. "Data1" that isn't a correct path
a = Image.open('Data1' + '\\'+ imlist[0]) # open one image to get size
You are supposed to read from path2. aren't you??
I am trying to export my psd file to bmp.
If I del zhe line ###here, it generator test.png correctly,
but i want to get bmp file,
if I use ###here , I get "AttributeError: Property 'Photoshop.BMPSaveOptions.Format' can not be set."
import win32com.client
import os
fn='test.psd'
psApp = win32com.client.Dispatch('Photoshop.Application')
options = win32com.client.Dispatch('Photoshop.ExportOptionsSaveForWeb')
options.Format = 13 # PNG
options.PNG8 = False # Sets it to PNG-24 bit
#options = win32com.client.Dispatch('Photoshop.BMPSaveOptions') ###here del
#options.Format = 2 # bmp
#
fd=os.path.abspath('.')
fk=os.path.join(fd, fn)
doc = psApp.Open(fk)
fn='BBB'
fn = os.path.splitext(fk)[0] + '_' + fn + '.png'
#fn = os.path.splitext(fk)[0] + '_' + fn + '.bmp' ###
doc.Export(ExportIn=fn, ExportAs=2, Options=options) #ExportAs=2,
doc.Close(2)
If I am reading your question properly (apoligies if I am not) you want to save the file in BMP and not PNG format. My guess is you need to change the options.Format
options.Format = 13 # PNG
After some research it looks like BMP is 2 so I'd change your code to:
options.Format = 2 # BMP
As a note, I'd also recommend you change your filename when you save your file to avoid confusion. Maybe this?
fn = os.path.splitext(fk)[0] + '_' + fn + '.bmp'