Python Multithreading issue file manipulation - python

I've been trying to get the hang of multithreading in Python. However, whenever I attempt to make it do something that might be considered useful, I run into issues.
In this case, I have 300 PDF files. For simplicity, we'll assume that each PDF only has a unique number on it (say 1 to 300). I'm trying to make Python open the file, grab the text from it, and then use that text to rename the file accordingly.
The non-multithreaded version I make works amazing. But it's a bit slow and I thought I'd see if I could speed it up a bit. However, this version finds the very first file, renames it correctly, and then throws an error saying:
FileNotFoundError: [Errno 2] No such file or directory: './pdfPages/1006941.pdf'
Which is it basically telling me that it can't find a file by that name. The reason it can't is because it already named it. And in my head that tells me that I've probably messed something up with this loop and/or multithreading in general.
Any help would be appreciated.
Source:
import PyPDF2
import os
from os import listdir
from os.path import isfile, join
from PyPDF2 import PdfFileWriter, PdfFileReader
from multiprocessing.dummy import Pool as ThreadPool
# Global
i=0
def readPDF(allFiles):
print(allFiles)
global i
while i < l:
i+=1
pdf_file = open(path+allFiles, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
pdf_file.close()
Text = str(page_content.encode('utf-8')).strip("b").strip("'")
os.rename(path+allFiles,path+pre+"-"+Text+".PDF")
pre = "77"
path = "./pdfPages/"
included_extensions = ['pdf','PDF']
allFiles = [f for f in listdir(path) if any(f.endswith(ext) for ext in included_extensions)] # Get all files in current directory
l = len(allFiles)
pool = ThreadPool(4)
doThings = pool.map(readPDF, allFiles)
pool.close()
pool.join()

Yes, you have, in fact, messed up with the loop as you say. The loop should not be there at all. This is implicitly handled by the pool.map(...) that ensures that each function call will receive a unique file name from your list to work with. You should not do any other looping.
I have updated your code below, by removing the loop and some other changes (minor, but still improvements, I think):
# Removed a number of imports
import PyPDF2
import os
from multiprocessing.dummy import Pool as ThreadPool
# Removed not needed global variable
def readPDF(allFiles):
# The while loop not needed, as pool.map will distribute the different
# files to different processes anyway
print(allFiles)
pdf_file = open(path+allFiles, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
pdf_file.close()
Text = str(page_content.encode('utf-8')).strip("b").strip("'")
os.rename(path+allFiles,path+pre+"-"+Text+".PDF")
pre = "77"
path = "./pdfPages/"
included_extensions = ('pdf','PDF') # Tuple instead of list
# Tuple allows for simpler "F.endswith"
allFiles = [f for f in os.listdir(path) if f.endswith(included_ext)]
pool = ThreadPool(4)
doThings = pool.map(readPDF, allFiles)
# doThings will be a list of "None"s since the readPDF returns nothing
pool.close()
pool.join()
Thus, the global variable and the counter are not needed, since all of that is handled implicitly. But, even with these changes, it is not at all certain that this will speed up your execution much. Most likely, the bulk of your program execution is waiting for the disk to load. In that case, it is possible that even if you have multiple threads, they will still have to wait for the main resource, i.e., the hard drive. But to know for certain, you have to test.

Related

Printing a PDF with Python from variable, instead a file

I would like to print a PDF file with an external printer. However, since I'm about to open, create or transform multiple files in some loop, I would like to print the thing without the need of saving it as a PDF file in every iteration.
Simplified code looks like this:
import PyPDF2
import os
pdf_in = open('tubba.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_in)
pdf_writer = PyPDF2.PdfFileWriter()
page = pdf_reader.getPage(0)
page.rotateClockwise(90)
# Some other operations done on the page, such as scaling, cropping etc.
pdf_writer.addPage(page)
pdf_out = open('rotated.pdf', 'wb')
pdf_writer.write(pdf_out)
pdf_print = os.startfile('rotated.pdf', 'print')
pdf_out.close()
pdf_in.close()
Is there any way to print "page", or "pdf_writer"?
Best regards
You can just use variables.
Eg.
path = 'C\yourfile.pdf'
os.startfile(path) #just pass the variable here

How to run python functions sequentially

In the code below, "list.py" will read target_list.txt and create a domain list as "http://testsites.com".
Only when this process is completed, I know that target_list is finished, and my other function must run. How do I sequence them properly?
#!/usr/bin/python
import Queue
targetsite = "target_list.txt"
def domaincreate(targetsitelist):
for i in targetsite.readlines():
i = i.strip()
Url = "http://" + i
DomainList = open("LiveSite.txt", "rb")
DomainList.write(Url)
DomainList.close()
def SiteBrowser():
TargetSite = "LiveSite.txt"
Tar = open(TargetSite, "rb")
for Links in Tar.readlines():
Links = Links.strip()
UrlSites = "http://www." + Links
browser = webdriver.Firefox()
browser.get(UrlSites)
browser.save_screenshot(Links+".png")
browser.quit()
domaincreate(targetsite)
SiteBrowser()
I suspect that, whatever problem you have, a large part is because you are trying to write to a file that is open read-only. If you're running on Windows, you may later have a problem that you are in binary mode, but writing a text file (under a UNIX-based system, there's no problem).

Python: pickle error with multiprocessing

As suggested below, I have changed my code to use Pool instead. I've also simplified my functions and included all my code below. However, now I'm getting a different error: NameError: global name 'split_files' is not defined
What I want to do is pass the actual file chunk into the parse_csv_chunk function but I'm not sure how to do that.
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
csv_filename = 'test.csv'
def parse_csv_chunk(files_index):
global split_files
print files_index
print len(split_files)
return 1
def split(infilename, num_chunks):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'Original file size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'Target chunk size:', chunk_size
print 'Target number of chunks:', num_chunks
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
infile.next()
infile.next()
infile.next()
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
#write 3 lines before checking if still < chunk_size
#this is done to improve performance
#the result is that each chunk will not be exactly the same size
temp_file.write(infile.next())
temp_file.write(infile.next())
temp_file.write(infile.next())
#end of original file
except StopIteration:
break
#rewind each chunk
temp_file.seek(0)
files.append(temp_file)
return files
if __name__ == '__main__':
start = time.time()
num_chunks = mp.cpu_count()
split_files = split(csv_filename, num_chunks)
print 'Number of files after splitting: ', len(split_files)
pool = mp.Pool(processes = num_chunks)
results = [pool.apply_async(parse_csv_chunk, args=(x,)) for x in range(num_chunks)]
output = [p.get() for p in results]
print output
I'm trying to split up a csv file into parts and have them processed by each of my CPU's cores. This is what I have so far:
import csv
from itertools import islice
from collections import deque
import time
import math
import multiprocessing as mp
import os
import sys
import tempfile
def parse_csv_chunk(infile):
#code here
return
def split(infilename, num_chunks):
#code here
return files
def get_header_indices(infilename):
#code here
return
if __name__ == '__main__':
start = time.time() #start measuring performance
num_chunks = mp.cpu_count() #record number of CPU cores
files = split(csv_filename, num_chunks) #split csv file into a number equal of CPU cores and store as list
print 'number of files after splitting: ', len(files)
get_header_indices(csv_filename) #get headers of csv file
print headers_list
processes = [mp.Process(target=parse_csv_chunk,
args=ifile) for ifile in enumerate(files)] #create a list of processes for each file chunk
for p in processes:
p.start()
for p in processes:
p.join()
end = time.time()
print "Execution time: %.2f" % (end - start) #display performance
There seems to be a problem at the line 'p.start()'. I see a lot of output on the console, which eventually indicates an error:
pickle.PicklingError: Can't pickle <built-in method write of file object at 0x02
22EAC8>: it's not found as __main__.write
I did not include the code for the functions I called as they are quite long, but I can if needed. I'm wondering if I'm using multiprocessing correctly.
First off, if there a reason you are not using a Pool and the imap method of the Pool?
Second, it's very hard to tell any specifics without seeing your code, especially since the error points to parts of the code that are not provided.
However, it looks like you are using multiprocessing correctly from what you have provided -- and it's a serialization problem.
Note that if you use dill, you can serialize the write method.
>>> import dill
>>>
>>> f = open('foo.bar', 'w')
>>> dill.dumps(f.write)
'\x80\x02cdill.dill\n_get_attr\nq\x00cdill.dill\n_create_filehandle\nq\x01(U\x07foo.barq\x02U\x01wq\x03K\x00\x89c__builtin__\nopen\nq\x04\x89K\x00U\x00q\x05tq\x06Rq\x07U\x05writeq\x08\x86q\tRq\n.'
Most versions of multiprocessing use cPickle (or a version of pickle that is built in C), and while dill can inject it's types into the python version of pickle, it can't do so in the C equivalent.
There is a dill-activated fork of multiprocessing -- so you might try that, as if it's purely a pickling problem, then you should get past it with multiprocess.
See: https://github.com/uqfoundation/multiprocess.
EDIT (after OP update): The global declaration in your helper function isn't going to play well with pickle. Why not just use a payload function (like split) that reads a portion of the file and returns the contents, or writes to the target file? Don't return a list of files. I know they are TemporaryFile objects, but unless you use dill (and even then it's touchy) you can't pickle a file. If you absolutely have to, return the file name, not the file, and don't use a TemporaryFile. pickle will choke trying to pass the file. So, you should refactor your code, or as I suggested earlier, try to see if you can bypass serialization issues by using multiprocess (which uses dill).

How to define an input for python multiprocessing function to take all the files in a directory?

This question may sounds basic because I do not know that much about multiprocessing, I am just learning.
I have python code which processes a bunch of files in a directory.
with Pool(processes=cores) as pp:
pp.map(function, list)
Here is my code:
path = '/data/personal'
print("Running with PID: %d" % getpid())
psl = PublicSuffixList()
d = defaultdict(set)
start = time()
#
files_list = glob(path)
for filename in files:
print(filename)
f = open(filename, 'r')
for n, line in enumerate(f):
line = line[:-1]
ip,reversed_domain_1= line.split('|')
reversed_domain_2 = reversed_domain_1.split('.')
reversed_domain_3 = list(reversed(reversed_domain_2))
domain = ('.'.join(reversed_domain_3))
domain = psl.get_public_suffix(domain)
d[ip].add(domain)
###
for ip, domains in d.iteritems():
for domain in domains:
print(ip,domain)
How can I convert it to be done in a multiprocessing pool?
You can process each file in a separate process like this:
from os import getpid
from collections import defaultdict
from glob import glob
from multiprocessing import Pool
from time import time
from functools import partial
path = '/data/personal'
print("Running with PID: %d" % getpid())
def process_file(psl, filename):
print(filename)
f = open(filename, 'r')
for n, line in enumerate(f):
line = line[:-1]
ip,reversed_domain_1= line.split('|')
reversed_domain_2 = reversed_domain_1.split('.')
reversed_domain_3 = list(reversed(reversed_domain_2))
domain = ('.'.join(reversed_domain_3))
domain = psl.get_public_suffix(domain)
return ip, domain
if __name__ == "__main__":
psl = PublicSuffixList()
d = defaultdict(set)
start = time()
files_list = glob(path)
pp = Pool(processes=cores)
func = partial(process_file, psl)
results = pp.imap_unordered(func, files_list)
for ip, domain in results:
d[ip].add(domain)
p.close()
p.join()
for ip, domains in d.iteritems():
for domain in domains:
print(ip,domain)
Note that the defaultdict is populated in the parent process, because you can't actually share the same defaultdict between multiple processes without using a multiprocessing.Manager. You could do that here if you wanted, but I don't think its necessary. Instead, as soon as any child has a result available, we add it to the defaultdict in the parent. Using imap_unordered instead of map enables us to receive results on-demand, rather than having to wait for all of them to be ready. The only other notable thing is the use of partial to enable passing the psl list to all the child processes in addition to an item from files_list with imap_unordered.
One important note here: Using multiprocessing for this kind of operation may not actually improve performance. A lot of the work you're doing here is reading from disk, which can't be sped up via multiple processes; your hard drive can only do one read operation at a time. Getting read requests for different files from a bunch of processes at once can actually slow things down that doing them sequentially, because it potentially has to to constantly switch to different areas of the physical disk to read a new line from each file. Now, it's possible that the CPU-bound work you're doing with each line is expensive enough to dominate that I/O time, in which case you will see a speed boost.

how do I generate multiprocessing jobs inside a list without making duplicates?

How do I make a multiprocessor system work, which makes new jobs inside a list?
I keep getting:
assert self._popen is None, 'cannot start a process twice'
AttributeError: 'Worker' object has no attribute '_popen'
which makes sense, because I'm basically making multiple instances of the same job... so how do i fix that? do I need to set up a multiprocessor pool?
let me know if I need to clarify things more.
here is my multiprocessing class:
class Worker(multiprocessing.Process):
def __init__(self, output_path, source, file_name):
self.output_path = output_path
self.source = source
self.file_name = file_name
def run(self):
t = HTML(self.source)
output = open(self.output_path+self.file_name+'.html','w')
word_out = open(self.output_path+self.file_name+'.txt','w')
try:
output.write(t.tokenized)
for w in word_list:
if w:
word_out.write(w+'\n')
word_out.close()
output.close()
word_list = []
except IndexError:
output.write(s[1])
output.close()
word_out.close()
except UnboundLocalError:
output.write(s[1])
output.close()
word_out.close()
here is the class that implements this whole thing.
class implement(HTML):
def __init__(self, input_path, output_path):
self.input_path = input_path
self.output_path = output_path
def ensure_dir(self, directory):
if not os.path.exists(directory):
os.makedirs(directory)
return directory
def prosses_epubs(self):
for root, dirs, files in os.walk(self.input_path+"\\"):
epubs = [root+file for file in files if file.endswith('.epub')]
output_file = [self.ensure_dir(self.output_path+"\\"+os.path.splitext(os.path.basename(e))[0]+'_output\\') for e in epubs]
count = 0
for e in epubs:
epub = epubLoader(e)
jobs = []
# this is what's breaking everything right here. I'm not sure how to fix it.
for output_epub in epub.get_html_from_epub():
worker = Worker(output_file[count], output_epub[1], output_epub[0])
jobs.append(worker)
worker.start()
for j in jobs:
j.join()
count += 1
print "done!"
if __name__ == '__main__':
test = implement('some local directory', 'some local directory')
test.prosses_epubs()
any help on this would be greatly appreciated. also let me know if something I'm doing in my code can be done better... I'm always trying to learn how to do things the best way.
Don't use classes when functions suffice. In this case, your classes
each have essentially one meaty method, and the __init__ method is
simply holding arguments used in the meaty method. You can make your
code sleeker by just making the meaty method a function and passing
the arguments directly to it.
Separate the idea of "jobs" (i.e. tasks) from the idea of "workers"
(i.e. processes). Your machine has a limited number of processors,
but the number of jobs could be far greater. You don't want to open a
new process for each job since that might swamp your CPUs --
essentially fork-bombing yourself.
Use the with statement to guarantee that your file handles get
closed. I see output.close() and word_out.close() getting called
in three different places each. You can eliminate all those lines by
using the with-statement, which will automatically close those file
handles once Python leaves the with-suite.
I think a multiprocessing Pool would work well with your code. The jobs can be sent to the workers in the pool
using pool.apply_async. Each call queues a job which will wait
until a worker in the pool is available to handle it. pool.join()
causes the main process to wait until all the jobs are done.
Use os.path.join instead of joining directories with '\\'. This
will make your code compatible with non-Windows machines.
Use enumerate instead of manually implementing/incrementing
counters. It's less typing and will make your code a bit more readable.
The following code will not run since epubLoader, HTML, and word_list are not defined, but it may give you a clearer idea of what I am suggesting above:
import multiprocessing as mp
def worker(output_path, source, filename):
t = HTML(source)
output_path = output_path+filename
output = open(output_path+'.html', 'w')
word_out = open(output_path+'.txt','w')
with output, word_out:
try:
output.write(t.tokenized)
for w in word_list:
if w:
word_out.write(w+'\n')
word_list = []
except IndexError:
output.write(s[1])
except UnboundLocalError:
output.write(s[1])
def ensure_dir(directory):
if not os.path.exists(directory):
os.makedirs(directory)
return directory
def process_epubs(input_path, output_path):
pool = mp.Pool()
for root, dirs, files in os.walk(input_path):
epubs = [os.path.join(root, file) for file in files
if file.endswith('.epub')]
output_file = [
ensure_dir(
os.path.join(
output_path,
os.path.splitext(os.path.basename(e))[0] + '_output')
for e in epubs)]
for count, e in enumerate(epubs):
epub = epubLoader(e)
for filename, source in epub.get_html_from_epub():
pool.apply_async(
worker,
args=(output_file[count], source, filename))
pool.close()
pool.join()
print "done!"
if __name__ == '__main__':
process_epubs('some local directory', 'some local directory')

Categories