Python multiprocessing Pool collision (file writing) due to process race-condition - python

The objective of the code is to read sqlite3 file and process the text and write to another file (gzip format). I am trying to use multiprocessing with pool, but
sometimes it generates errors and stop with error message "Cannot create a file when that file already exists". After the fail, if I repeat the same code, it works fine most of cases, which means this happens occasionally.
I guess this is related to race between processes in the pool, but cannot find a way to solve the problem. Normally it works nice, but sometimes it causes problem.
Also, I tried to terminate all the process at each directory level, and start new processes in the next directory.
P.S. Environment: Windows server 64bit, Python 2.7 64bit
import sqlite3 as lite
import gzip
import multiprocessing
def convert_txt((infile,outfile)):
try:
conn=lite.connect(infile)
conn.text_factory = str
except:
print 'Sql Lite error:', infile
return
try:
fout = gzip.open(outfile, 'wb')
except:
'File write error:', outfile
return
for line in conn.iterdump():
fout.write(line.replace('abc','def')
fout.close()
for directory in directory_list:
filenames=glob.glob('B:\\Hebercity UT\\*.txt')
p = multiprocessing.Pool(min(10,len(filenames)))
file_list=[]
for input_file in filenames:
output_file=input_file.replace('.txt','.csv')
file_list.append([input_file, output_file])
p.map(convert_txt, file_list)
time.sleep(1)
p.close() # close the pool and start the pool in the next directory
Traceback (most recent call last):
File "B:\gws_txt_converter_multi.py", line 100, in <module>
p.map(convert_txt_msg, fflist)
File "C:\opt\Anaconda\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\opt\Anaconda\lib\multiprocessing\pool.py", line 558, in get
raise self._value
WindowsError: [Error 183] Cannot create a file when that file already exists: 'B:\\Hebercity UT'

Related

`concurrent.futures.ProcessPoolExecutor` on Python is ran from beginning of file instead of the defined function

I have a trouble with concurrent.futures. For the short background, I was trying to do a massive image manipulation with python-opencv2. I stumbled upon performance issue, which is a pain considering it can take hours to process only hundreds of image. I found a solution by using concurrent.futures to utilize CPU multicores to make the process go faster (because I noticed while it took really long time to process, it only use like 16% of my 6-core processor, which is roughly a single-core). So I created the code but then I noticed that the multiprocessing actually start from the beginning of the code instead of isolated around the function I just created. Here's the minimal working reproduction of the error:
import glob
import concurrent.futures
import cv2
import os
def convert_this(filename):
### Read in the image data
img = cv2.imread(filename)
### Resize the image
res = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
res.save("output/"+filename)
try:
#create output dir
os.mkdir("output")
with concurrent.futures.ProcessPoolExecutor() as executor:
files = glob.glob("../project/temp/")
executor.map(convert_this, files)
except Exception as e:
print("Encountered Error!")
print(e)
filelist = glob.glob("output")
for f in filelist:
os.remove(f)
os.rmdir("output")
It gave me an error:
Encountered Error!
Encountered Error!
[WinError 183] Cannot create a file when that file already exists: 'output'
Traceback (most recent call last):
File "M:\pythonproject\testfolder\test.py", line 17, in <module>
os.mkdir("output")
[WinError 183] Cannot create a file when that file already exists: 'output'
Encountered Error!
[WinError 183] Cannot create a file when that file already exists: 'output'
Traceback (most recent call last):
File "M:\pythonproject\testfolder\test.py", line 17, in <module>
os.mkdir("output")
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'output'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\<username>\Anaconda3\envs\py37\lib\multiprocessing\spawn.py", line 105, in spawn_main
Encountered Error!
[WinError 183] Cannot create a file when that file already exists: 'output'
Traceback (most recent call last):
File "M:\pythonproject\testfolder\test.py", line 17, in <module>
os.mkdir("output")
FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'output'
...
(it was repeating errors of the same "can't create file")
As you see, the os.mkdir was ran even though it's outside of the convert_this function I just defined. I'm not that new to Python but definitely new in multiprocessing and threading. Is this just how concurrent.futures behaves? Or am I missing some documentation reading?
Thanks.
Yes, multiprocessing must load the file in the new processes before it can run the function (just as it does when you run the file yourself), so it runs all code you have written. So, either (1) move your multiprocessing code to a separate file with nothing extra in it and call that, or (2) enclose your top level code in a function (e.g., main()), and at the bottom of your file write
If __name__ == ”__main__":
main()
This code will only be run when you start the script, but not by the multiprocess-spawned version. See Python docs for details on this construction.

Why doesn't fileinput throw an error when there's a bad path?

import fileinput
def main()
try:
lines = fileinput.input()
res = process_lines(lines)
...more code
except Exception:
print('is your file path bad?')
if __name__ == '__main__':
main()
When I run this code with a bad path, it doesn't throw an error, yet the docs say that an OS error will be thrown if there's an IO error. How do I test for bad paths then?
fileinput.input() returns an iterator, not an ad-hoc list:
In [1]: fileinput.input()
Out[1]: <fileinput.FileInput at 0x7fa9bea55a50>
Proper usage of this function is done via a for loop:
with fileinput.input() as files:
for line in files:
process_line(line)
or using a conversion to list:
lines = list(fileinput.input())
I.e. the files are opened only when you actually iterate over this object.
Although I wouldn't recommend the second way, as it is counter to the philosophy of how such scripts are supposed to work.
You are supposed to parse as little as you need to output data, and then output it as soon as possible. This avoids issues with large inputs, and if your script is used within a larger pipeline, speeds up the processing significantly.
With regards to checking whether the path is correct or not:
As soon as you'll iterate down to the file that doesn't exist, the iterator will throw an exception:
# script.py
import fileinput
with fileinput.input() as files:
for line in files:
print(repr(line))
$ echo abc > /tmp/this_exists
$ echo xyz > /tmp/this_also_exists
$ python script.py /tmp/this_exists /this/does/not /tmp/this_also_exists
'abc\n'
Traceback (most recent call last):
File "/tmp/script.py", line 6, in <module>
for line in files:
File "/home/mrmino/.pyenv/versions/3.7.7/lib/python3.7/fileinput.py", line 252, in __next__
line = self._readline()
File "/home/mrmino/.pyenv/versions/3.7.7/lib/python3.7/fileinput.py", line 364, in _readline
self._file = open(self._filename, self._mode)
FileNotFoundError: [Errno 2] No such file or directory: '/this/does/not'

Python multiprocessing race condition

I discovered a strange error when using concurrent.futures to read from multiple text files.
Here is a small reproducible example:
import os
import concurrent.futures
def read_file(file):
with open(os.path.join(data_dir, file),buffering=1000) as f:
for row in f:
try:
print(row)
except Exception as e:
print(str(e))
if __name__ == '__main__':
data_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data'))
files = ['file1', 'file2']
with concurrent.futures.ProcessPoolExecutor() as executor:
for file,_ in zip(files,executor.map(read_file,files)):
pass
file1 and file2 are arbitrary text files in the data directory.
I am getting the following error (basically a process tries to read data_dir variable before it is assigned):
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in _process_chunk
return [fn(*args) for args in chunk]
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in <listcomp>
return [fn(*args) for args in chunk]
File "C:\Users\my_username\Downloads\example.py", line 5, in read_file
with open(os.path.join(data_dir, file),buffering=1000) as f:
NameError: name 'data_dir' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "example.py", line 16, in <module>
for file,_ in zip(files,executor.map(read_file,files)):
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 556, in result_iterator
yield future.result()
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 405, in result
return self.__get_result()
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
NameError: name 'data_dir' is not defined
If I place data_dir assignment before if __name__ == '__main__': block, I don't get this error and the code executes as expected.
What is causing this error? Clearly, data_dir is assigned before any asynchronous calls should be made in both cases.
ProcessPoolExecutor spaws a new Python process, imports the right module and calls the function you provide.
As data_dir will only be defined when you run the module, not when you import it, the error is to be expected.
Providing the data_dir file descriptor as an argument to read_file might work, as I believe that processes inherit the file descriptors of their parents. You'd need to check, though.
If were to use a ThreadPoolExecutor however, your example should work, as the spawned threads share memory.
fork() not available on windows, so python use spawn to start new process, which will start a fresh python interpreter process, no memory will be shared, but python will try to recreate worker function environment in the new process, that's why module level variable works. See doc for more detail.

Python IOError: [Errno 24] Too many open files

I plan to pass a user code script to python logger, run it using python bdb, then log the output in a file.
Here is my code in python logger:
try:
logger._runscript(script_str)
except bdb.BdbQuit:
pass
finally:
logger.finalize(filename)
where logger.finalize is defined below as function finalizer(output, filename).
The bdb will spawn a new thread and call the following finalizer function after execution:
def finalizer(output, filename):
outfile = open(filename , 'a')
outfile.write(json.dumps(output, indent = 4))
outfile.close()
Here output is execution result and we will write it to a file with filename.
I tested the three lines in the finalizer function, and they ran okay.
However, when they were called from python logger, I always get the following error message:
IOError: [Errno 24] Too many open files: filename
I only open one file, append a string to its end, then close it. Why are there "too many open files"? Can anyone kindly point me to the problem?
Here is the TraceBack info:
Traceback (most recent call last):
File "./exec.py", line 95, in <module>
File "./exec.py", line 82, in main
File "./exec.py", line 45, in run
File "path to project/logger.py", line 1321, in exec_script_str
File "path to project/logger.py", line 1292, in finalize
File "./exec.py", line 24, in finalizer
IOError: [Errno 24] Too many open files: 'test01.py'
This doesn't look like the logger from the standard library. Rather it looks like you are trying to use a third-party library that provides a sandbox environment. One of the restrictions is which that it forbids file access (see line 1233).
If you are creating the logger object yourself you can disable these security checks by creating with the appropriate flag eg.
def exec_str_with_user_ns(script_str, user_ns, finalizer_func):
logger = PGLogger(False, False, False, finalizer_func, disable_security_checks=True)
try:
logger._runscript(script_str, user_ns)
except bdb.BdbQuit:
pass
finally:
return logger.finalize()

python errno 24 on cgi script using subprocess

I have a python cgi script that runs an application via subprocess over and over again (several thousand times). I keep getting the same error...
Traceback (most recent call last):
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 413, in <module>
webpage()
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 406, in main
displayOmpResult(form['odfFile'].value)
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 342, in displayContainerDiv
makeSection(position,sAoiInput)
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 360, in displayData
displayTable(i,j,lAmpAndVars,dOligoSet[key],position)
File "/home/linuser/Webpages/cgi/SnpEdit.py", line 247, in displayTable
p = subprocess.Popen(['/usr/bin/pDat',sInputFileLoc,sOutputFileLoc],stdout=fh, stderr=fh)
File "/usr/lib/python2.6/subprocess.py", line 633, in __init__
errread, errwrite)
File "/usr/lib/python2.6/subprocess.py", line 1039, in _execute_child
errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
The function causing it is below.
def displayTable(sData):
# convert the data to the proper format
sFormattedData = convertToFormat(sData)
# write the formatted data to file
sInputFile = tempfile.mkstemp(prefix='In_')[1]
fOpen = open(sInputFile,'w')
fOpen.write(sFormattedData)
fOpen.close()
sOutputFileLoc = sInputFile.replace('In_','Out_')
# run app, requires two files; an input and an output
# temp file to holds stdout stderr of subprocess
fh = tempfile.TemporaryFile(mode='w',dir=tempfile.gettempdir())
p = subprocess.Popen(['/usr/bin/pDat',sInputFileLoc,sOutputFileLoc],stdout=fh, stderr=fh)
p.communicate()
fh.close()
# open output file and print parsed data into a list of dictionaries
sOutput = open(sOutputFileLoc).read()
lOutputData = parseOutput(sOutput)
displayTableHeader(lOutputData)
displaySimpleTable(lOutputData)
As far as I can tell, I'm closing the files properly. When I run...
import resource
print resource.getrlimit(resource.RLIMIT_NOFILE)
I get...
(1024, 1024)
Do I have to increase this value? I read that subprocess opens several file descriptors. I tried adding "close_fds = True" and I tried using the with statement when creating my file but the result was the same. I suspect the problem may be with the application that I'm subprocessing, pDat, but this program was made by someone else. It requires two inputs; an input file and the location of where you want the output file written to. I suspect it may not be closing the output file that it creates. Aside from this, I can't see what I might be doing wrong. Any suggestions? Thanks.
EDIT:
I'm on ubuntu 10.04 running python 2.6.5 and apache 2.2.14
Instead of this...
sInputFile = tempfile.mkstemp(prefix='In_')[1]
fOpen = open(sInputFile,'w')
fOpen.write(sFormattedData)
fOpen.close()
I should have done this...
iFileHandle,sInputFile = tempfile.mkstemp(prefix='In_')
fOpen = open(sInputFile,'w')
fOpen.write(sFormattedData)
fOpen.close()
os.close(iFileHandle)
The mkstemp function makes OS level handles to a file and I wasn't closing them. The solution is described in more detail here...
http://www.logilab.org/blogentry/17873
You want to add close_fds=True to the popen call (just in case).
Then, here:
# open output file and print parsed data into a list of dictionaries
sOutput = open(sOutputFileLoc).read()
lOutputData = parseOutput(sOutput)
...I might remember wrong, but unless you use the with syntax, I do not think that the output file descriptor has been closed.
UPDATE: the main problem is that you need to know which files are open. On Windows this would require something like Process Explorer. In Linux it's a bit simpler; you just have to invoke the CGI from command line, or be sure that there is only one instance of the CGI running, and fetch its pid with ps command.
Once you have the pid, run a ls -la on the content of the /proc/<PID>/fd directory. All open file descriptors will be there, with the name of the files they point to. Knowing that file so-and-so is opened 377 times, that goes a long way towards finding out where exactly that file is opened (but not closed).

Categories