I discovered a strange error when using concurrent.futures to read from multiple text files.
Here is a small reproducible example:
import os
import concurrent.futures
def read_file(file):
with open(os.path.join(data_dir, file),buffering=1000) as f:
for row in f:
try:
print(row)
except Exception as e:
print(str(e))
if __name__ == '__main__':
data_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), 'data'))
files = ['file1', 'file2']
with concurrent.futures.ProcessPoolExecutor() as executor:
for file,_ in zip(files,executor.map(read_file,files)):
pass
file1 and file2 are arbitrary text files in the data directory.
I am getting the following error (basically a process tries to read data_dir variable before it is assigned):
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 175, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in _process_chunk
return [fn(*args) for args in chunk]
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\process.py", line 153, in <listcomp>
return [fn(*args) for args in chunk]
File "C:\Users\my_username\Downloads\example.py", line 5, in read_file
with open(os.path.join(data_dir, file),buffering=1000) as f:
NameError: name 'data_dir' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "example.py", line 16, in <module>
for file,_ in zip(files,executor.map(read_file,files)):
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 556, in result_iterator
yield future.result()
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 405, in result
return self.__get_result()
File "C:\Users\my_username\AppData\Local\Continuum\Anaconda3\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
NameError: name 'data_dir' is not defined
If I place data_dir assignment before if __name__ == '__main__': block, I don't get this error and the code executes as expected.
What is causing this error? Clearly, data_dir is assigned before any asynchronous calls should be made in both cases.
ProcessPoolExecutor spaws a new Python process, imports the right module and calls the function you provide.
As data_dir will only be defined when you run the module, not when you import it, the error is to be expected.
Providing the data_dir file descriptor as an argument to read_file might work, as I believe that processes inherit the file descriptors of their parents. You'd need to check, though.
If were to use a ThreadPoolExecutor however, your example should work, as the spawned threads share memory.
fork() not available on windows, so python use spawn to start new process, which will start a fresh python interpreter process, no memory will be shared, but python will try to recreate worker function environment in the new process, that's why module level variable works. See doc for more detail.
Related
import fileinput
def main()
try:
lines = fileinput.input()
res = process_lines(lines)
...more code
except Exception:
print('is your file path bad?')
if __name__ == '__main__':
main()
When I run this code with a bad path, it doesn't throw an error, yet the docs say that an OS error will be thrown if there's an IO error. How do I test for bad paths then?
fileinput.input() returns an iterator, not an ad-hoc list:
In [1]: fileinput.input()
Out[1]: <fileinput.FileInput at 0x7fa9bea55a50>
Proper usage of this function is done via a for loop:
with fileinput.input() as files:
for line in files:
process_line(line)
or using a conversion to list:
lines = list(fileinput.input())
I.e. the files are opened only when you actually iterate over this object.
Although I wouldn't recommend the second way, as it is counter to the philosophy of how such scripts are supposed to work.
You are supposed to parse as little as you need to output data, and then output it as soon as possible. This avoids issues with large inputs, and if your script is used within a larger pipeline, speeds up the processing significantly.
With regards to checking whether the path is correct or not:
As soon as you'll iterate down to the file that doesn't exist, the iterator will throw an exception:
# script.py
import fileinput
with fileinput.input() as files:
for line in files:
print(repr(line))
$ echo abc > /tmp/this_exists
$ echo xyz > /tmp/this_also_exists
$ python script.py /tmp/this_exists /this/does/not /tmp/this_also_exists
'abc\n'
Traceback (most recent call last):
File "/tmp/script.py", line 6, in <module>
for line in files:
File "/home/mrmino/.pyenv/versions/3.7.7/lib/python3.7/fileinput.py", line 252, in __next__
line = self._readline()
File "/home/mrmino/.pyenv/versions/3.7.7/lib/python3.7/fileinput.py", line 364, in _readline
self._file = open(self._filename, self._mode)
FileNotFoundError: [Errno 2] No such file or directory: '/this/does/not'
I'm trying to run these lines of code in atom and python3.6 :
from pycall import CallFile, Call, Application
import sys
def call():
c = Call('SIP/200')
a = Application('Playback', 'hello-world')
cf = CallFile(c, a)
cf.spool()
if __name__ == '__main__':
call()
But I receive this error:
Traceback (most recent call last):
File "/home/pd/gits/voiphone/main.py", line 12, in <module>
call()
File "/home/pd/gits/voiphone/main.py", line 9, in call
cf.spool()
File "/home/pd/telephonerelayEnv/lib/python3.6/site-packages/pycall/callfile.py", line 135, in spool
self.writefile()
File "/home/pd/telephonerelayEnv/lib/python3.6/site-packages/pycall/callfile.py", line 123, in writefile
f.write(self.contents)
File "/home/pd/telephonerelayEnv/lib/python3.6/site-packages/pycall/callfile.py", line 118, in contents
return '\n'.join(self.buildfile())
File "/home/pd/telephonerelayEnv/lib/python3.6/site-packages/pycall/callfile.py", line 100, in buildfile
raise ValidationError
pycall.errors.ValidationError
I would appreciate if you help me solving my problem.
thank you in advance
Looking at the source code for the validity check, it appears like the only check that could be catching you out is the one that verifies the spool directory. By default this is set to /var/spool/asterisk/outgoing but can be changed when you create the callfile:
cf = CallFile(c, a, spool_dir='/my/asterisk/spool/outgoing')
I'm trying to evaluate a number of processes in a multiprocessing pool but keep running into errors and I can't work out why... There's a simplified version of the code below:
class Object_1():
def add_godd_spd_column()
def calculate_correlations(arg1, arg2, arg3):
return {'a': 1}
processes = {}
pool = Pool(processes=6)
for i in range(1, 10):
processes[i] = pool.apply_async(calculate_correlations,
args=(arg1, arg2, arg3,))
correlations = {}
for i in range(0, 10):
correlations[i] = processes[i].get()
This returns the following error:
Traceback (most recent call last):
File "./02_results.py", line 116, in <module>
correlations[0] = processes[0].get()
File "/opt/anaconda3/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/opt/anaconda3/lib/python3.5/multiprocessing/pool.py", line 385, in
_handle_tasks
put(task)
File "/opt/anaconda3/lib/python3.5/multiprocessing/connection.py", line 206, in send
self._send_bytes(ForkingPickler.dumps(obj))
File "/opt/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'SCADA.add_good_spd_column.<locals>.calculate_correlations
When I call the following:
correlations[0].successful()
I get the following error:
Traceback (most recent call last):
File "./02_results.py", line 116, in <module>
print(processes[0].successful())
File "/opt/anaconda3/lib/python3.5/multiprocessing/pool.py", line 595, in
successful
assert self.ready()
AssertionError
Is this because the process isn't actually finished before the .get() is called? The function being evaluated just returns a dictionary which should definitely be pickle-able...
Cheers,
The error is occurring because pickling a function nested in another function is not supported, and multiprocessing.Pool needs to pickle the function you pass as an argument to apply_async in order to execute it in a worker process. You have to move the function to the top level of the module, or make it an instance method of the class. Keep in mind that if you make it an instance method, the instance of the class itself must also be picklable.
And yes, the assertion error when calling successful() occurs because you're calling it before a result is ready. From the docs:
successful()
Return whether the call completed without raising an exception. Will raise AssertionError if the result is not ready.
The objective of the code is to read sqlite3 file and process the text and write to another file (gzip format). I am trying to use multiprocessing with pool, but
sometimes it generates errors and stop with error message "Cannot create a file when that file already exists". After the fail, if I repeat the same code, it works fine most of cases, which means this happens occasionally.
I guess this is related to race between processes in the pool, but cannot find a way to solve the problem. Normally it works nice, but sometimes it causes problem.
Also, I tried to terminate all the process at each directory level, and start new processes in the next directory.
P.S. Environment: Windows server 64bit, Python 2.7 64bit
import sqlite3 as lite
import gzip
import multiprocessing
def convert_txt((infile,outfile)):
try:
conn=lite.connect(infile)
conn.text_factory = str
except:
print 'Sql Lite error:', infile
return
try:
fout = gzip.open(outfile, 'wb')
except:
'File write error:', outfile
return
for line in conn.iterdump():
fout.write(line.replace('abc','def')
fout.close()
for directory in directory_list:
filenames=glob.glob('B:\\Hebercity UT\\*.txt')
p = multiprocessing.Pool(min(10,len(filenames)))
file_list=[]
for input_file in filenames:
output_file=input_file.replace('.txt','.csv')
file_list.append([input_file, output_file])
p.map(convert_txt, file_list)
time.sleep(1)
p.close() # close the pool and start the pool in the next directory
Traceback (most recent call last):
File "B:\gws_txt_converter_multi.py", line 100, in <module>
p.map(convert_txt_msg, fflist)
File "C:\opt\Anaconda\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\opt\Anaconda\lib\multiprocessing\pool.py", line 558, in get
raise self._value
WindowsError: [Error 183] Cannot create a file when that file already exists: 'B:\\Hebercity UT'
I have the following working code (Python 3.5) which uses concurrent futures to parse files in a threaded manner, and then do some post-processing on the results when they come back (in any order).
from concurrent import futures
with futures.ThreadPoolExecutor(max_workers=4) as executor:
# A dictionary which will contain a list the future info in the key, and the filename in the value
jobs = {}
# Loop through the files, and run the parse function for each file, sending the file-name to it, along with the kwargs of parser_variables.
# The results of the functions can come back in any order.
for this_file in files_list:
job = executor.submit(parse_log_file.parse, this_file, **parser_variables)
jobs[job] = this_file
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
debug.checkpointer("Multi-threaded Parsing File finishing")
# Send the result of the file the job is based on (jobs[job]) and the job (job.result)
result_content = job.result()
this_file = jobs[job]
I want to convert this to use processes instead of threads because threads don't offer any speedup. In theory I just need to change ThreadPoolExecutor into ProcessPoolExecutor.
The problem is, if I do that I get this exception:
Process Process-2:
Traceback (most recent call last):
File "C:\Python35\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\Python35\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Python35\lib\concurrent\futures\process.py", line 169, in _process_worker
call_item = call_queue.get(block=True)
File "C:\Python35\lib\multiprocessing\queues.py", line 113, in get
return ForkingPickler.loads(res)
TypeError: Required argument 'fileno' (pos 1) not found
Traceback (most recent call last):
File "c:/myscript/main.py", line 89, in <module>
main()
File "c:/myscript/main.py", line 59, in main
system_counters = process_system(system, filename)
File "c:\myscript\per_system.py", line 208, in process_system
system_counters = process_filelist(**file_handling_variables)
File "c:\myscript\per_logfile.py", line 31, in process_filelist
results_list = job.result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 398, in result
return self.__get_result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I think that this might have something to do with pickling, but googling for the error hasn't found anything.
How do I convert the above to use multiple processes?
It turns out this is because one of the things I'm passing inside parser_variables is a class (a reader from a third-party module). If I remove the class, the above works fine.
For whatever reason, pickle doesn't seem to be able to handle this particular object.