ThreadPoolExecutor behaving unexpectedly - python

I've got a directory of files such as:
input_0.data
input_1.data
and so forth. I want to parse these files with a function that has been shown to output 47 lines for input_0.data when run by itself. However, when I bring a ThreadPoolExecutor into the mix and actually run more than one thread, the output from input_0.data becomes huge, quickly exceeding the known good 47 lines.
The code I'm trying to use is as follows, with needless details cut fairly obviously:
def find_moves(param_list):
input_filename = param_list[0]
output_filename = param_list[1]
print(input_filename+" "+output_filename, flush=True)
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
for line in input_file:
if do_log_line(line):
log = format_log(line)
print(log, file=output_file, flush=True)
input_file.close()
output_file.close()
if len(argv) != 3:
print("Usage:\n\tmoves.py [input_dir] [output_dir]")
quit()
input_files = list()
for file in glob(path.join(argv[1], "input_*.data")):
input_files.append(file)
input_files = sorted(input_files)
with ThreadPoolExecutor(max_workers=8) as executor:
for file_number, input_filename in enumerate(input_files):
output_filename = "moves_"+str(file_number)+".csv"
output_filename = path.join(argv[2], output_filename)
executor.submit(find_moves, (input_filename, output_filename))
It's obvious I'm using this tool incorrectly, but it's not obvious to me where my mistake is. I'd appreciate some guidance in the matter.
It seems like the threads are writing to each other's files, even though they explicitly state they're working on the right file.

Related

Undesired deletion of temporaly files

I am try to create some temporal files and make some operations on them inside a loop. Then I will access the information on all of the temporal files. And do some operations with that information. For simplicity I brought the following code that reproduces my issue:
import tempfile
tmp_files = []
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt")
with open(tmp.name, "w") as f:
f.write(str(i))
tmp_files.append(tmp.name)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
ERROR:
with open(tmp_file, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpynh0kbnw.txt'
When I look on /tmp directory (with some time.sleep(2) on the loop) I see that the file is deleted and only one is preserved. And for that the error.
Of course I could handle to keep all the files with the flag tempfile.NamedTemporaryFile(suffix=".txt", delete=False). But that is not the idea. I would like to hold the temporal files just for the running time of the script. I also could delete the files with os.remove. But my question is more why this happen. Because I expected that the files hold to the end of the running. Because I don't close the file on the execution (or do I?).
A lot of thanks in advance.
tdelaney does already answer your actual question.
I just would like to offer you an alternative to NamedTemporaryFile. Why not creating a temporary folder which is removed (with all files in it) at the end of the script?
Instead of using a NamedTemporaryFile, you could use tempfile.TemporaryDirectory. The directory will be deleted when closed.
The example below uses the with statement which closes the file handle automatically when the block ends (see John Gordon's comment).
import os
import tempfile
with tempfile.TemporaryDirectory() as temp_folder:
tmp_files = []
for i in range(40):
tmp_file = os.path.join(temp_folder, f"{i}.txt")
with open(tmp_file, "w") as f:
f.write(str(i))
tmp_files.append(tmp_file)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
By default, a NamedTemporaryFile deletes its file when closed. its a bit subtle, but tmp = tempfile.NamedTemporaryFile(suffix=".txt") in the loop causes the previous file to be deleted when tmp is reassigned. One option is to use the delete=False parameter. Or, just keep the file open and seek to the beginning after the write.
NamedTemporaryFile is already a file object - you can write to it directly without reopening. Just make sure the mode is "write plus" and in text, not binary mode. Put the code an a try/finally block to make sure the files are really deleted at the end.
import tempfile
tmp_files = []
try:
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w+")
tmp.write(str(i))
tmp.seek(0)
tmp_files.append(tmp)
string = ""
for tmp_file in tmp_files:
data = tmp_file.read()
string += data
finally:
for tmp_file in tmp_files:
tmp_file.close()
print(string)

Write to csv with Python Multiprocessing apply_async causes missing of data

I have a csv file, where I read urls line by line to make a request for each enpoint. Each request is parsed and data is written to the output.csv. This process is paralleled.
The issue is connected with written data. Some portions of data are partially missed, or totally missed (blank lines). I suppose that it is happening because of collision or conflicts between async processes. Can you please advice how to fix that.
def parse_data(url, line_num):
print line_num, url
r = requests.get(url)
htmltext = r.text.encode("utf-8")
pois = re.findall(re.compile('<pois>(.+?)</pois>'), htmltext)
for poi in pois:
write_data(poi)
def write_data(poi):
with open('output.csv', 'ab') as resfile:
writer = csv.writer(resfile)
writer.writerow([poi])
resfile.close()
def main():
pool = Pool(processes=4)
with open("input.csv", "rb") as f:
reader = csv.reader(f)
for line_num, line in enumerate(reader):
url = line[0]
pool.apply_async(parse_data, args=(url, line_num))
pool.close()
pool.join()
Try to add file locking:
import fcntl
def write_data(poi):
with open('output.csv', 'ab') as resfile:
writer = csv.writer(resfile)
fcntl.flock(resfile, fcntl.LOCK_EX)
writer.writerow([poi])
fcntl.flock(resfile, fcntl.LOCK_UN)
# Note that you dont have to close the file. The 'with' will take care of it
Concurrent writes to a same file is indeed a known cause of data loss / file corruption. The safe solution here is the "map / reduce" pattern - each process writes in it's own result file (map), then you concatenate those files together (reduce).

race-condition: reading/writing file (windows)

I have the following situation:
-different users (all on windows OS) that run a python script that can either read or write to pickle file located on a shared folder.
-the "system" is designed in way that only one user at a time will be writing to the file (therefore no race condition of more processes trying to WRITE at the same time on the file)
-the basic code to write would be this:
with open(path + r'\final_db.p', 'wb') as f:
pickle.dump((x, y), f)
-while code to read would be:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
-x is list of 5K or plus elements, where each element is a class instance containing many attributes and functions; y is a date
QUESTION:
am i correct assuming that there is a race condition when a reading and a writing process overlap? and that the reading one can end up with a corrupt file?
PROPOSED SOLUTIONS:
1.a possible solution i thought of is using filelock:
code to write:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'wb') as f:
pickle.dump((x, y), f)
code to read:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'rb') as f:
x, y = pickle.load(f)
this solution should work (??), but if a process crash, the file remains blocked till the "file_path + '.lock'" is cancelled
2.another solution could be to use portalocker
code to write:
with open(path + r'\final_db.p', 'wb') as f:
portalocker.lock(f, portalocker.LOCK_EX)
pickle.dump((x, y), f)
code to read:
segnale = True
while segnale:
try:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
segnale = False
except:
pass
the reading process, if another process started writing before it, will keep looping till the file is unlocked (except PermissionError).
if the writing process started after the reading process, the reading should loop if the file is corrupt.
what i am not sure about is if the reading process could end up reading a partially written file.
Any advice? better solutions?

ValueError: I/O operation on closed file while looping

Hi I am facing I/O error while looping file execution. The code prompt 'ValueError: I/O operation on closed file.' while running. Does anyone have any idea while says operation on closed as I am opening new while looping? Many thanks
code below:
with open('inputlist.csv', 'r') as f: #input list reading
reader = csv.reader(f)
queries2Google = reader
print(queries2Google)
def QGN(query2Google):
s = '"'+query2Google+'"' #Keywords for query, to solve the + for space
s = s.replace(" ","+")
date = str(datetime.datetime.now().date()) #timestamp
filename =query2Google+"_"+date+"_"+'SearchNews.csv' #csv filename
f = open(filename,"wb") #open output file
pass
df = np.reshape(df,(-1,3))
itemnum,col=df.shape
itemnum=str(itemnum)
df1 = pd.DataFrame(df,columns=['Title','URL','Brief'])
print("Done! "+itemnum+" pieces found.")
df1.to_csv(filename, index=False,encoding='utf-8')
f.close()
return
for query2Google in queries2Google:
QGN(query2Google) #output should be multiple files
with closes the file that you are trying to read once it it is done. So you are opening file, making a csv reader, and then closing the underlying file and then trying to read from it. See more about file i/o here
Solution is to do all of your work on your queries2Google reader INSIDE the with statement:
with open('inputlist.csv', 'r') as f: #input list reading
reader = csv.reader(f)
for q2g in reader:
QGN(q2g)
Some additional stuff:
That pass isn't doing anything and you should probably be using with again inside the QGN function since the file is opened and closed in there. Python doesn't need empty returns. You also don't seem to even be using f in the QGN function.

Python read from a file, and only do work if a string isn't found

So I'm trying to make a reddit bot that will exec code from a submission. I have my own sub for controlling these clients.
while __name__ == '__main__':
string = open('config.txt').read()
for submission in subreddit.get_new(limit = 1):
if submission.url not in string:
f.write(submission.url + "\n")
f.close()
f = open('config.txt', "a")
string = open('config.txt').read()
So what this is suppose to do is read from the config file, then only do work if the submission url isn't in config.txt. However, it always sees the most recent post and does it's work. This is how F is opened.
if not os.path.exists('file'):
open('config.txt', 'w').close()
f = open('config.txt', "a")
First a critique of your existing code (in comments):
# the next two lines are not needed; open('config.txt', "a")
# will create the file if it doesn't exist.
if not os.path.exists('file'):
open('config.txt', 'w').close()
f = open('config.txt', "a")
# this is an unusual condition which will confuse readers
while __name__ == '__main__':
# the next line will open a file handle and never explicitly close it
# (it will probably get closed automatically when it goes out of scope,
# but it's not good form)
string = open('config.txt').read()
for submission in subreddit.get_new(limit = 1):
# the next line should check for a full-line match; as written, it
# will match "http://www.test.com" if "http://www.test.com/level2"
# is in config.txt
if submission.url not in string:
f.write(submission.url + "\n")
# the next two lines could be replaced with f.flush()
f.close()
f = open('config.txt', "a")
# this is a cumbersome way to keep your string synced with the file,
# and it never explicitly releases the new file handle
string = open('config.txt').read()
# If subreddit.get_new() doesn't return any results, this will act as
# a busy loop, repeatedly requesting new results as fast as possible.
# If that is undesirable, you might want to sleep here.
# file handle f should get closed after the loop
None of the problems pointed out above should keep your code from working (except maybe the imprecise matching). But simpler code may be easier to debug. Here's some code that does the same thing. Note: I assume there is no chance any other process is writing to config.txt at the same time. You could try this code (or your code) with pdb, line-by-line, to see whether it works as expected.
import time
import praw
r = praw.Reddit(...)
subreddit = r.get_subreddit(...)
if __name__ == '__main__':
# open config.txt for reading and writing without truncating.
# moves pointer to end of file; closes file at end of block
with open('config.txt', "a+") as f:
# move pointer to start of file
f.seek(0)
# make a list of existing lines; also move pointer to end of file
lines = set(f.read().splitlines())
while True:
got_one = False
for submission in subreddit.get_new(limit=1):
got_one = True
if submission.url not in lines:
lines.add(submission.url)
f.write(submission.url + "\n")
# write data to disk immediately
f.flush()
...
if not got_one:
# wait a little while before trying again
time.sleep(10)

Categories