race-condition: reading/writing file (windows) - python

I have the following situation:
-different users (all on windows OS) that run a python script that can either read or write to pickle file located on a shared folder.
-the "system" is designed in way that only one user at a time will be writing to the file (therefore no race condition of more processes trying to WRITE at the same time on the file)
-the basic code to write would be this:
with open(path + r'\final_db.p', 'wb') as f:
pickle.dump((x, y), f)
-while code to read would be:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
-x is list of 5K or plus elements, where each element is a class instance containing many attributes and functions; y is a date
QUESTION:
am i correct assuming that there is a race condition when a reading and a writing process overlap? and that the reading one can end up with a corrupt file?
PROPOSED SOLUTIONS:
1.a possible solution i thought of is using filelock:
code to write:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'wb') as f:
pickle.dump((x, y), f)
code to read:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'rb') as f:
x, y = pickle.load(f)
this solution should work (??), but if a process crash, the file remains blocked till the "file_path + '.lock'" is cancelled
2.another solution could be to use portalocker
code to write:
with open(path + r'\final_db.p', 'wb') as f:
portalocker.lock(f, portalocker.LOCK_EX)
pickle.dump((x, y), f)
code to read:
segnale = True
while segnale:
try:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
segnale = False
except:
pass
the reading process, if another process started writing before it, will keep looping till the file is unlocked (except PermissionError).
if the writing process started after the reading process, the reading should loop if the file is corrupt.
what i am not sure about is if the reading process could end up reading a partially written file.
Any advice? better solutions?

Related

Undesired deletion of temporaly files

I am try to create some temporal files and make some operations on them inside a loop. Then I will access the information on all of the temporal files. And do some operations with that information. For simplicity I brought the following code that reproduces my issue:
import tempfile
tmp_files = []
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt")
with open(tmp.name, "w") as f:
f.write(str(i))
tmp_files.append(tmp.name)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
ERROR:
with open(tmp_file, "r") as f: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpynh0kbnw.txt'
When I look on /tmp directory (with some time.sleep(2) on the loop) I see that the file is deleted and only one is preserved. And for that the error.
Of course I could handle to keep all the files with the flag tempfile.NamedTemporaryFile(suffix=".txt", delete=False). But that is not the idea. I would like to hold the temporal files just for the running time of the script. I also could delete the files with os.remove. But my question is more why this happen. Because I expected that the files hold to the end of the running. Because I don't close the file on the execution (or do I?).
A lot of thanks in advance.
tdelaney does already answer your actual question.
I just would like to offer you an alternative to NamedTemporaryFile. Why not creating a temporary folder which is removed (with all files in it) at the end of the script?
Instead of using a NamedTemporaryFile, you could use tempfile.TemporaryDirectory. The directory will be deleted when closed.
The example below uses the with statement which closes the file handle automatically when the block ends (see John Gordon's comment).
import os
import tempfile
with tempfile.TemporaryDirectory() as temp_folder:
tmp_files = []
for i in range(40):
tmp_file = os.path.join(temp_folder, f"{i}.txt")
with open(tmp_file, "w") as f:
f.write(str(i))
tmp_files.append(tmp_file)
string = ""
for tmp_file in tmp_files:
with open(tmp_file, "r") as f:
data = f.read()
string += data
print(string)
By default, a NamedTemporaryFile deletes its file when closed. its a bit subtle, but tmp = tempfile.NamedTemporaryFile(suffix=".txt") in the loop causes the previous file to be deleted when tmp is reassigned. One option is to use the delete=False parameter. Or, just keep the file open and seek to the beginning after the write.
NamedTemporaryFile is already a file object - you can write to it directly without reopening. Just make sure the mode is "write plus" and in text, not binary mode. Put the code an a try/finally block to make sure the files are really deleted at the end.
import tempfile
tmp_files = []
try:
for i in range(40):
tmp = tempfile.NamedTemporaryFile(suffix=".txt", mode="w+")
tmp.write(str(i))
tmp.seek(0)
tmp_files.append(tmp)
string = ""
for tmp_file in tmp_files:
data = tmp_file.read()
string += data
finally:
for tmp_file in tmp_files:
tmp_file.close()
print(string)

ThreadPoolExecutor behaving unexpectedly

I've got a directory of files such as:
input_0.data
input_1.data
and so forth. I want to parse these files with a function that has been shown to output 47 lines for input_0.data when run by itself. However, when I bring a ThreadPoolExecutor into the mix and actually run more than one thread, the output from input_0.data becomes huge, quickly exceeding the known good 47 lines.
The code I'm trying to use is as follows, with needless details cut fairly obviously:
def find_moves(param_list):
input_filename = param_list[0]
output_filename = param_list[1]
print(input_filename+" "+output_filename, flush=True)
input_file = open(input_filename, "r")
output_file = open(output_filename, "w")
for line in input_file:
if do_log_line(line):
log = format_log(line)
print(log, file=output_file, flush=True)
input_file.close()
output_file.close()
if len(argv) != 3:
print("Usage:\n\tmoves.py [input_dir] [output_dir]")
quit()
input_files = list()
for file in glob(path.join(argv[1], "input_*.data")):
input_files.append(file)
input_files = sorted(input_files)
with ThreadPoolExecutor(max_workers=8) as executor:
for file_number, input_filename in enumerate(input_files):
output_filename = "moves_"+str(file_number)+".csv"
output_filename = path.join(argv[2], output_filename)
executor.submit(find_moves, (input_filename, output_filename))
It's obvious I'm using this tool incorrectly, but it's not obvious to me where my mistake is. I'd appreciate some guidance in the matter.
It seems like the threads are writing to each other's files, even though they explicitly state they're working on the right file.

ValueError: I/O operation on closed file while looping

Hi I am facing I/O error while looping file execution. The code prompt 'ValueError: I/O operation on closed file.' while running. Does anyone have any idea while says operation on closed as I am opening new while looping? Many thanks
code below:
with open('inputlist.csv', 'r') as f: #input list reading
reader = csv.reader(f)
queries2Google = reader
print(queries2Google)
def QGN(query2Google):
s = '"'+query2Google+'"' #Keywords for query, to solve the + for space
s = s.replace(" ","+")
date = str(datetime.datetime.now().date()) #timestamp
filename =query2Google+"_"+date+"_"+'SearchNews.csv' #csv filename
f = open(filename,"wb") #open output file
pass
df = np.reshape(df,(-1,3))
itemnum,col=df.shape
itemnum=str(itemnum)
df1 = pd.DataFrame(df,columns=['Title','URL','Brief'])
print("Done! "+itemnum+" pieces found.")
df1.to_csv(filename, index=False,encoding='utf-8')
f.close()
return
for query2Google in queries2Google:
QGN(query2Google) #output should be multiple files
with closes the file that you are trying to read once it it is done. So you are opening file, making a csv reader, and then closing the underlying file and then trying to read from it. See more about file i/o here
Solution is to do all of your work on your queries2Google reader INSIDE the with statement:
with open('inputlist.csv', 'r') as f: #input list reading
reader = csv.reader(f)
for q2g in reader:
QGN(q2g)
Some additional stuff:
That pass isn't doing anything and you should probably be using with again inside the QGN function since the file is opened and closed in there. Python doesn't need empty returns. You also don't seem to even be using f in the QGN function.

Python read file into memory for repeated FTP copy

I need to read a local file and copy to remote location with FTP, I copy same file file.txt to remote location repeatedly hundreds of times with different names like f1.txt, f2.txt... f1000.txt etc. Now, is it necessary to always open, read, close my local file.txt for every single FTP copy or is there a way to store into a variable and use that all time and avoid file open, close functions. file.txt is small file of 6KB. Below is the code I am using
for i in range(1,101):
fname = 'file'+ str(i) +'.txt'
fp = open('file.txt', 'rb')
ftp.storbinary('STOR ' + fname, fp)
fp.close()
I tried reading into a string variable and replace fp but ftp.storbinary requires second argument to have method read(), please suggest if there is better way to avoid file open close or let me know if it has no performance improvement at all. I am using python 2.7.10 on Windows 7.
Simply open it before the loop, and close it after:
fp = open('file.txt', 'rb')
for i in range(1,101):
fname = 'file'+ str(i) +'.txt'
fp.seek(0)
ftp.storbinary('STOR ' + fname, fp)
fp.close()
Update Make sure you add fp.seek(0) before the call to ftp.storbinary, otherwise the read call will exhaust the file in the first iteration as noted by #eryksun.
Update 2 depending on the size of the file it will probably be faster to use BytesIO. This way the file content is saved in memory but will still be a file-like object (ie it will have a read method).
from io import BytesIO
with open('file.txt', 'rb') as f:
output = BytesIO()
output.write(f.read())
for i in range(1, 101):
fname = 'file' + str(i) + '.txt'
output.seek(0)
ftp.storbinary('STOR ' + fname, fp)

Using too many Open calls. How do I close all files?

I'm trying to change a lot of pdf-files. Because of this I must open a lot of files. I use the method open to many times. So python gives the error too many open files.
I hope my code is grace.writer many too similar
readerbanner = PyPDF2.pdf.PdfFileReader(open('transafe.pdf', 'rb'))
readertestpages = PyPDF2.pdf.PdfFileReader(open(os.path.join(Cache_path, cache_file_name), 'rb'))
writeroutput.write(open(os.path.join(Output_path,cache_file_name), 'wb'))
or
writer_output.write(open(os.path.join(Cache_path, str(NumPage) + "_" + pdf_file_name), 'wb'))
reader_page_x = PyPDF2.pdf.PdfFileReader(open(os.path.join(PDF_path, pdf_file_name), 'rb'))
All the open methods do not use f_name = open("path","r").
because all open file have period. I know the position but not know how close all open files.
To close a file just call close() on it.
You can also use a context manager which handles file closing for you:
with open('file.txt') as myfile:
# do something with myfile here
# here myfile is automatically closed
As far as i know, this code should not open too many files. Unless it is run a lot of times.
Regardless, the problem consists of you calling:
PyPDF2.pdf.PdfFileReader(open('transafe.pdf', 'rb'))
and similar. This creates a file object, but saves no reference to it.
What you need to do for all open calls is:
file = open('transafe.pdf', 'rb')
PyPDF2.pdf.PdfFileReader(file)
And then:
file.close()
when you do not use the file anymore.
If you want to close many files at the same time, put them in a list.
with statement
with open("abc.txt", "r") as file1, open("123.txt", "r") as file2:
# use files
foo = file1.readlines()
# they are closed automatically
print(file1.closed)
# -> True
print(file2.closed)
# -> True
wrapper function
files = []
def my_open(*args):
f = open(*args)
files.append(f)
return f
# use my_open
foo = my_open("text.txt", "r")
# close all files
list(map(lambda f: f.close(), files))
wrapper class
class OpenFiles():
def __init__(self):
self.files = []
def open(self, *args):
f = open(*args)
self.files.append(f)
return f
def close(self):
list(map(lambda f: f.close(), self.files))
files = OpenFiles()
# use open method
foo = files.open("text.txt", "r")
# close all files
files.close()
ExitStack can be useful:
https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
# All opened files will automatically be closed at the end of
# the with statement, even if attempts to open files later
# in the list raise an exception

Categories