I am using python's csv module to extract data from a csv that is constantly being updated by an external tool. I have run into a problem where when I reach the end of the file I get a StopIteration error, however, I would like the script to continue to loop waiting for more lines to be added by the external tool.
What I came up with so far to do this is:
f = open('file.csv')
csvReader = csv.reader(f, delimiter=',')
while 1:
try:
doStuff(csvReader.next())
except StopIteration:
depth = f.tell()
f.close()
f = open('file.csv')
f.seek(depth)
csvReader = csv.reader(f, delimiter=',')
This has the intended functionality but it also seems terrible. Looping after catching the StopIteration is not possible since once StopIteration is thrown, it will throw a StopIteration on every subsequent call to next(). Anyone have any suggestions on how to implement this is in such a way that I don't have to do this silly tell and seeking? Or have a different python module that can easily support this functionality.
Your problem is not with the CSV reader, but with the file object itself. You may still have to do the crazy gyrations you're doing in your snippet above, but it would be better to create a file object wrapper or subclass that does it for you, and use that with your CSV reader. That keeps the complexity isolated from your csv processing code.
For instance (warning: untested code):
class ReopeningFile(object):
def __init__(self, filename):
self.filename = filename
self.f = open(self.filename)
def next(self):
try:
self.f.next()
except StopIteration:
depth = self.f.tell()
self.f.close()
self.f = open(self.filename)
self.f.seek(depth)
# May need to sleep here to allow more data to come in
# Also may need a way to signal a real StopIteration
self.next()
def __iter__(self):
return self
Then your main code becomes simpler, as it is freed from having to manage the file reopening (note that you also don't have to restart your csv_reader whenever the file restarts:
import csv
csv_reader = csv.reader(ReopeningFile('data.csv'))
for each in csv_reader:
process_csv_line(each)
Producer-consumer stuff can get a bit tricky. How about using seek and reading bytes instead? What about using a named pipe?
Heck, why not communicate over a local socket?
You rarely need to catch StopIteration explicitly. Do this:
for row in csvReader:
doStuff(row)
As for detecting when new lines are written to the file, you can either popen a tail -f process or write out the Python code for what tail -f does. (It isn't complicated; it basically just stats the file every second to see if it's changed. Here's the C source code of tail.)
EDIT: Disappointingly, popening tail -f doesn't work as I expected in Python 2.x. It seems iterating over the lines of a file is implemented using fread and a largeish buffer, even if the file is supposed to be unbuffered (like when subprocess.py creates the file, passing bufsize=0). But popening tail would be a mildly ugly hack anyway.
Related
I want to open multiple files using a with statement (so I get the benefit of the context manager) based on boolean flags which instruct whether my program should or should not actually open each file.
I know I can use a with statement to open multiple files, like:
with open('log.txt', 'w') as logfile, open('out_a.txt', 'w') as out_a, open('out_b.txt', 'w') as out_b:
# do something with logfile, out_a and out_b
# all files are closed here
I want to run a similar statement, but only opening certain files based on their corresponding flags. I thought about implementing it as a conditional_openfunction, something like:
write_log = True
write_out_a = False
write_out_b = True
with conditional_open('log.txt', 'w', cond=write_log) as logfile, open('out_a.txt', 'w', cond=write_out_a) as out_a, open('out_b.txt', 'w', cond=write_out_b) as out_b:
# do something with logfile, out_a and out_b
# all files are closed here
But I'm a little confused as to how properly create that function. Ideally, coditional_open would either return an open file handle or None (in which case the file is never created/touched/deleted):
def conditional_open(filename, mode, cond):
return open(filename, mode) if cond else None
But I fear that this skips the benefits of the context manager when opening a file, since I'm calling open outside from it. Is this assumption correct?
Can anyone give some ideas about how I could be doing this? I know I could create mock file objects based on the conditions and write to them instead, but it sounds a bit too convoluted to me - this seems like a simple problem, which should have a simple solution in Python.
Just set up your function as a context manager.
from contextlib import contextmanager
#contextmanager
def conditional_open(f_name, mode, cond):
if not cond:
yield None
resource = open(f_name, mode)
try:
yield resource
finally:
resource.close()
Is there some way to "capture" all attempted writes to a particular file /my/special/file, and instead write that to a BytesIO or StringIO object instead, or some other way to get that output without actually writing to disk?
The use case is: there's a 'handler' function, whose contract is that it should write its output to /my/special/file. I don't have any control over this handler function -- I don't write it, I don't know its contents and I can't change its contents, and the contract cannot change. I'd like to be able to do something like this:
# 'output' has whatever 'handler' has written to `/my/special/file`
output = handler.run(data)
Even if this is an odd request, I'd like to be able to do this even with a 'hackier' answer.
EDIT: my code (and handler) will be invoked many times on a lot of chunks of data, so performance (both latency and throughput) are important.
Thanks.
If you're talking about code in your own Python program, you could monkey-patch the built in open function before that code gets called. Here's a really stupid example, but it shows that you can do this. This causes code that thinks it's writing to a file to instead write into an in-memory buffer. The calling code then prints what the foreign code wrote to the file:
import io
# The function you don't have access to that writes to a file
def foo():
f = open("/tmp/foo", "w")
f.write("blahblahblah\n")
f.close()
# The buffer to contain the captured text
capture_buffer = ""
# My silly file-like object that only handles write(str) and close()
class MyFileClass:
def write(self, str):
global capture_buffer
capture_buffer += str
def close(self):
pass
# patch open to return a MyFileClass instance
def my_open2(*args, **kwargs):
return MyFileClass()
open = my_open2
# Call the target function
foo()
# Print what the function wrote to "the file"
print(capture_buffer)
Result:
blahblahblah
Sorry for not spending more time with this. Just showing you it's possible. As others say, a mocking module might be the way to go to not have to grow your own thing here. I don't know if they allow access to what is written. I guess they must. Such a module is just going to do a better job of what I've shown here.
If your program does other file IO with open, or whichever method the mystery code uses to open the file, you'd check the incoming path and only return your special object if it was the one path you're interested in. Otherwise, you could just call the original open, which you could stash away under another name.
As the thread How do you append to a file?, most answer is about open a file and append to it, for instance:
def FileSave(content):
with open(filename, "a") as myfile:
myfile.write(content)
FileSave("test1 \n")
FileSave("test2 \n")
Why don't we just extract myfile out and only write to it when FileSave is invoked.
global myfile
myfile = open(filename)
def FileSave(content):
myfile.write(content)
FileSave("test1 \n")
FileSave("test2 \n")
Is the latter code better cause it's open the file only once and write it multiple times?
Or, there is no difference cause what's inside python will guarantee the file is opened only once albeit the open method is invoked multiple times.
There are a number of problems with your modified code that aren't really relevant to your question: you open the file in read-only mode, you never close the file, you have a global statement that does nothing…
Let's ignore all of those and just talk about the advantages and disadvantages of opening and closing a file over and over:
Wastes a bit of time. If you're really unlucky, the file could even just barely keep falling out of the disk cache and waste even more time.
Ensures that you're always appending to the end of the file, even if some other program is also appending to the same file. (This is pretty important for, e.g., syslog-type logs.)1
Ensures that you've flushed your writes to disk at some point, which reduces the chance of lost data if your program crashes or gets killed.
Ensures that you've flushed your writes to disk as soon as you write them. If you try to open and read the file elsewhere in the same program, or in a different program, or if the end user just opens it in Notepad, you won't be missing the last 1.73KB worth of lines because they're still in a buffer somewhere and won't be written until later.2
So, it's a tradeoff. Often, you want one of those guarantees, and the performance cost isn't a big deal. Sometimes, it is a big deal and the guarantees don't matter. Sometimes, you really need both, so you have to write something complicated where you manually buffer up bits and write-and-flush them all at once.
1. As the Python docs for open make clear, this will happen anyway on some Unix systems. But not on other Unix systems, and not on Windows..
2. Also, if you have multiple writers, they're all appending a line at a time, rather than appending whenever they happen to flush, which is again pretty important for logfiles.
In general global should be avoided if possible.
The reason that people use the with command when dealing with files is that it explicitly controls the scope. Once the with operator is done the file is closed and the file variable is discarded.
You can avoid using the with operator but then you must remember to call myfile.close(). Particularly if you're dealing with a lot of files.
One way that avoids using the with block that also avoids using global is
def filesave(f_obj, string):
f_obj.write(string)
f = open(filename, 'a')
filesave(f, "test1\n")
filesave(f, "test2\n")
f.close()
However at this point you'd be better off getting rid of the function and just simply doing:
f = open(filename, 'a')
f.write("test1\n")
f.write("test2\n")
f.close()
At which point you could easily put it within a with block:
with open(filename, 'a') as f:
f.write("test1\n")
f.write("test2\n")
So yes. There's no hard reason to not do what you're doing. It's just not very Pythonic.
The latter code may be more efficient, but the former code is safer because it makes sure that the content that each call to FileSave writes to the file gets flushed to the filesystem so that other processes can read the updated content, and by closing the file handle with each call using open as a context manager, you allow other processes a chance to write to the file as well (specifically in Windows).
It really depends on the circumstances, but here are some thoughts:
A with block absolutely guarantees that the file will be closed once the block is exited. Python does not make and weird optimizations for appending files.
In general, globals make your code less modular, and therefore harder to read and maintain. You would think that the original FileSave function is attempting to avoid globals, but it's using the global name filename, so you may as well use a global file altogether at that point, as it will save you some I/O overhead.
A better option would be to avoid globals at all, or to at least use them properly. You really don't need a separate function to wrap file.write, but if it represents something more complex, here is a design suggestion:
def save(file, content):
print(content, file=file)
def my_thing(filename):
with open(filename, 'a') as f:
# do some stuff
save(f, 'test1')
# do more stuff
save(f, 'test2')
if __name__ == '__main__':
my_thing('myfile.txt')
Notice that when you call the module as a script, a file name defined in the global scope will be passed in to the main routine. However, since the main routine does not reference global variables, you can A) read it easier because it's self contained, and B) test it without having to wonder how to feed it inputs without breaking everything else.
Also, by using print instead of file.write, you avoid having to spend newlines manually.
I'm running a program the takes in data from other clients, and have been having an enormous amount of problems writing, and changing information in a file, and I feel like I have tried everything. I want to save the information in case the program stops for some reason, and so the data would have saved. I feel like i have tried everything, using file.flush, using os.fsync() with it, I have tried using with open(file) as file: statements to close the file when the program stops, and currently, I's trying atexit to have a function write to the file when it closes, which hasn't worked out, plus doesn't call on errors, so is kinda irrelevant. I'm looking for a way to write to a file, repeatedly, and, well, work. I may not understand something, so please explain it to me. I have been having trouble without end, and need help.
EDIT
AccData = {}
client = discord.Client()
User = discord.User
def SaveData():
pickle.dump(AccData,data)
data.close()
print("data saved")
atexit.register(SaveData)
f = open('DisCoin.json','rb')
AccData = pickle.load(open('DisCoin.json','rb'))
f.seek(0)
f.close()
data = open('DisCoin.json','wb')
Python catches its own exceptions, most signals and exit() then runs atexit routines for cleanup. So, you can deal with normal badness there.
But other bad things happen. A segmenation fault or other internal error. An unknown signal. Code that calls os._exit(). These will cause an early termination and data not yet flushed will be lost. Bad things can happen to any program and if they need extra resiliency, they need some method to handle that.
You can write things to temporary files and rename them to the "live" file only when they are complete. If a program terminates, at least its last saved data is still there.
You can write a log or journal of changes and rebuild the data you want by scanning that log. That's how many file systems work, and "Big Data" map/reduce systems to basically the same thing.
You can move to a database and use its transaction processing or any OLPT system to make sure you do all-or-none updates to your data store.
Your example code is especially fragile because
data = open('DisCoin.json','wb')
trashes existing data on disk. There is no going back with this code! Step one, then, is don't do that. Keep old data until the new stuff is ready.
Here is an example class that manages temporary files for you. Use it instead of open and it will create a temporary file for you to update and will only go live with the data of the with clause exits without an exception. There is no need for an atexit handler if you use this in a with clause.
import shutil
import os
class SidelineFile:
def __init__(self, *args, **kw):
self.args = list(args)
self.kw = kw
def __enter__(self):
self.closed = False
self.orig_filename = self.args[0]
self.args[0] += '.tmp'
try:
mode = self.args[1]
except IndexError:
try:
mode = self.kw['mode']
except KeyError:
mode = 'r'
if 'a' in mode:
shutil.copy2(self.orig_filename, self.args[0])
self.file_obj = open(*self.args, **self.kw)
return self.file_obj
def __exit__(self, exc_type, exc_value, traceback):
if not self.closed:
self.file_obj.close()
self.closed = True
if not exc_type:
os.rename(self.args[0], self.orig_filename)
else:
os.remove(self.args[0])
fn = 'test.txt'
with SidelineFile(fn, 'w') as fp:
fp.write("foo")
print(1, repr(open(fn).read()))
with SidelineFile(fn, mode='a') as fp:
fp.write("bar")
print(2, repr(open(fn).read()))
with SidelineFile(fn, 'w') as fp:
fp.write("foo")
print(3, repr(open(fn).read()))
try:
with SidelineFile(fn, 'a') as fp:
fp.write("bar")
raise IndexError()
except IndexError:
pass
print(4, repr(open(fn).read()))
Personally, I like to achieve this by defining a print function for it.
import os
def fprint(text,**kwargs):
os.chdir('C:\\mypath')
myfile=open('output.txt','a')
if kwargs:
print(text,end=kwargs['end'],file=myfile)
else:
print(text,file=myfile)
myfile.close()
fprint('Hello')
input()
fprint('This is here too',end='!!\n')
The above code will write 'Hello' into the file 'output.txt' at C:\mypath, save it, then after you enter some input will write 'This is here too!!' into the file. If you check the file while the script is waiting for input, it should already contain 'Hello'.
So I want to write some files that might be locked/blocked for write/delete by other processes and like to test that upfront.
As I understand: os.access(path, os.W_OK) only looks for the permissions and will return true although a file cannot currently be written. So I have this little function:
def write_test(path):
try:
fobj = open(path, 'a')
fobj.close()
return True
except IOError:
return False
It actually works pretty well, when I try it with a file that I manually open with a Program. But as a wannabe-good-developer I want to put it in a test to automatically see if it works as expected.
Thing is: If I just open(path, 'a') the file I can still open() it again no problem! Even from another Python instance. Although Explorer will actually tell me that the file is currently open in Python!
I looked up other posts here & there about locking. Most are suggesting to install a package. You migth understand that I don't wanna do that to test a handful lines of code. So I dug up the packages to see the actual spot where the locking is eventually done...
fcntl? I don't have that. win32con? Don't have it either... Now in filelock there is this:
self.fd = os.open(self.lockfile, os.O_CREAT|os.O_EXCL|os.O_RDWR)
When I do that on a file it moans that the file exists!! Ehhm ... yea! That's the idea! But even when I do it on a non-existing path. I can still open(path, 'a') it! Even from another python instance...
I'm beginning to think that I fail to understand something very basic here. Am I looking for the wrong thing? Can someone point me into the right direction?
Thanks!
You are trying to implement the file locking problem using just the system call open(). The Unix-like systems uses by default advisory file locking. This means that cooperating processes may use locks to coordinate access to a file among themselves, but uncooperative processes are also free to ignore locks and access the file in any way they choose. In other words, file locks lock out other file lockers only, not I/O. See Wikipedia.
As stated in system call open() reference the solution for performing atomic file locking using a lockfile is to create a unique file on the same file system (e.g., incorporating hostname and pid), use link(2) to make a link to the lockfile. If link() returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful.
That is why in filelock they also use the function fcntl.flock() and puts all that stuff in a module as it should be.
Alright! Thanks to those guys I actually have something now! So this is my function:
def lock_test(path):
"""
Checks if a file can, aside from it's permissions, be changed right now (True)
or is already locked by another process (False).
:param str path: file to be checked
:rtype: bool
"""
import msvcrt
try:
fd = os.open(path, os.O_APPEND | os.O_EXCL | os.O_RDWR)
except OSError:
return False
try:
msvcrt.locking(fd, msvcrt.LK_NBLCK, 1)
msvcrt.locking(fd, msvcrt.LK_UNLCK, 1)
os.close(fd)
return True
except (OSError, IOError):
os.close(fd)
return False
And the unittest could look something like this:
class Test(unittest.TestCase):
def test_lock_test(self):
testfile = 'some_test_name4142351345.xyz'
testcontent = 'some random blaaa'
with open(testfile, 'w') as fob:
fob.write(testcontent)
# test successful locking and unlocking
self.assertTrue(lock_test(testfile))
os.remove(testfile)
self.assertFalse(os.path.exists(testfile))
# make file again, lock and test False locking
with open(testfile, 'w') as fob:
fob.write(testcontent)
fd = os.open(testfile, os.O_APPEND | os.O_RDWR)
msvcrt.locking(fd, msvcrt.LK_NBLCK, 1)
self.assertFalse(lock_test(testfile))
msvcrt.locking(fd, msvcrt.LK_UNLCK, 1)
self.assertTrue(lock_test(testfile))
os.close(fd)
with open(testfile) as fob:
content = fob.read()
self.assertTrue(content == testcontent)
os.remove(testfile)
Works. Downsides are:
It's kind of testing itself with itself
so the initial OSError catch is not even tested, only locking again with msvcrt
But I dunno how to make it better now.