Writing to file without losing data after crash

Writing to file without losing data after crash - python

I have a file that I open before a loop starts, and I'm writing to that file almost at each iteration of the loop. Then I close the file once the loop has finished. So e.g. something like:
testfile = open('datagathered','w')
for i in range(n):
...
testfile.write(line)
testfile.close()
The issue I'm having is that, in case the program crashes or I want to crash it, what has already been written to testfile will be deleted, and the text file datagathered will be empty. I understand that this happens because I'm closing the file only after the loop, but if I close and open the file after each write (i.e. in the loop) doesn't that lead to an incredible slow-down?
If yes, what alternatives do I have for doing the writing, and making sure that in case of a crash the already-written-lines won't get lost, in an efficient way?
The linked posts do bring up good suggestions that arguably answer this question, but they don't cover risks and efficiency differences involved. More precisely: Are there any risks involved with playing with the buffersize? e.g. testfile = open('datagathered','w',0) Finally is using with open... still a viable alternative if there are multiple files to write to?
Small note: This is asked in the context of a very long run, where the file is being written to for 2-3 days. Thus having a speedy and safe way of doing the writing is definitely valuable here.

From the question I understood that you are talking about exceptions may occur at runtime and SIGINT.
You may use 'try-except-finally' block to achieve your goal. It enables you to catch both exceptions and SIGINT signal. Since the finally block will be executed either exception is caught or everything goes well, closing file there is the best choice. Following sample code would solve your problem I guess.
testfile = open('datagathered','w')
try:
for i in range(n):
...
testfile.write(line)
except KeyboardInterrupt:
print "Interrupt from keyboard"
except:
print "Other exception"
finally:
testfile.close()

Use a context:
with open('datagathered','w') as f:
f.write(data)

Related

How do I add a separate function for average calculation?

I am stuck on this problem. Code I have so far works but my Professor wants to see some changes. I need to add error handing and I need a separate function for calculating average which I will call in main. Here is the what I have so far...
import os
def process_file(filename):
f = open(filename,'r')
lines = f.readlines()[1:]
f.close()
scores = []
for line in lines:
parsed = line.split(",")
count = int(parsed[1])
scores.append(count)
calculate_result(scores)
def calculate_result(scores):
print("High: ", max(scores))
print("Low: ", min(scores))
print("Average: ", sum(scores)/len(scores))
def main():
filename = "scores.text"
if os.path.isfile(filename):
process_file(filename)
else:
print ("File does not exist")
return 0
main()

I guess there are 2 parts:
I need to add error handling
and
I need a separate function for calculating average which I will call in main
The second part I don't think you need help with. But error handling is kind of an art, so I can see where you might be stuck on that. Here are some suggestions to help get started.
The most common type of error handling involves dealing with input. Thinking more broadly, we could expand that to anything that crosses the boundary of the programs memory space. This includes not just user input, but also output; filesystem interaction; using network interfaces (or any communication device or hardware interface); starting/stopping or otherwise interacting with other programs; calling a library that does any of these things on our behalf; and many more....
So what parts of your program are interacting with "the outside" ? I can see a few:
in main() the program is making an assumption about the existence of a file. You are already checking to make sure this file exists, and returning 0 if it doesn't (you might want to change that to a non-zero value, since 0 is usually used to signal that no error occurred)
process_file() does this: f = open(filename,'r') but are you sure that will work? Are there conditions where this could fail?
What if the user that is running the program doesn't have permissions to read that file?
What if the file was deleted or changed between the time it was checked in main and the subsequent open call in process_file? This is a TOCTOU race condition, and it is something that every software developer needs to watch out for.
Probably the most obvious source of potential errors for this program is the content of the input file:
We're assuming the input is comma-separated. What if the user uses tabs or some other character?
While processing the lines, you've got: count = int(parsed[1]), but how do you know that parsed[1] can be cast to an int?
What will happen if the file exists, but is empty (hint: len(scores)==0)? Always look at these edge cases.
Finally, it looks like you are using if-then statements for error checking. That is fine, but another powerful tool for dealing with errors are try-except statements. They are not mutually exclusive: sometimes it's easier to use an if statement, and sometimes catching an exception with try-except is better. Some of the errors you'll need to deal with are easier to handle using one approach over the other.

File open and close in python

I have read that when file is opened using the below format
with open(filename) as f:
#My Code
f.close()
explicit closing of file is not required . Can someone explain why is it so ? Also if someone does explicitly close the file, will it have any undesirable effect ?

The mile-high overview is this: When you leave the nested block, Python automatically calls f.close() for you.
It doesn't matter whether you leave by just falling off the bottom, or calling break/continue/return to jump out of it, or raise an exception; no matter how you leave that block. It always knows you're leaving, so it always closes the file.*
One level down, you can think of it as mapping to the try:/finally: statement:
f = open(filename)
try:
# My Code
finally:
f.close()
One level down: How does it know to call close instead of something different?
Well, it doesn't really. It actually calls special methods __enter__ and __exit__:
f = open()
f.__enter__()
try:
# My Code
finally:
f.__exit__()
And the object returned by open (a file in Python 2, one of the wrappers in io in Python 3) has something like this in it:
def __exit__(self):
self.close()
It's actually a bit more complicated than that last version, which makes it easier to generate better error messages, and lets Python avoid "entering" a block that it doesn't know how to "exit".
To understand all the details, read PEP 343.
Also if someone does explicitly close the file, will it have any undesirable effect ?
In general, this is a bad thing to do.
However, file objects go out of their way to make it safe. It's an error to do anything to a closed file—except to close it again.
* Unless you leave by, say, pulling the power cord on the server in the middle of it executing your script. In that case, obviously, it never gets to run any code, much less the close. But an explicit close would hardly help you there.

Closing is not required because the with statement automatically takes care of that.
Within the with statement the __enter__ method on open(...) is called and as soon as you go out of that block the __exit__ method is called.
So closing it manually is just futile since the __exit__ method will take care of that automatically.
As for the f.close() after, it's not wrong but useless. It's already closed so it won't do anything.
Also see this blogpost for more info about the with statement: http://effbot.org/zone/python-with-statement.htm

multiprocessing when getting URLs python 3.2

I've made a script to get inventory data from the Steam API and I'm a bit unsatisfied with the speed. So I read a bit about multiprocessing in python and simply cannot wrap my head around it. The program works as such: it gets the SteamID from a list, gets the inventory and then appends the SteamID and the inventory in a dictionary with the ID as the key and inventory contents as the value.
I've also understood that there are some issues involved with using a counter when multiprocessing, which is a small problem as I'd like to be able to resume the program from the last fetched inventory rather than from the beginning again.
Anyway, what I'm asking for is really a concrete example of how to do multiprocessing when opening the URL that contains the inventory data so that the program can fetch more than one inventory at a time rather than just one.
onto the code:
with open("index_to_name.json", "r", encoding=("utf-8")) as fp:
index_to_name=json.load(fp)
with open("index_to_quality.json", "r", encoding=("utf-8")) as fp:
index_to_quality=json.load(fp)
with open("index_to_name_no_the.json", "r", encoding=("utf-8")) as fp:
index_to_name_no_the=json.load(fp)
with open("steamprofiler.json", "r", encoding=("utf-8")) as fp:
steamprofiler=json.load(fp)
with open("itemdb.json", "r", encoding=("utf-8")) as fp:
players=json.load(fp)
error=list()
playerinventories=dict()
c=127480
while c<len(steamprofiler):
inventory=dict()
items=list()
try:
url=urllib.request.urlopen("http://api.steampowered.com/IEconItems_440/GetPlayerItems/v0001/?key=DD5180808208B830FCA60D0BDFD27E27&steamid="+steamprofiler[c]+"&format=json")
inv=json.loads(url.read().decode("utf-8"))
url.close()
except (urllib.error.HTTPError, urllib.error.URLError, socket.error, UnicodeDecodeError) as e:
c+=1
print("HTTP-error, continuing")
error.append(c)
continue
try:
for r in inv["result"]["items"]:
inventory[r["id"]]=r["quality"], r["defindex"]
except KeyError:
c+=1
error.append(c)
continue
for key in inventory:
try:
if index_to_quality[str(inventory[key][0])]=="":
items.append(
index_to_quality[str(inventory[key][0])]
+""+
index_to_name[str(inventory[key][1])]
)
else:
items.append(
index_to_quality[str(inventory[key][0])]
+" "+
index_to_name_no_the[str(inventory[key][1])]
)
except KeyError:
print("keyerror, uppdate def_to_index")
c+=1
error.append(c)
continue
playerinventories[int(steamprofiler[c])]=items
c+=1
if c % 10==0:
print(c, "inventories downloaded")
I hope my problem was clear, otherwise just say so obviously. I would optimally avoid using 3rd party libraries but if it's not possible it's not possible. Thanks in advance

So you're assuming the fetching of the URL might be the thing slowing your program down? You'd do well to check that assumption first, but if it's indeed the case using the multiprocessing module is a huge overkill: for I/O bound bottlenecks threading is quite a bit simpler and might even be a bit faster (it takes a lot more time to spawn another python interpreter than to spawn a thread).
Looking at your code, you might get away with sticking most of the content of your while loop in a function with c as a parameter, and starting a thread from there using another function, something like:
def process_item(c):
# The work goes here
# Replace al those 'continue' statements with 'return'
for c in range(127480, len(steamprofiler)):
thread = threading.Thread(name="inventory {0}".format(c), target=process_item, args=[c])
thread.start()
A real problem might be that there's no limit to the amount of threads being spawned, which might break the program. Also the guys at Steam might not be amused at getting hammered by your script, and they might decide to un-friend you.
A better approach would be to fill a collections.deque object with your list of c's and then start a limited set of threads to do the work:
def process_item(c):
# The work goes here
# Replace al those 'continue' statements with 'return'
def process():
while True:
process_item(work.popleft())
work = collections.deque(range(127480, len(steamprofiler)))
threads = [threading.Thread(name="worker {0}".format(n), target=process)
for n in range(6)]
for worker in threads:
worker.start()
Note that I'm counting on work.popleft() to throw an IndexError when we're out of work, which will kill the thread. That's a bit sneaky, so consider using a try...except instead.
Two more things:
Consider using the excellent Requests library instead of urllib (which, API-wise, is by far the worst module in the entire Python standard library that I've worked with).
For Requests, there's an add-on called grequests which allows you to do fully asynchronous HTTP-requests. That would have made for even simpler code.
I hope this helps, but please keep in mind this is all untested code.

The outermost while loop seems to be distributed over a few processes(or tasks).
When you break the loop into tasks, note that you are sharing playerinventories and error object between processes. You will need to use multiprocessing.Manager for the sharing issue.
I recommend you to start modifying your code from this snippet.

Close all open files in ipython

Sometimes when using ipython you might hit an exception in a function which has opened a file in write mode. This means that the next time you run the function you get a value error,
ValueError: The file 'filename' is already opened. Please close it before reopening in write mode.
However since the function bugged out, the file handle (which was created inside the function) is lost, so it can't be closed. The only way round it seems to be to close the ipython session, at which point you get the message:
Closing remaining open files: filename... done
Is there a way to instruct ipython to close the files without quitting the session?

You should try to always use the with statement when working with files. For example, use something like
with open("x.txt") as fh:
...do something with the file handle fh
This ensures that if something goes wrong during the execution of the with block, and an exception is raised, the file is guaranteed to be closed. See the with documentation for more information on this.
Edit: Following a discussion in the comments, it seems that the OP needs to have a number of files open at the same time and needs to use data from multiple files at once. Clearly having lots of nested with statements, one for each file opened, is not an option and goes against the ideal that "flat is better than nested".
One option would be to wrap the calculation in a try/finally block. For example
file_handles = []
try:
for file in file_list:
file_handles.append(open(file))
# Do some calculations with open files
finally:
for fh in file_handles:
fh.close()
The finally block contains code which should be run after any try, except or else block, even if an exception occured. From the documentation:
If finally is present, it specifies a "cleanup" handler. The try clause is executed, including any except and else clauses. If an exception occurs in any of the clauses and is not handled, the exception is temporarily saved. The finally clause is executed. If there is a saved exception, it is re-raised at the end of the finally clause. If the finally clause raises another exception or executes a return or break statement, the saved exception is lost. The exception information is not available to the program during execution of the finally clause.

A few ideas:
use always finally (or a with block) when working with files, so they are properly closed.
you can blindly close the non standard file descriptors using os.close(n) where n is a number greater than 2 (this is unix specific, so you might want to peek /proc/ipython_pid/fd/ to see what descriptors the process have opened so far).
you can inspect the captured stack frames locals to see if you can find the reference to the wayward file and close it... take a look to sys.last_traceback

Overriding basic signals (SIGINT, SIGQUIT, SIGKILL??) in Python

I'm writing a program that adds normal UNIX accounts (i.e. modifying /etc/passwd, /etc/group, and /etc/shadow) according to our corp's policy. It also does some slightly fancy stuff like sending an email to the user.
I've got all the code working, but there are three pieces of code that are very critical, which update the three files above. The code is already fairly robust because it locks those files (ex. /etc/passwd.lock), writes to to a temporary files (ex. /etc/passwd.tmp), and then, overwrites the original file with the temporary. I'm fairly pleased that it won't interefere with other running versions of my program or the system useradd, usermod, passwd, etc. programs.
The thing that I'm most worried about is a stray ctrl+c, ctrl+d, or kill command in the middle of these sections. This has led me to the signal module, which seems to do precisely what I want: ignore certain signals during the "critical" region.
I'm using an older version of Python, which doesn't have signal.SIG_IGN, so I have an awesome "pass" function:
def passer(*a):
pass
The problem that I'm seeing is that signal handlers don't work the way that I expect.
Given the following test code:
def passer(a=None, b=None):
pass
def signalhander(enable):
signallist = (signal.SIGINT, signal.SIGQUIT, signal.SIGABRT, signal.SIGPIPE, signal.SIGALRM, signal.SIGTERM, signal.SIGKILL)
if enable:
for i in signallist:
signal.signal(i, passer)
else:
for i in signallist:
signal.signal(i, abort)
return
def abort(a=None, b=None):
sys.exit('\nAccount was not created.\n')
return
signalhander(True)
print('Enabled')
time.sleep(10) # ^C during this sleep
The problem with this code is that a ^C (SIGINT) during the time.sleep(10) call causes that function to stop, and then, my signal handler takes over as desired. However, that doesn't solve my "critical" region problem above because I can't tolerate whatever statement encounters the signal to fail.
I need some sort of signal handler that will just completely ignore SIGINT and SIGQUIT.
The Fedora/RH command "yum" is written is Python and does basically exactly what I want. If you do a ^C while it's installing anything, it will print a message like "Press ^C within two seconds to force kill." Otherwise, the ^C is ignored. I don't really care about the two second warning since my program completes in a fraction of a second.
Could someone help me implement a signal handler for CPython 2.3 that doesn't cause the current statement/function to cancel before the signal is ignored?
As always, thanks in advance.
Edit: After S.Lott's answer, I've decided to abandon the signal module.
I'm just going to go back to try: except: blocks. Looking at my code there are two things that happen for each critical region that cannot be aborted: overwriting file with file.tmp and removing the lock once finished (or other tools will be unable to modify the file, until it is manually removed). I've put each of those in their own function inside a try: block, and the except: simply calls the function again. That way the function will just re-call itself in the event of KeyBoardInterrupt or EOFError, until the critical code is completed.
I don't think that I can get into too much trouble since I'm only catching user provided exit commands, and even then, only for two to three lines of code. Theoretically, if those exceptions could be raised fast enough, I suppose I could get the "maximum reccurrsion depth exceded" error, but that would seem far out.
Any other concerns?
Pesudo-code:
def criticalRemoveLock(file):
try:
if os.path.isFile(file):
os.remove(file)
else:
return True
except (KeyboardInterrupt, EOFError):
return criticalRemoveLock(file)
def criticalOverwrite(tmp, file):
try:
if os.path.isFile(tmp):
shutil.copy2(tmp, file)
os.remove(tmp)
else:
return True
except (KeyboardInterrupt, EOFError):
return criticalOverwrite(tmp, file)

There is no real way to make your script really save. Of course you can ignore signals and catch a keyboard interrupt using try: except: but it is up to your application to be idempotent against such interrupts and it must be able to resume operations after dealing with an interrupt at some kind of savepoint.
The only thing that you can really to is to work on temporary files (and not original files) and move them after doing the work into the final destination. I think such file operations are supposed to be "atomic" from the filesystem prospective. Otherwise in case of an interrupt: restart your processing from start with clean data.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.