I ve written a script that fetches bitcoin data and saves it in .txt files or in the case where the .txt files exist, it updates them. The .txt files are nodes and relationships connecting the nodes for neo4j.
At the beginning of the script:
It checks whether the files exist, so it opens them and appends new lines OR
In case the files do not exist, the script creates them and starts appending lines.
The .txt files are constantly open, the script writes the new data. The .txt files close when all the data are written or I terminate the execution.
My question is:
Should I open, write, close each .txt file for each iteration and for each .txt file?
or
Should I keep it the way it is now; open the .txt files, do all the writing, when the writing is done close the .txt file
I am saving data from 6013 blocks. Which way would minimize risk of corrupting the data written in the .txt files?
Keeping files open will be faster. In the comments you mentioned that "Loss of data previously written is not an option". The probability of corrupting files is higher for open files so open and close file on each iteration is more reliable.
There is also an option to keep data in some buffer and to write/append buffer to file when all data is received or on user/system interrupt or network timeout.
I think keeping the file open will be more efficient, because python won't need to search for the file and open it every time you want to read/write the file.
I guess it should look like this
with open(filename, "a") as file:
while True:
data = # get data
file.write(data)
Run a benchmark and see for yourself would the typical answer for this kind of question.
Nevertheless opening and closing a file does have a cost. Python needs to allocate memory for the buffer and data structures associated with the file and call some operating system functions, e.g. the open syscall which in turn would search the file in cache or on disk.
On the other hand there is a limit on the number of files a program, the user, the whole system, etc can open at the same time. For example on Linux, the value in /proc/sys/fs/file-max denotes the maximum number of file-handles that the kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit (source).
If your program runs in such a restrictive environment then it would be good to keep the file open only when needed.
Related
When opening and appending to a file in python, does that file get loaded into memory? I'm asking this because I'm writing a program where I write to several files in a round-robin fashion where I have the guarantee that any one file can fit into memory but not all files can fit into memory at the same time. Opening and closing files every time I append is not an option since that would be too slow. As such, I would need all the files opened simultaneously.
The answer is NO. Regarding the documentations of open() wraps a system call and returns a file object (Not the content of file): https://docs.python.org/2/library/functions.html#open
Open a file, returning an object of the file type described in section
File Objects.
The file contents are not loaded into RAM unless you read the file with eg.: readlines(), read()
I am running a long Python program which prints values to a .txt file in an iterative way. I am trying to read the values using terminal "gedit/tail/less" commands and trying to plot them in Gnuplot. But I am not able to read the .txt file till the whole execution is over. What is the correct argument for such file handling ?
The files are written when they are closed or when the size of the buffer is too large to store.
That is even when you use file.write("something"), something isn't written in the file till you close the file, or with block is over.
with open("temp.txt","w") as w:
w.write("hey")
x=input("touch")
w.write("\nhello")
w.write(x)
run this code and try to read the file before touch, it'll be empty, but after the with block is over you can see the contents.
If you are going to access the file from many sources, then you have to be careful of this, and also not to modify it from multiple sources.
EDIT: I forgot to say, you have to continuously close the file and open it in append mode if you want some other program to read it while you are writing to the file.
I am writing a program with a while loop, which would write giant amount of data into a csv file. There maybe more than 1 million rows.
Considering running time, memory usage, debugging and so on, what is the better option between the two:
open a CSV file, keep it open and write line by line, until the 1 million all written
Open a file, write about 100 lines, close(), open again, write about 100 lines, ......
I guess I just want to know would it take more memories if we're to keep the file open all the time? And which one will take longer?
I can't run the code to compare because I'm using a VPN for the code, and testing through testing would cost too much $$ for me. So just some rules of thumb would be enough for this thing.
I believe the write will immediately write to the disk, so there isn't any benefit that I can see from closing and reopening the file. The file isn't stored in memory when it's opened, you just get essentially a pointer to the file, and then load or write a portion of it at a time.
Edit
To be more explicit, no, opening a large file will not use a large amount of memory. Similarly writing a large amount of data will not use a large amount of memory as long as you don't hold the data in memory after it has been written to the file.
I've seen a few questions related to this but nothing that definitively answers my question.
I have a short python script that does some simple tasks, then outputs some text to a log file, waits for more input, and loops.
At times, the file is opened in write mode ("w") and other times it is opened in append mode ("a") depending on the results of the other tasks. For the sake of simplicity let's say it is in write mode/append mode 50/50.
I am opening files by saying:
with open(fileName, mode) as file:
and writing to them by saying:
file.write(line)
While these files are being opened, written to, appended to, etc., I expect a command prompt to be doing some read activities on them (findstr, specifically).
1) What's going to happen if my script tries to write to the same file the command window is reading from?
2) Is there a way to explicitly set the open to shared mode?
3)Does using the 'logger' module help at all/handle this instead of just manually making my own log files?
Thanks
What you are referring to is generally called a "race condition" where two programs are trying to read / write the same file at the same time. Some operating systems can help you avoid this by implementing a file-lock mutex system, but on most operating systems you just get a corrupted file, a crashed program, or both.
Here's an interesting article talking about how to avoid race conditions in python:
http://blog.gocept.com/2013/07/15/reliable-file-updates-with-python/
One suggestion that the author makes is to copy the file to a temp file, make your writes/appends there and then move the file back. Race conditions happen when files are kept open for a long time, this way you are never actually opening the main file in python, so the only point at which a collision could occur is during the OS copy / move operations, which are much faster.
I am writing a Python logger script which writes to a CSV file in the following manner:
Open the file
Append data
Close the file (I think this is necessary to save the changes, to be safe after every logging routine.)
PROBLEM:
The file is very much accessible through Windows Explorer (I'm using XP). If the file is opened in Excel, access to it is locked by Excel. When the script tries to append data, obviously it fails, then it aborts altogether.
OBJECTIVE:
Is there a way to lock a file using Python so that any access to it remains exclusive to the script? Or perhaps my methodology is poor in the first place?
Rather than closing and reopening the file after each access, just flush its buffer:
theloggingfile.flush()
This way, you keep it open for writing in Python, which should lock the file from other programs opening it for writing. I think Excel will be able to open it as read-only while it's open in Python, but I can't check that without rebooting into Windows.
EDIT: I don't think you need the step below. .flush() should send it to the operating system, and if you try to look at it in another program, the OS should give it the cached version. Use os.fsync to force the OS to really write it to the hard drive, e.g. if you're concerned about sudden power failures.
os.fsync(theloggingfile.fileno())
As far as I know, Windows does not support file locking. In other words, applications that don't know about your file being locked can't be prevented from reading a file.
But the remaining question is: how can Excel accomplish this?
You might want to try to write to a temporary file first (one that Excel does not know about) and replace the original file by it lateron.