So I have a couple of questions:
Does read, write in python using file.read(), file.write() happen using locks ?
Or can read, write happen at the same time ?
I have pickled object to a file on the disk. Now multiple threads can read the pickled model. But I also want to update the pickled file. Now one way is to lock the file on every read, write. But this would be inefficient since locking the files only when they are being read by different process is unneccary.
How should I solve this ?
Related
I have a bunch of processes accessing and modifying a csv file from Python at the same time (safely doing so with FileLock). I access the file using the Pandas package.
I would like to launch another batch of processes which are going to access this same file (reading) but without modifying it.
Can I access this file without using a locking system, and still have the first writing processes unaffected by this?
In case that it is okay to do this, will the reading of the file have to wait for a writing process to finish if it is locked?
I must note, in case that it is not clear, that I do not want the writing processes to be affected by the reading ones. However, the modifications done by the writing processes on the file are not relevant for the reading processes in my case, so I do not mind in which stage of writing they are.
I have a Python script that writes to disk every time it runs, which happens very frequently. It collects data that changes just as often so I cannot change the rate at which the script runs. I would like to reduce the number of disk writes by only writing after every x minutes. I need the different script results to persist in memory over multiple script runs. The result data consists of lines of text, which are all strings. This script is running on Debian Linux.
Is there a way to store multiple strings/files directly in memory until I write them all to the disk at a later time?
Of course, there are many options for that. One way to do it is to create a temporary string buffer (i.e. container), where you write these strings until disk write is called. If strings are more specific/timestamped or similar, you can use a dictionary to remember their IDs also.
You can use a buffer or queue/pipe to put data in them and then whenever suited write into the file pointer and flush() it to the hard disk whenever needed.
I have to read a file in chunks of 2KB and do some operation on those chunks. Now where I'm actually stuck is, when the data needs to be thread-safe. From what I've seen in online tutorials and StackOverflow answers, we define a worker thread, and override its run method. The run method uses data from a queue which we pass as an argument, and which contains the actual data. But to load that queue with data, I'll have to go through the file serially, which eliminates parallelism. I want that multiple threads read the file in parallel manner. So I'll have to cover the read part in the run function only. But I'm not sure how to go with that. Help needed.
Reading the file serially is your best option since (hardware wise) it gives you the best read throughout.
Usually the slow part is not in the data reading but in its processing...
I have a thread writing to a file(writeThread) periodically and another(readThread) that reads from the file asynchronously. Can readThread access the file using a different handle and not mess anything up?
If not, does python have a shared lock that can be used by writeThread but does not block readThread ? I wouldn't prefer a simple non-shared lock because file access takes order of a millisecond and the writeThread write period is of the same order(the period depends on some external parameters). Thus, a situation may arise where even though writeThread may release the lock, it will re-acquire it immediately and thus cause starvation.
A solution which I can think of is to maintain multiple copies of the file, one for reading and another for writing and avoid the whole situation all-together. However, the file sizes involved may become huge, thus making this method not preferable.
Are there any other alternatives or is this a bad design ?
Thanks
Yes, you can open the file multiple times and get independent access to it. Each file object will have its own buffers and position so for instance a seek on one will not mess up the other. It works pretty much like multiple program access and you have to be careful when reading / writing the same area of the file. For instance, a write that appends to the end of the file won't be seen by the reader until the write object flushes. Rewrites of existing data won't be seen by the reader until both the reader and writer flush. Writes won't be atomic, so if you are writing records the reader may see partial records. async Select or poll events on the reader may be funky... not sure about that one.
An alternative is mmap but I haven't used it enough to know the gotchas.
I have a program that wants to be called from the command line many times, but involves reading a large pickle file, so each call can be potentially expensive. Is there any way that I can make cPickle just mmap the file into memory rather than read it in its entirety?
You probably don't even need to do this explicitly as your OS's disk cache will probably do a damn good job already.
Any poor performance might actually be related to the cost of deserialization and not the cost of reading it off the disk. You can test this by creating a temporary ram disk and putting the file there.
And the way to remove the cost of deserialization is to move the loading of the file to a separate python process and call it like a service. Building a quick-and-dirty REST service in python is super-easy and super-useful in these cases.
Take a look at the socket docs for how to do this with a raw socket. The echo server is a good example to start from.