How to read and write numpy arrays to protected file - python

I would like several processes running in parallel to read and write to the same numpy array. To avoid problems, where two processes try to read/write to the same memory, I need to protect the file I am writing to. How do I do that?
I assume that np.savetxt does not protect the file. I have tried the library portalocker. But by opening a file and locking it, np.savetxt is not allowed to write to the file.

See this question "Downloading over 1000 files in python" (link) for examples of using a worker thread pool.
Basically you split up all of the work beforehand, put the work into a queue and let a pool of worker threads process each piece of work. The workers put the results onto a another queue which can be processed by another thread to put all of the pieces together.

Related

Do I need FileLock to read a file that is being written by other batch of processes using file-locking?

I have a bunch of processes accessing and modifying a csv file from Python at the same time (safely doing so with FileLock). I access the file using the Pandas package.
I would like to launch another batch of processes which are going to access this same file (reading) but without modifying it.
Can I access this file without using a locking system, and still have the first writing processes unaffected by this?
In case that it is okay to do this, will the reading of the file have to wait for a writing process to finish if it is locked?
I must note, in case that it is not clear, that I do not want the writing processes to be affected by the reading ones. However, the modifications done by the writing processes on the file are not relevant for the reading processes in my case, so I do not mind in which stage of writing they are.

Is it possible for multiple processes to simultaneously only read (not write to) from a file in Python?

I have spawned multiple processes using multiprocessing.Process in a loop and each process is trying to read the same file. Will this cause an issue ? References to the answers are most welcome.
It's no problem. Not only applies to files, also to RAM. You only get in trouble when someone is writing.
The phenomenon is called a data race (emphasis mine):
access [==read] a memory location at the same time that a memory operation in another thread is writing to that memory location

Read a file multi-threaded in python in chunks of 2KB.

I have to read a file in chunks of 2KB and do some operation on those chunks. Now where I'm actually stuck is, when the data needs to be thread-safe. From what I've seen in online tutorials and StackOverflow answers, we define a worker thread, and override its run method. The run method uses data from a queue which we pass as an argument, and which contains the actual data. But to load that queue with data, I'll have to go through the file serially, which eliminates parallelism. I want that multiple threads read the file in parallel manner. So I'll have to cover the read part in the run function only. But I'm not sure how to go with that. Help needed.
Reading the file serially is your best option since (hardware wise) it gives you the best read throughout.
Usually the slow part is not in the data reading but in its processing...

What's the Best Way to Schedule and Manage Multiple Processes in Python 3

I'm working on a project in Python 3 that involves reading lines from a text file, manipulating those lines in some way, and then writing the results of said manipulation into another text file. Implementing that flow in a serial way is trivial.
However, running every step serially takes a long time (I'm working on text files that are several hundred megabytes/several gigabytes in size). I thought about breaking up the process into multiple, actual system processes. Based on the recommended best practices, I'm going to use Python's multiprocessing library.
Ideally, there should be one and only one Process to read from and write to the text files. The manipulation part, however, is where I'm running into issues.
When the "reader process" reads a line from the initial text file, it places that line in a Queue. The "manipulation processes" then pull from that line from the Queue, do their thing, then put the result into yet another Queue, which the "writer process" then takes and writes to another text file. As it stands right now, the manipulation processes simply check to see if the "reader Queue" has data in it, and if it does, they get() the data from the Queue and do their thing. However, those processes may be running before the reader process runs, thus causing the program to stall.
What, in your opinions, would be the "Best Way" to schedule the processes in such a way so the manipulation processes won't run until the reader process has put data into the Queue, and vice-versa with the writer process? I considered firing off custom signals, but I'm not sure if that's the most appropriate way forward. Any help will be greatly appreciated!
If I were you, I would separate the tasks of dividing your file into tractable chunks and the compute-intensive manipulation part. If that is not possible (for example, if lines are not independent for some reason), then you might have to do a purely serial implementation anyway.
Once you have N chunks in separate files, you can just start your serial manipulation script N times, for each chunk. Afterwards, combine the output back into one file. If you do it it this way, no queue is needed and you will save yourself some work.
You're describing a task queue. Celery is a task queue: http://www.celeryproject.org/

Multiple threads reading from single folder on Linux

My projects needs multiple threads reading files from the same folder. This folder has incoming files and the file should only be processed by any one of those threads. Later, this file reading thread, deletes the file after processing it.
EDIT after the first answer: I don't want a single thread in charge of reading filenames and feeding those names to other threads, so that they can read it.
Is there any efficient way of achieving this in python?
You should probably use the Queue module. From the docs:
The Queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads.
I would use a FIFO approach, with a thread in charge of checking for inbound files and queuing them, and a number of workers processing them. A LIFO approach or an approach in which priority is assigned with a custom method are also supported by the module.
EDIT: If you don't want to use the Queue module and you are under a *nix system, you could use fcntl.lockf instead. An alternative, opening the files with os.open('filename', os.O_EXLOCK).
Depending on how often you perform this operation, you might find it less performing than using Queue, as you will have to account for race conditions (i.e.: you might acquire the name of the file to open, but the file might get locked by another thread before you get a chance to open it, throwing an exception that you will have to trap). Queue is there for a reason! ;)
EDIT2: Comments in this and other questions are bringing up the problem with simultaneous disk access to different files and the consequent performance hit. I was thinking that task_done would have been used for preventing this, but reading others' comments it occurred to me that instead of queuing file names, one could queue the files' content directly. This second alternative would work only for a limited amount of limited size queued files, given that RAM would fill up rather quickly otherwise.
I'm unaware if RAID and other parallel disk configurations would already take care of reading one file per disk rather than bouncing back and forth between two files on both disks.
HTH!
If you want multiple threads to read directly from the same folder several files in parallel, then I must disappoint you. Reading in parallel from a single disk is not a viable option. A single disk needs to spin and seek the next location to be read. If you're reading with multiple threads, you are just bouncing the disk around between seeks and the performance is much worse than a simple sequential read.
Just stick to mac's advice and use a single thread for reading.

Categories