I am running some numerical code in Python (using numpy) on a linux cluster that can be submitted as an array job. This means that the same code will run with thousands of parallel instances, each generating some data (using a random input) and saving it as a .npy file.
To avoid generating thousands of output files, I want to save all the data to the same file. However, I'm worried that if two instances of the code try to open the file simultaneously, it might be corrupted.
What is the best way to do this?
Related
The context:
I am using PyArrow to read a folder structured as exchange/symbol/date.parquet. The folder contains multiple exchanges, multiple symbols and multiple files. At the time I am writing the folder is about 30GB/1.85M files.
If I use a single PyArrow Dataset to read/manage the entire folder, the simplest process with just the dataset defined will occupy 2.3GB of RAM. The problem is, I am instanciating this dataset on multiple processes but since every process only needs some exchanges (typically just one), I don't need to read all folders and files in every single process.
So I tried to use a UnionDataset composed of single exchange Dataset. In this way, every process just loads the required folder/files as a dataset. By a simple test, by doing so every process now occupy just 868MB of RAM, -63%.
The problem:
When using a single Dataset for the entire folder/files, I have no problem at all. I can read filtered data without problems and it's fast as duck.
But when I read the UnionDataset filtered data, I always get Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) error. So after looking every single source of the problem, I noticed that if I create a dummy folder with multiple exchanges but just some symbols, in order to limit the files amout to read, I don't get that error and it works normally. If I then copy new symbols folders (any) I get again that error.
I came up thinking that the problem is not about my code, but linked instead to the amout of files that the UnionDataset is able to manage.
Am I correct or am I doing something wrong? Thank you all, have a nice day and good work.
I use pyspark
And use MLUtils saveaslibsvm to save an RDD on labledpoints
It works but keeps that files in all the worker nodes under /_temporary/ as many files.
No error is thrown, i would like to save the files in the proper folder, and preferably saving all the output to one libsvm file that will be located on the nodes or on the master.
Is that possible?
edit
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No matter what i do, i can't use MLUtils.loadaslibsvm() to load the libsvm data from the same path i used to save it. maybe something is wrong with writing the file?
This is a normal behavior for Spark. All writing and reading activities are performed in parallel directly from the worker nodes and data is not passed to or from driver node.
This why reading and writing should be performed using storage which can be accessed from each machine, like distributed file system, object store or database. Using Spark with local file system has very limited applications.
For testing you can can use network file system (it is quite easy to deploy) but it won't work well in production.
I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.
I'm using the Psychopy 1.82.01 Coder and its iohub functionality (on Ubuntu 14.04 LTS). It is working but I was wondering if there is a way to dynamically rename the hdf5 file it produces during an experiment (such that in the end, I know which participant it belongs to and two participants will get two files without overwriting one of them).
It seems to me that the filename is determined in this file: https://github.com/psychopy/psychopy/blob/df68d434973817f92e5df78786da313b35322ae8/psychopy/iohub/default_config.yaml
But is there a way to change this dynamically?
If you want to create a different hdf5 file for each experiment run, then the options depend on how you are starting the ioHub process. Assuming you are using the psychopy.iohub.launchHubServer() function to start ioHub, then you can pass the 'experiment_code' kwarg to the function and that will be used as the hdf5 file name.
For example, if you created a script with the following code and ran it:
import psychopy.iohub as iohub
io = iohub.launchHubServer(experiment_code="exp_sess_1")
# your experiment code here ....
# ...
io.quit()
An ioHub hdf5 file called 'exp_sess_1.hdf5' will be created in the same folder as the script file.
As a side note, you do not have to save each experiment sessions data into a separate hdf5 file. The ioHub hdf5 file structure is designed to save multiple participants / sessions data in a single file. Each time the experiment is run, a unique session code is required, and the data from each run is saved in the hdf5 file with a session id that is associated with the session code.
I've accumulated a set of 500 or so files, each of which has an array and header that stores metadata. Something like:
2,.25,.9,26 #<-- header, which is actually cryptic metadata
1.7331,0
1.7163,0
1.7042,0
1.6951,0
1.6881,0
1.6825,0
1.678,0
1.6743,0
1.6713,0
I'd like to read these arrays into memory selectively. We've built a GUI that lets users select one or multiple files from disk, then each are read in to the program. If users want to read in all 500 files, the program is slow opening and closing each file. Therefore, my question is: will it speed up my program to store all of these in a single structure? Something like hdf5? Ideally, this would have faster access than the individual files. What is the best way to go about this? I haven't ever dealt with these types of considerations. What's the best way to speed up this bottleneck in Python? The total data is only a few MegaBytes, I'd even be amenable to storing it in the program somewhere, not just on disk (but don't know how to do this)
Reading 500 files in python should not take much time, as the overall file size is around few MB. Your data-structure is plain and simple in your file chunks, it ll not even take much time to parse I guess.
Is the actual slowness is bcoz of opening and closing file, then there may be OS related issue (it may have very poor I/O.)
Did you timed it like how much time it is taking to read all the files.?
You can also try using small database structures like sqllite. Where you can store your file data and access the required data in a fly.