python make huge file persist in memory - python

I have a python script that needs to read a huge file into a var and then search into it and perform other stuff,
the problem is the web server calls this script multiple times and every time i am having a latency of around 8 seconds while the file loads.
Is it possible to make the file persist in memory to have faster access to it atlater times ?
I know i can make the script as a service using supervisor but i can't do that for this.
Any other suggestions please.
PS I am already using var = pickle.load(open(file))

You should take a look at http://docs.h5py.org/en/latest/. It allows to perform various operations on huge files. It's what the NASA uses.

Not an easy problem. I assume you can do nothing about the fact that your web server calls your application multiple times. In that case I see two solutions:
(1) Write TWO separate applications. The first application, A, loads the large file and then it just sits there, waiting for the other application to access the data. "A" provides access as required, so it's basically a sort of custom server. The second application, B, is the one that gets called multiple times by the web server. On each call, it extracts the necessary data from A using some form of interprocess communication. This ought to be relatively fast. The Python standard library offers some tools for interprocess communication (socket, http server) but they are rather low-level. Alternatives are almost certainly going to be operating-system dependent.
(2) Perhaps you can pre-digest or pre-analyze the large file, writing out a more compact file that can be loaded quickly. A similar idea is suggested by tdelaney in his comment (some sort of database arrangement).

You are talking about memory-caching a large array, essentially…?
There are three fairly viable options for large arrays:
use memory-mapped arrays
use h5py or pytables as a back-end
use an array caching-aware package like klepto or joblib.
Memory-mapped arrays index the array in file, as if there were in memory.
h5py or pytables give you fast access to arrays on disk, and also can avoid the load of the entire array into memory. klepto and joblib can store arrays as a collection of "database" entries (typically a directory tree of files on disk), so you can load portions of the array into memory easily. Each have a different use case, so the best choice for you depends on what you want to do. (I'm the klepto author, and it can use SQL database tables as a backend instead of files).

Related

Reading multiple (CERN) ROOT files into NumPy array - using n nodes and say, 2n GPUs

I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy
With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards,
Kolahal
You're looping over all the events sequentially from python, that's probably the bottleneck.
You can look into root_numpy to load the data you need from the root file into numpy arrays:
root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.
I'm also currently looking at root_pandas which seems similar.
While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.

Why protobuf is bad for large data structures?

I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients.
I'm trying to apply protobuf because:
It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large
But Protobuf user guide says:
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
https://developers.google.com/protocol-buffers/docs/techniques#large-data
I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.
Why protobuf is bad for serializing large datasets? What should I use instead?
It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.
However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.
The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.
If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.
It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.
The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.
If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.

Live-analysis of simulation data using pytables / hdf5

I am working on some cfd-simulations with c/CUDA and python, at the moment the workflow goes like this:
Start a simulation written in pure c / cuda
Write output to a binary file
Reopen files with python i.e. numpy.fromfile and do some analysis.
Since I have a lot of data and also some metadata I though it would be better
to switch to hdf5 file format. So my Idea was something like,
Create some initial conditions data for my simulations using pytables.
Reopen and write to the datasets in c by using the standard hdf5 library.
Reopen files using pytables for analysis.
I really would like to do some live analysis of the data i.e.
write from the c-programm to hdf5 and directly read from python using pytables.
This would be pretty useful, but I am really not
sure how much this is supported by pytables.
Since I never worked with pytables or hdf5 it would be good to know
if this is a good approach or if there are maybe some pitfalls.
I think it is a reasonable approach, but there is a pitfall indeed. The HDF5 C-library is not thread-safe (there is a "parallel" version, more on this later). That means, your scenario does not work out of the box: one process writing data to a file while another process is reading (not necessarily the same dataset) will result in a corrupted file. To make it work, you must either:
implement file locking, making sure that no process is reading while the file is being written to, or
serialize access to the file by delegating reads/writes to a distinguished process. You must then communicate with this process through some IPC technique (Unix domain sockets, ...). Of course, this might affect performance because data is being copied back and forth.
Recently, the HDF group published an MPI-based parallel version of HDF5, which makes concurrent read/write access possible. Cf. http://www.hdfgroup.org/HDF5/PHDF5/. It was created for use cases like yours.
To my knowledge, pytables does not provide any bindings to parallel HDF5. You should use h5py instead, which provides very user-friendly bindings to parallel HDF5. See the examples on this website: http://docs.h5py.org/en/2.3/mpi.html
Unfortunately, parallel HDF5 has a major drawback: to date, it does not support writing compressed datasets (reading is possible, though). Cf. http://www.hdfgroup.org/hdf5-quest.html#p5comp

How do I use Python with a database?

I wrote a Python program that handles very large data. As it processes the data, it puts the processed data into an array, which easily grows to hundreds of megabytes or even over a gigabyte.
The reason I set it like that is because Python needs to continuously access the data in the array. Because the array gets larger and larger, the process is easily prone to error and very slow.
Is there a way to have the array-like database stored on a different file or database module and access it on a as-needed basis?
Perhaps this is a very basic task, but I have no clue.
You can use sqlite3 if you want. it is part of the python
packages, it is simpler for basic usage.
MySQL for python
Postgres for python

how to speed up the code?

in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance
As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.
Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?
It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.
Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method
If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data

Categories