Why protobuf is bad for large data structures? - python

I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients.
I'm trying to apply protobuf because:
It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large
But Protobuf user guide says:
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
https://developers.google.com/protocol-buffers/docs/techniques#large-data
I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.
Why protobuf is bad for serializing large datasets? What should I use instead?

It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.
However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.
The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.
If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.

It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.
The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.
If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.

Related

Caching a large data structure in python across instances while maintaining types

I'm looking to use a distributed cache in python. I have a fastApi application and wish that every instance have access to the same data as our load balancer may route the incoming requests differently. The problem is that I'm storing / editing information about a relatively big data set from a arrow feather file and processing it with Vaex. The feather file automaticaly loads the correct types for the data. The data structure I need to store will use a user id as a key and the value will be a large array of arrays of numbers. I've looked at memcache and redis as possible caching solutions, but both seem to store entries as strings / simple values. I'm looking to avoid parsing strings and extra processing on a large amount data. Is there a distributed caching stategy that will let me persist types?
One solution we came up with is to store the data in mutliple feather files in a directory that is accessible to all instances of the app but this seems to be messy as you would need to clean up / delete the files after each session.
Redis 'strings' are actually able to store arbitrary binary data, it isn't limited to actual strings. From https://redis.io/topics/data-types:
Redis Strings are binary safe, this means that a Redis string can contain any kind of data, for instance a JPEG image or a serialized Ruby object.
A String value can be at max 512 Megabytes in length.
Another option is to use Flatbuffers, which is a serialisation protocol specifically designed to allow reading/writing serialised objects without expensive deserialisation.
Although I would suggest reconsidering storing large, complex data structures as cache values. The drawback is that any change will lead to having to rewrite the entire thing in cache which can get expensive, so consider breaking it up into smaller k/v pairs if possible. You could use Redis Hash data type to make this easier to implement.

Extracting only few columns from a FITS file that is freely available online to download using python

I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?
I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.

Save Data in C++, Load from Python - Recommended Data Formats

I have a ROS/CPP simulator that saves large amounts of data to a rosbag (around 90 MB). I want to read this data frequently from Python and since reading rosbags is slow and cumbersome, I currently have another python script that reads the rosbag and saves the relevant contents to a HDF5 file.
It would be nice though to be able to just save the data from the simulator directly (in C++) and then read it from my scripts (in Python). So I was wondering which data format I should use.
It should be:
Fast to load from Python
Be compact (so ideally a binary of some sort)
Be easy to use
You might be wondering why I don't just save to HDF5 from my C++ simulator, but it just doesn't seem to be easy. There is basically nothing on forums such as Stackoverflow and the HDF5 Group website is opaque, seems to have some complicated licensing and very poor examples. I just want something quick and dirty that I can get running this afternoon.
You may want to have a look at HDFql as it is a high-level language (similar to SQL) to manage HDF5 files. Amongst others, HDFql supports C++ and Python. There are some examples that illustrates how to use HDFql in these languages here.
I see two solutions that can be useful for your problem :
LV: Length Value that you can store directly in binary into a file.
JSON: This does not add many data more than you need, and there are many libraries in Python or C++ that can simplify you the work
Protocol Buffers is an option with language bindings in C++ and Python, though it might be more time investment than quick/dirty running this afternoon.

python make huge file persist in memory

I have a python script that needs to read a huge file into a var and then search into it and perform other stuff,
the problem is the web server calls this script multiple times and every time i am having a latency of around 8 seconds while the file loads.
Is it possible to make the file persist in memory to have faster access to it atlater times ?
I know i can make the script as a service using supervisor but i can't do that for this.
Any other suggestions please.
PS I am already using var = pickle.load(open(file))
You should take a look at http://docs.h5py.org/en/latest/. It allows to perform various operations on huge files. It's what the NASA uses.
Not an easy problem. I assume you can do nothing about the fact that your web server calls your application multiple times. In that case I see two solutions:
(1) Write TWO separate applications. The first application, A, loads the large file and then it just sits there, waiting for the other application to access the data. "A" provides access as required, so it's basically a sort of custom server. The second application, B, is the one that gets called multiple times by the web server. On each call, it extracts the necessary data from A using some form of interprocess communication. This ought to be relatively fast. The Python standard library offers some tools for interprocess communication (socket, http server) but they are rather low-level. Alternatives are almost certainly going to be operating-system dependent.
(2) Perhaps you can pre-digest or pre-analyze the large file, writing out a more compact file that can be loaded quickly. A similar idea is suggested by tdelaney in his comment (some sort of database arrangement).
You are talking about memory-caching a large array, essentially…?
There are three fairly viable options for large arrays:
use memory-mapped arrays
use h5py or pytables as a back-end
use an array caching-aware package like klepto or joblib.
Memory-mapped arrays index the array in file, as if there were in memory.
h5py or pytables give you fast access to arrays on disk, and also can avoid the load of the entire array into memory. klepto and joblib can store arrays as a collection of "database" entries (typically a directory tree of files on disk), so you can load portions of the array into memory easily. Each have a different use case, so the best choice for you depends on what you want to do. (I'm the klepto author, and it can use SQL database tables as a backend instead of files).

Live-analysis of simulation data using pytables / hdf5

I am working on some cfd-simulations with c/CUDA and python, at the moment the workflow goes like this:
Start a simulation written in pure c / cuda
Write output to a binary file
Reopen files with python i.e. numpy.fromfile and do some analysis.
Since I have a lot of data and also some metadata I though it would be better
to switch to hdf5 file format. So my Idea was something like,
Create some initial conditions data for my simulations using pytables.
Reopen and write to the datasets in c by using the standard hdf5 library.
Reopen files using pytables for analysis.
I really would like to do some live analysis of the data i.e.
write from the c-programm to hdf5 and directly read from python using pytables.
This would be pretty useful, but I am really not
sure how much this is supported by pytables.
Since I never worked with pytables or hdf5 it would be good to know
if this is a good approach or if there are maybe some pitfalls.
I think it is a reasonable approach, but there is a pitfall indeed. The HDF5 C-library is not thread-safe (there is a "parallel" version, more on this later). That means, your scenario does not work out of the box: one process writing data to a file while another process is reading (not necessarily the same dataset) will result in a corrupted file. To make it work, you must either:
implement file locking, making sure that no process is reading while the file is being written to, or
serialize access to the file by delegating reads/writes to a distinguished process. You must then communicate with this process through some IPC technique (Unix domain sockets, ...). Of course, this might affect performance because data is being copied back and forth.
Recently, the HDF group published an MPI-based parallel version of HDF5, which makes concurrent read/write access possible. Cf. http://www.hdfgroup.org/HDF5/PHDF5/. It was created for use cases like yours.
To my knowledge, pytables does not provide any bindings to parallel HDF5. You should use h5py instead, which provides very user-friendly bindings to parallel HDF5. See the examples on this website: http://docs.h5py.org/en/2.3/mpi.html
Unfortunately, parallel HDF5 has a major drawback: to date, it does not support writing compressed datasets (reading is possible, though). Cf. http://www.hdfgroup.org/hdf5-quest.html#p5comp

Categories