Large overhead for reading pickled panda to R - python

I have a 10GB .pkl file, which contains four pandas. Following advice in this blog post, I try to load it in R. R takes up to 25GB of memory for this process and I wonder whether the computational overhead arises. How to make this more feasible? I could save as csv but, but there should be a better solution, no?
For reference, the post suggests to save a python function read_pickle_file.py:
import pandas as pd
def read_pickle_file(file):
pickle_data = pd.read_pickle(file)
return pickle_data
and then in R:
require("reticulate")
source_python("pickle_reader.py")
pickle_data <- read_pickle_file("C:/tsa/dataset.pickle")

The only workable solution for me was to export from Python using feather, which allows to also import from R. Resulted in (almost) no overhead (but is not a proper answer to my question, if you have better one I'd be happy to accept).
A feather tutorial can be found here

Related

Storing large dense 2D matrix of float32 data

I am searching for the best possible solution for storing large dense 2D matrices of floating-point data, generally float32 data.
The goal would be to share scientific data more easily from websites like the Internet Archive and make such data FAIR.
My current approaches (listed below) fall short of the desired goal, so I thought I might ask here, hoping to find something new, such as a better data structure. Even though my examples are in Python, the solution does not need to be in Python. Any good solution will do, even one in COBOL!
CSV-based
One approach I tried would be to store the values as compressed CSVs using pandas, but this is excruciatingly slow, and the end compression is not exactly optimal (generally a 50% from the plain CSV on the data I tried this on, which is not flawed but not sufficient to make this viable.) In this example, I am using gzip. I have also tried LZMA, but it is generally way slower, and, at least on the data I tried it on, it does not yield a significantly better result.
import pandas as pd
my_data: pd.DataFrame = create_my_data_doing_something()
my_data.to_csv("my_data_saved.csv.gz")
NumPy Based
Another solution is to store the data into a NumPy-like matrix and then compress it on a disk.
import numpy as np
my_data: np.ndarray = create_my_data_doing_something()
np.save("my_data_saved.npy", my_data)
and afterwards
gzip -k my_data_saved.npy
Numpy with H5
Another possible compression is H5, but as shown in the benchmark below it does not do better than the plain numpy.
with h5py.File("my_data_saved.h5", 'w') as hf:
hf.create_dataset("my_data_saved", data=my_data)
Issues in using the Numpy format
While this may be a good solution, we limit the scope of usability of the data to people that can use Python. Of course, this is a vast pool of people, though, in my circles, many biologists and mathematicians abhor Python and like to stick to Matlab and R (and therefore would not know what to do with a .npy.gz file).
Pickle
Another solution, as correctly pointed out by #lojza, is to store the data into a pickle object, which may also be compressed to a disk. Pickle obtains in my benchmarks (see below) a compression ratio comparable with what is obtained with Numpy.
import pickle
import compress_pickle
my_data: np.ndarray = create_my_data_doing_something()
# Not compressed pickle
with open("my_data_saved.pkl", "wb") as f:
pickle.dump(my_data, f)
# Compressed version
compress_pickle.dump(df, "my_data_saved.pkl.gz")
Issues in using Pickle format
The issue in using Pickle is two-fold: first, the same Python-dependency issue discussed above. Secondly, there is a significant security issue: the Pickle format can be used for arbitrary code execution exploits. People should be wary of downloading random pickle files from the internet (and here, the goal is to make people share datasets on the internet).
python import pickle
# Build the exploit
command = b"""cat flag.txt"""
x = b"c__builtin__\ngetattr\nc__builtin__\n__import__\nS'os'\n\x85RS'system'\n\x86RS'%s'\n\x85R."%command
# Test it
pickle.load(x)
Benchmarks
I have executed benchmarks to provide a baseline on which to improve. Here, we can see that generally, Numpy is the best performing choice, and to my knowledge, it should not have any security risk in its format. It follows that a compressed NumPy array is currently the best contender, please let's find a better one!
Examples
To share some actual use cases of this relatively simple task, I share a couple of graphs embedding the complete OBO Foundry graph. If there were a way to make these files smaller, sharing them would be significantly more accessible, allowing for more reproducible experiments and accelerating research in the bio-ontologies (for this specific data) and other fields.
OBO Foundry embedded using CBOW
OBO Foundry embedded using TransE
What other approaches may I try?

Optimising RAM in reading large file in Python

I have a neural network that takes in arrays of 1024 doubles and returns a single value. I've trained it, and now I want to use it on a set of test data.
The file that I want to test it on is very big (8.2G), and whenever I try to import it into Google Colab it crashes the RAM.
I'm reading it in using read_csv as follows:
testsignals = pd.read_csv("/content/drive/My Drive/MyFile.txt", delimiter="\n",header=None).
What I'd like to know is if there's a more efficient way of reading in the file, or whether I will just have to work with smaller data sets.
Edit: following Prune's comment, I looked at the advice from the people who had commented, and I tried this:
import csv
testsignals=[]
with open('/content/drive/MyDrive/MyFile') as csvfile:
reader=csv.reader(csvfile)
for line in reader:
testsignals.append(line)
But it still exceed the RAM.
If anyone can give me some help I'd be very grateful!

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

how to rapidaly load data into memory with python?

I have a large csv file (5 GB) and I can read it with pandas.read_csv(). This operation takes a lot of time 10-20 minutes.
How can I speed it up?
Would it be useful to transform the data in a sqllite format? In case what should I do?
EDIT: More information:
The data contains 1852 columns and 350000 rows. Most of the columns are float65 and contain numbers. Some other contains string or dates (that I suppose are considered as string)
I am using a laptop with 16 GB of RAM and SSD hard drive. The data should fit fine in memory (but I know that python tends to increase the data size)
EDIT 2 :
During the loading I receive this message
/usr/local/lib/python3.4/dist-packages/pandas/io/parsers.py:1164: DtypeWarning: Columns (1841,1842,1844) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
EDIT: SOLUTION
Read one time the csv file and save it as
data.to_hdf('data.h5', 'table')
This format is incredibly efficient
This actually depends on which part of reading it is taking 10 minutes.
If it's actually reading from disk, then obviously any more compact form of the data will be better.
If it's processing the CSV format (you can tell this because your CPU is at near 100% on one core while reading; it'll be very low for the other two), then you want a form that's already preprocessed.
If it's swapping memory, e.g., because you only have 2GB of physical RAM, then nothing is going to help except splitting the data.
It's important to know which one you have. For example, stream-compressing the data (e.g., with gzip) will make the first problem a lot better, but the second one even worse.
It sounds like you probably have the second problem, which is good to know. (However, there are things you can do that will probably be better no matter what the problem.)
Your idea of storing it in a sqlite database is nice because it can at least potentially solve all three at once; you only read the data in from disk as-needed, and it's stored in a reasonably compact and easy-to-process form. But it's not the best possible solution for the first two, just a "pretty good" one.
In particular, if you actually do need to do array-wide work across all 350000 rows, and can't translate that work into SQL queries, you're not going to get much benefit out of sqlite. Ultimately, you're going to be doing a giant SELECT to pull in all the data and then process it all into one big frame.
Writing out the shape and structure information, then writing the underlying arrays in NumPy binary form. Then, for reading, you have to reverse that. NumPy's binary form just stores the raw data as compactly as possible, and it's a format that can be written blindingly quickly (it's basically just dumping the raw in-memory storage to disk). That will improve both the first and second problems.
Similarly, storing the data in HDF5 (either using Pandas IO or an external library like PyTables or h5py) will improve both the first and second problems. HDF5 is designed to be a reasonably compact and simple format for storing the same kind of data you usually store in Pandas. (And it includes optional compression as a built-in feature, so if you know which of the two you have, you can tune it.) It won't solve the second problem quite as well as the last option, but probably well enough, and it's much simpler (once you get past setting up your HDF5 libraries).
Finally, pickling the data may sometimes be faster. pickle is Python's native serialization format, and it's hookable by third-party modules—and NumPy and Pandas have both hooked it to do a reasonably good job of pickling their data.
(Although this doesn't apply to the question, it may help someone searching later: If you're using Python 2.x, make sure to explicitly use pickle format 2; IIRC, NumPy is very bad at the default pickle format 0. In Python 3.0+, this isn't relevant, because the default format is at least 3.)
Python has two built-in libraries called pickle and cPickle that can store any Python data structure.
cPickle is identical to pickle except that cPickle has trouble with Unicode stuff and is 1000x faster.
Both are really convenient for saving stuff that's going to be re-loaded into Python in some form, since you don't have to worry about some kind of error popping up in your file I/O.
Having worked with a number of XML files, I've found some performance gains from loading pickles instead of raw XML. I'm not entirely sure how the performance compares with CSVs, but it's worth a shot, especially if you don't have to worry about Unicode stuff and can use cPickle. It's also simple, so if it's not a good enough boost, you can move on to other methods with minimal time lost.
A simple example of usage:
>>> import pickle
>>> stuff = ["Here's", "a", "list", "of", "tokens"]
>>> fstream = open("test.pkl", "wb")
>>> pickle.dump(stuff,fstream)
>>> fstream.close()
>>>
>>> fstream2 = open("test.pkl", "rb")
>>> old_stuff = pickle.load(fstream2)
>>> fstream2.close()
>>> old_stuff
["Here's", 'a', 'list', 'of', 'tokens']
>>>
Notice the "b" in the file stream openers. This is important--it preserves cross-platform compatibility of the pickles. I've failed to do this before and had it come back to haunt me.
For your stuff, I recommend writing a first script that parses the CSV and saves it as a pickle; when you do your analysis, the script associated with that loads the pickle like in the second block of code up there.
I've tried this with XML; I'm curious how much of a boost you will get with CSVs.
If the problem is in the processing overhead, then you can divide the file into smaller files and handle them in different CPU cores or threads. Also for some algorithms the python time will increase non-linearly and the dividing method will help in these cases.

Storing Numpy or Pandas Data in Mongodb

I'm trying to decide the best way to store my time series data in mongodb. Outside of mongo I'm working with them as numpy arrays or pandas DataFrames. I have seen a number of people (such as in this post) recommend pickling it and storing the binary, but I was under the impression that pickle should never be used for long term storage. Is that only true for data structures that might have underlying code changes to their class structures? To put it another way, numpy arrays are probably stable so fine to pickle, but pandas DataFrames might go bad as pandas is still evolving?
UPDATE:
A friend pointed me to this, which seems to be a good start on exactly what I want:
http://docs.scipy.org/doc/numpy/reference/routines.io.html
Numpy has its own binary file format, which should be long term storage stable. Once I get it actually working I'll come back and post my code. If someone else has made this work already I'll happily accept your answer.
We've built an open source library for storing numeric data (Pandas, numpy, etc.) in MongoDB:
https://github.com/manahl/arctic
Best of all, it's easy to use, pretty fast and supports data versioning, multiple data libraries and more.

Categories