What data format for large files in R? - python

I produce a very large data file with Python, mostly consisting of 0 (false) and only a few 1 (true). It has about 700.000 columns and 15.000 rows and thus a size of 10.5GB. The first row is the header.
This file then needs to be read and visualized in R.
I'm looking for the right data format to export my file from Python.
As stated here:
HDF5 is row based. You get MUCH efficiency by having tables that are
not too wide but are fairly long.
As I have a very wide table, I assume, HDF5 is inappropriate in my case?
So what data format suits best for this purpose?
Would it also make sense to compress (zip) it?
Example of my file:
id,col1,col2,col3,col4,col5,...
1,0,0,0,1,0,...
2,1,0,0,0,1,...
3,0,1,0,0,1,...
4,...

Zipping won't help you, as you'll have to unzip it to process it. If you could post your code that generates the file, that might help a lot.
Also, what do yo want to accomplish in R? Might it be faster to visualize it in Python, avoiding the read/write of 10.5GB?
Perhaps rethinking your approach to how you're storing the data (eg: store the coordinates of the 1's if there are very few) might be a better angle here.
For instance, instead of storing a 700K by 15K table of all zeroes except for a 1 in line 600492 column 10786, I might just store the tuple (600492, 10786) and achieve the same visualization in R.

SciPy has scipy.io.mmwrite which makes files that can be read by R's readMM command. SciPy also supports several different sparse matrix representations.

Related

How do I store a multidimensional array?

I am trying to write a code in python that will display the trajectory of projectile on a 2D graph. The initial velocity and launch angle will be varying. Instead of calculating it every time, I was wondering if there is any way to create a data file which will store all the values of the coordinates for each of those different combinations of speed and launch angle. That is a 4 dimensional database. Is this even possible?
This sounds like a pretty ideal case for using CSV as your file format. It's not a "4 dimension" so much as a "4 column" database.
initial_velocity, launch_angle, end_x, end_y
which you can write out and read in easily - using either the standard library's csv module, or pandas' read_csv()
i think that you should look at the HDF5 format, which has been specialized to work with big data in NASA and bulletproof in very large scale applications :
From the webside
HDF5 lets you store huge amounts of numerical data, and easily
manipulate that data from NumPy. For example, you can slice into
multi-terabyte datasets stored on disk, as if they were real NumPy
arrays. Thousands of datasets can be stored in a single file,
categorized and tagged however you want.
In addition from me is the point that NumPy has been developed to work with multidimensional array very efficiently. Good luck !

Which data files output format from C++ to be read in python (or other) is more size efficient?

I want to write output files containing tabular datas (float values in lines and columns) from a C++ program.
I need to open those files later on with other languages/softwares (here python and paraview, but might change).
What would be the most efficient output format for tabular files (efficient for files memory sizes efficiency) that would be compatible with other languages ?
E.g., txt files, csv, xml, binarized or not ?
Thanks for advices
HDF5 might be a good option for you. It’s a standard format for storing large amounts of data, and there are Python and C++ libraries for reading and writing to it.
See here for an example
1- Your output files contain tabular data (float values in lines and columns), in other words, a kind of matrix.
2- You need to open those files later on with other languages/softwares
3- You want to have files memory sizes efficiency
That's said, you have to consider one of the two formats below:
CSV: if your data are very simple (a matrix of float without particualr structure)
JSON if you need a minimum structure for your files
These two formats are standard, supported by almost all the known languages and maintained softwares.
Last, if your data have a great complexity structure, prefer to look at a format like XML but the price to pay is then in the size of your files!
Hope this helps!
First of all the efficiency of i/o operations is limited by the buffer size. So if you want to achieve higher throughput you might have to play with the input output buffers. And regarding your doubt of what way to output into the files is dependent on your data and what delimiters you want to use to separate the data in the files.

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

PyTables, Creating a table without opening an hdf5 file

is it possible to create a PyTables table without opening or creating an hdf5 file?
What I mean, and what I need, is to create a table (well actually very many tables) in different processes, work with these tables and store the tables into an hdf5 file only in the end after some calculations (and ensuring only one process at a time performs the storage).
In principle I could do all the calculations on normal Python data (arrays strings etc) and perform the storage in the end. However, why I would appreciate to work on PyTables right from the start are the sanity checks. I want to always ensure that the data I work with fits into the predefined tables and does not violate shape constraints etc (and since PyTables checks for those problems I don't need to implement it all by myself).
Thanks a lot and kind regards,
Robert
You are looking for pandas which has great Pytables integration. You will be working with tables all the way and in the end you will be able to save to hdf5 in the easiest possible way.
You can create a numpy-array with given shape and datatype.
my_array = num.empty(shape=my_shape, dtype=num.float)
If you need indexing by name have a look at numpy record-arrays (nee numpy recarray)
But if you work directly with the PyTable-Object it can be faster (see benchmark here).

Python: handling a large set of data. Scipy or Rpy? And how?

In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)

Categories