I have a JSON file containing properties of some mathematical objects (Calabi-Yau manifolds). These objects are defined by a matrix and a vector, and two additional properties I am storing are the matrix size (such that it does not need to be computed again) and the Euler number of the manifold (some integer). In total there are roughly 1 million entries, the biggest matrix is 16 x 20.
I would like to convert the matrix and vectors to numpy arrays. Hence I was wondering if it is possible to do it directly when loading from json, or at least how to convert afterwards. The reason for converting is that I will need in any case some functions from numpy later, but I also hope (especially if the conversion is done on loading) that it will speed up my code: for the moment loading the complete dataset takes roughly 90 seconds (a previous loading using json module required only 20 s; I will open another thread on this question if using numpy does not improve).
Here is a minimal working code:
import pandas as pd
import numpy as np
json = '''
{"1":{"euler":2610,"matrix":[[6]],"size":[1,1],"vec":[5]},
"2":{"euler":2190,"matrix":[[2,5]],"size":[1,2],"vec":[6]},
"4":{"euler":1632,"matrix":[[2,2,4]],"size":[1,3],"vec":[7]},
"6":{"euler":1152,"matrix":[[2,2,2,3]],"size":[1,4],"vec":[8]},
"7":{"euler":960,"matrix":[[2,2,2,2,2]],"size":[1,5],"vec":[9]},
"8":{"euler":2160,"matrix":[[2],[5]],"size":[2,1],"vec":[1,4]},
"9":{"euler":1836,"matrix":[[0,2],[2,4]],"size":[2,2],"vec":[1,5]}}
'''
data = pd.read_json(json, orient="index")
data.sort_index(inplace=True)
My first guess was to use the numpy argument, but it fails with an error:
>>> data = pd.read_json(json, orient="index", numpy=True)
ValueError: cannot reshape array of size 51 into shape (7,4,2,2)
Then I have tried giving the dtype argument but it does not look like changing anything (my hope was that by using a numpy type it would convert the list to an array):
>>> dtype = {"euler": np.int16, "matrix": np.int8, "vector": np.int8,
... "size": np.int8, "number": np.int32}
>>> data = pd.read_json(json, orient="index", dtype=dtype)
>>> type(type(data["matrix"][1]))
list
For the conversion I was wondering if there is a more subtle (and perhaps more efficient) way than brutal conversion:
data["matrix"] = data["matrix"].apply(lambda x: np.array(x, dtype=np.int8))
Related
I am attempting to read in a FITS file and convert the data into a multidimensional numpy array (So i can easily index the data).
The FITS data is structured like:
FITS_rec([(time, [rate, rate, rate, rate], [ERROR, ERROR, ERROR, ERROR], TOTCOUNTS, fraxexp.)]
that is one 'row' (IE = data[0] of data = hdul[1].data), in my case the number of 'rate' (or error) is varies, for different FITS files.
I wish to make this data into a numpy array, but when I do:
arr = np.asarray(data), I get a 1D object out which I cannot index easily. IE arr[:][0] is just equal to data[0]. I have also tried to do a np.split with no benifit.
I have timeseries data in sequential (packed c-struct) format in very large files. Each structure contain K fields of different types in some order. The file is essentially an array of these structures (row-wise). I would like to be able to mmap the file and map each field to a numpy array (or another form) where the array recognizes a stride (the size of the struct) aliased as columns in a dataframe.
An example struct might be:
struct {
int32_t a;
double b;
int16_t c;
}
Such a file of records could be generated with python as:
from struct import pack, unpack
db = open("binarydb", "wb")
for i in range(1,1000):
packed = pack('<idh', i, i*3.14, i*2)
db.write(packed)
db.close()
The question is then how to view such a file efficiently as a dataframe. If we assume the file is hundreds of millions of rows in length would need to use a mem-map solution.
Using memmap, how can i map a numpy array (or alternative array structure) to the sequence of integers for column a. It seems to me that would need to be able to indicate a stride (14 bytes) and offset (0 in this case) for the int32 series "a", an offset of 4 for the float64 series "b", and an offset of 12 for the int16 series "c".
I have seen that one can easily create a numpy array against a mmap'ed file if the file contains a single dtype. Is there a way to pull the different series in this file by indicating a type, offset, and stride? With this approach could present mmapped columns to pandas or another dataframe implementation.
Even better, is there a simple way to integrate a custom mem-mapped format into Dask, such that get the benefits of lazy paging into the file?
You can use numpy.memmap to do that. Since your input data type is not a native type, you need to use advanced Numpy data types. Note that you need the size of the array ahead of time since Numpy does not support unbounded streams but fixed-size array.
size = 999
datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int16)])
# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='write', shape=size)
for i in range(1,1+size):
data[i]['a'] = i
data[i]['b'] = i*3.14
data[i]['c'] = i*2
Note that vectorized operation are generally much faster than direct indexing in Numpy. Numba can also be used to speed up the direct indexing if the operation cannot be vectorized.
Note that memory mapped area can be flushed but not yet closed in Numpy.
Extrapolating from #Jérôme Richard's answer above. Here is code to read from a binary sequence of records:
size = 999
datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int32)])
# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='readonly', shape=size)
Can then pull each series as:
data['a']
data['b']
data['c']
I have a sample dataset of names. It is a csv file with 2 columns, each 200 lines long. Both columns contain random names. I have the following code to load the csv file into a pandas Dataframe, convert the dataframe into a numpy array, then convert the numpy array into a standard python list. The code is as follows:
x_df = pd.read_csv("names.csv")
x_np = x_df.to_numpy()
x_list = x_np.tolist()
print("Pandas dataframe:")
print('Using sys.getsizeof(): {}'.format(sys.getsizeof(x_df)))
print('Using pandas_df.memory_usage(): {}'.format(x_df.memory_usage(index=True, deep=True).sum()))
print('\nNumpy ndarray (dtype: {}):'.format(x_np.dtype))
print('Using sys.getsizeof(): {}'.format(sys.getsizeof(x_np)))
print('Using ndarray.nbytes: {}'.format(x_np.nbytes))
total_mem = 0
for row in x_np:
for name in row:
total_mem += sys.getsizeof(name)
print('Using sys.getsizeof() on each element in np array: {}'.format(total_mem))
print('\nStandard list:')
print('Using sys.getsizeof(): {}'.format(sys.getsizeof(x_list)))
total_mem = sum([sys.getsizeof(x) for sublist in x_list for x in sublist])
print('Using sys.getsizeof() on each element in list: {}'.format(total_mem))
The output of this code is as follows:
Pandas dataframe:
Using sys.getsizeof(): 25337
Using pandas_df.memory_usage(): 25305
Numpy ndarray (dtype: object):
Using sys.getsizeof(): 112
Using ndarray.nbytes: 3200
Using sys.getsizeof() on each element in np array: 21977
Standard list:
Using sys.getsizeof(): 1672
Using sys.getsizeof() on each element in list: 21977
I think I understand, for the standard python list, why sys.getsizeof() is such a small value compared to using sys.getsizeof() on each element of that list - using it on the list overall just shows the list object, which contains references to elements of the list.
Does this same logic apply to the numpy array? Why exactly is the value of nbytes on the array so small compared to the list? Does numpy have excellent memory management, or does the numpy array consist of references, not the actual objects? If the numpy array consists of references, not the actual objects, does this apply to all dtypes? Or just the object dtype?
A dataframe containing strings will be object dtype.
22008 (8 bytes per pointer) is 3200, the array nbytes. The 112 is just the size of the array object (shape, strides etc), and not the databuffer. It apparently is a view of the array that x_df is using to store it's references.
Pandas data storage is more complicated than numpy, but apparently if the dtype across columns is uniform, it does use a 2d ndarray. I don't know how getsizeof and memory_usage (with those parameters) works, though the numbers suggest they are the same.
Your enumeration size suggests that the string elements are on the average 6-7 bytes long. That seems small for unicode, but you haven't told us about those 'random names'.
The enumerated list apparently does the same same as the numpy enumeration. I'm a little surprised that 1672 is so much small than 3200, as though the list's pointer array holds 4 byte pointers rather than 8.
I am working on a dataset with mixed sparse / dense columns. As the number of sparse columns greatly outnumber the number of dense I wanted to see if I could store these in an efficient manner using sparse data structures in pandas. However, while testing the functionality I found dataframes with sparse columns appear to take up more memory, consider the following example:
import numpy as np
import pandas as pd
a = np.zeros(10000000)
b = np.zeros(10000000)
a[3000:3100] = 2
b[300:310] = 1
df = pd.DataFrame({'a':pd.SparseArray(a), 'b':pd.SparseArray(b)})
print(df.info())
This prints memory usage: 228.9 MB.
Next:
df = pd.DataFrame({'a':a, 'b':b})
print(df.info())
This prints memory usage: 152.6 MB.
Does the non-sparse dataframe take up less space? Am I misunderstanding?
Installation info:
pandas 0.25.0
python 3.7.2
I've reproduced those exact numbers. From the docs:
Pandas provides data structures for efficiently storing sparse data.
These are not necessarily sparse in the typical “mostly 0”. Rather,
you can view these objects as being “compressed” where any data
matching a specific value (NaN / missing value, though any value can
be chosen, including 0) is omitted. The compressed values are not
actually stored in the array.
Which means you have to specify that it's the 0 elements that should be compressed. You can do that by using fill_value=0, like so:
df = pd.DataFrame({'a':pd.SparseArray(a, fill_value=0), 'b':pd.SparseArray(b, fill_value=0)})
The result of df.info() is 1.4kb of memory usage in this case, quite a dramatic difference.
As to why it's initially bigger in your example than a normal "uncompressed" array, my guess is that it has to do with the compression data added on top of all the normal data that is still there (including zeros in your case). Anyway, that's just a guess
Additional reading in the docs would tell you that 0 is the default fill_value only in arrays of data.dtype=int, which yours weren't
How do I build data.frames containing multiple types of data(Strings, int, logical) and both continuous and factors in Python Numpy?
The following code makes my headers NaN's and all but my float values Nan's
from numpy import genfromtxt
my_data = genfromtxt('FlightDataTraining.csv', delimiter=',')
This puts a "b'data'" on all of my data, such that year becomes "b'year'"
import numpy as np
d = np.loadtxt('FlightDataTraining.csv',delimiter=',',dtype=str)
Try genfromtxt('FlightDataTraining.csv', delimiter=',', dtype=None). This tells genfromtxt to intelligently guess the dtype of each column. If that does not work, please post a sample of your CSV and what the desired output should look like.
The b in b'data' is Python's way of representing bytes as opposed to str objects. So the b'data' is okay. If you want strs, you would need to decode the bytes.
NumPy does not have a dtype for representing factors, though Pandas does have a pd.Categorical type.