python - save dict to .npy format - python

I have a dictand I associate an array to each key (the key itself is a number).
Minimal example:
import numpy as np
data = {}
data[2.5] = np.array([np.array([1,2,3,4]), np.array([5,6,7,8])])
Then I save the dict:
np.save('file.npy', data)
and then reload it:
datanew = np.load('file.npy')
--
Now, in order to access what is stored in each key, I cannot just do:
datanew[2.5]
But I have to do
datanew[()][2.5]
Why?
Is there a better way to save dicts?

The reason is because np.save's arr argument expects an array. When you passed a dictionary it instead saved it as a one by one 'dimensionless' array. So when you load it, you need to get the 'first' element out of that dimensionless array (i.e. [()]). You could just do this when you call np.load though and then never worry about it again:
datanew = np.load('file.npy')[()]
Alternatively since you're trying to save a dictionary you could use pickle. np.save is suppose to be optimized for numerical arrays, I don't know if you still get the benefits it you've put your arrays in a dictionary though...

Related

h5py: how to use keys() loop over HDF5 Groups and Datasets

print(list(file.keys()))
When I run this code I get:
T00000000,T00000001,T00000002,T00000003, ... ,T00000474
Now, I analized T00000000 but I want to scan them all with for loop. I couldn't do because this is a string. Is there any way to do this?
#python_student, there is more to this than explained in the initial answer. Based on the syntax of you question, it appears you are using h5py to read the HDF5 file. To effectively access the file contents, you need a basic understanding of HDF5 and h5py. I suggest starting here: h5py Quick Start Guide. In addition, there are many good questions and answer here on StackOverflow with details and examples.
An HDF5 file has 2 basic objects:
Datasets: array-like collections of data
Groups: folder-like containers that hold datasets and other groups
h5py, uses dictionary syntax to access Group objects, and reads Datasets using NumPy syntax. (Note group objects are not Python dictionaries - just just "look" like them!)
As you noted, the keys() are the NAMES of the objects (groups or datasets) at the root level of your file. Your code created a list from the group keys: list(file.keys()). In general there is no reason to do this. Typically, you will iterate over the keys() or items() instead of creating a list.
Here is a short code segment to show how you might do this. I can add more details once I know more about your data schema. (HDF5 is a general data container and have most any schema.)
# loop on names:
for name in file.keys():
print(name)
# loop on names and H5 objects:
for name, h5obj in file.items():
if isinstance(h5obj,h5py.Group):
print(name,'is a Group')
elif isinstance(h5obj,h5py.Dataset):
print(name,'is a Dataset')
# return a np.array using dataset object:
arr1 = h5obj[:]
# return a np.array using dataset name:
arr2 = file[name][:]
# compare arr1 to arr2 (should always return True):
print(np.array_equal(arr1, arr2))
Yes, you can use the split() method.
If the string is "T00000000,T00000001,T00000002,T00000003, ... ,T00000474", you can use split to turn it on a list like this:
string = "T00000000,T00000001,T00000002,T00000003, ... ,T00000474"
values = string.split(",")
So, the list values becomes ["T00000000", "T00000001","T00000003", ... ,"T000000474"].
Then you can use this in a for loop.
If you don't wanto to create a list, you can simply:
for value in string.split(","):
#Your code here...
The for loop will be execute with the values T00000000, T00000001, T00000003 ...

Whey saving an numpy array of float arrays to .npy file using numpy.save/numpy.load, is there any reason why the order of the arrays would change?

I currently have data where each row has a text passage and a numpy float array.
As far as I know, the it's not efficient to save these two datatypes into one data format (correct me if I am wrong). So I am going to save them separately, with another column of ints that will be used to map the two datasets together when I want to join them again.
I have having trouble figuring out how to append a column of ints next to the float arrays (if anyone has a solution to that I would love to hear it) and then save the numpy array.
But then I realized I can just save the float arrays as is with numpy.save without the extra int column if I can get a confirmation that numpy.save and numpy.load will never change the order of the arrays.
That way I can just append the loaded numpy float arrays to the pandas dataframe as is.
Logically, I don't see any reason why the order of the rows would change, but perhaps there's some optimization compression that I am unaware of.
Would numpy.save or numpy.load ever change the order of a numpy array of float arrays?
The order will not change by the numpy save / load. You are saving the numpy object as is. An array is an ordered object.
Note: if you want to save multiple data arrays to the same file, you can use np.savez.
>>> np.savez('out.npz', f=array_of_floats, s=array_of_strings)
You can retrieve back each with the following:
>>> data = np.load('out.npz')
>>> array_of_floats = data['f']
>>> array_of_strings = data['s']

Preallocate very large array in Python leads to MemoryError

I am trying to preallocate a list in python
c=[1]*mM #preallocate array
My Problem is that I run in to a MemoryError since
mM=4999999950000000
What is the best way to deal with this. I am thinking about creating a new object where is split my list at about a value of 500000000.
Is this what I should do or is there a best practice to create an array with a lot of inputs?
Using a Generator
You are attempting to create an object that you very likely will not be able to fit into your computer's memory. If you truly need to represent a list of that length, you can use a generator that dynamically produces values as they are needed.
def ones_generator(length):
for _ in range(length):
yield 1
gen = ones_generator(4999999950000000)
for i in gen:
print(i) # prints 1, a lot
Note: The question is tagged for Python 3, but if you are using Python 2.7, you will want to use xrange instead of range.
Using a Dictionary
By the sound of your question, you do not actually need to preallocate a list of that length, but you want to store values very sparsely at indexes that are very large. This pattern matches the dict type in Python more so than the list. You can simply store values in a dictionary, without pre-allocating they keys/space, Python handles that under the hood for you.
dct = {}
dct[100000] = "A string"
dct[592091] = 123
dct[4999999950000000] = "I promise, I need to be at this index"
print(dct[4999999950000000])
# I promise, I need to be at this index
In that example, I just stored str and int values, but they can be any object in Python. The best part about this is that this dictionary will not consume memory based on the maximum index (like a list would) but instead based on how many values are stored within it.

Python - Alternative for using numpy array as key in dictionary

I'm pretty new to Python numpy. I was attempted to use numpy array as the key in dictionary in one of my functions and then been told by Python interpreter that numpy array is not hashable. I've just found out that one way to work this issue around is to use repr() function to convert numpy array to a string but it seems very expensive. Is there any better way to achieve same effect?
Update: I could create a new class to contain the numpy array, which seems to be right way to achieve what I want. Just wondering if there is any better method?
update 2: Using a class to contain data in the array and then override __hash__ function is acceptable, however, I'd prefer the solution provided by #hpaulj. Convert the array/list to a tuple fits my need in a better way as it does not require an additional class.
If you want to quickly store a numpy.ndarray as a key in a dictionary, a fast option is to use ndarray.tobytes() which will return a raw python bytes string which is immutable
my_array = numpy.arange(4).reshape((2,2))
my_dict = {}
my_dict[my_array.tobytes()] = None
After done some researches and reading through all comments. I think I've known the answer to my own question so I'd just write them down.
Write a class to contain the data in the array and then override __hash__ function to amend the way how it is hashed as mentioned by ZdaR
Convert this array to a tuple, which makes the list hashable instantaneously.Thanks to hpaulj
I'd prefer method No.2 because it fits my need better, as well as simpler. However, using a class might bring some additional benefits so it could also be useful.
I just ran into that issue and there's a very simple solution to it using list comprehension:
import numpy as np
dict = {'key1':1, 'key2':2}
my_array = np.array(['key1', 'key2'])
result = np.array( [dict[element] for element in my_array] )
print(result)
The result should be:
[1 2]
I don't know how efficient this is but seems like a very practical and straight-forward solution, no conversions or new classes needed :)

Accessing Data from .mat (version 8.1) structure in Python

I have a Matlab (.mat, version >7.3) file that contains a structure (data) that itself contains many fields. Each field is a single column array. Each field represents an individual sensor and the array is the time series data. I am trying to open this file in Python to do some more analysis. I am using PyTables to read the data in:
import tables
impdat = tables.openFile('data_file.mat')
This reads the file in and I can enter the fileObject and get the names of each field by using:
impdat.root.data.__members__
This prints a list of the fields:
['rdg', 'freqlabels', 'freqbinsctr',... ]
Now, what I would like is a method to take each field in data and make a python variable (perhaps dictionary) with the field name as the key (if it is a dictionary) and the corresponding array as its value. I can see the size of the array by doing, for example:
impdat.root.data.rdg
which returns this:
/data/rdg (EArray(1, 1286920), zlib(3))
atom := Int32Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (1, 16290)
My question is how do I access some of the data stored in that large array (1, 1286920). How can I read that array into another Python variable (list, dictionary, numpy array, etc.)? Any thoughts or guidance would be appreciated.
I have come up with a working solution. It is not very elegant as it requires an eval. So I first create a new variable (alldata) to the data I want to access, and then I create an empty dictionary datastruct, then I loop over all the members of data and assign the arrays to the appropriate key in the dictionary:
alldata = impdat.root.data
datastruct = {}
for names in impdat.rood.data.__members___:
datastruct[names] = eval('alldata.' + names + '[0][:]')
The '[0]' could be superfluous depending on the structure of the data trying to access. In my case the data is stored in an array of an array and I just want the first one. If you come up with a better solution please feel free to share.
I can't seem to replicate your code. I get an error when trying to open the file which I made in 8.0 using tables.
How about if you took the variables within the structure and saved them to a new mat file which only contains a collection of variables. This would make it much easier to deal with and this has already been answered quite eloquently here.
Which states that mat files which are arrays are simply hdf5 files which can be read with:
import numpy as np, h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to numpy array
Not sure the size of the data set you're working with. If it's large I'm sure I could come up with a script to pull the fields out of the structures. I did find this tool which may be helpful. It recursively gets all of the structure field names.

Categories