Numpy and Applying Method to a Column - python

I have a numpy array that contains objects.
For example my array is:
a = np.array({'a':1,'b':2},....,{'a':n,'b':n+1})
The data is not that important, but what I need to do is for each column call a property on that object.
Using my dictionary example, I want to call keys() to print out a list of keys on that row and return as a numpy array:
a[0].keys()
If I were using Pandas, I could leverage apply() on the column and use lambda functions to do this. For this case, I CANNOT use Pandas, so how can I do the same operation on a single numpy array column?
I tried using apply_along_axis but the lambda passes the arr as a whole not one row at a time, so I need to basically use a for loop inside my lambda to get my method.
np.apply_along_axis(lambda b: b.keys(), axis=0, arr=self.data)
The above code does not work! (I know this).
If there a way to do a pandas.apply() using a numpy array?
The desired result in this case would be N row numpy array with lists of [a,b] in them.

A object array like this can treated as a list:
In [110]: n=2;a = np.array(({'a':1,'b':2},{'a':n,'b':n+1}))
In [111]: a
Out[111]: array([{'a': 1, 'b': 2}, {'a': 2, 'b': 3}], dtype=object)
In [112]: [d.keys() for d in a]
Out[112]: [dict_keys(['a', 'b']), dict_keys(['a', 'b'])]
You could also use frompyfunc which will apply a function to all elements of an array (or broadcasted elements of several arrays)
In [114]: np.frompyfunc(lambda d:d.keys(),1,1)(a)
Out[114]: array([dict_keys(['a', 'b']), dict_keys(['a', 'b'])], dtype=object)
It returns an object array, which is fine in this case. np.vectorize uses this function as well, but takes an otypes parameter.
As a general rule, iterating on an object dtype array is faster than iterating on a numeric array (since all it has to do is return a pointer), but slower than the equivalent iteration on a list. Calculations on object dtype arrays are not as fast as the compiled numeric array calculations.

Related

default values for numpy ndarray

I was working with numpy.ndarray and something interesting happened.
I created an array with the shape of (2, 2) and left everything else with the default values.
It created an array for me with these values:
array([[2.12199579e-314, 0.00000000e+000],
[5.35567160e-321, 7.72406468e-312]])
I created another array with the same default values and it also gave me the same result.
Then I created a new array (using the default values and the shape (2, 2)) and filled it with zeros using the 'fill' method.
The interesting part is that now whenever I create a new array with ndarray it gives me an array with 0 values.
So what is going on behind the scenes?
See https://numpy.org/doc/stable/reference/generated/numpy.empty.html#numpy.empty:
(Precisely as #Michael Butscher commented)
np.empty([2, 2]) creates an array without touching the contents of the memory chunk allocated for the array; thus, the array may look as if filled with some more or less random values.
np.ndarray([2, 2]) does the same.
Other creation methods, however, fill the memory with some values:
np.zeros([2, 2]) fills the memory with zeros,
np.full([2, 2], 9) fills the memory with nines, etc.
Now, if you create a new array via np.empty() after creating (and disposing of, i.e. automatically garbage collected) an array filled with e.g. ones, your new array may be allocated the same chunk of memory and thus look as if "filled" with ones.
np.empty explicitly says it returns:
Array of uninitialized (arbitrary) data of the given shape, dtype, and
order. Object arrays will be initialized to None.
It's compiled code so I can't say for sure, but I strongly suspect is just calls np.ndarray, with shape and dtype.
ndarray describes itself as a low level function, and lists many, better alternatives.
In a ipython session I can make two arrays:
In [2]: arr = np.empty((2,2), dtype='int32'); arr
Out[2]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
In [3]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[3]:
array([[ 927000399, 1267404612],
[ 1828571807, -1590157072]])
The values are the same, but when I check the "location" of their data buffers, I see that they are different:
In [4]: arr.__array_interface__['data'][0]
Out[4]: 2213385069328
In [5]: arr1.__array_interface__['data'][0]
Out[5]: 2213385068176
We can't use that number in code to fiddle with the values, but it's useful as a human-readable indicator of where the data is stored. (Do you understand the basics of how arrays are stored, with shape, dtype, strides, and data-buffer?)
Why the "uninitialized values" are the same is anyones guess; my guess it's just an artifact of the how that bit of memory was used before. np.empty stresses that we shouldn't place an significance to those values.
Doing the ndarray again, produces different values and location:
In [9]: arr1 = np.ndarray((2,2), dtype='int32'); arr1
Out[9]:
array([[1469865440, 515],
[ 0, 0]])
In [10]: arr1.__array_interface__['data'][0]
Out[10]: 2213403372816
apparent reuse
If I don't assign the array to a variable, or otherwise "hang on to it", numpy may reuse the data buffer memory:
In [17]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[17]: 2213403374512
In [18]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[18]: 2213403374512
In [19]: np.ndarray((2,2), dtype='int').__array_interface__['data'][0]
Out[19]: 2213403374512
In [20]: np.empty((2,2), dtype='int').__array_interface__['data'][0]
Out[20]: 2213403374512
Again, we shouldn't place too much significance to this reuse, and certainly not count on it for any calculations.
object dtype
If we specify the object dtype, then the values are initialized to None. This dtype contains references/pointers to objects in memory, and "random" pointers wouldn't be safe.
In [14]: arr1 = np.ndarray((2,2), dtype='object'); arr1
Out[14]:
array([[None, None],
[None, None]], dtype=object)
In [15]: arr1 = np.ndarray((2,2), dtype='U3'); arr1
Out[15]:
array([['', ''],
['', '']], dtype='<U3')

How come when a Series is passed to Numpy's exp() function, a Series is returned?

Unless numpy itself is programmed to return a Series when a Series is passed to it, it is very confusing. yet the documentation on this function doesn't mention that it returns a Series when a Series is passed to it.
Understand that i come from a java background and i am new to python.
The NumPy ufunc machinery has built-in hooks to customize how objects are treated by ufuncs. In this particular case, the numpy.exp ufunc calls the Series's __array__ method to get an array to work with, computes the exponential over the array, and then calls the Series's __array_wrap__ method on the resulting array to post-process it.
__array__ is how the ufunc gets an object it knows how to work with, and __array_wrap__ is how the result gets converted back to a Series instead of an array.
You can see the same mechanisms in action by writing your own class with those methods:
In [9]: class ArrayWrapper(object):
...: def __init__(self, arr):
...: self.arr = arr
...: def __repr__(self):
...: return 'ArrayWrapper({!r})'.format(self.arr)
...: def __array__(self):
...: return self.arr
...: def __array_wrap__(self, arr):
...: return ArrayWrapper(arr)
...:
In [10]: numpy.exp(ArrayWrapper(numpy.array([1, 2, 3])))
Out[10]: ArrayWrapper(array([ 2.71828183, 7.3890561 , 20.08553692]))
The difference between a Series and a ndarray object is that the Series object allows you to define your own labeled index and to access the elements of the Series using that index, which can be strings, floats, ints etc. whereas the ndarray object has fixed index starting from 0.
The downside is that Series is about 10 times slower than a ndarray.
The Series is the primary building block of pandas. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. Like an array, a Series can hold zero or more values of any single data type. A Series can be created and initialized by passing either a scalar value, a NumPy ndarray, a Python list, or a Python Dict as the data parameter of the Series constructor.
For more info, see pandas and NumPy arrays explained

How to efficiently extract values from nested numpy arrays generated by loadmat function?

Is there a more efficient method in python to extract data from a nested python list such as A = array([[array([[12000000]])]], dtype=object). I have been using A[0][0][0][0], it does not seem to be an efficinet method when you have lots of data like A.
I have also used
numpy.squeeeze(array([[array([[12000000]])]], dtype=object)) but this gives me
array(array([[12000000]]), dtype=object)
PS: The nested array was generated by loadmat() function in scipy module to load a .mat file which consists of nested structures.
Creating such an array is a bit tedious, but loadmat does it to handle the MATLAB cells and 2d matrix:
In [5]: A = np.empty((1,1),object)
In [6]: A[0,0] = np.array([[1.23]])
In [7]: A
Out[7]: array([[array([[ 1.23]])]], dtype=object)
In [8]: A.any()
Out[8]: array([[ 1.23]])
In [9]: A.shape
Out[9]: (1, 1)
squeeze compresses the shape, but does not cross the object boundary
In [10]: np.squeeze(A)
Out[10]: array(array([[ 1.23]]), dtype=object)
but if you have one item in an array (regardless of shape) item() can extract it. Indexing also works, A[0,0]
In [11]: np.squeeze(A).item()
Out[11]: array([[ 1.23]])
item again to extract the number from that inner array:
In [12]: np.squeeze(A).item().item()
Out[12]: 1.23
Or we don't even need the squeeze:
In [13]: A.item().item()
Out[13]: 1.23
loadmat has a squeeze_me parameter.
Indexing is just as easy:
In [17]: A[0,0]
Out[17]: array([[ 1.23]])
In [18]: A[0,0][0,0]
Out[18]: 1.23
astype can also work (though it can be picky about the number of dimensions).
In [21]: A.astype(float)
Out[21]: array([[ 1.23]])
With single item arrays like efficiency isn't much of an issue. All these methods are quick. Things become more complicated when the array has many items, or the items are themselves large.
How to access elements of numpy ndarray?
You could use A.all() or A.any() to get a scalar. This would only work if A contains one element.
Try A.flatten()[0]
This will flatten the array into a single dimension and extract the first item from it. In your case, the first item is the only item.
What worked in my case was the following..
import scipy.io
xcat = scipy.io.loadmat(os.path.join(dir_data, file_name))
pars = xcat['pars'] # Extract numpy.void element from the loadmat object
# Note that you are dealing with a numpy structured array object when you enter pars[0][0].
# Thus you can acces names and all that...
dict_values = [x[0][0] for x in pars[0][0]] # Extract all elements in one go
dict_keys = list(pars.dtype.names) # Extract the corresponding names/tags
dict_xcat = dict(zip(dict_keys, dict_values)) # Pack it up again in a dict
where the idea behind this is.. first extract ALL values I want, and format them in a nice python dict.
This prevents me from cumbersome indexing later in the file...
Of course, this is a very specific solution. Since in my case the values I needed were all floats/ints.

Save repeated calculations in Python Pandas

In Pandas, I can use .apply to apply functions to two columns. For example,
df = pd.DataFrame({'A':['a', 'a', 'a', 'b'], 'B':[3, 3, 2, 5], 'C':[2, 2, 2, 8]})
formula = lambda x: (x.B + x.C)**2
df.apply(formula, axis=1)
But, notice that results on the first two rows are the same since all the inputs are the same. In large dataset with complicated operations. These repeated calculations is likely to slow down my program. Is there a way that I can program it so that I can save time with these repeated calculations?
You can use a technique called memoization. For functions which accept hashable arguments, you can use the built-in functools.lru_cache.
from functools import lru_cache
#lru_cache(maxsize=None)
def cached_function(B, C):
return (B + C)**2
def formula(x):
return cached_function(x.B, x.C)
Notice that I had to pass the values through to the cached function for lru_cache to work correctly because Series objects aren't hashable.
You could use np.unique to create a copy of the dataframe consisting of only the unique rows, then do the calculation on those, and construct the full results.
For example:
import numpy as np
# convert to records for use with numpy
rec = df.to_records(index=False)
arr, ind = np.unique(rec, return_inverse=True)
# find dataframe of unique rows
df_small = pd.DataFrame(arr)
# Apply the formula & construct the full result
df_small.apply(formula, axis=1).iloc[ind].reset_index()
Even faster than using apply here would be to use broadcasting: for example, simply compute
(df.B + df.C) ** 2
If this is still too slow, you can use this method on the de-duplicated dataframe, as above.

How to reshape a pandas.Series

It looks to me like a bug in pandas.Series.
a = pd.Series([1,2,3,4])
b = a.reshape(2,2)
b
b has type Series but can not be displayed, the last statement gives exception, very lengthy, the last line is "TypeError: %d format: a number is required, not numpy.ndarray". b.shape returns (2,2), which contradicts its type Series. I am guessing perhaps pandas.Series does not implement reshape function and I am calling the version from np.array? Anyone see this error as well? I am at pandas 0.9.1.
You can call reshape on the values array of the Series:
In [4]: a.values.reshape(2,2)
Out[4]:
array([[1, 2],
[3, 4]], dtype=int64)
I actually think it won't always make sense to apply reshape to a Series (do you ignore the index?), and that you're correct in thinking it's just numpy's reshape:
a.reshape?
Docstring: See numpy.ndarray.reshape
that said, I agree the fact that it let's you try to do this looks like a bug.
The reshape function takes the new shape as a tuple rather than as multiple arguments:
In [4]: a.reshape?
Type: function
String Form:<function reshape at 0x1023d2578>
File: /Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/numpy/core/fromnumeric.py
Definition: numpy.reshape(a, newshape, order='C')
Docstring:
Gives a new shape to an array without changing its data.
Parameters
----------
a : array_like
Array to be reshaped.
newshape : int or tuple of ints
The new shape should be compatible with the original shape. If
an integer, then the result will be a 1-D array of that length.
One shape dimension can be -1. In this case, the value is inferred
from the length of the array and remaining dimensions.
Reshape is actually implemented in Series and will return an ndarray:
In [11]: a
Out[11]:
0 1
1 2
2 3
3 4
In [12]: a.reshape((2, 2))
Out[12]:
array([[1, 2],
[3, 4]])
you can directly use a.reshape((2,2)) to reshape a Series, but you can not reshape a pandas DataFrame directly, because there is no reshape function for pandas DataFrame, but you can do reshape on numpy ndarray:
convert DataFrame to numpy ndarray
do reshape
convert back
e.g.
a = pd.DataFrame([[1,2,3],[4,5,6]])
b = a.as_matrix().reshape(3,2)
a = pd.DataFrame(b)
Just use this below code:
b=a.values.reshape(2,2)
I think it will help you.
u can directly use only reshape() function.but it will give future warning
for example we have a series. We can change it to dataframe like this way;
a = pd.DataFrame(a)

Categories