Fast slicing of dataframe/list/numpy array of tuples

Fast slicing of dataframe/list/numpy array of tuples - python

I have a dataframe where one column consists of tuples, i.e
df['A'].values = array([(1,2), (5,6), (11,12)])
Now I want to split this into two different columns. A working solution is
df['A1'] = df['A'].apply(lambda x: x[0])
But this is extremely slow. On my Dataframe it takes multiple minutes. So I would like to vectorize this, to something like
df['A1'] = df['A'][:,0]
With pandas, or using numpy or anything. But all of them give me an error similar to
"*** KeyError: 'key of type tuple not found and not a MultiIndex'"
Is there any vectorized way? This feels like a super simple question and task but i cannot find any working and properly vectorized function.

n: int = 2
df = pd.DataFrame(df["A"].apply(lambda x: (x[:n], x[n:])).tolist(), index=df.index)
you can have a look into pandarallel also.

I'll do it in numpy and skip over the pandas bits.
You can get a decent speedup using np.fromiter together with either itertools.chain.from_iterable to extract everything in one go or operator.itemgetter for individual columns.
import operator as op
import itertools as it
a = [*zip(range(10000),range(10000,20000))]
A = np.empty(10000,object)
A[...] = a
A
# array([(0, 10000), (1, 10001), (2, 10002), ..., (9997, 19997),
# (9998, 19998), (9999, 19999)], dtype=object)
(*np.fromiter(it.chain.from_iterable(A),int,len(A[0])*A.size).reshape(A.size,-1).T,)
# (array([ 0, 1, 2, ..., 9997, 9998, 9999]), array([10000, 10001,
# 10002, ..., 19997, 19998, 19999]))
np.fromiter(map(op.itemgetter(0),A),int,A.size)
# array([ 0, 1, 2, ..., 9997, 9998, 9999])

Related

Xarray: grouping by contiguous identical values

In Pandas, it is simple to slice a series(/array) such as [1,1,1,1,2,2,1,1,1,1] to return groups of [1,1,1,1], [2,2,],[1,1,1,1]. To do this, I use the syntax:
datagroups= df[key].groupby(df[key][df[key][variable] == some condition].index.to_series().diff().ne(1).cumsum())
...where I would obtain individual groups by df[key][variable] == some condition. Groups that have the same value of some condition that aren't contiguous are their own groups. If the condition was x < 2, I would end up with [1,1,1,1],[1,1,1,1] from the above example.
I am attempting to do the same thing in xarray package, because I am working with multidimensional data, but the above syntax obviously doesn't work.
What I have been successful doing so far:
a) apply some condition to separate the values I want by NaNs:
datagroups_notsplit = df[key].where(df[key][variable] == some condition)
So now I have groups as in the example above [1,1,1,1,Nan,Nan,1,1,1,1] (if some condition was x <2). The question is, how do I cut these groups so that it becomes [1,1,1,1],[1,1,1,1]?
b) Alternatively, group by some condition...
datagroups_agglomerated = df[key].groupby_bins('variable', bins = [cleverly designed for some condition])
But then, following the example above, I end up with groups [1,1,1,1,1,1,1], [2,2]. Is there a way to then groupby the groups on noncontiguous index values?

Without knowing more about what your 'some condition' can be, or the domain of your data (small integers only?), I'd just workaround the missing pandas functionality, something like:
import pandas as pd
import xarray as xr
dat = xr.DataArray([1,1,1,1,2,2,1,1,1,1], dims='x')
# Use `diff()` to get groups of contiguous values
(dat.diff('x') != 0)]
# ...prepend a leading 0 (pedantic syntax for xarray)
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x')
# ...take cumsum() to get group indices
xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum()
# array([0, 0, 0, 0, 1, 1, 2, 2, 2, 2])
dat.groupby(xr.concat([xr.DataArray(0), (dat.diff('x') != 0)], 'x').cumsum() )
# DataArrayGroupBy, grouped over 'group'
# 3 groups with labels 0, 1, 2.
The xarray How do I page could use some recipes like this ("Group contiguous values"), suggest you contact them and have them added.

My use case was a bit more complicated than the minimal example I posted due to the use of timeseries indices and the desire to subselect certain conditions; however, I was able to adapt the answer of smci, above in the following way:
(1) create indexnumber variable:
df = Dataset( data_vars={
'some_data' : (('date'), some_data),
'more_data' : (('date'), more_data),
'indexnumber' : (('date'), arange(0,len(date_arr))
},
coords={
'date' : date_arr
}
)
(2) get the indices for the groupby groups:
ind_slice = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum().indexes
(3) get the cumsum field:
sumcum = df.where(df['more_data'] == some_condition)['indexnumber'].dropna(dim='date').diff(dim='date') !=1).cumsum()
(4) reconstitute a new df:
df2 = df.loc[ind_slice]
(5) add the cumsum field:
df2['sumcum'] = sumcum
(6) groupby:
groups = df2.groupby(df['sumcum'])
hope this helps anyone else out there looking to do this.

Numpy: proper way to arrange input and output vectors

I am trying to create a numpy array with 2 columns and multiple rows. The first column is meant to represent input vector of size 3. The 2nd column is meant to represent output vector of size 2.
arr = np.array([
[np.array([1,2,3]), np.array([1,0])]
[np.array([4,5,6]), np.array([0,1])]
])
I was expecting: arr[:, 0].shape
to return (2, 3), but it returns (2, )
What is the proper way to arrange input and output vectors into a matrix using numpy?

If you are sure the elements in each column have the same size/length, you can select and then stack the result using numpy.row_stack:
np.row_stack(arr[:,0]).shape
# (2, 3)
np.row_stack(arr[:,1]).shape
# (2, 2)

So, the code
arr = np.array([
[np.array([1,2,3]), np.array([1,0])],
[np.array([4,5,6]), np.array([0,1])]
])
Creates an object array, indexing the first column gives you back two rows with one object in each, which accounts for the size. To get what you want you'd need to wrap it in something like
np.vstack(arr[:, 0])
Which creates an array out of the objects in the first column. This isn't very convenient, it would make more sense to me to store these in a dictionary, something like
io = {'in': np.array([[1,2,3],[4,5,6]]),
'out':np.array([[1,0], [0,1]])
}
A structured array gives you a bit of both. Creation is a bit tricky, for the example given,
arr = np.array([
(1,2,3), (1,0)),
((4,5,6), (0,1)) ],
dtype=[('in', '3int64'), ('out', '2float64')])
Creates a structured array with fields in and out, consisting of 3 integers and 2 floats respectively. Rows can be accessed as usual,
In[73]: arr[0]
Out[74]: ([1, 2, 3], [ 1., 0.])
Or by the field name
In [73]: arr['in']
Out[73]:
array([[1, 2, 3],
[4, 5, 6]])
The numpy manual has many more details (https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html). I can't add any details as I've been intending to use them in a project for some time, but haven't.

ValueError: 'object too deep for desired array'

I have a ValueError: 'object too deep for desired array' in a Python program.
I have this error while using numpy.digitize.
I think it's how I use Pandas DataFrames:
To keep it simple (because this is done through an external library), I have a list in my program but the library needs a DataFrame so I do something like this:
ts = range(1000)
df = pandas.DataFrame(ts)
res = numpy.digitize(df.values, bins)
But then it seems like df.values is an array of lists instead of an array of floats. I mean:
array([[ 0],
[ 1],
[ 2],
...,
[997],
[998],
[999]], dtype=int64)
Help please, I spent too much time on this.

Try this:
numpy.digitize(df.iloc[:, 0], bins)
You are trying to get the values from a whole DataFrame. That is why you get the 2D array. Each row in the array is a row of the DataFrame.

Error when broadcasting to numpy matrix

I know this is a relatively common topic on stackoverflow but I couldn't find the answer I was looking for. Basically, I am trying to make very efficient code (I have rather large data sets) to get certain columns of data from a matrix. Below is what I have so far. It gives me this error: could not broadcast input array from shape (2947,1) into shape (2947)
def get_data(self, colHeaders):
temp = np.zeros((self.matrix_data.shape[0],len(colHeaders)))
for col in colHeaders:
index = self.header2matrix[col]
temp[:,index:] = self.matrix_data[:,index]
data = np.matrix(temp)
return temp

Maybe this simple example will help:
In [70]: data=np.arange(12).reshape(3,4)
In [71]: header={'a':0,'b':1,'c':2}
In [72]: col=['c','a']
In [73]: index=[header[i] for i in col]
In [74]: index
Out[74]: [2, 0]
In [75]: data[:,index]
Out[75]:
array([[ 2, 0],
[ 6, 4],
[10, 8]])
data is some sort of 2D array, header is a dictionary mapping names to column numbers. Using the input col, I construct a column index list. You can select all columns at once, rather than one by one.

NumPy: filter rows by np.array

I'd like to filter a NumPy 2-d array by checking whether another array contains a column value. How can I do that?
import numpy as np
ar = np.array([[1,2],[3,-5],[6,-15],[10,7]])
another_ar = np.array([1,6])
new_ar = ar[ar[:,0] in another_ar]
print new_ar
I hope to get [[1,2],[6,-15]] but above code prints just [1,2].

You can use np.where,but note that as ar[:,0] is a list of first elements if ar you need to loop over it and check for membership :
>>> ar[np.where([i in another_ar for i in ar[:,0]])]
array([[ 1, 2],
[ 6, -15]])

Instead of using in, you can use np.in1d to check which values in the first column of ar are also in another_ar and then use the boolean index returned to fetch the rows of ar:
>>> ar[np.in1d(ar[:,0], another_ar)]
array([[ 1, 2],
[ 6, -15]])
This is likely to be much faster than using any kind of for loop and testing membership with in.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast slicing of dataframe/list/numpy array of tuples - python

n: int = 2 df = pd.DataFrame(df["A"].apply(lambda x: (x[:n], x[n:])).tolist(), index=df.index) you can have a look into pandarallel also.

Related

Xarray: grouping by contiguous identical values

Numpy: proper way to arrange input and output vectors

ValueError: 'object too deep for desired array'

Error when broadcasting to numpy matrix

NumPy: filter rows by np.array

Categories

Resources