Pandas groupby overlapping list - python

I have a dataframe like this
data
0 1.5
1 1.3
2 1.3
3 1.8
4 1.3
5 1.8
6 1.5
And I have a list of lists like this:
indices = [[0, 3, 4], [0, 3], [2, 6, 4], [1, 3, 4, 5]]
I want to produce sums of each of the groups in my dataframe using the list of lists, so
group1 = df[0] + df[1] + df[2]
group2 = df[1] + df[2] + df[3]
group3 = df[2] + df[3] + df[4]
group4 = df[3] + df[4] + df[5]
so I am looking for something like df.groupby(indices).sum
I know this can be done iteratively using a for loop and applying the sum to each of the df.iloc[sublist], but I am looking for a faster way.

Use list comprehension:
a = [df.loc[x, 'data'].sum() for x in indices]
print (a)
[4.6, 3.3, 4.1, 6.2]
arr = df['data'].values
a = [arr[x].sum() for x in indices]
print (a)
[4.6, 3.3, 4.1, 6.2]
Solution with groupby + sum is possible, but not sure if better performance:
df1 = pd.DataFrame({
'd' : df['data'].values[np.concatenate(indices)],
'g' : np.arange(len(indices)).repeat([len(x) for x in indices])
})
print (df1)
d g
0 1.5 0
1 1.8 0
2 1.3 0
3 1.5 1
4 1.8 1
5 1.3 2
6 1.5 2
7 1.3 2
8 1.3 3
9 1.8 3
10 1.3 3
11 1.8 3
print(df1.groupby('g')['d'].sum())
g
0 4.6
1 3.3
2 4.1
3 6.2
Name: d, dtype: float64
Performance tested in small sample data - in real data should be different:
In [150]: %timeit [df.loc[x, 'data'].sum() for x in indices]
4.84 ms ± 80.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [151]: %%timeit
...: df['data'].values
...: [arr[x].sum() for x in indices]
...:
...:
20.9 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [152]: %timeit pd.DataFrame({'d' : df['data'].values[np.concatenate(indices)],'g' : np.arange(len(indices)).repeat([len(x) for x in indices])}).groupby('g')['d'].sum()
1.46 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
On real data
In [37]: %timeit [df.iloc[x, 0].sum() for x in indices]
158 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [38]: arr = df['data'].values
...: %timeit \
...: [arr[x].sum() for x in indices]
5.99 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In[49]: %timeit pd.DataFrame({'d' : df['last'].values[np.concatenate(sample_indices['train'])],'g' : np.arange(len(sample_indices['train'])).repeat([len(x) for x in sample_indices['train']])}).groupby('g')['d'].sum()
...:
5.97 ms ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
interesting.. both of the bottom answers are fast.

Related

How do I check if a value already appeared in pandas df column?

I have a Dataframe of stock prices...
I wish to have a boolean column that indicates if the price had reached a certain threshold in the previous rows or not.
My output should be something like this (let's say my threshold is 100):
index
price
bool
0
98
False
1
99
False
2
100.5
True
3
101
True
4
99
True
5
98
True
I've managed to do this with the following code but it's not efficient and takes a lot of time:
(df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
Please, any suggestions?
Use a comparison and cummax:
threshold = 100
df['bool'] = df['price'].ge(threshold).cummax()
Note that it would work the other way around (although maybe less efficiently*):
threshold = 100
df['bool'] = df['price'].cummax().ge(threshold)
Output:
index price bool
0 0 98.0 False
1 1 99.0 False
2 2 100.5 True
3 3 101.0 True
4 4 99.0 True
5 5 98.0 True
* indeed on a large array:
%%timeit
df['price'].ge(threshold).cummax()
# 193 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df['price'].cummax().ge(threshold)
# 309 µs ± 4.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
timing
# setting up a dummy example with 10M rows
np.random.seed(0)
df = pd.DataFrame({'price': np.random.choice([0,1], p=[0.999,0.001], size=10_000_000)})
threshold = 0.5
## comparison
%%timeit
df['bool'] = (df.loc[:, 'price'] > threshold).cumsum().fillna(0).gt(0)
# 271 ms ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['bool'] = df['price'].ge(threshold).cummax()
# 109 ms ± 5.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each
%%timeit
df['bool'] = np.maximum.accumulate(df['price'].to_numpy()>threshold)
# 75.8 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fastest ways to filter for values in pandas dataframe

A similar dataframe can be created:
import pandas as pd
df = pd.DataFrame()
df["nodes"] = list(range(1, 11))
df["x"] = [1,4,9,12,27,87,99,121,156,234]
df["y"] = [3,5,6,1,8,9,2,1,0,-1]
df["z"] = [2,3,4,2,1,5,9,99,78,1]
df.set_index("nodes", inplace=True)
So the dataframe looks like this:
x y z
nodes
1 1 3 2
2 4 5 3
3 9 6 4
4 12 1 2
5 27 8 1
6 87 9 5
7 99 2 9
8 121 1 99
9 156 0 78
10 234 -1 1
My first try for searching e.g. all nodes containing number 1 is:
>>> df[(df == 1).any(axis=1)].index.values
[1 4 5 8 10]
As i have to do this for many numbers and my real dataframe is much bigger than this one, i'm searching for a very fast way to do this.
Just tried something that may be enlightening
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
df.set_index("A", inplace=True)
df_no_index = df.reset_index()
So set up a dataframe with ints right the way through. This is not the same as yours but it will suffice.
Then I ran four tests
%timeit df[(df == 1).any(axis=1)].index.values
%timeit df[(df['B'] == 1) | (df['C']==1)| (df['D']==1)].index.values
%timeit df_no_index[(df_no_index == 1).any(axis=1)].A.values
%timeit df_no_index[(df_no_index['B'] == 1) | (df_no_index['C']==1)| (df_no_index['D']==1)].A.values
The results I got were,
940 µs ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.47 ms ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.08 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.55 ms ± 51.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Which showed that the initial method that you took, with index seems to be the fastest of these approaches. Removing the index does not improve the speed with a moderately sized dataframe

Numpy merge 2 columns values

Let's say I have this numpy array:
[[3 2 1 5]
[3 2 1 5]
[3 2 1 5]
[3 2 1 5]]
How to merge the values of the last column into the first column (or any column to any column). Expected output:
[[8 2 1]
[8 2 1]
[8 2 1]
[8 2 1]]
I've found this solution. But, is there any better way than that?
As per comment, you need to create a view or copy of array in order to get a new array with different size. This is a short comparison of performance of view vs copy:
x = np.tile([1,3,2,4],(4,1))
def f(x):
# calculation + view
x[:,0] = x[:,0] + x[:,-1]
return x[:,:-1]
def g(x):
# calculation + copy
x[:,0] = x[:,0] + x[:,-1]
return np.delete(x,-1, 1)
def h(x):
#calculation only
x[:,0] = x[:,0] + x[:,-1]
%timeit f(x)
%timeit g(x)
%timeit h(x)
9.16 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
35 µs ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
7.81 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And if len(x) were = 1M:
6.13 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
18 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
5.83 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So solution in a link is very economic, it applies calculation + instant view.
I don't know if this is the best, but it's kind of clever.
In [66]: np.add.reduceat(arr[:,[0,3,1,2]], [0,2,3], axis=1)
Out[66]:
array([[8, 2, 1],
[8, 2, 1],
[8, 2, 1]])
reduceat applies add to groups of columns (axis 1). I first reordered the columns to put the ones to be added together.

How to generate 2 columns of incremental values and all their unique combinations with pandas?

I need to create a 2 columns dataframe.
The first column contains values from 7000 until 15000 and all the increments of 500 in that range (7000,7500,8000...14500,1500)
The second column contain all the integers from 6 to 24
I need a simple way to generate these values and all their unique combinations:
6,7000
6,7500
6,8000
....
24,14500
24,15000
You can use numpy.arange for generating sequence of numbers, numpy.repeat and numpy.tile for generating cross-product and stack them using numpy.c_ or numpy.column_stack
x = np.arange(6, 25)
y = np.arange(7000, 15001, 500)
pd.DataFrame(np.c_[x.repeat(len(y)),np.tile(y, len(x))])
# pd.DataFrame(np.column_stack([x.repeat(len(y)),np.tile(y, len(x))]))
0 1
0 6 7000
1 6 7500
2 6 8000
3 6 8500
4 6 9000
.. .. ...
318 24 13000
319 24 13500
320 24 14000
321 24 14500
322 24 15000
[323 rows x 2 columns]
Another idea is to use itertools.product
from itertools import product
pd.DataFrame(list(product(x,y)))
Timeit results:
# Henry' answer in comments
In [44]: %timeit pd.DataFrame([(x,y) for x in range(6,25) for y in range(7000,15001,500)])
657 µs ± 169 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# My solution
In [45]: %%timeit
...: x = np.arange(6, 25)
...: y = np.arange(7000, 15001, 500)
...:
...: pd.DataFrame(np.c_[x.repeat(len(y)),np.tile(y, len(x))])
...:
...:
155 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Using `np.column_stack`
In [49]: %%timeit
...: x = np.arange(6, 25)
...: y = np.arange(7000, 15001, 500)
...:
...: pd.DataFrame(np.column_stack([x.repeat(len(y)),np.tile(y, len(x))]))
...:
121 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# `itertools.product` solution
In [62]: %timeit pd.DataFrame(list(product(x,y)))
489 µs ± 7.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pandas DataFrame: copy the contents of a column if it is empty

I have the following DataFrame with named columns and index:
'a' 'a*' 'b' 'b*'
1 5 NaN 9 NaN
2 NaN 3 3 NaN
3 4 NaN 1 NaN
4 NaN 9 NaN 7
The data source has caused some column headings to be copied slightly differently. For example, as above, some column headings are a string and some are the same string with an additional '*' character.
I want to copy any values (which are not null) from a* and b* columns to a and b, respectively.
Is there an efficient way to do such an operation?
Use np.where
df['a']= np.where(df['a'].isnull(), df['a*'], df['a'])
df['b']= np.where(df['b'].isnull(), df['b*'], df['b'])
Output:
a a* b b*
0 5.0 NaN 9.0 NaN
1 3.0 3.0 3.0 NaN
2 4.0 NaN 1.0 NaN
3 9.0 9.0 7.0 7.0
Using fillna() is a lot slower than np.where but has the advantage of being pandas only. If you want a faster method and keep it pandas pure, you can use combine_first() which according to the documentation is used to:
Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Translation: this is a method designed to do exactly what is asked in the question.
How do I use it?
df['a'].combine_first(df['a*'])
Performance:
df = pd.DataFrame({'A': [0, None, 1, 2, 3, None] * 10000, 'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
return df['A'].fillna(df['A*'])
def using_combine_first(df):
return df['A'].combine_first(df['A*'])
def using_np_where(df):
return np.where(df['A'].isnull(), df['A*'], df['A'])
def using_np_where_numpy(df):
return np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_np_where_numpy(df)
1.34 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
281 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
257 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
166 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For better performance is possible use numpy.isnan and convert Series to numpy arrays by values:
df['a'] = np.where(np.isnan(df['a'].values), df['a*'].values, df['a'].values)
df['b'] = np.where(np.isnan(df['b'].values), df['b*'].values, df['a'].values)
Another general solution if exist only pairs with/without * in columns of DataFrame and is necessary remove * columns:
First create MultiIndex by split with append *val:
df.columns = (df.columns + '*val').str.split('*', expand=True, n=1)
And then select by DataFrame.xs for DataFrames, so DataFrame.fillna working very nice:
df = df.xs('*val', axis=1, level=1).fillna(df.xs('val', axis=1, level=1))
print (df)
a b
1 5.0 9.0
2 3.0 3.0
3 4.0 1.0
4 9.0 7.0
Performance: (depends of number of missing values and length of DataFrame)
df = pd.DataFrame({'A': [0, np.nan, 1, 2, 3, np.nan] * 10000,
'A*': [4, 4, 5, 6, 7, 8] * 10000})
def using_fillna(df):
df['A'] = df['A'].fillna(df['A*'])
return df
def using_np_where(df):
df['B'] = np.where(df['A'].isnull(), df['A*'], df['A'])
return df
def using_np_where_numpy(df):
df['C'] = np.where(np.isnan(df['A'].values), df['A*'].values, df['A'].values)
return df
def using_combine_first(df):
df['D'] = df['A'].combine_first(df['A*'])
return df
%timeit -n 100 using_fillna(df)
%timeit -n 100 using_np_where(df)
%timeit -n 100 using_combine_first(df)
%timeit -n 100 using_np_where_numpy(df)
1.15 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
533 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
591 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
423 µs ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories