How to improve performance of dataframe slices matching? - python

I need to improve the performance of the following dataframe slices matching.
What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.
My 2 dataframes:
>>>df1
trips sequence
0 11 a
1 11 d
2 21 d
3 21 a
4 31 a
5 31 b
6 31 c
>>>df2
trips sequence
0 12 a
1 12 d
2 22 c
3 22 b
4 22 a
5 32 a
6 32 d
Expected output:
['11 match 12']
This is the following code I' m using:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']})
df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']})
route_match = []
for trip1 in df1['trips'].drop_duplicates():
for trip2 in df2['trips'].drop_duplicates():
route1 = df1[df1['trips'] == trip1]['sequence']
route2 = df2[df2['trips'] == trip2]['sequence']
if np.array_equal(route1.values,route2.values):
route_match.append(str(trip1) + ' match ' + str(trip2))
break
else:
continue
Despite working, this is very time costly and unefficient as my real dataframes are longer.
Any suggestions?

You can aggregate each trip as tuple with groupby.agg, then merge the two outputs to identify the identical routes:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12
1 11 (a, d) 32
If you only want the first match, drop_duplicates the output of df2 aggregation to prevent unnecessary merging:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
df2.groupby('trips', as_index=False)['sequence'].agg(tuple)
.drop_duplicates(subset='sequence'),
on='sequence'
)
output:
trips_x sequence trips_y
0 11 (a, d) 12

Related

Get the minimum for each data in the data frame

I have this data frame:
data = {name: ['a', 'a','b', 'c', 'd', 'b', 'b', 'a', 'c'],
number: [32, 25, 9 , 43,8, 5, 11, 21, 0]
}
and I want to get min number for each name where data in the number column for that name is not 0.
for my example, I want this result:
data = {'col1': ['a', 'b', 'c', 'd'],
'col2': [21, 5, 43, 8]
}
I don't want the repetitive name.
IIUC, you can try:
df = df.mask(df.number.eq(0)).dropna().groupby('name', as_index = False).min()
OUTPUT:
name number
0 a 21.0
1 b 5.0
2 c 43.0
3 d 8.0
Try with sort_values + drop_duplicates
out = df.loc[df.number!=0].sort_values('number').drop_duplicates('name')
Out[24]:
name number
5 b 5
4 d 8
7 a 21
3 c 43
Try:
df = df.query('number != 0')
df.loc[df.groupby('name')['number'].idxmin().tolist()]
Output:
name number
7 a 21
5 b 5
3 c 43
4 d 8
replace with groupby:
df.replace({"number":{0:np.nan}}).groupby("name",as_index=False)['number'].min()
name number
0 a 21.0
1 b 5.0
2 c 43.0
3 d 8.0
Cast it back to int if you want using astype

pandas lambda function returns both df and series, why?

Given a df and a lambda function:
df = pd.DataFrame({'label' : ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c'],
't' : [1, 2, 3, 4, 5, 1, 2, 3, 4, 1, 2, ],
'x' : [48, 6, 30, 30, 53, 48, 25, 51, 9, 55, 2]})
top3 = lambda x : x.groupby('t')['x'].idxmax().head(3)
I tried a few combinations of label and got varying results when the function is called:
print(df.groupby('label').apply(top3))
label t
a 1 0
2 1
3 2
b 1 5
2 6
3 7
c 1 9
2 10
Name: x, dtype: int64
df2 = df[df.label=='a']
print(df2.groupby('label').apply(top3))
t 1 2 3
label
a 0 1 2
df3 = df[df.label.isin(['a', 'b'])]
print(df3.groupby('label').apply(top3))
t 1 2 3
label
a 0 1 2
b 5 6 7
The first result is a Series while the next 2 are DataFrames. why is this so?
.groupby.apply() has a lot of magic behind it to try to coerce things into what it thinks the best shape will be. When c is excluded from the passed dataframe, it can coerce things into a clean rectangular dataframe as a result, but with c included, it will fall back to a MultiIndex:
In [71]: df[df.label.isin(['a', 'c'])].groupby('label').apply(top3)
Out[71]:
label t
a 1 0
2 1
3 2
c 1 9
2 10
Name: x, dtype: int64
If you want to follow the rabbit hole in pandas' code, you can start here: https://github.com/pandas-dev/pandas/blob/30362ed828bebdd58d4f1f74d70236d32547d52a/pandas/core/groupby/ops.py#L189

Python - Pandas - Edit duplicate items keeping last

Lets say my df is:
import pandas as pd
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'col2':[10,20, 30, 10, 20, 10, 10, 20, 30]})
How can I make all numbers zero keeping the last one only? In this case the result should be:
col1 col2
a 0
a 0
a 30
b 0
b 20
c 10
d 0
d 0
d 30
Thanks!
Use loc and duplicated with the argument keep='last':
df.loc[df.duplicated(subset='col1',keep='last'), 'col2'] = 0
>>> df
col1 col2
0 a 0
1 a 0
2 a 30
3 b 0
4 b 20
5 c 10
6 d 0
7 d 0
8 d 30

Pandas - aggregate over inconsistent values types (string vs list)

Given the following DataFrame, I try to aggregate over columns 'A' and 'C'. for 'A', count unique appearances of the strings, and for 'C', sum the values.
Problem arises when some of the samples in 'A' are actually lists of those strings.
Here's a simplified example:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2],
'A' : ['a', 'a', 'a', 'b', ['b', 'c', 'd'], 'a', 'a', ['a', 'b', 'c']],
'C' : [1, 2, 15, 5, 13, 6, 7, 1]})
df
Out[100]:
ID A C
0 1 a 1
1 1 a 2
2 1 a 15
3 1 b 5
4 1 [b, c, d] 13
5 2 a 6
6 2 a 7
7 2 [a, b, c] 1
aggs = {'A' : lambda x: x.nunique(dropna=True),
'C' : 'sum'}
# This will result an error: TypeError: unhashable type: 'list'
agg_df = df.groupby('ID').agg(aggs)
I'd like the following output:
print(agg_df)
A C
ID
1 4 36
2 3 14
Which resulted because for 'ID' = 1 we had 'a', 'b', 'c' and 'd' and for 'ID' = 2, we had 'a', 'b', 'c'.
One solution is to split your problem into 2 parts. First flatten your dataframe to ensure df['A'] consists only of strings. Then concatenate a couple of GroupBy operations.
Step 1: Flatten your dataframe
You can use itertools.chain and numpy.repeat to chain and repeat values as appropriate.
from itertools import chain
A = df['A'].apply(lambda x: [x] if not isinstance(x, list) else x)
lens = A.map(len)
res = pd.DataFrame({'ID': np.repeat(df['ID'], lens),
'A': list(chain.from_iterable(A)),
'C': np.repeat(df['C'], lens)})
print(res)
# A C ID
# 0 a 1 1
# 1 a 2 1
# 2 a 15 1
# 3 b 5 1
# 4 b 13 1
# 4 c 13 1
# 4 d 13 1
# 5 a 6 2
# 6 a 7 2
# 7 a 1 2
# 7 b 1 2
# 7 c 1 2
Step 2: Concatenate GroupBy on original and flattened
agg_df = pd.concat([res.groupby('ID')['A'].nunique(),
df.groupby('ID')['C'].sum()], axis=1)
print(agg_df)
# A C
# ID
# 1 4 36
# 2 3 14

Conditional multiplication of multiple series with another series

I would like to multiply (in place) values in one column of a DataFrame by values in another column, based on a condition in a third column. For example:
data = pd.DataFrame({'a': [1, 33, 56, 79, 2], 'b': [9, 12, 14, 5, 5], 'c': np.arange(5)})
data.loc[data.a > 10, ['a', 'b']] *= data.loc[data.a > 10, 'c']
What I would like this to do is multiply the values of both 'a' and 'b' by the corresponding (same row) value in 'c' based on a condition. However, the above code just results in NaN values in the desired range.
The closest workaround I've found has been to do this:
data.loc[data.a > 10, ['a', 'b']] = (data.loc[data.a > 10, ['a', 'b']].as_matrix().T * data.loc[data.a > 10, 'c']).T
which works, but it seems like there is a better (more Pythonic) way that I'm missing.
you can use mul(..., axis=0) method:
In [122]: mask = data.a > 10
In [125]: data.loc[mask, ['a','b']] = data.loc[mask, ['a','b']].mul(data.loc[mask, 'c'], 0)
In [126]: data
Out[126]:
a b c
0 1 9 0
1 33 12 1
2 112 28 2
3 237 15 3
4 2 5 4
Here is one alternative to use Series.where() to update values conditionally:
data[['a', 'b']] = data[['a', 'b']].apply(lambda m: m.where(data.a <= 10, m*data.c))
use update
data.update(data.query('a > 10')[['a', 'b']].mul(data.query('a > 10').c, 0))
data
Well it seems NumPy could be an alternative here -
arr = data.values
mask = arr[:,0] > 10
arr[mask,:2] *= arr[mask,2,None]
We just extracted the values as an array, which is a view into the dataframe and that lets us work on the array and the updates would be automatically reflected in the dataframe. Here's a sample run to show the progress -
In [507]: data # Input dataframe
Out[507]:
a b c
0 1 9 0
1 33 12 1
2 56 14 2
3 79 5 3
4 2 5 4
Use proposed codes -
In [508]: arr = data.values
In [509]: mask = arr[:,0] > 10
In [510]: arr[mask,:2] *= arr[mask,2,None]
Verify results with dataframe -
In [511]: data
Out[511]:
a b c
0 1 9 0
1 33 12 1
2 112 28 2
3 237 15 3
4 2 5 4
Let's try to verify through other way that we were indeed working with a view there -
In [512]: np.may_share_memory(data,arr)
Out[512]: True
# %%
import pandas as pd
import numpy as np
data = pd.DataFrame({'a': [1, 33, 56, 79, 2],
'b': [9, 12, 14, 5, 5],
'c': np.arange(5)})
(data.loc[data.a>10, ['a','b']]\
.T * data.loc[data.a>10, 'c'])\
.T.append(data.loc[data.a<=10, ['a','b']])\
.T.append(data.c).T.sort()
# %%
Out[17]:
a b c
0 1 9 0
1 33 12 1
2 112 28 2
3 237 15 3
4 2 5 4

Categories