Finding combinations that meet thresholds for subgroups - python

I need to find all the combinations of rows where multiple conditions are met.
I tried to use the powerset recipe from itertools and the answer here by adding multiple conditions but can't seem to get the conditions to work properly.
The code I've come up with is:
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
df_groups = pd.concat(
[data.reindex(l).assign(Group = n) for n, l in
enumerate(powerset(data.index)) ])
if ((data.loc[l, 'Account'] == 'COS').any() & (data.loc[l,'Amount'].sum() >= 100)
& (data.loc[l,'Account'] == 'Rev').any() & (data.loc[l, 'Amount'].sum() >= 150)
& (data.loc[l,'Account'] == 'Inv').any() and (data.loc[l, 'Amount'].sum() >= 60)))] )
What I'm trying to do above is find only those combinations where the the following thresholds are met/exceeded:
Account Amount
COS 150
Rev 100
Inv 60
Sample data:
Entity Account Amount Location
A10 Rev 60 A
B01 Rev 90 B
C11 Rev 80 C
B01 COS 90 B
C11 COS 80 C
A10 Inv 60 A
Apologies in advance for the poor question writing etiquette, its the first time I haven't been able to find an answer on Stackoverflow and have had to ask a question.
Also, aware that this will get very slow as len(data) increases so any suggestions on that end are also greatly appreciated.

Let's start by creating the dataframe that OP mentions in the question
df = pd.DataFrame({'Entity': ['A10', 'B01', 'C11', 'B01', 'C11', 'A10'],
'Account': ['Rev', 'Rev', 'Rev', 'COS', 'COS', 'Inv'],
'Amount': [60, 90, 80, 90, 80, 60],
'Location': ['A', 'B', 'C', 'B', 'C', 'A']})
[Out]:
Entity Account Amount Location
0 A10 Rev 60 A
1 B01 Rev 90 B
2 C11 Rev 80 C
3 B01 COS 90 B
4 C11 COS 80 C
5 A10 Inv 60 A
Then, in order to achieve OP's goal of filtering based on specific constraints, one can do this using a one-liner with pandas.concat and pandas.DataFrame.query, as follows
df_new = pd.concat([df[df['Account'] == 'Rev'].query('Amount <= 100'), df[df['Account'] == 'COS'].query('Amount <= 150'), df[df['Account'] == 'Inv'].query('Amount <= 60')])
[Out]:
Entity Account Amount Location
0 A10 Rev 60 A
1 B01 Rev 90 B
2 C11 Rev 80 C
3 B01 COS 90 B
4 C11 COS 80 C
5 A10 Inv 60 A
As the sample dataframe doesn't allow us to get a clear picture if it is working or not, let us create a new random dataframe for testing purposes.
import numpy as np
df = pd.DataFrame({'Entity': np.random.choice(['A10', 'B01', 'C11', 'B01', 'C11', 'A10'], 1000),
'Account': np.random.choice(['Rev', 'COS', 'Inv'], 1000),
'Amount': np.random.randint(0, 1000, 1000),
'Location': np.random.choice(['A', 'B', 'C'], 1000)})
[Out]:
Entity Account Amount Location
0 B01 Rev 497 A
1 B01 Rev 52 C
2 B01 Rev 42 C
3 B01 Rev 285 B
4 A10 COS 714 B
5 A10 Rev 288 B
6 B01 Rev 396 B
7 A10 Inv 277 B
8 C11 Inv 435 C
9 C11 COS 228 C
If one runs the one-liner on that newly created dataframe, one gets the following
df_new = pd.concat([df[df['Account'] == 'Rev'].query('Amount <= 100'), df[df['Account'] == 'COS'].query('Amount <= 150'), df[df['Account'] == 'Inv'].query('Amount <= 60')])
[Out]:
Entity Account Amount Location
1 B01 Rev 52 C
2 B01 Rev 42 C
21 B01 Rev 1 A
31 C11 Rev 38 A
47 A10 Rev 83 C
60 B01 Rev 41 C
156 B01 Rev 81 C
197 C11 Rev 61 C
206 C11 Rev 90 A
224 C11 Rev 23 B
which, from the sample we are seeing, it does satisfy the requirements.
There are additional ways to solve this.
Another example is using pandas.DataFrame.apply and a lambda function as follows
df_new = df[df.apply(lambda x: x['Amount'] <= 100 if x['Account'] == 'Rev' else x['Amount'] <= 150 if x['Account'] == 'COS' else x['Amount'] <= 60, axis=1)]

Related

Optimising finding row pairs in a dataframe

I have a dataframe with rows that describe a movement of value between nodes in a system. This dataframe looks like this:
index from_node to_node value invoice_number
0 A E 10 a
1 B F 20 a
2 C G 40 c
3 D H 60 d
4 E I 35 c
5 X D 43 d
6 Y F 50 d
7 E H 70 a
8 B A 55 b
9 X B 33 a
I am looking to find "swaps" in the invoice history. A swap is defined where a node both receives and sends a value to a different node within the same invoice number. In the above dataset there are two swaps in invoice "a", and one swap in invoice "d" ("sent to" and "received from" could be the same node in the same row):
index node sent_to sent_value received_from received_value invoice_number
0 B F 20 X 33 a
1 E H 70 A 10 a
2 D H 60 X 43 d
I have solved this problem by iterating over all of the unique invoice numbers in the dataset and then iterating over each row within that invoice number to find pairs:
import pandas as pd
df = pd.DataFrame({
'from_node':['A','B','C','D','E','X','Y','E','B','X'],
'to_node':['E','F','G','H','I','D','F','H','A','B'],
'value':[10,20,40,60,35,43,50,70,55,33],
'invoice_number':['a','a','c','d','c','d','d','a','b','a'],
})
invoices = set(df.invoice_number)
list_df_swap = []
for invoice in invoices:
df_inv = df[df.invoice_number.isin([invoice])]
for r in df_inv.itertuples():
df_is_swap = df_inv[df_inv.to_node.isin([r.from_node])]
if len(df_is_swap.index) == 1:
swap = {'node': r.from_node,
'sent_to': r.to_node,
'sent_value': r.value,
'received_from': df_is_swap.iloc[0]['from_node'],
'received_value': df_is_swap.iloc[0]['value'],
'invoice_number': r.invoice_number
}
list_df_swap.append(pd.DataFrame(swap, index = [0]))
df_swap = pd.concat(list_df_swap, ignore_index = True)
The total dataset consists of several hundred million rows, and this approach is not very efficient. Is there a way to solve this problem using some kind of vectorised solution, or another method that would speed up the execution time?
Calculate all possible swaps, regradless of the invoice number:
swaps = df.merge(df, left_on='from_node', right_on='to_node')
Then select those that have the same invoice number:
columns = ['from_node_x', 'to_node_x', 'value_x', 'from_node_y', 'value_y',
'invoice_number_x']
swaps[swaps.invoice_number_x == swaps.invoice_number_y][columns]
# from_node_x to_node_x value_x from_node_y value_y invoice_number_x
#1 B F 20 X 33 a
#3 D H 60 X 43 d
#5 E H 70 A 10 a

rolling mean with a moving window

My dataframe has a daily price column and a window size column :
df = pd.DataFrame(columns = ['price', 'window'],
data = [[100, 1],[120, 2], [115, 2], [116, 2], [100, 4]])
df
price window
0 100 1
1 120 2
2 115 2
3 116 2
4 100 4
I would like to compute the rolling mean of price for each row using the window of the window column.
The result would be this :
df
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
I don't find any elegant way to do it with apply and I refuse to loop over each row of my DataFrame...
The best solutions, in terms of raw speed and complexity, are based on ideas from summed-area table. The problem can be consider as a table of one dimension. Below you can find several approaches, ranked from best to worst.
Numpy + Linear complexity
size = len(df['price'])
price = np.zeros(size + 1)
price[1:] = df['price'].values.cumsum()
window = np.clip(np.arange(size) - (df['window'].values - 1), 0, None)
df['rolling_mean_price'] = (price[1:] - price[window]) / df['window'].values
print(df)
Output
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75
Loopy + Linear complexity
price = df['price'].values.cumsum()
df['rolling_mean_price'] = [(price[i] - float((i - w) > -1) * price[i-w]) / w for i, w in enumerate(df['window'])]
Loopy + Quadratic complexity
price = df['price'].values
df['rolling_mean_price'] = [price[i - (w - 1):i + 1].mean() for i, w in enumerate(df['window'])]
I would not recommend this approach using pandas.DataFrame.apply() (reasons described here), but if you insist on it, here is one solution:
df['rolling_mean_price'] = df.apply(
lambda row: df.rolling(row.window).price.mean().iloc[row.name], axis=1)
The output looks like this:
>>> print(df)
price window rolling_mean_price
0 100 1 100.00
1 120 2 110.00
2 115 2 117.50
3 116 2 115.50
4 100 4 112.75

stop groupby from making 2 combination same pair in python dataframe

I am working on IPL dataset from Kaggle (https://www.kaggle.com/manasgarg/ipl).
I want to sum up the runs made by two people as pair and I have prepared my data.
When I am trying a GROUPBY on the dataframe columns (batsman and non_striker) it is making 2 combination of the same pair.
like (a,b) and (b,a) - rather I wish it should consider it as same.
As I can't drop rows any further.
import pandas as pd
df = pd.read_csv("C:\\Users\\Yash\\AppData\\Local\\Programs\\Python\\Python36-32\\Machine Learning\\IPL\\deliveries.csv")
df = df[(df["is_super_over"] != 1)]
df["pri_key"] = df["match_id"].astype(str) + "-" + df["inning"].astype(str)
openners = df[(df["over"] == 1) & (df["ball"] == 1)]
openners = openners[["pri_key", "batsman", "non_striker"]]
openners = openners.rename(columns = {"batsman":"batter1", "non_striker":"batter2"})
df = pd.merge(df, openners, on="pri_key")
df = df[["batsman", "non_striker", "batter1", "batter2", "batsman_runs"]]
df = df[((df["batsman"] == df["batter1"]) | (df["batsman"] == df["batter2"]))
& ((df["non_striker"] == df["batter1"]) | (df["non_striker"] == df["batter2"]))]
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(10)
Result:
batsman non_striker
DA Warner S Dhawan 1294
S Dhawan DA Warner 823
RV Uthappa G Gambhir 781
DR Smith BB McCullum 684
CH Gayle V Kohli 674
MEK Hussey M Vijay 666
M Vijay MEK Hussey 629
G Gambhir RV Uthappa 611
BB McCullum DR Smith 593
CH Gayle TM Dilshan 537
and, I want to keep 1 pair as unique
for those who don't understand cricket
I have a dataframe
batsman non_striker runs
a b 2
a b 3
b a 1
c d 6
d c 1
d c 4
b a 3
e f 1
f e 2
f e 6
df1 = df.groupby(["batsman" , "non_striker"], group_keys = False)["batsman_runs"].agg("sum")
df1.nlargest(30)
output:
batsman non_striker runs
a b 5
b a 4
c d 6
d c 5
e f 1
f e 8
expected output:
batsman non_striker runs
a b 9
c d 11
e f 9
what should I do? Please advise....
You can sort the batsman and non_striker and then group the data
df[['batsman', 'non_striker']] = df[['batsman', 'non_striker']].apply(sorted, axis=1)
df.groupby(['batsman', 'non_striker']).batsman_runs.sum().nlargest(10)
Edit: You can also use numpy for sorting the columns, which will be faster than using pandas sorted
df[['batsman', 'non_striker']] = np.sort(df[['batsman', 'non_striker']],1)
df.groupby(['batsman', 'non_striker'], sort = False).batsman_runs.sum().nlargest(10).sort_index()
Either way, you will get,
batsman non_striker
CH Gayle V Kohli 2650
DA Warner S Dhawan 2242
AB de Villiers V Kohli 2135
G Gambhir RV Uthappa 1795
M Vijay MEK Hussey 1302
BB McCullum DR Smith 1277
KA Pollard RG Sharma 1220
MEK Hussey SK Raina 1129
AT Rayudu RG Sharma 1121
AM Rahane SR Watson 1118
Craete a new DataFrame using np.sort. Then groupby and sum.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.sort(df[['batsman', 'non_striker']].values,1),
index=df.index,
columns=['player_1', 'player_2']).assign(runs = df.runs)
df1.groupby(['player_1', 'player_2']).runs.sum()
Output:
player_1 player_2
a b 9
c d 11
e f 9
Name: runs, dtype: int64
I hope I understand you right...
What you can do is something like put the smaller value always in column A and the greater value always in column B.
import pandas as pd
import numpy as np
# generate example
values = ['a', 'b' , 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame()
df['batsman'] = np.random.choice(values, size=10)
df['no_striker'] = np.random.choice(values, size=10)
# column evaluation
df['smaller'] = df['batsman'].where(df['batsman'] < df['no_striker'], df['no_striker'])
df['greater'] = df['batsman'].where(df['batsman'] > df['no_striker'], df['no_striker'])

New column based in multiple conditions

a b
0 100 90
1 30 117
2 90 99
3 200 94
I want to create a new df["c"] with next conditions:
If a > 50 and b is into (a ± 0.5a), then c = a
If a > 50 and b is out (a ± 0.5a), then c = b
If a <= 50, then *c = a*
Output should be:
a b c
0 100 90 100
1 30 117 30
2 90 99 90
3 200 94 94
I´ve tried:
df['c'] = np.where(df.eval("0.5 * a <= b <= 1.5 * a"), df.a, df.b)
But I don´t know how to include the last condition (If a <= 50, then c = a) in this sentence.
You're almost there, you'll just need to add an or clause inside your eval string.
np.where(df.eval("(0.5 * a <= b <= 1.5 * a) or (a <= 50)"), df.a, df.b)
# ~~~~~~~~~~~~
array([100, 30, 90, 94])

selecting indexes with multiple years of observations

I wish to select only the rows that have observations across multiple years. For example, suppose
mlIndx = pd.MultiIndex.from_tuples([('x', 0,),('x',1),('z', 0), ('y', 1),('t', 0),('t', 1)])
df = pd.DataFrame(np.random.randint(0,100,(6,2)), columns = ['a','b'], index=mlIndx)
In [18]: df
Out[18]:
a b
x 0 6 1
1 63 88
z 0 69 54
y 1 27 27
t 0 98 12
1 69 31
My desired output is
Out[19]:
a b
x 0 6 1
1 63 88
t 0 98 12
1 69 31
My current solution is blunt so something that can scale up more easily would be great. You can assumed a sorted index.
df.reset_index(level=0, inplace=True)
df[df.level_0.duplicated() | df.level_0.duplicated(keep='last')]
Out[30]:
level_0 a b
0 x 6 1
1 x 63 88
0 t 98 12
1 t 69 31
You can figure this out with groupby (on the first level of the index) + transform, and then use boolean indexing to filter out those rows:
df[df.groupby(level=0).a.transform('size').gt(1)]
a b
x 0 67 83
1 2 34
t 0 18 87
1 63 20
Details
Output of the groupby -
df.groupby(level=0).a.transform('size')
x 0 2
1 2
z 0 1
y 1 1
t 0 2
1 2
Name: a, dtype: int64
Filtering from here is straightforward, just find those rows with size > 1.
Use the group by filter
You can pass a function that returns a boolean to
df.groupby(level=0).filter(lambda x: len(x) > 1)
a b
x 0 7 33
1 31 43
t 0 71 18
1 68 72
I've spent my fare share of time focused on speed. Not all solutions need to be the fastest solutions. However, since the subject has come up. I'll offer what I think should be a fast solution. It is my intent to keep future readers informed.
Results of Time Test
res.plot(loglog=True)
res.div(res.min(1), 0).T
10 30 100 300 1000 3000
cs 4.425970 4.643234 5.422120 3.768960 3.912819 3.937120
wen 2.617455 4.288538 6.694974 18.489803 57.416648 148.860403
jp 6.644870 21.444406 67.315362 208.024627 569.421257 1525.943062
pir 6.043569 10.358355 26.099766 63.531397 165.032540 404.254033
pir_pd_factorize 1.153351 1.132094 1.141539 1.191434 1.000000 1.000000
pir_np_unique 1.058743 1.000000 1.000000 1.000000 1.021489 1.188738
pir_best_of 1.000000 1.006871 1.030610 1.086425 1.068483 1.025837
Simulation Details
def pir_pd_factorize(df):
f, u = pd.factorize(df.index.get_level_values(0))
m = np.bincount(f)[f] > 1
return df[m]
def pir_np_unique(df):
u, f = np.unique(df.index.get_level_values(0), return_inverse=True)
m = np.bincount(f)[f] > 1
return df[m]
def pir_best_of(df):
if len(df) > 1000:
return pir_pd_factorize(df)
else:
return pir_np_unique(df)
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000],
columns='cs wen jp pir pir_pd_factorize pir_np_unique pir_best_of'.split(),
dtype=float
)
np.random.seed([3, 1415])
for i in res.index:
d = pd.DataFrame(
dict(a=range(i)),
pd.MultiIndex.from_arrays([
np.random.randint(i // 4 * 3, size=i),
range(i)
])
)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)
Just a new way
s=df.a.count(level=0)
df.loc[s[s>1].index.tolist()]
Out[12]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
And if you want to keep using duplicate
s=df.index.get_level_values(level=0)
df.loc[s[s.duplicated()].tolist()]
Out[18]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
I'm not convinced groupby is necessary:
df = df.sort_index()
df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
# a b
# x 0 16 3
# 1 97 36
# t 0 9 18
# 1 37 30
Some benchmarking:
df = pd.concat([df]*10000).sort_index()
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
%timeit cs(df) # 19.5ms
%timeit pir(df) # 33.8ms
%timeit wen(df) # 17.0ms
%timeit jp(df) # 22.3ms

Categories