I am struggling with pandas where by condition especially in a group by
I am having following dataframe
data = {'script':['a','a','a','b','b'],
'call_put':['C', 'P', 'P','C', 'P'],
'strike':[280,260,275,280,285],
'premium':[10,20,35,38,50]}
df=pd.DataFrame(data)
df['t']=df['premium'].cumsum()
df
script call_put strike premium t
0 a C 280 10 10
1 a P 260 20 30
2 a P 275 35 65
3 b C 280 38 103
4 b P 285 50 153
I want two additional columns having running count based on script and call_put and premium > 0 expected output
k1 k2
a c 10 1 1 call_put is "C" so first value should be 1, k2 column should be also one as call_put "P" is 0
a p 30 1 1 for call_put value is P so second column count 1
a P 65 1 2 as value is "P", so increase cumm count by 1
b C 103 1 1 script value changed, "C" is 1 and "P" = 0 so 1
b P 153 1 1 "C" = 1 and "P" = 1
can you please tell me how to do this?
Based on your explanation, this what you need.
df['k1'] = df.loc[df["premium"]>0].groupby(["script"])['call_put'].apply(lambda x: np.cumsum(x=='C'))
df['k2'] = df.loc[df["premium"]>0].groupby(["script"])['call_put'].apply(lambda x: np.cumsum(x=='P'))
Output
script call_put strike premium t k1 k2
a C 280 10 10 1 0
a P 260 20 30 1 1
a P 275 35 65 1 2
b C 280 38 103 1 0
b P 285 50 153 1 1
may be you need four columns to represent the cumsum as there will be 4 different combinations of script and call_put. Following code do as per you told. The count start from zero here
data = {'script':['a','a','a','b','b'],
'call_put':['C', 'P', 'P','C', 'P'],
'strike':[280,260,275,280,285],
'premium':[10,20,35,38,50]}
df=pd.DataFrame(data)
df['t']=df['premium'].cumsum()
## column cond_col will have unique combination of script, call_put and premium >0
df["cond_col"] = df["script"] + "-" + df["call_put"] + "-" + (df["premium"]>0).astype(np.str)
## and new columns for each unique combination
for col in np.unique(df["cond_col"]):
df[col] = df["cond_col"]==col
## do cumsum in each unique combination column
for col in np.unique(df["cond_col"]):
df[col] = df[col].cumsum()
## may be the solution you want is upto here
## if you want to combine the columns then you can do following
df["k1"] = df["a-C-True"].where(df["cond_col"]=="a-C-True", df["b-C-True"])
df["k2"] = df["a-P-True"].where(df["cond_col"]=="a-P-True", df["b-P-True"])
df
Output
script call_put strike premium t cond_col a-C-True a-P-True b-C-True b-P-True k1 k2
0 a C 280 10 10 a-C-True 1 0 0 0 1 0
1 a P 260 20 30 a-P-True 1 1 0 0 0 1
2 a P 275 35 65 a-P-True 1 2 0 0 0 2
3 b C 280 38 103 b-C-True 1 2 1 0 1 0
4 b P 285 50 153 b-P-True 1 2 1 1 1 1
Related
I have the following dataframe from a database download that I cleaned up a bit. Unfortunately some of the single numbers split into a second column (row 9) from a single one. I'm trying to merge the two columns but exclude the zero values.
city crashes crashes_1 total_crashes
1 ABERDEEN 710 0 710
2 ACHERS LODGE 1 0 1
3 ACME 1 0 1
4 ADVANCE 55 0 55
5 AFTON 2 0 2
6 AHOSKIE 393 0 393
7 AKERS CENTER 1 0 1
8 ALAMANCE 50 0 50
9 ALBEMARLE 1 58 59
So for row 9 I want to end with:
9 ALBEMARLE 1 58 158
I tried a few snippets but nothing seems to work:
df['total_crashes'] = df['crashes'].astype(str).str.zfill(0) + df['crashes_1'].astype(str).str.zfill(0)
df['total_crashes'] = df['total_crashes'].astype(str).replace('\0', '', regex=True)
df['total_crashes'] = df['total_crashes'].apply(lambda x: ''.join(x[x!=0]))
df['total_crashes'] = df['total_crashes'].str.cat(df['total_crashes'], x[x!=0])
df['total_crashes'] = df.drop[0].sum(axis=1)
Thanks for any help.
You can use where condition:
df['total_crashes'] = df['crashes'].astype(str) + df['crashes_1'].astype(str).where(df['crashes_1'] != 0, "")
I have a Pandas data frame like below.
X Y Z
0 10 101 1
0 12 120 2
0 15 112 3
0 06 115 4
0 07 125 1
0 17 131 2
0 14 121 1
0 11 127 2
0 13 107 3
0 02 180 4
0 19 114 1
I want to calculate the average of the values in column X according to the group values in Z.
That is something like
X Z
(10+7+14+19)/4 1
(12+17+11)/2 2
(15+13)/2 3
(2+6/1) 4
What is an optimum way of doing this using Pandas?
It works this way,
sample_data = [['X','Y','Z'],[10,101,1],[12,120,2],[15,12 ,3],[6,115,4],[7,125,1],[17,131,2]]
def group_X_based_on_Z(data):
value_pair = [(row[2], row[0]) for row in data[1:]]
dictionary_with_groouped_values = {}
for z, x in value_pair:
dictionary_with_groouped_values.setdefault(z, []).append(x)
return dictionary_with_groouped_values
def cal_avg_values(data):
grouped_dictionary = group_X_based_on_Z(data)
avg_value_dictionary = {}
for z, x in grouped_dictionary.items():
avg_value_dictionary[z] = mean(x)
return avg_value_dictionary
print(cal_avg_values(sample_data))
I want to know whether there is a Pandas specific method for this?
Use the groupby function.
df.groupby('Z').agg(x_avg = ('X', 'mean'))
edit: forgot a ')'
Try
s=df.groupby('Z',as_index=False).X.mean()
Z X
0 1 12.500000
1 2 13.333333
2 3 14.000000
3 4 4.000000
First time asking a question here so hopefully I will make my issue clear. I am trying to understand how to better apply a list of scenarios (via for loop) to the same dataset and summarize results. *Note that once a scenario is applied, and I pull the relevant statistical data from dataframe and put into the summary table, I do not need to retain the information. Iterrows is painfully slow as I have tens of thousands of scenarios I want to run. Thank you for taking the time to review.
I have two Pandas dataframes: df_analysts and df_results:
1) df_analysts contains a specific list of factors (e.g. TB,JK,SF,PWR) scenarios of weights (e.g. 50,50,50,50)
TB JK SF PWR
0 50 50 50 50
1 50 50 50 100
2 50 50 50 150
3 50 50 50 200
4 50 50 50 250
2) df_results holds results by date and group and entrant an then ranking by each factor, finally it has the final finish result.
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W)
0 11182017 1 1 2 1 2 1 2
1 11182017 1 2 3 2 3 2 1
2 11182017 1 3 1 3 1 3 3
3 11182017 2 1 1 2 2 1 1
4 11182017 2 2 2 1 1 2 1
3) I am using iterrows to
loop through each scenario in the df_analysts dataframe
apply weight scenario to each factor rank (if rank = 1, then 1.0*weight, rank = 2, then 0.68*weight, rank = 3, then 0.32*weight). Those results go into the W1-W4 columns.
Sum the W1-W4 columns.
Rank the SUM(W) column.
Result sample below for a single scenario (e.g. 50,50,50,50)
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W) Rank
0 11182017 1 1 2 1 2 1 1 34 50 34 50 168 1
1 11182017 1 2 3 2 3 2 3 16 34 16 34 100 3
2 11182017 1 3 1 3 1 3 2 50 16 50 16 132 2
3 11182017 2 1 2 2 2 1 1 34 34 34 50 152 2
4 11182017 2 2 1 1 1 2 1 50 50 50 34 184 1
4) Finally, for each scenario, I am creating a new dataframe for the summary results (df_summary) which logs the factor / weight scenario used (from df_analysts) and compares the RANK result to the Finish by date and group and keeps a tally where they land. Sample below (only the 50,50,50,50 scenario is shown above which results in a 1,1).
Factors Weights Top Top2
0 (TB,JK,SF,PWR) (50,50,50,50) 1 1
1 (TB,JK,SF,PWR) (50,50,50,100) 1 0
2 (TB,JK,SF,PWR) (50,50,50,150) 1 1
3 (TB,JK,SF,PWR) (50,50,50,200) 1 0
4 (TB,JK,SF,PWR) (50,50,50,250) 1 1
You could merge your analyst and results dataframe and then perform the calculations.
def factor_rank(x,y):
if (x==1): return y
elif (x==2): return y*0.68
elif (x==3): return y*0.32
df_analysts.index.name='SCENARIO'
df_analysts.reset_index(inplace=True)
df_analysts['key'] = 1
df_results['key'] = 1
df = pd.merge(df_analysts, df_results, on='key')
df.drop(['key'],axis=1,inplace=True)
df['W1'] = df.apply(lambda r: factor_rank(r['TB-R'], r['TB']), axis=1)
df['W2'] = df.apply(lambda r: factor_rank(r['JK-R'], r['JK']), axis=1)
df['W3'] = df.apply(lambda r: factor_rank(r['SF-R'], r['SF']), axis=1)
df['W4'] = df.apply(lambda r: factor_rank(r['PWR-R'], r['PWR']), axis=1)
df['SUM(W)'] = df.W1 + df.W1 + df.W3 + df.W4
df["rank"] = df.groupby(['GR','SCENARIO'])['SUM(W)'].rank(ascending=False)
You may also want to check out this question which deals with improving processing times on row based calculations:
How to apply a function to mulitple columns of a pandas DataFrame in parallel
I've Pandas Dataframe as shown below. What I'm trying to do is, partition (or groupby) by BlockID, LineID, WordID, and then within each group use current WordStartX - previous (WordStartX + WordWidth) to derive another column, e.g., WordDistance to indicate the distance between this word and previous word.
This post Row operations within a group of a pandas dataframe is very helpful but in my case multiple columns involved (WordStartX and WordWidth).
*BlockID LineID WordID WordStartX WordWidth WordDistance
0 0 0 0 275 150 0
1 0 0 1 431 96 431-(275+150)=6
2 0 0 2 642 90 642-(431+96)=115
3 0 0 3 746 104 746-(642+90)=14
4 1 0 0 273 69 ...
5 1 0 1 352 151 ...
6 1 0 2 510 92
7 1 0 3 647 90
8 1 0 4 752 105**
The diff() and shift() functions are usually helpful for calculation referring to previous or next rows:
df['WordDistance'] = (df.groupby(['BlockID', 'LineID'])
.apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)
I have some data imported from a csv, to create something similar I used this:
data = pd.DataFrame([[1,0,2,3,4,5],[0,1,2,3,4,5],[1,1,2,3,4,5],[0,0,2,3,4,5]], columns=['split','sex', 'group0Low', 'group0High', 'group1Low', 'group1High'])
means = data.groupby(['split','sex']).mean()
so the dataframe looks something like this:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
You'll notice that each column actually contains 2 variables (group# and height). (It was set up this way for running repeated measures anova in SPSS.)
I want to split the columns up, so I can also groupby "group", like this (I actually screwed up the order of the numbers, but hopefully the idea is clear):
low high
split sex group
0 0 95 265
0 0 1 123 54
1 0 120 220
1 1 98 111
1 0 0 150 190
0 1 211 300
1 0 139 86
1 1 132 250
How do I achieve this?
The first trick is to gather the columns into a single column using stack:
In [6]: means
Out[6]:
group0Low group0High group1Low group1High
split sex
0 0 2 3 4 5
1 2 3 4 5
1 0 2 3 4 5
1 2 3 4 5
In [13]: stacked = means.stack().reset_index(level=2)
In [14]: stacked.columns = ['group_level', 'mean']
In [15]: stacked.head(2)
Out[15]:
group_level mean
split sex
0 0 group0Low 2
0 group0High 3
Now we can do whatever string operations we want on group_level using pd.Series.str as follows:
In [18]: stacked['group'] = stacked.group_level.str[:6]
In [21]: stacked['level'] = stacked.group_level.str[6:]
In [22]: stacked.head(2)
Out[22]:
group_level mean group level
split sex
0 0 group0Low 2 group0 Low
0 group0High 3 group0 High
Now you're in business and you can do whatever you want. For example, sum each group/level:
In [31]: stacked.groupby(['group', 'level']).sum()
Out[31]:
mean
group level
group0 High 12
Low 8
group1 High 20
Low 16
How do I group by everything?
If you want to group by split, sex, group and level you can do:
In [113]: stacked.reset_index().groupby(['split', 'sex', 'group', 'level']).sum().head(4)
Out[113]:
mean
split sex group level
0 0 group0 High 3
Low 2
group1 0High 5
0Low 4
What if the split is not always at location 6?
This SO answer will show you how to do the splitting more intelligently.
This can be done by first construct multi-level index on column names and then reshape the dataframe by stack.
import pandas as pd
import numpy as np
# some artificial data
# ==================================
multi_index = pd.MultiIndex.from_arrays([[0,0,1,1], [0,1,0,1]], names=['split', 'sex'])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(50,300, (4,4)), columns='group0Low group0High group1Low group1High'.split(), index=multi_index)
df
group0Low group0High group1Low group1High
split sex
0 0 222 97 167 242
1 117 245 153 59
1 0 261 71 292 86
1 137 120 266 138
# processing
# ==============================
level_group = np.where(df.columns.str.contains('0'), 0, 1)
# output: array([0, 0, 1, 1])
level_low_high = np.where(df.columns.str.contains('Low'), 'low', 'high')
# output: array(['low', 'high', 'low', 'high'], dtype='<U4')
multi_level_columns = pd.MultiIndex.from_arrays([level_group, level_low_high], names=['group', 'val'])
df.columns = multi_level_columns
df.stack(level='group')
val high low
split sex group
0 0 0 97 222
1 242 167
1 0 245 117
1 59 153
1 0 0 71 261
1 86 292
1 0 120 137
1 138 266