Python pandas: case statement in agg function - python

I have sql statment like this:
select id
, avg(case when rate=1 then rate end) as "P_Rate"
, stddev(case when rate=1 then rate end) as "std P_Rate",
, avg(case when f_rate = 1 then f_rate else 0 end) as "A_Rate"
, stddev(case when f_rate = 1 then f_rate else 0 end) as "std A_Rate"
from (
select id, connected_date,payment_type,acc_type,
max(case when is s_rate > 1 then 1 else 0 end) / count(open) as rate
sum(case when is hire_days <= 5 and paid>1000 then 1 else 0 end )/count(open) as f_rate
from analysis_table where alloc_date <= '2016-01-01' group by 1,2
) a group by id
I trying rewrite in using Pandas:
at first I will create dataframe for "inner" table:
filtered_data = data.where(data['alloc_date'] <= analysis_date)
then I will group this data
grouped = filtered_data.groupby(['id','connected_date'])
But what I have to use for filtering each column and use max/sum on it.
I tried something like this:
`def my_agg_function(hire_days,paid,open):
r_arr = []
if hire_days <= 5 and paid > 1000:
r_arr.append(1)
else:
r.append(0)
return np.max(r_arr)/len(????)
inner_table['f_rate'] = grouped.agg(lambda row: my_agg_function(row['hire_days'],row['paid'],row['open'])`
and something similar for rate

You should put a little DataFrame in your question to make it easier to answer.
For your need you might want to use agg method of groupby dataframes. Let's suppose you have the following dataframe:
connected_date id number_of_clicks time_spent
0 Mon matt 15 124
1 Tue john 13 986
2 Mon matt 48 451
3 Thu jack 68 234
4 Sun john 52 976
5 Sat sabrina 13 156
And you want to get the sum of the time spent by user by day and the maximum of clicks in a single session. Then you use groupby this way:
df.groupby(['id','connected_date'],as_index = False).agg({'number_of_clicks':max,'time_spent':sum})
Output:
id connected_date time_spent number_of_clicks
0 jack Thu 234 68
1 john Sun 976 52
2 john Tue 986 13
3 matt Mon 575 48
4 sabrina Sat 156 13
Note that I only passed the as_index=False for clarity of the output.

Related

Pandas: query + mul + groupby + cumsum

My dataframe looks like this:
CUST_NO
ORDER_AMOUNT
PAYT_CODE
IS_PAYMENT_SUCCESSFUL
001
50
OR
1
001
20
IC
0
001
10
IC
1
002
55
IC
1
002
300
MR
1
002
215
MR
0
I want to know the total amount a customer has successfully paid all-time, specifically from the payment codes 'OR', 'IC'. The dataframe is sorted and indexed by order date.
The expected output is shown in the CUMSUM_OR_IC_SUCCESSFUL column:
CUST_NO
ORDER_AMOUNT
PAYT_CODE
IS_PAYMENT_SUCCESSFUL
CUMSUM_OR_IC_SUCCESSFUL
001
50
OR
1
0
001
20
IC
0
50
001
10
IC
1
50
002
55
IC
1
0
002
300
MR
1
55
002
215
MR
0
55
I already have some code that should work, but it just keeps running until the kernel crashes.
df["CUMSUM_OR_IC_SUCCESSFUL "] = (df.query("PAYT_CODE == ('OR', 'IC')")["IS_PAYMENT_SUCCESSFUL"].mul(df["ORDER_AMOUNT"])
.groupby(df["CUST_NO"])
.transform(lambda x: x.cumsum().shift().fillna(0))
)
Any help is appreciated!
Answer
agg = df.groupby("CUST_NO").apply(lambda x:(x["ORDER_AMOUNT"] * x["PAYT_CODE"].isin(["IC", "OR"]) * x["IS_PAYMENT_SUCCESSFUL"]).cumsum())
df["CUMSUM_OR_IC_SUCCESSFUL"] = agg.to_numpy()
Output
Although not as different as your expectations, I still guess that your output table has a little mistake.
If you want to shift CUMSUM_OR_IC_SUCCESSFUL with one position, use agg.shift().to_numpy()
CUST_NO ORDER_AMOUNT ... IS_PAYMENT_SUCCESSFUL CUMSUM_OR_IC_SUCCESSFUL
0 1 50 ... 1 50
1 1 20 ... 0 50
2 1 10 ... 1 60
3 2 55 ... 1 55
4 2 300 ... 1 55
5 2 215 ... 0 55
Explanation
apply will run for each group
After some experimenting, this one worked:
df["CUMSUM_GUARANTEED_SUCCESSFUL"] = df["ORDER_AMOUNT"].mul(df["PAYMENT_SUCCESSFUL"]).mul(df["PAYT_CODE"].isin(['IC', 'OC'])).groupby(df["CUST_NO"]).transform(lambda x: x.cumsum().shift().fillna(0))}

Pandas can only convert an array of size 1 to a Python scalar

I have this dataframe, df_pm:
Player GameWeek Minutes \
PlayerMatchesDetailID
1 Alisson 1 90
2 Virgil van Dijk 1 90
3 Joseph Gomez 1 90
ForTeam AgainstTeam \
1 Liverpool Norwich City
2 Liverpool Norwich City
3 Liverpool Norwich City
Goals ShotsOnTarget ShotsInBox CloseShots \
1 0 0 0 0
2 1 1 1 1
3 0 0 0 0
TotalShots Headers GoalAssists ShotOnTargetCreated \
1 0 0 0 0
2 1 1 0 0
3 0 0 0 0
ShotInBoxCreated CloseShotCreated TotalShotCreated \
1 0 0 0
2 0 0 0
3 0 0 1
HeadersCreated
1 0
2 0
3 0
this second dataframe, df_melt:
MatchID GameWeek Date Team Home \
0 46605 1 2019-08-09 Liverpool Home
1 46605 1 2019-08-09 Norwich City Away
2 46606 1 2019-08-10 AFC Bournemouth Home
AgainstTeam
0 Norwich City
1 Liverpool
2 Sheffield United
3 AFC Bournemouth
...
575 Sheffield United
576 Newcastle United
577 Southampton
and this snippet, which uses both:
match_ids = []
home_away = []
dates = []
#For each row in the player matches dataframe...
for row in df_pm.itertuples():
#Look up the match id from the team matches dataframe
team = row.ForTeam
againstteam = row.AgainstTeam
gameweek = row.GameWeek
print (team,againstteam,gameweek)
match_id = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'MatchID'].item()
date = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Date'].item()
home = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Home'].item()
match_ids.append(match_id)
home_away.append(home)
dates.append(date)
At first iteration, I print:
Liverpool
Norwich City
1
But I'm getting the error:
Traceback (most recent call last):
File "tableau_data_generation.py", line 166, in <module>
'MatchID'].item()
File "/Users/me/anaconda2/envs/data_science/lib/python3.7/site-packages/pandas/core/base.py", line 652, in item
return self.values.item()
ValueError: can only convert an array of size 1 to a Python scalar
printing the whole df_melt dataframe, I see that these four datetime values are flawed:
540 46875 28 TBC Aston Villa Home
541 46875 28 TBC Sheffield United Away
...
548 46879 28 TBC Manchester City Home
549 46879 28 TBC Arsenal Away
How do I fix this?
When you use item() on a Series you should actually have received:
FutureWarning: `item` has been deprecated and will be removed in a future version
Since item() has been deprecated in version 0.25.0, it looks like you use
some outdated version of Pandas and possibly you should start from upgrading it.
Even in a newer version of Pandas you can use item(), but on a Numpy
array (at least now, not deprecated).
So change your code to:
df_melt.loc[...].values.item()
Another option is to use iloc[0], so you can also change your code to:
df_melt.loc[...].iloc[0]
Edit
The above solution still can raise an exception (IndexError) if df_melt
does not find any row meeting the given criteria.
To make your code resistant to such cases (and return some default value)
you can add a function getting the given attribute (attr, actually a
column) from the first row meeting the criteria given (gameweek, team,
and againstteam):
def getAttr(gameweek, team, againstteam, attr, default=None):
xx = df_melt.loc[(df_melt['GameWeek'] == gameweek)
& (df_melt['Team'] == team)
& (df_melt['AgainstTeam'] == againstteam)]
return default if xx.empty else xx.iloc[0].loc[attr]
Then, instead of all 3 ... = df_melt.loc[...].item() instructions run:
match_id = getAttr(gameweek, team, againstteam, 'MatchID', default=-1)
date = getAttr(gameweek, team, againstteam, 'Date')
home = getAttr(gameweek, team, againstteam, 'Home', default='????')

How can I merge these two datasets on 'Name' and 'Year'?

I am new in this field and stuck on this problem. I have two datasets
all_batsman_df, this df has 5 columns('years','team','pos','name','salary')
years team pos name salary
0 1991 SF 1B Will Clark 3750000.0
1 1991 NYY 1B Don Mattingly 3420000.0
2 1991 BAL 1B Glenn Davis 3275000.0
3 1991 MIL DH Paul Molitor 3233333.0
4 1991 TOR 3B Kelly Gruber 3033333.0
all_batting_statistics_df, this df has 31 columns
Year Rk Name Age Tm Lg G PA AB R ... SLG OPS OPS+ TB GDP HBP SH SF IBB Pos Summary
0 1988 1 Glen Davis 22 SDP NL 37 89 83 6 ... 0.289 0.514 48.0 24 1 1 0 1 1 987
1 1988 2 Jim Acker 29 ATL NL 21 6 5 0 ... 0.400 0.900 158.0 2 0 0 0 0 0 1
2 1988 3 Jim Adduci* 28 MIL AL 44 97 94 8 ... 0.383 0.641 77.0 36 1 0 0 3 0 7D/93
3 1988 4 Juan Agosto* 30 HOU NL 75 6 5 0 ... 0.000 0.000 -100.0 0 0 0 1 0 0 1
4 1988 5 Luis Aguayo 29 TOT MLB 99 260 237 21 ... 0.354 0.663 88.0 84 6 1 1 1 3 564
I want to merge these two datasets on 'year', 'name'. But the problem is, these both data frames has different names like in the first dataset, it has name 'Glenn Davis' but in second dataset it has 'Glen Davis'.
Now, I want to know that How can I merge both of them using difflib library even it has different names?
Any help will be appreciated ...
Thanks in advance.
I have used this code which I got in a question asked at this platform but it is not working for me. I am adding a new column after matching names in both of the datasets. I know this is not a good approach. Kindly suggest, If i can do it in a better way.
df_a = all_batting_statistics_df
df_b = all_batters
df_a = df_a.astype(str)
df_b = df_b.astype(str)
df_a['merge_year'] = df_a['Year'] # we will use these as the merge keys
df_a['merge_name'] = df_a['Name']
for comp_a, addr_a in df_a[['Year','Name']].values:
for ixb, (comp_b, addr_b) in enumerate(df_b[['years','name']].values):
if cdifflib.CSequenceMatcher(None,comp_a,comp_b).ratio() > .6:
df_b.loc[ixb,'merge_year'] = comp_a # creates a merge key in df_b
if cdifflib.CSequenceMatcher(None,addr_a, addr_b).ratio() > .6:
df_b.loc[ixb,'merge_name'] = addr_a # creates a merge key in df_b
merged_df = pd.merge(df_a,df_b,on=['merge_name','merge_years'],how='inner')
You can do
import difflib
df_b['name'] = df_b['name'].apply(lambda x: \
difflib.get_close_matches(x, df_a['name'])[0])
to replace names in df_b with closest match from df_a, then do your merge. See also this post.
Let me get to your problem by assuming that you have to make a data set with 2 columns and the 2 columns being 1. 'year' and 2. 'name'
okay
1. we will 1st rename all the names which are wrong
I hope you know all the wrong names from all_batting_statistics_df using this
all_batting_statistics_df.replace(regex=r'^Glen.$', value='Glenn Davis')
once you have corrected all the spellings, choose the smaller one which has the names you know, so it doesn't take long
2. we need both data sets to have the same columns i.e. only 'year' and 'name'
use this to drop the columns we don't need
all_batsman_df_1 = all_batsman_df.drop(['team','pos','salary'])
all_batting_statistics_df_1 = all_batting_statistics_df.drop(['Rk','Name','Age','Tm','Lg','G','PA','AB','R','Summary'], axis=1)
I cannot see all the 31 columns so I left them, you have to add to the above code
3. we need to change the column names to look the same i.e. 'year' and 'name' using python dataframe rename
df_new_1 = all_batting_statistics_df(colums={'Year': 'year', 'Name':'name'})
4. next, to merge them
we will use this
all_batsman_df.merge(df_new_1, left_on='year', right_on='name')
FINAL THOUGHTS:
If you don't want to do all this find a way to export the data set to google sheets or microsoft excel and use edit them with those advanced software, if you like pandas then its not that difficult you will find a way, all the best!

how to get sum of the row by using column*value in pandas pivot table?

Im trying to get following output. Stuck to get Total.
Here is my code;
def generate_invoice_summary_info():
file_path = 'output.xlsx'
df = pd.read_excel(file_path, sheet_name='Invoice Details', usecols="E:F,I,L:M")
df['Price'] = df['Price'].astype(float)
# df['Total'] = df.groupby(["Invoice Cost Centre", "Invoice Category"]).agg({'Price': 'sum'}).reset_index()
df = pd.pivot_table(df, index=["Invoice Cost Centre", "Invoice Category"],columns=['Price','Reporting Frequency','Data Feed'],
aggfunc=len ,fill_value=0,margins=True)
print(df.head())
df.to_excel('a.xlsx',sheet_name='Invoice Summary')
Above code produces following output (90% right)
I got stuck to find the Total column.
Total column is calcultaed for each row, based on count* price
Total = count*price column
How can I do that in pivot table?
I used margins attribute , but it gives row sum only.
Edit
print(df):
Price 10.4 ... 85.0 All
Reporting Frequency M ... M
Data Feed BWH EMAIL ... StarBOS
Invoice Cost Centre Invoice Category ...
D3TM Reseller Non Equity 21 10 ... 0 125
EQUITYEMP Baileys 0 7 ... 0 10
Energy NSW 16 0 ... 0 32
Far North Queensland 3 0 ... 0 6
South East 6 0 ... 0 16
Cooper & Dysart 0 0 ... 0 3
Petro Fuel & Lubricants 8 0 ... 0 20
South East QLD Fuels 0 0 ... 0 19
R1M Retail QLD 60 0 ... 0 867

Pandas dataframe vectorizing/filtering: ValueError: Can only compare identically-labeled Series objects

I have two dataframes with NHL hockey stats. One contains every game played by every team for the last ten years, and the other is where I want to fill it up with calculated values. Simply put, I want to take a metric from a team's first five games, sum it, and put that into the other df. I've trimmed my dfs below to exclude other stats and will only look at one stat.
df_all contains all of the games:
>>> df_all
season gameId playerTeam opposingTeam gameDate xGoalsFor xGoalsAgainst
1 2008 2008020001 NYR T.B 20081004 2.287 2.689
6 2008 2008020003 NYR T.B 20081005 1.793 0.916
11 2008 2008020010 NYR CHI 20081010 1.938 2.762
16 2008 2008020019 NYR PHI 20081011 3.030 3.020
21 2008 2008020034 NYR N.J 20081013 1.562 3.454
... ... ... ... ... ... ... ...
142576 2015 2015030185 L.A S.J 20160422 2.927 2.042
142581 2017 2017030171 L.A VGK 20180411 1.275 2.279
142586 2017 2017030172 L.A VGK 20180413 1.907 4.642
142591 2017 2017030173 L.A VGK 20180415 2.452 3.159
142596 2017 2017030174 L.A VGK 20180417 2.427 1.818
df_sum_all will contain the calculated stats, for now it has a bunch of empty columns:
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0 0 0 0
1 2009 NYR 0 0 0 0
2 2010 NYR 0 0 0 0
3 2011 NYR 0 0 0 0
4 2012 NYR 0 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0 0 0 0
328 2015 L.A 0 0 0 0
329 2016 L.A 0 0 0 0
330 2017 L.A 0 0 0 0
331 2018 L.A 0 0 0 0
Here's my function for calculating the ratio of xGoalsFor and xGoalsAgainst.
def calcRatio(statfor, statagainst, games, season, team, statsdf):
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
tempRatio = tempFor / tempAgainst
return tempRatio
I believe it's logical enough. I input the stat I want to make a ratio from, how many games to sum, the season and team to match on, and then where to get the stats from. I've tested these functions separately and know that I can filter just fine, and sum the stats, and so forth. Here's an example of a standalone implementation of the tempFor calculation:
>>> statsdf = df_all
>>> team = 'TOR'
>>> season = 2015
>>> games = 3
>>> tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
>>> print(tempFor)
8.618
See? It returns a value. However I can't do the same across the whole dataframe. What am I missing? I thought the way this works is essentially for every row, it sets the 'xg5' column to the output of the calcRatio function, which uses that row's 'season' and 'team' to filter on df_all.
>>> df_sum_all['xg5'] = calcRatio('xGoalsFor','xGoalsAgainst',5,df_sum_all['season'], df_sum_all['team'], df_all)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in calcRatio
File "/home/sebastian/.local/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 1142, in wrapper
raise ValueError("Can only compare identically-labeled " "Series objects")
ValueError: Can only compare identically-labeled Series objects
Cheers, thanks for any help!
Update: I used iterrows() and it worked fine, so I must just not understand vectorization very well. It's the same function, though - why does it work in one fashion, but not another?
>>> emptyseries = []
>>> for index, row in df_sum_all.iterrows():
... emptyseries.append(calcRatio('xGoalsFor','xGoalsAgainst',5,row['season'],row['team'], df_all))
...
>>> df_sum_all['xg5'] = emptyseries
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> df_sum_all
season team xg5 xg10 xg15 xg20
0 2008 NYR 0.826260 0 0 0
1 2009 NYR 1.288390 0 0 0
2 2010 NYR 0.915942 0 0 0
3 2011 NYR 0.730498 0 0 0
4 2012 NYR 0.980744 0 0 0
.. ... ... ... ... ... ...
327 2014 L.A 0.823998 0 0 0
328 2015 L.A 1.147412 0 0 0
329 2016 L.A 1.054947 0 0 0
330 2017 L.A 1.369005 0 0 0
331 2018 L.A 0.721411 0 0 0
[332 rows x 6 columns]
"ValueError: Can only compare identically-labeled Series objects"
tempFor = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statfor).sum())
tempAgainst = float(statsdf[(statsdf.playerTeam == team) & (statsdf.season == season)].nsmallest(games, 'gameDate').eval(statagainst).sum())
The input for variables:
team: df_sum_all['team']
season: df_sum_all['season']
statsdf: df_all
So in the code, (statsdf.playerTeam == team), it will compare between series from df_sum_all and from df_all.
If these two are not identically labeled, you will see the above error.

Categories