Sequentially extract top n from group depending on group value - python

I am working on calculating some football stats.
I have the following dataframe:
{'Player': {8: 'Darrel Williams', 2: 'Mark Ingram', 3: 'Michael Carter', 4: 'Najee Harris', 10: 'James Conner', 0: 'Buffalo Bills', 15: 'Davante Adams', 1: 'Aaron Rodgers', 5: 'Tyler Bass', 11: 'Corey Davis', 6: 'Van Jefferson', 14: 'Matt Ryan', 7: 'T.J. Hockenson', 9: 'Antonio Brown', 12: 'Alvin Kamara', 13: 'Tyler Boyd'}, 'Position': {8: 'RB', 2: 'RB', 3: 'RB', 4: 'RB', 10: 'RB', 0: 'DEF', 15: 'WR', 1: 'QB', 5: 'K', 11: 'WR', 6: 'WR', 14: 'QB', 7: 'TE', 9: 'WR', 12: 'RB', 13: 'WR'}, 'Score': {8: 24.9, 2: 18.8, 3: 16.2, 4: 15.3, 10: 13.9, 0: 12.0, 15: 11.3, 1: 10.48, 5: 9.0, 11: 8.8, 6: 6.9, 14: 1.68, 7: 0.0, 9: 0.0, 12: 0.0, 13: 0.0}}
Player
Position
Score
Darrel Williams
RB
24.9
Mark Ingram
RB
18.8
Michael Carter
RB
16.2
Najee Harris
RB
15.3
James Conner
RB
13.9
Buffalo Bills
DEF
12
Davante Adams
WR
11.3
Aaron Rodgers
QB
10.48
Tyler Bass
K
9
Corey Davis
WR
8.8
Van Jefferson
WR
6.9
Matt Ryan
QB
1.68
T.J. Hockenson
TE
0
Antonio Brown
WR
0
Alvin Kamara
RB
0
Tyler Boyd
WR
0
What I am looking to do, given the following requirements_dictionary, is to extract the top value (Score in the dataframe) for each key (Position in the dataframe):
requirements_dictionary = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 1, 'K': 1, 'DEF': 1, 'FLEX': 2}
What makes this challenging, is that for the final key, FLEX, that matches to no position in the dataframe, because that value could be a position of: RB, WR, or TE.
Final output should look like:
Player
Position
Score
Darrel Williams
RB
24.9
Mark Ingram
RB
18.8
Michael Carter
RB
16.2
Najee Harris
RB
15.3
Buffalo Bills
DEF
12
Davante Adams
WR
11.3
Aaron Rodgers
QB
10.48
Tyler Bass
K
9
Corey Davis
WR
8.8
T.J. Hockenson
TE
0
Since that is the top 2 RB, 1 QB, 2 WR, 1 TE, 1 K, 1 DEF and 2 FLEX.
I have tried the following code which gets me close:
all_points.groupby('Position')['Score'].nlargest(2)
Position
DEF 0 12.00
K 5 9.00
QB 1 10.48
14 1.68
RB 8 24.90
2 18.80
TE 7 0.00
WR 15 11.30
11 8.80
Name: Score, dtype: float64
However, that does not account for the FLEX "position"
I could alternatively loop through the dataframe and do this manually, but that seems very intensive.
How can I achieve the intended result?

Create a custom function that select a number of players according your requirements for each group and keep this index as idx_best. Then exclude all already selected players and select FLEX other players as idx_flex. Finally extract the union of this two indexes.
FLEX = requirements_dictionary['FLEX']
select_players = lambda x: x.nlargest(requirements_dictionary[x.name])
idx_best = df.groupby('Position')['Score'].apply(select_players).index.levels[1]
idx_flex = df.loc[df.index.difference(idx_best), 'Score'].nlargest(FLEX).index
out = df.loc[idx_best.union(idx_flex)].sort_values('Score', ascending=False)
Output:
>>> out
Player Position Score
8 Darrel Williams RB 24.90
2 Mark Ingram RB 18.80
3 Michael Carter RB 16.20
4 Najee Harris RB 15.30
0 Buffalo Bills DEF 12.00
15 Davante Adams WR 11.30
1 Aaron Rodgers QB 10.48
5 Tyler Bass K 9.00
11 Corey Davis WR 8.80
7 T.J. Hockenson TE 0.00

use the requirements dictionary to get the rows equal to a position then sort by score and get the head equal to the dictionary value for the position. Flex is top 2, for position in RB, WR, TE. I concatenate the flex results. my solution is more intuitive and logical to understand
txt="""Player,Position,Score
Darrel Williams,RB,24.9
Mark Ingram,RB,18.8
Michael Carter,RB,16.2
Najee Harris,RB,15.3
Buffalo Bills,DEF,12
Davante Adams,WR,11.3
Aaron Rodgers,QB,10.48
Tyler Bass,K,9
Corey Davis,WR,8.8
T.J. Hockenson,TE,0"""
df = pd.read_csv(io.StringIO(txt),sep=',')
requirements_dictionary = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 1, 'K': 1, 'DEF': 1, 'FLEX': 2}
#print(df)
df_top_rows = pd.DataFrame()
for position in requirements_dictionary.keys():
df_top_rows = df_top_rows.append(df[df['Position'] == position].sort_values(by='Score', ascending=False).head(requirements_dictionary[position]))
print(df_top_rows)
position='FLEX'
df_flex_rows = df_top_rows.append(df[df['Position'].isin(['RB','WR','TE'])].sort_values(by='Score', ascending=False).head(requirements_dictionary[position]))
#print(df_flex_rows)
df_result=pd.concat([df_top_rows,df_flex_rows],axis=0)
df_result.drop_duplicates(inplace=True)
print(df_result)
output
Player Position Score
6 Aaron Rodgers QB 10.48
0 Darrel Williams RB 24.90
1 Mark Ingram RB 18.80
5 Davante Adams WR 11.30
8 Corey Davis WR 8.80
9 T.J. Hockenson TE 0.00
7 Tyler Bass K 9.00
4 Buffalo Bills DEF 12.00​

Related

Grouping together lists in pandas

I have a database of patents citing other patents looking like this:
{'index': {0: 0, 1: 1, 2: 2, 12: 12, 21: 21},
'docdb_family_id': {0: 57904406,
1: 57904406,
2: 57906556,
12: 57909419,
21: 57942222},
'cited_docdbs': {0: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
1: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
2: [6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238],
12: [6293366,
7856452,
16980051,
23177359,
26477802,
27453602,
41135094,
53004244,
54332594,
55018863],
21: [7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]},
'doc_std_name': {0: 'SEEO INC',
1: 'BOSCH GMBH ROBERT',
2: 'SAMSUNG SDI CO LTD',
12: 'NAGAI TAKAYUKI',
21: 'SAMSUNG SDI CO LTD'}}
Now, what I would like to do is performing a groupby firm as follows:
df_grouped_byfirm=data_min.groupby("doc_std_name").agg(publn_nrs=('docdb_family_id',"unique")).reset_index()
but merging together the lists of cited_docdbs. So, for instance in the example above, for SAMSUNG SDI CO LTD the final list of cited_docdbs should become a mega list where all the cited docdbs of both ids of SAMSUNG SDI CO LTD are merged together:
[6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238,
7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]
Thank you
You can flatten nested lists with dict.fromkeys for remove duplicates in original order:
f = lambda x: list(dict.fromkeys(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))
print (df)
publn_nrs
doc_std_name
BOSCH GMBH ROBERT [15057621, 16359315, 18731820, 19198211, 19198...
NAGAI TAKAYUKI [6293366, 7856452, 16980051, 23177359, 2647780...
SAMSUNG SDI CO LTD [6078355, 8173164, 14235835, 16940834, 1815241...
SEEO INC [15057621, 16359315, 18731820, 19198211, 19198...
If order is not important use sets for remove duplicates:
f = lambda x: list(set(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))
print (df)
publn_nrs
doc_std_name
BOSCH GMBH ROBERT [19700609, 19198211, 19198340, 44947081, 19198...
NAGAI TAKAYUKI [27453602, 7856452, 26477802, 23177359, 550188...
SAMSUNG SDI CO LTD [48222848, 18834564, 42075013, 58800268, 18704...
SEEO INC [19700609, 19198211, 19198340, 44947081, 19198...
You can just use sum in agg to concatenate the lists within each group.
df.groupby("doc_std_name").agg({"cited_docdbs": sum}).reset_index()
This will give the follow:
doc_std_name cited_docdbs
0 BOSCH GMBH ROBERT [15057621, 16359315, 18731820, 19198211, 19198...
1 NAGAI TAKAYUKI [6293366, 7856452, 16980051, 23177359, 2647780...
2 SAMSUNG SDI CO LTD [6078355, 8173164, 14235835, 16940834, 1815241...
3 SEEO INC [15057621, 16359315, 18731820, 19198211, 19198...

One column (df1) compare with two column(df2) with python

I would like to compare one column (df1) with two columns (df2).
df1
name area
Cody California
Billy Connecticut
Jeniffer Indiana
Franc Georgia
Mark Illinois
Tamis Connecticut
Danye Illinois
Leesa Indiana
Hector Illinois
Coy California
df2
name1 name2 points
Billy NA 20
Cody NA 27.5
Coy NA 25
Danye NA 21
Franc NA 19
Hector 40
Jeniffer 30
Leesa 20
Mark 50
Tamis 90
Output
name area points
Cody California 27.5
Billy Connecticut 20
Jeniffer Indiana 30
Franc Georgia 19
Mark Illinois 50
Tamis Connecticut 90
Danye Illinois 21
Leesa Indiana 20
Hector Illinois 40
Coy California 25
You could try as follows:
import pandas as pd
import numpy as np
data = {'name': {0: 'Cody', 1: 'Billy', 2: 'Jeniffer', 3: 'Franc', 4: 'Mark',
5: 'Tamis', 6: 'Danye', 7: 'Leesa', 8: 'Hector', 9: 'Coy'},
'area': {0: 'California', 1: 'Connecticut', 2: 'Indiana', 3: 'Georgia',
4: 'Illinois', 5: 'Connecticut', 6: 'Illinois', 7: 'Indiana',
8: 'Illinois', 9: 'California'}}
df = pd.DataFrame(data)
data2 = {'name1': {0: 'Billy', 1: 'Cody', 2: 'Coy', 3: 'Danye', 4: 'Franc',
5: np.nan, 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan},
'name2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: 'Hector',
6: 'Jeniffer', 7: 'Leesa', 8: 'Mark', 9: 'Tamis'},
'points': {0: 20.0, 1: 27.5, 2: 25.0, 3: 21.0, 4: 19.0, 5: 40.0,
6: 30.0, 7: 20.0, 8: 50.0, 9: 90.0}}
df2 = pd.DataFrame(data2)
# fill NaNs in `name2` based on `name1`
df2['name2'] = df2['name2'].fillna(df2['name1'])
# merge dfs
df_new = df.merge(df2[['name2','points']], left_on='name', right_on='name2')
print(df_new)
name area points
0 Cody California 27.5
1 Billy Connecticut 20.0
2 Jeniffer Indiana 30.0
3 Franc Georgia 19.0
4 Mark Illinois 50.0
5 Tamis Connecticut 90.0
6 Danye Illinois 21.0
7 Leesa Indiana 20.0
8 Hector Illinois 40.0
9 Coy California 25.0
Alternatively, instead of merge you could use map to add column points to your first df:
df['points'] = df['name'].map(df2.set_index('name2')['points'])

Optimal conditional joining of pandas dataframe

I have a situation where I am trying to join df_a to df_b
In reality, these dataframes have shapes: (389944, 121) and (1098118, 60)
I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
For an example...
df_a:
player
website
merch
michael jordan
www.michaeljordan.com
Y
Lebron James
www.kingjames.com
Y
Kobe Bryant
www.mamba.com
Y
Larry Bird
www.larrybird.com
Y
luka Doncic
www.77.com
N
df_b:
platform
url
web_addr
notes
handle
followers
following
Twitter
https://twitter.com/luka7doncic
www.77.com
luka7doncic
1500000
347
Twitter
www.larrybird.com
https://en.wikipedia.org/wiki/Larry_Bird
www.larrybird.com
Twitter
https://www.michaeljordansworld.com/
www.michaeljordan.com
Twitter
https://twitter.com/kobebryant
https://granitystudios.com/
https://granitystudios.com/
Kobe Bryant
14900000
514
Twitter
fooman.com
thefoo.com
foobar
foobarman
1
1
Twitter
www.stackoverflow.com
Ideally, df_a gets left joined to df_b to bring in the handle, followers, and following fields
player
website
merch
handle
followers
following
michael jordan
www.michaeljordan.com
Y
nh
0
0
Lebron James
www.kingjames.com
Y
null
null
null
Kobe Bryant
www.mamba.com
Y
Kobe Bryant
14900000
514
Larry Bird
www.larrybird.com
Y
nh
0
0
luka Doncic
www.77.com
N
luka7doncic
1500000
347
A minimal, reproducible example is below:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
However, this produces erroneous results with duplicate columns.
How can I do this more efficiently and with correct results?
Use:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.stackoverflow.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic
You can pass left_on and right_on with lists -
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)

How to calculate a win streak in Python/Pandas

I'm trying to calculate the win-streak or losing-streak going into a game. My goal is to generate a betting decision based on these streak factors or a recent record. I am new to Python and Pandas (and programming in general), so any detailed explanation of what code does would be welcome.
Here's my data
Season Game Date Game Index Away Team Away Score Home Team Home Score Winner Loser
0 2014 Regular Season Saturday, March 22, 2014 2014032201 Los Angeles Dodgers 3 Arizona D'Backs 1 Los Angeles Dodgers Arizona D'Backs
1 2014 Regular Season Sunday, March 23, 2014 2014032301 Los Angeles Dodgers 7 Arizona D'Backs 5 Los Angeles Dodgers Arizona D'Backs
2 2014 Regular Season Sunday, March 30, 2014 2014033001 Los Angeles Dodgers 1 San Diego Padres 3 San Diego Padres Los Angeles Dodgers
3 2014 Regular Season Monday, March 31, 2014 2014033101 Seattle Mariners 10 Los Angeles Angels 3 Seattle Mariners Los Angeles Angels
4 2014 Regular Season Monday, March 31, 2014 2014033102 San Francisco Giants 9 Arizona D'Backs 8 San Francisco Giants Arizona D'Backs
5 2014 Regular Season Monday, March 31, 2014 2014033103 Boston Red Sox 1 Baltimore Orioles 2 Baltimore Orioles Boston Red Sox
6 2014 Regular Season Monday, March 31, 2014 2014033104 Minnesota Twins 3 Chicago White Sox 5 Chicago White Sox Minnesota Twins
7 2014 Regular Season Monday, March 31, 2014 2014033105 St. Louis Cardinals 1 Cincinnati Reds 0 St. Louis Cardinals Cincinnati Reds
8 2014 Regular Season Monday, March 31, 2014 2014033106 Kansas City Royals 3 Detroit Tigers 4 Detroit Tigers Kansas City Royals
9 2014 Regular Season Monday, March 31, 2014 2014033107 Colorado Rockies 1 Miami Marlins 10 Miami Marlins Colorado Rockies
Dictionary below:
{'Away Score': {0: 3, 1: 7, 2: 1, 3: 10, 4: 9},
'Away Team': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'Los Angeles Dodgers',
3: 'Seattle Mariners',
4: 'San Francisco Giants'},
'Game Date': {0: 'Saturday, March 22, 2014',
1: 'Sunday, March 23, 2014',
2: 'Sunday, March 30, 2014',
3: 'Monday, March 31, 2014',
4: 'Monday, March 31, 2014'},
'Game Index': {0: 2014032201,
1: 2014032301,
2: 2014033001,
3: 2014033101,
4: 2014033102},
'Home Score': {0: 1, 1: 5, 2: 3, 3: 3, 4: 8},
'Home Team': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'San Diego Padres',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Loser': {0: "Arizona D'Backs",
1: "Arizona D'Backs",
2: 'Los Angeles Dodgers',
3: 'Los Angeles Angels',
4: "Arizona D'Backs"},
'Season': {0: '2014 Regular Season',
1: '2014 Regular Season',
2: '2014 Regular Season',
3: '2014 Regular Season',
4: '2014 Regular Season'},
'Winner': {0: 'Los Angeles Dodgers',
1: 'Los Angeles Dodgers',
2: 'San Diego Padres',
3: 'Seattle Mariners',
4: 'San Francisco Giants'}}
I've tried looping through the season and the team, and then creating a streak count based on [this]: https://github.com/nhcamp/EPL-Betting/blob/master/EPL%20Match%20Results%20DF.ipynb github project.
I run into key errors early in building my loops, and I have trouble identifying data
game_table = pd.read_csv('MLB_Scores_2014_2018.csv')
# Get Team List
team_list = game_table['Away Team'].unique()
# Get Season List
season_list = game_table['Season'].unique()
#Defining "chunks" to append gamedata to the total dataframe
chunks = []
for season in season_list:
# Looping through seasons. Streaks reset for each season
season_games = game_table[game_table['Season'] == season]
for team in team_list:
# Looping through teams
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
#Setting streak list and streak counter values
streak_list = []
streak = 0
# Looping through each game
for game in season_team_games.iterrow():
# Check if team is a winner, and up the streak
if game_table['Winner'] == team:
streak_list.append(streak)
streak += 1
# If not the winner, append streak and set to zero
elif game_table['Winner'] != team:
streak_list.append(streak)
streak = 0
# Just in case something wierd happens with the scores
else:
streak_list.append(streak)
game_table['Streak'] = streak_list
chunk_list.append(game_table)
And that's kind of where I lose it. How do I append separately if each team is the home team or the away team? Is there a better way to display this data?
As a general matter, I want to add a win-streak and/or losing-streak for each team in each game. Headers would look like this:
| Season | Game Date | Game Index | Away Team | Away Score | Home Team | Home Score | Winner | Loser | Away Win Streak | Away Lose Streak | Home Win Streak | Home Lose Streak |
Edit: this error message has been resolved
I also get an error creating the dataframe 'season_team_games."
TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]
The error you are seeing come from the statement
season_team_games = season_games[(season_games['Away Team'] == team | season_games['Home Team'] == team)]
When you're adding two boolean conditions, you need to separate them out with parentheses. This is because the | operator takes precedence over the == operator. So this should become:
season_team_games = season_games[(season_games['Away Team'] == team) | (season_games['Home Team'] == team)]
I know there is more to the question than this error, but as mentioned in the comment, once you provide some text based data, it might be easier to help

How can I improve performance on my apply() with fuzzy matching statement

I've written a function called muzz that leverages the fuzzywuzzy module to 'merge' two pandas dataframes. Works great, but the performance is pretty bad on larger frames. Please take a look at my apply() that does the extracting/scoring and let me know if you have any ideas that could speed it up.
import pandas as pd
import numpy as np
import fuzzywuzzy as fw
Create a frame of raw data
dfRaw = pd.DataFrame({'City': {0: u'St Louis',
1: 'Omaha',
2: 'Chicogo',
3: 'Kansas city',
4: 'Des Moine'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL', 3 : 'MO', 4 : 'IA'}})
Which yields
City State
0 St Louis MO
1 Omaha NE
2 Chicogo IL
3 Kansas city MO
4 Des Moine IA
Then a frame that represents the good data that we want to look up
dfLocations = pd.DataFrame({'City': {0: 'Saint Louis',
1: u'Omaha',
2: u'Chicago',
3: u'Kansas City',
4: u'Des Moines'},
'State' : {0: 'MO', 1: 'NE', 2 : 'IL',
3 : 'KS', 4 : 'IA'},
u'Zip': {0: '63201', 1: '68104', 2: '60290',
3: '68101', 4: '50301'}})
Which yields
City State Zip
0 Saint Louis MO 63201
1 Omaha NE 68104
2 Chicago IL 60290
3 Kansas City KS 68101
4 Des Moines IA 50301
and now the muzz function. EDIT: Added choices= right[match_col_name] line and used choices in the apply per Brenbarn suggestion. I also, per Brenbarn suggestion, ran some tests with the extractOne() without the apply and it it appears to be the bottleneck. Maybe there's a faster way to do the fuzzy matching?
def muzz(left, right, on, match_col_name='match_on',score_col_name='score_match',
right_suffix='_match', score_cutoff=80):
right[match_col_name] = np.sum(right[on],axis=1)
choices= right[match_col_name]
###The offending statement###
left[[match_col_name,score_col_name]] =
pd.Series(np.sum(left[on],axis=1)).apply(lambda x : pd.Series(
fw.process.extractOne(x,choices,score_cutoff=score_cutoff)))
dfTemp = pd.merge(left,right,how='left',on=match_col_name,suffixes=('',right_suffix))
return dfTemp.drop(match_col_name, axis=1)
Calling muzz
muzz(dfRaw.copy(),dfLocations,on=['City','State'], score_cutoff=85)
Which yields
City State score_match City_match State_match Zip
0 St Louis MO 87 Saint Louis MO 63201
1 Omaha NE 100 Omaha NE 68104
2 Chicogo IL 89 Chicago IL 60290
3 Kansas city MO NaN NaN NaN NaN
4 Des Moine IA 96 Des Moines IA 50301

Categories