I want to create a data frame to link 2 columns together (customer ID to each order ID the customer placed). The row index + 1 correlates to the customer ID. Is there a way to do this through mapping?
Data: invoice_df
Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal
839FKFW2LLX4LMBB,27-05-2016,INBUX904GIHI8YBD,LJKS5NK6788CYMUU,2016-05-31 07:00:00+02:00,['David Bishop'],469,Breakfast
97OX39BGVMHODLJM,27-09-2018,J0MMOOPP709DIDIE,LJKS5NK6788CYMUU,2018-10-01 20:00:00+02:00,['David Bishop'],22,Dinner
041ORQM5OIHTIU6L,24-08-2014,E4UJLQNCI16UX5CS,LJKS5NK6788CYMUU,2014-08-23 14:00:00+02:00,['Karen Stansell'],314,Lunch
YT796QI18WNGZ7ZJ,12-04-2014,C9SDFHF7553BE247,LJKS5NK6788CYMUU,2014-04-07 21:00:00+02:00,['Addie Patino'],438,Dinner
6YLROQT27B6HRF4E,28-07-2015,48EQXS6IHYNZDDZ5,LJKS5NK6788CYMUU,2015-07-27 14:00:00+02:00,['Addie Patino' 'Susan Guerrero'],690,Lunch
AT0R4DFYYAFOC88Q,21-07-2014,W48JPR1UYWJ18NC6,LJKS5NK6788CYMUU,2014-07-17 20:00:00+02:00,['David Bishop' 'Susan Guerrero' 'Karen Stansell'],181,Dinner
2DDN2LHS7G85GKPQ,29-04-2014,1MKLAKBOE3SP7YUL,LJKS5NK6788CYMUU,2014-04-30 21:00:00+02:00,['Susan Guerrero' 'David Bishop'],14,Dinner
FM608JK1N01BPUQN,08-05-2014,E8WJZ1FOSKZD2MJN,36MFTZOYMTAJP1RK,2014-05-07 09:00:00+02:00,['Amanda Knowles' 'Cheryl Feaster' 'Ginger Hoagland' 'Michael White'],320,Breakfast
CK331XXNIBQT81QL,23-05-2015,CTZSFFKQTY7SBZ4J,36MFTZOYMTAJP1RK,2015-05-18 13:00:00+02:00,['Cheryl Feaster' 'Amanda Knowles' 'Ginger Hoagland'],697,Lunch
FESGKOQN2OZZWXY3,10-01-2016,US0NQYNNHS1SQJ4S,36MFTZOYMTAJP1RK,2016-01-14 22:00:00+01:00,['Glenn Gould' 'Amanda Knowles' 'Ginger Hoagland' 'Michael White'],451,Dinner
YITOTLOF0MWZ0VYX,03-10-2016,RGYX8772307H78ON,36MFTZOYMTAJP1RK,2016-10-01 22:00:00+02:00,['Ginger Hoagland' 'Amanda Knowles' 'Michael White'],263,Dinner
8RIGCF74GUEQHQEE,23-07-2018,5XK0KTFTD6OAP9ZP,36MFTZOYMTAJP1RK,2018-07-27 08:00:00+02:00,['Amanda Knowles'],210,Breakfast
TH60C9D8TPYS7DGG,15-12-2016,KDSMP2VJ22HNEPYF,36MFTZOYMTAJP1RK,2016-12-13 08:00:00+01:00,['Cheryl Feaster' 'Bret Adams' 'Ginger Hoagland'],755,Breakfast
W1Y086SRAVUZU1AL,17-09-2017,8IUOYVS031QPROUG,36MFTZOYMTAJP1RK,2017-09-14 13:00:00+02:00,['Bret Adams'],469,Lunch
WKB58Q8BHLOFQAB5,31-08-2016,E2K2TQUMENXSI9RP,36MFTZOYMTAJP1RK,2016-09-03 14:00:00+02:00,['Michael White' 'Ginger Hoagland' 'Bret Adams'],502,Lunch
N8DOG58MW238BHA9,25-12-2018,KFR2TAYXZSVCHAA2,36MFTZOYMTAJP1RK,2018-12-20 12:00:00+01:00,['Ginger Hoagland' 'Cheryl Feaster' 'Glenn Gould' 'Bret Adams'],829,Lunch
DPDV9UGF0SUCYTGW,25-05-2017,6YV61SH7W9ECUZP0,36MFTZOYMTAJP1RK,2017-05-24 22:00:00+02:00,['Michael White'],708,Dinner
KNF3E3QTOQ22J269,20-06-2018,737T2U7604ABDFDF,36MFTZOYMTAJP1RK,2018-06-15 07:00:00+02:00,['Glenn Gould' 'Cheryl Feaster' 'Ginger Hoagland' 'Amanda Knowles'],475,Breakfast
LEED1HY47M8BR5VL,22-10-2017,I22P10IQQD06MO45,36MFTZOYMTAJP1RK,2017-10-22 14:00:00+02:00,['Glenn Gould'],27,Lunch
LSJPNJQLDTIRNWAL,27-01-2017,247IIVNN6CXGWINB,36MFTZOYMTAJP1RK,2017-01-23 13:00:00+01:00,['Amanda Knowles' 'Bret Adams'],672,Lunch
6UX5RMHJ1GK1F9YQ,24-08-2014,LL4AOPXDM8V5KP5S,H3JRC7XX7WJAD4ZO,2014-08-27 12:00:00+02:00,['Anthony Emerson' 'Irvin Gentry' 'Melba Inlow'],552,Lunch
5SYB15QEFWD1E4Q4,09-07-2017,KZI0VRU30GLSDYHA,H3JRC7XX7WJAD4ZO,2017-07-13 08:00:00+02:00,"['Anthony Emerson' 'Emma Steitz' 'Melba Inlow' 'Irvin Gentry'
'Kelly Killebrew']",191,Breakfast
W5S8VZ61WJONS4EE,25-03-2017,XPSPBQF1YLIG26N1,H3JRC7XX7WJAD4ZO,2017-03-25 07:00:00+01:00,['Irvin Gentry' 'Kelly Killebrew'],471,Breakfast
795SVIJKO8KS3ZEL,05-01-2015,HHTLB8M9U0TGC7Z4,H3JRC7XX7WJAD4ZO,2015-01-06 22:00:00+01:00,['Emma Steitz'],588,Dinner
8070KEFYSSPWPCD0,05-08-2014,VZ2OL0LREO8V9RKF,H3JRC7XX7WJAD4ZO,2014-08-09 12:00:00+02:00,['Lewis Eyre'],98,Lunch
RUQOHROBGBOSNUO4,10-06-2016,R3LFUK1WFDODC1YF,H3JRC7XX7WJAD4ZO,2016-06-09 08:00:00+02:00,['Anthony Emerson' 'Kelly Killebrew' 'Lewis Eyre'],516,Breakfast
6P91QRADC2O9WOVT,25-09-2016,L2F2HEGB6Q141080,H3JRC7XX7WJAD4ZO,2016-09-26 07:00:00+02:00,"['Kelly Killebrew' 'Lewis Eyre' 'Irvin Gentry' 'Emma Steitz'
'Anthony Emerson']",664,Breakfast
Code:
# Function to convert string ['name' 'name2'] to list ['name', 'name2']
# Returns a list of participant names
def string_to_list(participant_string): return re.findall(r"'(.*?)'", participant_string)
invoice_df["Participants"] = invoice_df["Participants"].apply(string_to_list)
# Obtain an array of all unique customer names
customers = invoice_df["Participants"].explode().unique()
# Create new customer dataframe
customers_df = pd.DataFrame(customers, columns = ["CustomerName"])
# Add customer id
customers_df["customer_id"] = customers_df.index + 1
# Create a first_name and last_name column
customers_df["first_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" "[0])
# Splice the list 1: in the event the person has multiple last names
customers_df["last_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" ")[1])
Solution
# Find all the occurrences of customer names
# then explode to convert values in lists to rows
cust = invoice_df['Participants'].str.findall(r"'(.*?)'").explode()
# Join with orderid
customers_df = invoice_df[['Order Id']].join(cust)
# factorize to encode the unique values in participants
customers_df['Customer Id'] = customers_df['Participants'].factorize()[0] + 1
Result
Order Id Participants Customer Id
0 839FKFW2LLX4LMBB David Bishop 1
1 97OX39BGVMHODLJM David Bishop 1
2 041ORQM5OIHTIU6L Karen Stansell 2
3 YT796QI18WNGZ7ZJ Addie Patino 3
4 6YLROQT27B6HRF4E Addie Patino 3
4 6YLROQT27B6HRF4E Susan Guerrero 4
5 AT0R4DFYYAFOC88Q David Bishop 1
5 AT0R4DFYYAFOC88Q Susan Guerrero 4
5 AT0R4DFYYAFOC88Q Karen Stansell 2
6 2DDN2LHS7G85GKPQ Susan Guerrero 4
6 2DDN2LHS7G85GKPQ David Bishop 1
7 FM608JK1N01BPUQN Amanda Knowles 5
7 FM608JK1N01BPUQN Cheryl Feaster 6
7 FM608JK1N01BPUQN Ginger Hoagland 7
7 FM608JK1N01BPUQN Michael White 8
8 CK331XXNIBQT81QL Cheryl Feaster 6
8 CK331XXNIBQT81QL Amanda Knowles 5
8 CK331XXNIBQT81QL Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Glenn Gould 9
9 FESGKOQN2OZZWXY3 Amanda Knowles 5
9 FESGKOQN2OZZWXY3 Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Michael White 8
10 YITOTLOF0MWZ0VYX Ginger Hoagland 7
10 YITOTLOF0MWZ0VYX Amanda Knowles 5
10 YITOTLOF0MWZ0VYX Michael White 8
11 8RIGCF74GUEQHQEE Amanda Knowles 5
12 TH60C9D8TPYS7DGG Cheryl Feaster 6
12 TH60C9D8TPYS7DGG Bret Adams 10
12 TH60C9D8TPYS7DGG Ginger Hoagland 7
13 W1Y086SRAVUZU1AL Bret Adams 10
14 WKB58Q8BHLOFQAB5 Michael White 8
14 WKB58Q8BHLOFQAB5 Ginger Hoagland 7
14 WKB58Q8BHLOFQAB5 Bret Adams 10
15 N8DOG58MW238BHA9 Ginger Hoagland 7
15 N8DOG58MW238BHA9 Cheryl Feaster 6
15 N8DOG58MW238BHA9 Glenn Gould 9
15 N8DOG58MW238BHA9 Bret Adams 10
16 DPDV9UGF0SUCYTGW Michael White 8
17 KNF3E3QTOQ22J269 Glenn Gould 9
17 KNF3E3QTOQ22J269 Cheryl Feaster 6
17 KNF3E3QTOQ22J269 Ginger Hoagland 7
17 KNF3E3QTOQ22J269 Amanda Knowles 5
18 LEED1HY47M8BR5VL Glenn Gould 9
19 LSJPNJQLDTIRNWAL Amanda Knowles 5
19 LSJPNJQLDTIRNWAL Bret Adams 10
20 6UX5RMHJ1GK1F9YQ Anthony Emerson 11
20 6UX5RMHJ1GK1F9YQ Irvin Gentry 12
20 6UX5RMHJ1GK1F9YQ Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Anthony Emerson 11
21 5SYB15QEFWD1E4Q4 Emma Steitz 14
21 5SYB15QEFWD1E4Q4 Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Irvin Gentry 12
21 5SYB15QEFWD1E4Q4 Kelly Killebrew 15
22 W5S8VZ61WJONS4EE Irvin Gentry 12
22 W5S8VZ61WJONS4EE Kelly Killebrew 15
23 795SVIJKO8KS3ZEL Emma Steitz 14
24 8070KEFYSSPWPCD0 Lewis Eyre 16
25 RUQOHROBGBOSNUO4 Anthony Emerson 11
25 RUQOHROBGBOSNUO4 Kelly Killebrew 15
25 RUQOHROBGBOSNUO4 Lewis Eyre 16
26 6P91QRADC2O9WOVT Kelly Killebrew 15
26 6P91QRADC2O9WOVT Lewis Eyre 16
26 6P91QRADC2O9WOVT Irvin Gentry 12
26 6P91QRADC2O9WOVT Emma Steitz 14
26 6P91QRADC2O9WOVT Anthony Emerson 11
Below is my code and Dataframes. stats_df is much bigger. Not sure if it matters, but the column values are EXACTLY as they appear in the actual files. I can't merge the two DFs without losing 'Alex Len' even though both DFs have the same PlayerID value of '20000852'
stats_df = pd.read_csv('stats_todate.csv')
matchup_df = pd.read_csv('matchup.csv')
new_df = pd.merge(stats_df, matchup_df[['PlayerID','Matchup','Started','GameStatus']])
I have also tried:
stats_df['PlayerID'] = stats_df['PlayerID'].astype(str)
matchup_df['PlayerID'] = matchup_df['PlayerID'].astype(str)
stats_df['PlayerID'] = stats_df['PlayerID'].str.strip()
matchup_df['PlayerID'] = matchup_df['PlayerID'].str.strip()
Any ideas?
Here are my two Dataframes:
DF1:
PlayerID SeasonType Season Name Team Position
20001713 1 2018 A.J. Hammons MIA C
20002725 2 2022 A.J. Lawson ATL SG
20002038 2 2021 Élie Okobo BKN PG
20002742 2 2022 Aamir Simms NY PF
20000518 3 2018 Aaron Brooks MIN PG
20000681 1 2022 Aaron Gordon DEN PF
20001395 1 2018 Aaron Harrison DAL SG
20002680 1 2022 Aaron Henry PHI SF
20002005 1 2022 Aaron Holiday PHO PG
20001981 3 2018 Aaron Jackson HOU PF
20002539 1 2022 Aaron Nesmith BOS SF
20002714 1 2022 Aaron Wiggins OKC SG
20001721 1 2022 Abdel Nader PHO SF
20002251 2 2020 Abdul Gaddy OKC PG
20002458 1 2021 Adam Mokoka CHI SG
20002619 1 2022 Ade Murkey SAC PF
20002311 1 2022 Admiral Schofield ORL PF
20000783 1 2018 Adreian Payne ORL PF
20002510 1 2022 Ahmad Caver IND PG
20002498 2 2020 Ahmed Hill CHA PG
20000603 1 2022 Al Horford BOS PF
20000750 3 2018 Al Jefferson IND C
20001645 1 2019 Alan Williams BKN PF
20000837 1 2022 Alec Burks NY SG
20001882 1 2018 Alec Peters PHO PF
20002850 1 2022 Aleem Ford ORL SF
20002542 1 2022 Aleksej Pokuševski OKC PF
20002301 3 2021 Alen Smailagic GS PF
20001763 1 2019 Alex Abrines OKC SG
20001801 1 2022 Alex Caruso CHI SG
20000852 1 2022 Alex Len SAC C
DF2:
PlayerID Name Date Started Opponent GameStatus Matchup
20000681 Aaron Gordon 4/1/2022 1 MIN 16
20002005 Aaron Holiday 4/1/2022 0 MEM 21
20002539 Aaron Nesmith 4/1/2022 0 IND 13
20002714 Aaron Wiggins 4/1/2022 1 DET 14
20002311 Admiral Schofield 4/1/2022 0 TOR 10
20000603 Al Horford 4/1/2022 1 IND 13
20002542 Aleksej Pokuševski 4/1/2022 1 DET 14
20000852 Alex Len 4/1/2022 1 HOU 22
You need to specify the column you want to merge on using the on keyword argument:
new_df = pd.merge(stats_df, matchup_df[['PlayerID','Matchup','Started','GameStatus']], on=['PayerID'])
Otherwise it will merge using all of the shared columns.
Here is the explanation from the pandas docs:
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If on is None and not merging on indexes then this defaults
to the intersection of the columns in both DataFrames.
This is similar to some other questions posted, but i can't find an answer that fits my needs.
I have a Dataframe with the following:
RK PLAYER SCHOOL YEAR POS POS RK HT WT 2019 2018 2017 2016
0 1 Nick Bosa Ohio St. Jr EDGE 1 6-4 266 Jr
1 2 Quinnen Williams Alabama Soph DL 1 6-3 303 Soph
2 3 Josh Allen Kentucky Sr EDGE 2 6-5 262 Sr
3 4 Ed Oliver Houston Jr DL 2 6-2 287 Jr
2018, 2017, and 2016 have np.NaN values; but i can't format this table correctly with them in it.
Now i have a separate list containing the following:
season = ['Sr', 'Jr', 'Soph', 'Fr']
The 2019 column says their current status, and i would like for the 2018 column to show their status as of the prior year. So if it was 'Sr', it should be 'Jr'. Essentially, what i want to do is have the column check for the value in [season], move it one index ahead, and then take that value back into the column. The result for 2018 should be:
RK PLAYER SCHOOL YEAR POS POS RK HT WT 2019 2018 2017 2016
0 1 Nick Bosa Ohio St. Jr EDGE 1 6-4 266 Jr Soph
1 2 Quinnen Williams Alabama Soph DL 1 6-3 303 Soph Fr
2 3 Josh Allen Kentucky Sr EDGE 2 6-5 262 Sr Jr
3 4 Ed Oliver Houston Jr DL 2 6-2 287 Jr Soph
I can think of a way to do this with a for k, v in iteritems loop that would check the values, but i'm wondering if there's a better way?
I'm not sure if this is much smarter than what you already have, but its a suggestion
import pandas as pd
def get_season(curr_season, curr_year, prev_year):
season = ['Sr', 'Jr', 'Soph', 'Fr']
try:
return season[season.index(curr_season) + (curr_year - prev_year)]
except IndexError:
# Return some meaningful meassage perhaps?
return '-'
df = pd.DataFrame({'2019': ['Jr', 'Soph', 'Sr', 'Jr']})
df['2018'] = [get_season(s, 2019, 2018) for s in df['2019']]
df['2017'] = [get_season(s, 2019, 2017) for s in df['2019']]
df['2016'] = [get_season(s, 2019, 2016) for s in df['2019']]
df
Out[18]:
2019 2018 2017 2016
0 Jr Soph Fr -
1 Soph Fr - -
2 Sr Jr Soph Fr
3 Jr Soph Fr -
Another possible solution is to write a function that will accept a row, do a slice of seasons list starting from '2019' value and return that slice as pandas.Series. Then we can apply that function to columns using apply(). I used a part of your input DataFrame for testing.
In [3]: df
Out[3]:
WT 2019 2018 2017 2016
0 266 Jr NaN NaN NaN
1 303 Soph NaN NaN NaN
2 262 Sr NaN NaN NaN
3 287 Jr NaN NaN NaN
In [4]: def fill_row(row):
...: season = ['Sr', 'Jr', 'Soph', 'Fr']
...: data = season[season.index(row['2019']):]
...: return pd.Series(data)
In [5]: cols_to_update = ['2019', '2018', '2017', '2016']
In [6]: df[cols_to_update] = df[cols_to_update].apply(fill_row, axis=1)
In [7]: df
Out[7]:
WT 2019 2018 2017 2016
0 266 Jr Soph Fr NaN
1 303 Soph Fr NaN NaN
2 262 Sr Jr Soph Fr
3 287 Jr Soph Fr NaN
Sorry, I just asked this question: Pythonic Way to have multiple Or's when conditioning in a dataframe but marked it as answered prematurely because it passed my overly simplistic test case, but isn't working more generally. (If it is possible to merge and reopen the question that would be great...)
Here is the full issue:
sum(data['Name'].isin(eligible_players))
> 0
sum(data['Name'] == "Antonio Brown")
> 68
"Antonio Brown" in eligible_players
> True
Basically if I understand correctly, I am showing that Antonio Brown is in eligible players and he is in the dataframe. However, for some reason the .isin() isn't working properly.
As I said in my prior question, I am looking for a way to check many ors to select the proper rows
____ EDIT ____
In[14]:
eligible_players
Out[14]:
Name
Antonio Brown 378
Demaryius Thomas 334
Jordy Nelson 319
Dez Bryant 309
Emmanuel Sanders 293
Odell Beckham 289
Julio Jones 288
Randall Cobb 284
Jeremy Maclin 267
T.Y. Hilton 255
Alshon Jeffery 252
Golden Tate 250
Mike Evans 236
DeAndre Hopkins 223
Calvin Johnson 220
Kelvin Benjamin 218
Julian Edelman 213
Anquan Boldin 213
Steve Smith 213
Roddy White 208
Brandon LaFell 205
Mike Wallace 205
A.J. Green 203
DeSean Jackson 200
Jordan Matthews 194
Eric Decker 194
Sammy Watkins 190
Torrey Smith 186
Andre Johnson 186
Jarvis Landry 178
Eddie Royal 176
Brandon Marshall 175
Vincent Jackson 175
Rueben Randle 174
Marques Colston 173
Mohamed Sanu 171
Keenan Allen 170
James Jones 168
Malcom Floyd 168
Kenny Stills 167
Greg Jennings 162
Kendall Wright 162
Doug Baldwin 160
Michael Floyd 159
Robert Woods 158
Name: Pts, dtype: int64
and
In [31]:
data.tail(110)
Out[31]:
Name Pts year week pos Team
28029 Dez Bryant 25 2014 17 WR DAL
28030 Antonio Brown 25 2014 17 WR PIT
28031 Jordan Matthews 24 2014 17 WR PHI
28032 Randall Cobb 23 2014 17 WR GB
28033 Rueben Randle 21 2014 17 WR NYG
28034 Demaryius Thomas 19 2014 17 WR DEN
28035 Calvin Johnson 19 2014 17 WR DET
28036 Torrey Smith 18 2014 17 WR BAL
28037 Roddy White 17 2014 17 WR ATL
28038 Steve Smith 17 2014 17 WR BAL
28039 DeSean Jackson 16 2014 17 WR WAS
28040 Mike Evans 16 2014 17 WR TB
28041 Anquan Boldin 16 2014 17 WR SF
28042 Adam Thielen 15 2014 17 WR MIN
28043 Cecil Shorts 15 2014 17 WR JAC
28044 A.J. Green 15 2014 17 WR CIN
28045 Jordy Nelson 14 2014 17 WR GB
28046 Brian Hartline 14 2014 17 WR MIA
28047 Robert Woods 13 2014 17 WR BUF
28048 Kenny Stills 13 2014 17 WR NO
28049 Emmanuel Sanders 13 2014 17 WR DEN
28050 Eddie Royal 13 2014 17 WR SD
28051 Marques Colston 13 2014 17 WR NO
28052 Chris Owusu 12 2014 17 WR NYJ
28053 Brandon LaFell 12 2014 17 WR NE
28054 Dontrelle Inman 12 2014 17 WR SD
28055 Reggie Wayne 11 2014 17 WR IND
28056 Paul Richardson 11 2014 17 WR SEA
28057 Cole Beasley 11 2014 17 WR DAL
28058 Jarvis Landry 10 2014 17 WR MIA
(Aside: once you posted what you were actually using, it only took seconds to see the problem.)
Series.isin(something) iterates over something to determine the set of things you want to test membership in. But your eligible_players isn't a list, it's a Series. And iteration over a Series is iteration over the values, even though membership (in) is with respect to the index:
In [72]: eligible_players = pd.Series([10,20,30], index=["A","B","C"])
In [73]: list(eligible_players)
Out[73]: [10, 20, 30]
In [74]: "A" in eligible_players
Out[74]: True
So in your case, you could use eligible_players.index instead to pass the right names:
In [75]: df = pd.DataFrame({"Name": ["A","B","C","D"]})
In [76]: df
Out[76]:
Name
0 A
1 B
2 C
3 D
In [77]: df["Name"].isin(eligible_players) # remember, this will be [10, 20, 30]
Out[77]:
0 False
1 False
2 False
3 False
Name: Name, dtype: bool
In [78]: df["Name"].isin(eligible_players.index)
Out[78]:
0 True
1 True
2 True
3 False
Name: Name, dtype: bool
In [79]: df["Name"].isin(eligible_players.index).sum()
Out[79]: 3