I want to create a data frame to link 2 columns together (customer ID to each order ID the customer placed). The row index + 1 correlates to the customer ID. Is there a way to do this through mapping?
Data: invoice_df
Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal
839FKFW2LLX4LMBB,27-05-2016,INBUX904GIHI8YBD,LJKS5NK6788CYMUU,2016-05-31 07:00:00+02:00,['David Bishop'],469,Breakfast
97OX39BGVMHODLJM,27-09-2018,J0MMOOPP709DIDIE,LJKS5NK6788CYMUU,2018-10-01 20:00:00+02:00,['David Bishop'],22,Dinner
041ORQM5OIHTIU6L,24-08-2014,E4UJLQNCI16UX5CS,LJKS5NK6788CYMUU,2014-08-23 14:00:00+02:00,['Karen Stansell'],314,Lunch
YT796QI18WNGZ7ZJ,12-04-2014,C9SDFHF7553BE247,LJKS5NK6788CYMUU,2014-04-07 21:00:00+02:00,['Addie Patino'],438,Dinner
6YLROQT27B6HRF4E,28-07-2015,48EQXS6IHYNZDDZ5,LJKS5NK6788CYMUU,2015-07-27 14:00:00+02:00,['Addie Patino' 'Susan Guerrero'],690,Lunch
AT0R4DFYYAFOC88Q,21-07-2014,W48JPR1UYWJ18NC6,LJKS5NK6788CYMUU,2014-07-17 20:00:00+02:00,['David Bishop' 'Susan Guerrero' 'Karen Stansell'],181,Dinner
2DDN2LHS7G85GKPQ,29-04-2014,1MKLAKBOE3SP7YUL,LJKS5NK6788CYMUU,2014-04-30 21:00:00+02:00,['Susan Guerrero' 'David Bishop'],14,Dinner
FM608JK1N01BPUQN,08-05-2014,E8WJZ1FOSKZD2MJN,36MFTZOYMTAJP1RK,2014-05-07 09:00:00+02:00,['Amanda Knowles' 'Cheryl Feaster' 'Ginger Hoagland' 'Michael White'],320,Breakfast
CK331XXNIBQT81QL,23-05-2015,CTZSFFKQTY7SBZ4J,36MFTZOYMTAJP1RK,2015-05-18 13:00:00+02:00,['Cheryl Feaster' 'Amanda Knowles' 'Ginger Hoagland'],697,Lunch
FESGKOQN2OZZWXY3,10-01-2016,US0NQYNNHS1SQJ4S,36MFTZOYMTAJP1RK,2016-01-14 22:00:00+01:00,['Glenn Gould' 'Amanda Knowles' 'Ginger Hoagland' 'Michael White'],451,Dinner
YITOTLOF0MWZ0VYX,03-10-2016,RGYX8772307H78ON,36MFTZOYMTAJP1RK,2016-10-01 22:00:00+02:00,['Ginger Hoagland' 'Amanda Knowles' 'Michael White'],263,Dinner
8RIGCF74GUEQHQEE,23-07-2018,5XK0KTFTD6OAP9ZP,36MFTZOYMTAJP1RK,2018-07-27 08:00:00+02:00,['Amanda Knowles'],210,Breakfast
TH60C9D8TPYS7DGG,15-12-2016,KDSMP2VJ22HNEPYF,36MFTZOYMTAJP1RK,2016-12-13 08:00:00+01:00,['Cheryl Feaster' 'Bret Adams' 'Ginger Hoagland'],755,Breakfast
W1Y086SRAVUZU1AL,17-09-2017,8IUOYVS031QPROUG,36MFTZOYMTAJP1RK,2017-09-14 13:00:00+02:00,['Bret Adams'],469,Lunch
WKB58Q8BHLOFQAB5,31-08-2016,E2K2TQUMENXSI9RP,36MFTZOYMTAJP1RK,2016-09-03 14:00:00+02:00,['Michael White' 'Ginger Hoagland' 'Bret Adams'],502,Lunch
N8DOG58MW238BHA9,25-12-2018,KFR2TAYXZSVCHAA2,36MFTZOYMTAJP1RK,2018-12-20 12:00:00+01:00,['Ginger Hoagland' 'Cheryl Feaster' 'Glenn Gould' 'Bret Adams'],829,Lunch
DPDV9UGF0SUCYTGW,25-05-2017,6YV61SH7W9ECUZP0,36MFTZOYMTAJP1RK,2017-05-24 22:00:00+02:00,['Michael White'],708,Dinner
KNF3E3QTOQ22J269,20-06-2018,737T2U7604ABDFDF,36MFTZOYMTAJP1RK,2018-06-15 07:00:00+02:00,['Glenn Gould' 'Cheryl Feaster' 'Ginger Hoagland' 'Amanda Knowles'],475,Breakfast
LEED1HY47M8BR5VL,22-10-2017,I22P10IQQD06MO45,36MFTZOYMTAJP1RK,2017-10-22 14:00:00+02:00,['Glenn Gould'],27,Lunch
LSJPNJQLDTIRNWAL,27-01-2017,247IIVNN6CXGWINB,36MFTZOYMTAJP1RK,2017-01-23 13:00:00+01:00,['Amanda Knowles' 'Bret Adams'],672,Lunch
6UX5RMHJ1GK1F9YQ,24-08-2014,LL4AOPXDM8V5KP5S,H3JRC7XX7WJAD4ZO,2014-08-27 12:00:00+02:00,['Anthony Emerson' 'Irvin Gentry' 'Melba Inlow'],552,Lunch
5SYB15QEFWD1E4Q4,09-07-2017,KZI0VRU30GLSDYHA,H3JRC7XX7WJAD4ZO,2017-07-13 08:00:00+02:00,"['Anthony Emerson' 'Emma Steitz' 'Melba Inlow' 'Irvin Gentry'
'Kelly Killebrew']",191,Breakfast
W5S8VZ61WJONS4EE,25-03-2017,XPSPBQF1YLIG26N1,H3JRC7XX7WJAD4ZO,2017-03-25 07:00:00+01:00,['Irvin Gentry' 'Kelly Killebrew'],471,Breakfast
795SVIJKO8KS3ZEL,05-01-2015,HHTLB8M9U0TGC7Z4,H3JRC7XX7WJAD4ZO,2015-01-06 22:00:00+01:00,['Emma Steitz'],588,Dinner
8070KEFYSSPWPCD0,05-08-2014,VZ2OL0LREO8V9RKF,H3JRC7XX7WJAD4ZO,2014-08-09 12:00:00+02:00,['Lewis Eyre'],98,Lunch
RUQOHROBGBOSNUO4,10-06-2016,R3LFUK1WFDODC1YF,H3JRC7XX7WJAD4ZO,2016-06-09 08:00:00+02:00,['Anthony Emerson' 'Kelly Killebrew' 'Lewis Eyre'],516,Breakfast
6P91QRADC2O9WOVT,25-09-2016,L2F2HEGB6Q141080,H3JRC7XX7WJAD4ZO,2016-09-26 07:00:00+02:00,"['Kelly Killebrew' 'Lewis Eyre' 'Irvin Gentry' 'Emma Steitz'
'Anthony Emerson']",664,Breakfast
Code:
# Function to convert string ['name' 'name2'] to list ['name', 'name2']
# Returns a list of participant names
def string_to_list(participant_string): return re.findall(r"'(.*?)'", participant_string)
invoice_df["Participants"] = invoice_df["Participants"].apply(string_to_list)
# Obtain an array of all unique customer names
customers = invoice_df["Participants"].explode().unique()
# Create new customer dataframe
customers_df = pd.DataFrame(customers, columns = ["CustomerName"])
# Add customer id
customers_df["customer_id"] = customers_df.index + 1
# Create a first_name and last_name column
customers_df["first_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" "[0])
# Splice the list 1: in the event the person has multiple last names
customers_df["last_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" ")[1])
Solution
# Find all the occurrences of customer names
# then explode to convert values in lists to rows
cust = invoice_df['Participants'].str.findall(r"'(.*?)'").explode()
# Join with orderid
customers_df = invoice_df[['Order Id']].join(cust)
# factorize to encode the unique values in participants
customers_df['Customer Id'] = customers_df['Participants'].factorize()[0] + 1
Result
Order Id Participants Customer Id
0 839FKFW2LLX4LMBB David Bishop 1
1 97OX39BGVMHODLJM David Bishop 1
2 041ORQM5OIHTIU6L Karen Stansell 2
3 YT796QI18WNGZ7ZJ Addie Patino 3
4 6YLROQT27B6HRF4E Addie Patino 3
4 6YLROQT27B6HRF4E Susan Guerrero 4
5 AT0R4DFYYAFOC88Q David Bishop 1
5 AT0R4DFYYAFOC88Q Susan Guerrero 4
5 AT0R4DFYYAFOC88Q Karen Stansell 2
6 2DDN2LHS7G85GKPQ Susan Guerrero 4
6 2DDN2LHS7G85GKPQ David Bishop 1
7 FM608JK1N01BPUQN Amanda Knowles 5
7 FM608JK1N01BPUQN Cheryl Feaster 6
7 FM608JK1N01BPUQN Ginger Hoagland 7
7 FM608JK1N01BPUQN Michael White 8
8 CK331XXNIBQT81QL Cheryl Feaster 6
8 CK331XXNIBQT81QL Amanda Knowles 5
8 CK331XXNIBQT81QL Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Glenn Gould 9
9 FESGKOQN2OZZWXY3 Amanda Knowles 5
9 FESGKOQN2OZZWXY3 Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Michael White 8
10 YITOTLOF0MWZ0VYX Ginger Hoagland 7
10 YITOTLOF0MWZ0VYX Amanda Knowles 5
10 YITOTLOF0MWZ0VYX Michael White 8
11 8RIGCF74GUEQHQEE Amanda Knowles 5
12 TH60C9D8TPYS7DGG Cheryl Feaster 6
12 TH60C9D8TPYS7DGG Bret Adams 10
12 TH60C9D8TPYS7DGG Ginger Hoagland 7
13 W1Y086SRAVUZU1AL Bret Adams 10
14 WKB58Q8BHLOFQAB5 Michael White 8
14 WKB58Q8BHLOFQAB5 Ginger Hoagland 7
14 WKB58Q8BHLOFQAB5 Bret Adams 10
15 N8DOG58MW238BHA9 Ginger Hoagland 7
15 N8DOG58MW238BHA9 Cheryl Feaster 6
15 N8DOG58MW238BHA9 Glenn Gould 9
15 N8DOG58MW238BHA9 Bret Adams 10
16 DPDV9UGF0SUCYTGW Michael White 8
17 KNF3E3QTOQ22J269 Glenn Gould 9
17 KNF3E3QTOQ22J269 Cheryl Feaster 6
17 KNF3E3QTOQ22J269 Ginger Hoagland 7
17 KNF3E3QTOQ22J269 Amanda Knowles 5
18 LEED1HY47M8BR5VL Glenn Gould 9
19 LSJPNJQLDTIRNWAL Amanda Knowles 5
19 LSJPNJQLDTIRNWAL Bret Adams 10
20 6UX5RMHJ1GK1F9YQ Anthony Emerson 11
20 6UX5RMHJ1GK1F9YQ Irvin Gentry 12
20 6UX5RMHJ1GK1F9YQ Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Anthony Emerson 11
21 5SYB15QEFWD1E4Q4 Emma Steitz 14
21 5SYB15QEFWD1E4Q4 Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Irvin Gentry 12
21 5SYB15QEFWD1E4Q4 Kelly Killebrew 15
22 W5S8VZ61WJONS4EE Irvin Gentry 12
22 W5S8VZ61WJONS4EE Kelly Killebrew 15
23 795SVIJKO8KS3ZEL Emma Steitz 14
24 8070KEFYSSPWPCD0 Lewis Eyre 16
25 RUQOHROBGBOSNUO4 Anthony Emerson 11
25 RUQOHROBGBOSNUO4 Kelly Killebrew 15
25 RUQOHROBGBOSNUO4 Lewis Eyre 16
26 6P91QRADC2O9WOVT Kelly Killebrew 15
26 6P91QRADC2O9WOVT Lewis Eyre 16
26 6P91QRADC2O9WOVT Irvin Gentry 12
26 6P91QRADC2O9WOVT Emma Steitz 14
26 6P91QRADC2O9WOVT Anthony Emerson 11
I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0