Python - Link 2 columns - python

I want to create a data frame to link 2 columns together (customer ID to each order ID the customer placed). The row index + 1 correlates to the customer ID. Is there a way to do this through mapping?
Data: invoice_df
Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal
839FKFW2LLX4LMBB,27-05-2016,INBUX904GIHI8YBD,LJKS5NK6788CYMUU,2016-05-31 07:00:00+02:00,['David Bishop'],469,Breakfast
97OX39BGVMHODLJM,27-09-2018,J0MMOOPP709DIDIE,LJKS5NK6788CYMUU,2018-10-01 20:00:00+02:00,['David Bishop'],22,Dinner
041ORQM5OIHTIU6L,24-08-2014,E4UJLQNCI16UX5CS,LJKS5NK6788CYMUU,2014-08-23 14:00:00+02:00,['Karen Stansell'],314,Lunch
YT796QI18WNGZ7ZJ,12-04-2014,C9SDFHF7553BE247,LJKS5NK6788CYMUU,2014-04-07 21:00:00+02:00,['Addie Patino'],438,Dinner
6YLROQT27B6HRF4E,28-07-2015,48EQXS6IHYNZDDZ5,LJKS5NK6788CYMUU,2015-07-27 14:00:00+02:00,['Addie Patino' 'Susan Guerrero'],690,Lunch
AT0R4DFYYAFOC88Q,21-07-2014,W48JPR1UYWJ18NC6,LJKS5NK6788CYMUU,2014-07-17 20:00:00+02:00,['David Bishop' 'Susan Guerrero' 'Karen Stansell'],181,Dinner
2DDN2LHS7G85GKPQ,29-04-2014,1MKLAKBOE3SP7YUL,LJKS5NK6788CYMUU,2014-04-30 21:00:00+02:00,['Susan Guerrero' 'David Bishop'],14,Dinner
FM608JK1N01BPUQN,08-05-2014,E8WJZ1FOSKZD2MJN,36MFTZOYMTAJP1RK,2014-05-07 09:00:00+02:00,['Amanda Knowles' 'Cheryl Feaster' 'Ginger Hoagland' 'Michael White'],320,Breakfast
CK331XXNIBQT81QL,23-05-2015,CTZSFFKQTY7SBZ4J,36MFTZOYMTAJP1RK,2015-05-18 13:00:00+02:00,['Cheryl Feaster' 'Amanda Knowles' 'Ginger Hoagland'],697,Lunch
FESGKOQN2OZZWXY3,10-01-2016,US0NQYNNHS1SQJ4S,36MFTZOYMTAJP1RK,2016-01-14 22:00:00+01:00,['Glenn Gould' 'Amanda Knowles' 'Ginger Hoagland' 'Michael White'],451,Dinner
YITOTLOF0MWZ0VYX,03-10-2016,RGYX8772307H78ON,36MFTZOYMTAJP1RK,2016-10-01 22:00:00+02:00,['Ginger Hoagland' 'Amanda Knowles' 'Michael White'],263,Dinner
8RIGCF74GUEQHQEE,23-07-2018,5XK0KTFTD6OAP9ZP,36MFTZOYMTAJP1RK,2018-07-27 08:00:00+02:00,['Amanda Knowles'],210,Breakfast
TH60C9D8TPYS7DGG,15-12-2016,KDSMP2VJ22HNEPYF,36MFTZOYMTAJP1RK,2016-12-13 08:00:00+01:00,['Cheryl Feaster' 'Bret Adams' 'Ginger Hoagland'],755,Breakfast
W1Y086SRAVUZU1AL,17-09-2017,8IUOYVS031QPROUG,36MFTZOYMTAJP1RK,2017-09-14 13:00:00+02:00,['Bret Adams'],469,Lunch
WKB58Q8BHLOFQAB5,31-08-2016,E2K2TQUMENXSI9RP,36MFTZOYMTAJP1RK,2016-09-03 14:00:00+02:00,['Michael White' 'Ginger Hoagland' 'Bret Adams'],502,Lunch
N8DOG58MW238BHA9,25-12-2018,KFR2TAYXZSVCHAA2,36MFTZOYMTAJP1RK,2018-12-20 12:00:00+01:00,['Ginger Hoagland' 'Cheryl Feaster' 'Glenn Gould' 'Bret Adams'],829,Lunch
DPDV9UGF0SUCYTGW,25-05-2017,6YV61SH7W9ECUZP0,36MFTZOYMTAJP1RK,2017-05-24 22:00:00+02:00,['Michael White'],708,Dinner
KNF3E3QTOQ22J269,20-06-2018,737T2U7604ABDFDF,36MFTZOYMTAJP1RK,2018-06-15 07:00:00+02:00,['Glenn Gould' 'Cheryl Feaster' 'Ginger Hoagland' 'Amanda Knowles'],475,Breakfast
LEED1HY47M8BR5VL,22-10-2017,I22P10IQQD06MO45,36MFTZOYMTAJP1RK,2017-10-22 14:00:00+02:00,['Glenn Gould'],27,Lunch
LSJPNJQLDTIRNWAL,27-01-2017,247IIVNN6CXGWINB,36MFTZOYMTAJP1RK,2017-01-23 13:00:00+01:00,['Amanda Knowles' 'Bret Adams'],672,Lunch
6UX5RMHJ1GK1F9YQ,24-08-2014,LL4AOPXDM8V5KP5S,H3JRC7XX7WJAD4ZO,2014-08-27 12:00:00+02:00,['Anthony Emerson' 'Irvin Gentry' 'Melba Inlow'],552,Lunch
5SYB15QEFWD1E4Q4,09-07-2017,KZI0VRU30GLSDYHA,H3JRC7XX7WJAD4ZO,2017-07-13 08:00:00+02:00,"['Anthony Emerson' 'Emma Steitz' 'Melba Inlow' 'Irvin Gentry'
'Kelly Killebrew']",191,Breakfast
W5S8VZ61WJONS4EE,25-03-2017,XPSPBQF1YLIG26N1,H3JRC7XX7WJAD4ZO,2017-03-25 07:00:00+01:00,['Irvin Gentry' 'Kelly Killebrew'],471,Breakfast
795SVIJKO8KS3ZEL,05-01-2015,HHTLB8M9U0TGC7Z4,H3JRC7XX7WJAD4ZO,2015-01-06 22:00:00+01:00,['Emma Steitz'],588,Dinner
8070KEFYSSPWPCD0,05-08-2014,VZ2OL0LREO8V9RKF,H3JRC7XX7WJAD4ZO,2014-08-09 12:00:00+02:00,['Lewis Eyre'],98,Lunch
RUQOHROBGBOSNUO4,10-06-2016,R3LFUK1WFDODC1YF,H3JRC7XX7WJAD4ZO,2016-06-09 08:00:00+02:00,['Anthony Emerson' 'Kelly Killebrew' 'Lewis Eyre'],516,Breakfast
6P91QRADC2O9WOVT,25-09-2016,L2F2HEGB6Q141080,H3JRC7XX7WJAD4ZO,2016-09-26 07:00:00+02:00,"['Kelly Killebrew' 'Lewis Eyre' 'Irvin Gentry' 'Emma Steitz'
'Anthony Emerson']",664,Breakfast
Code:
# Function to convert string ['name' 'name2'] to list ['name', 'name2']
# Returns a list of participant names
def string_to_list(participant_string): return re.findall(r"'(.*?)'", participant_string)
invoice_df["Participants"] = invoice_df["Participants"].apply(string_to_list)
# Obtain an array of all unique customer names
customers = invoice_df["Participants"].explode().unique()
# Create new customer dataframe
customers_df = pd.DataFrame(customers, columns = ["CustomerName"])
# Add customer id
customers_df["customer_id"] = customers_df.index + 1
# Create a first_name and last_name column
customers_df["first_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" "[0])
# Splice the list 1: in the event the person has multiple last names
customers_df["last_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" ")[1])

Solution
# Find all the occurrences of customer names
# then explode to convert values in lists to rows
cust = invoice_df['Participants'].str.findall(r"'(.*?)'").explode()
# Join with orderid
customers_df = invoice_df[['Order Id']].join(cust)
# factorize to encode the unique values in participants
customers_df['Customer Id'] = customers_df['Participants'].factorize()[0] + 1
Result
Order Id Participants Customer Id
0 839FKFW2LLX4LMBB David Bishop 1
1 97OX39BGVMHODLJM David Bishop 1
2 041ORQM5OIHTIU6L Karen Stansell 2
3 YT796QI18WNGZ7ZJ Addie Patino 3
4 6YLROQT27B6HRF4E Addie Patino 3
4 6YLROQT27B6HRF4E Susan Guerrero 4
5 AT0R4DFYYAFOC88Q David Bishop 1
5 AT0R4DFYYAFOC88Q Susan Guerrero 4
5 AT0R4DFYYAFOC88Q Karen Stansell 2
6 2DDN2LHS7G85GKPQ Susan Guerrero 4
6 2DDN2LHS7G85GKPQ David Bishop 1
7 FM608JK1N01BPUQN Amanda Knowles 5
7 FM608JK1N01BPUQN Cheryl Feaster 6
7 FM608JK1N01BPUQN Ginger Hoagland 7
7 FM608JK1N01BPUQN Michael White 8
8 CK331XXNIBQT81QL Cheryl Feaster 6
8 CK331XXNIBQT81QL Amanda Knowles 5
8 CK331XXNIBQT81QL Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Glenn Gould 9
9 FESGKOQN2OZZWXY3 Amanda Knowles 5
9 FESGKOQN2OZZWXY3 Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Michael White 8
10 YITOTLOF0MWZ0VYX Ginger Hoagland 7
10 YITOTLOF0MWZ0VYX Amanda Knowles 5
10 YITOTLOF0MWZ0VYX Michael White 8
11 8RIGCF74GUEQHQEE Amanda Knowles 5
12 TH60C9D8TPYS7DGG Cheryl Feaster 6
12 TH60C9D8TPYS7DGG Bret Adams 10
12 TH60C9D8TPYS7DGG Ginger Hoagland 7
13 W1Y086SRAVUZU1AL Bret Adams 10
14 WKB58Q8BHLOFQAB5 Michael White 8
14 WKB58Q8BHLOFQAB5 Ginger Hoagland 7
14 WKB58Q8BHLOFQAB5 Bret Adams 10
15 N8DOG58MW238BHA9 Ginger Hoagland 7
15 N8DOG58MW238BHA9 Cheryl Feaster 6
15 N8DOG58MW238BHA9 Glenn Gould 9
15 N8DOG58MW238BHA9 Bret Adams 10
16 DPDV9UGF0SUCYTGW Michael White 8
17 KNF3E3QTOQ22J269 Glenn Gould 9
17 KNF3E3QTOQ22J269 Cheryl Feaster 6
17 KNF3E3QTOQ22J269 Ginger Hoagland 7
17 KNF3E3QTOQ22J269 Amanda Knowles 5
18 LEED1HY47M8BR5VL Glenn Gould 9
19 LSJPNJQLDTIRNWAL Amanda Knowles 5
19 LSJPNJQLDTIRNWAL Bret Adams 10
20 6UX5RMHJ1GK1F9YQ Anthony Emerson 11
20 6UX5RMHJ1GK1F9YQ Irvin Gentry 12
20 6UX5RMHJ1GK1F9YQ Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Anthony Emerson 11
21 5SYB15QEFWD1E4Q4 Emma Steitz 14
21 5SYB15QEFWD1E4Q4 Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Irvin Gentry 12
21 5SYB15QEFWD1E4Q4 Kelly Killebrew 15
22 W5S8VZ61WJONS4EE Irvin Gentry 12
22 W5S8VZ61WJONS4EE Kelly Killebrew 15
23 795SVIJKO8KS3ZEL Emma Steitz 14
24 8070KEFYSSPWPCD0 Lewis Eyre 16
25 RUQOHROBGBOSNUO4 Anthony Emerson 11
25 RUQOHROBGBOSNUO4 Kelly Killebrew 15
25 RUQOHROBGBOSNUO4 Lewis Eyre 16
26 6P91QRADC2O9WOVT Kelly Killebrew 15
26 6P91QRADC2O9WOVT Lewis Eyre 16
26 6P91QRADC2O9WOVT Irvin Gentry 12
26 6P91QRADC2O9WOVT Emma Steitz 14
26 6P91QRADC2O9WOVT Anthony Emerson 11

Related

Merge two related dataframe to one

How can I create a new DF such that each teacher should contain a list of Students
Teacher df
name married school
0 Pep Guardiola True Manchester High School
1 Jurgen Klopp True Liverpool High School
2 Mikel Arteta False Arsenal High
3 Zinadine Zidane True NaN
Student df
teacher name age height weight
0 Mikel Arteta Bukayo Saka 21 2.1m 80kg
1 Mikel Arteta Gabriel Martinelli 21 2.1m 75kg
2 Pep Guardiola Jack Grealish 27 2.1m 80kg
3 Jurgen Klopp Roberto Firmino 31 2.1m 65kg
4 Jurgen Klopp Andrew Robertson 28 2.1m 70kg
5 Jurgen Klopp Darwin Nunez 23 2.1m 75kg
6 Pep Guardiola Ederson Moraes 29 2.1m 90kg
7 Pep Guardiola Manuel Akanji 27 2.1m 80kg
8 Mikel Arteta Thomas Partey 29 2.1m 80kg
If need new column filled by list of students use Series.map with aggregate list:
df1['students'] = df1['name'].map(df2.groupby('teacher')['name'].agg(list))
You can consider using:
df.merge(df.groupby('teacher',as_index=False).agg({'name':list}),
how='left',
on='teacher',
suffixes=('','_list'))
Retuning:
teacher name age height weight name_list
0 Mikel Arteta Saka 21 2.1m 80kg [Saka, Martinelli, Partey]
1 Mikel Arteta Martinelli 21 2.1m 75kg [Saka, Martinelli, Partey]
2 Pep Guardiola Grealish 27 2.1m 80kg [Grealish, Moraes, Akanji]
3 Jurgen Klopp Firmino 31 2.1m 65kg [Firmino, Robertson, Nunez]
4 Jurgen Klopp Robertson 28 2.1m 70kg [Firmino, Robertson, Nunez]
5 Jurgen Klopp Nunez 23 2.1m 75kg [Firmino, Robertson, Nunez]
6 Pep Guardiola Moraes 29 2.1m 90kg [Grealish, Moraes, Akanji]
7 Pep Guardiola Akanji 27 2.1m 80kg [Grealish, Moraes, Akanji]
8 Mikel Arteta Partey 29 2.1m 80kg [Saka, Martinelli, Partey]

How to flatten a dataframe by a column containing ranges

Input dataframe:
df=
pd.DataFrame(columns=['id', 'antibiotic','start_date', 'end_date'],
data=[['Sophie', 'amoxicillin', 15, 17],
['Sophie', 'doxycycline', 19, 21],
['Sophie', 'amoxicillin', 20, 22],
['Robert', 'cephalexin', 12, 14],
['Robert', 'ciprofloxacin', 17, 18],
['Robert', 'clindamycin', 18, 18],
['Robert', 'cephalexin', 17, 19]
])
I would like to flatten out/expand the dates, and also join ('/') the antibiotics fields when they concur in the same date. like below:
df_flat=
pd.DataFrame(columns=['id', 'date', 'antibiotic'],
data=[['Sophie', 15, 'amoxicillin'],
['Sophie', 16, 'amoxicillin'],
['Sophie', 17, 'amoxicillin'],
['Sophie', 18, NaN],
['Sophie', 19, 'doxycycline'],
['Sophie', 20, 'doxycycline/amoxicillin'],
['Sophie', 21, 'doxycycline/amoxicillin'],
['Sophie', 22, 'amoxicillin'],
['Robert', 12, 'cephalexin'],
['Robert', 13, 'cephalexin'],
['Robert', 14, 'cephalexin'],
['Robert', 15, NaN],
['Robert', 16, NaN],
['Robert', 17, 'ciprofloxacin/cephalexin'],
['Robert', 18, 'ciprofloxacin/clindamycin/cephalexin'],
['Robert', 19, 'cephalexin']
])
What I'm trying...
#get mixmax of dates
minmax = df.groupby('id').agg({'start_date':min,'end_date':max})
#create multiindex manually with mins and maxes
multi_index = []
for i, row in minmax.iterrows():
for d in range(row.start_date, row.end_date):
tup = (i, d)
multi_index.append(tup)
# create output dataframe with this multiindex
df_flat = pd.DataFrame(index=pd.MultiIndex.from_tuples(multi_index),\
columns=['date','antibiotics'])
# And then fill up this df_flat using the original dataframe by matching the index of df_flat
# with the date values of original df, in a for loop :)
for i, row in df.iterrows():
for tup in multi_index:
if tup[1]>=row.start_date & ...
.
.
.
but this seems inefficient and inelegant. I'm sure something more smart can be done.
One option is to generate a range for each row, explode to create one row per date, then aggregate per id/date:
(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
.explode('date')
.groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
.reset_index()
)
output:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 17 ciprofloxacin/cephalexin
4 Robert 18 ciprofloxacin/clindamycin/cephalexin
5 Robert 19 cephalexin
6 Sophie 15 amoxicillin
7 Sophie 16 amoxicillin
8 Sophie 17 amoxicillin
9 Sophie 19 doxycycline
10 Sophie 20 doxycycline/amoxicillin
11 Sophie 21 doxycycline/amoxicillin
12 Sophie 22 amoxicillin
keeping the NaNs:
(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
.explode('date')
.groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
.reset_index(level=0)
.groupby('id')['antibiotic']
.apply(lambda g: g.reindex(range(g.index.min(), g.index.max()+1)))
.reset_index()
)
output:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 15 NaN
4 Robert 16 NaN
5 Robert 17 ciprofloxacin/cephalexin
6 Robert 18 ciprofloxacin/clindamycin/cephalexin
7 Robert 19 cephalexin
8 Sophie 15 amoxicillin
9 Sophie 16 amoxicillin
10 Sophie 17 amoxicillin
11 Sophie 18 NaN
12 Sophie 19 doxycycline
13 Sophie 20 doxycycline/amoxicillin
14 Sophie 21 doxycycline/amoxicillin
15 Sophie 22 amoxicillin
alternative to have all days for all patients:
(df.assign(date=lambda d: d.apply(lambda r: range(r['start_date'], r['end_date']+1), axis=1))
.explode('date')
.groupby(['id', 'date'], dropna=False)['antibiotic'].agg('/'.join)
.unstack('date')
.stack('date', dropna=False).rename('antibiotic')
.reset_index()
)
output:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 15 NaN
4 Robert 16 NaN
5 Robert 17 ciprofloxacin/cephalexin
6 Robert 18 ciprofloxacin/clindamycin/cephalexin
7 Robert 19 cephalexin
8 Robert 20 NaN
9 Robert 21 NaN
10 Robert 22 NaN
11 Sophie 12 NaN
12 Sophie 13 NaN
13 Sophie 14 NaN
14 Sophie 15 amoxicillin
15 Sophie 16 amoxicillin
16 Sophie 17 amoxicillin
17 Sophie 18 NaN
18 Sophie 19 doxycycline
19 Sophie 20 doxycycline/amoxicillin
20 Sophie 21 doxycycline/amoxicillin
21 Sophie 22 amoxicillin
First is repeat subtracted times by Index.repeat, then add counter to start_date column with agregate join, last add missing ranges:
df = df.loc[df.index.repeat(df['end_date'].sub(df['start_date']).add(1))].copy()
df['date'] = df['start_date'].add(df.groupby(level=0).cumcount())
df = (df.groupby(['id','date'], sort=False)['antibiotic'].agg('/'.join)
.reset_index(level=0)
.groupby('id', sort=False)['antibiotic']
.apply(lambda x: x.reindex(range(x.index.min(), x.index.max()+1)))
.reset_index()
)
print (df)
id date antibiotic
0 Sophie 15 amoxicillin
1 Sophie 16 amoxicillin
2 Sophie 17 amoxicillin
3 Sophie 18 NaN
4 Sophie 19 doxycycline
5 Sophie 20 doxycycline/amoxicillin
6 Sophie 21 doxycycline/amoxicillin
7 Sophie 22 amoxicillin
8 Robert 12 cephalexin
9 Robert 13 cephalexin
10 Robert 14 cephalexin
11 Robert 15 NaN
12 Robert 16 NaN
13 Robert 17 ciprofloxacin/cephalexin
14 Robert 18 ciprofloxacin/clindamycin/cephalexin
15 Robert 19 cephalexin
You could also cross-merge against your own data, then subset back to the matches before group-by and then joining.
df = df.merge(
pd.RangeIndex(df['start_date'].min(), df['end_date'].max() + 1).to_series().rename('date'),
how='cross')
df = df[df['date'] <= df['end_date']]
df = df[df['date'] >= df['start_date']]
# df.sort_values(by=['id', 'date'], inplace=True)
final = df.groupby(['id', 'date'])['antibiotic'] \
.agg('/'.join) \
.reset_index()
Producing:
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 17 ciprofloxacin/cephalexin
4 Robert 18 cephalexin/ciprofloxacin/clindamycin
5 Robert 19 cephalexin
6 Sophie 15 amoxicillin
7 Sophie 16 amoxicillin
8 Sophie 17 amoxicillin
9 Sophie 19 doxycycline
10 Sophie 20 doxycycline/amoxicillin
11 Sophie 21 doxycycline/amoxicillin
12 Sophie 22 amoxicillin
Use:
df['temp']=list(zip(df.start_date, df.end_date))
df.explode('temp').groupby(['id', 'temp'])['antibiotic'].apply(comb)
def comb(row):
out = ''
for part in row:
out+=part
out+='/'
return out[:-1]
First you combine the dates columns. Then you explode on the new column. Then you need to groupby using the custom function.
Result
id temp
Robert 12 cephalexin
14 cephalexin
17 ciprofloxacin/cephalexin
18 ciprofloxacin/clindamycin/clindamycin
19 cephalexin
Sophie 25 amoxicillin
27 amoxicillin
29 doxycycline
30 amoxicillin
31 doxycycline
32 amoxicillin
def function1(dd:pd.DataFrame):
dd1=dd.assign(date=pd.to_datetime(dd.date, format='%d')).set_index('date').asfreq(freq='d')
dd1.index=dd1.index.strftime('%d')
return dd1
df.apply(lambda ss:range(ss.start_date,ss.end_date+1),axis=1).explode()\
.to_frame('date').join(df).groupby(['id','date'])['antibiotic'].agg('/'.join)\
.reset_index(level=1).groupby(level=0).apply(function1)\
.reset_index()
id date antibiotic
0 Robert 12 cephalexin
1 Robert 13 cephalexin
2 Robert 14 cephalexin
3 Robert 15 NaN
4 Robert 16 NaN
5 Robert 17 ciprofloxacin/cephalexin
6 Robert 18 ciprofloxacin/clindamycin/cephalexin
7 Robert 19 cephalexin
8 Sophie 15 amoxicillin
9 Sophie 16 amoxicillin
10 Sophie 17 amoxicillin
11 Sophie 18 NaN
12 Sophie 19 doxycycline
13 Sophie 20 doxycycline/amoxicillin
14 Sophie 21 doxycycline/amoxicillin
15 Sophie 22 amoxicillin

Binning Categorical Columns Programatically Using Python

I am trying bin categorical columns programtically - any idea on how I can achieve this without manually hard-coding each value in that column
Essentially, what I would like is a function whereby it counts all values up to 80% [leaves the city name as is] and replaces the remaining 20% of city names with the word 'Other'
IE: if the first 17 city names make up 80% of that column, keep the city name as is, else return 'other'.
EG:
0 Brighton
1 Yokohama
2 Levin
3 Melbourne
4 Coffeyville
5 Whakatane
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Te Awamutu
11 Bishkek
12 Melbourne
13 Whanganui
14 Coffeyville
15 New York
16 Brisbane
17 Greymouth
18 Brisbane
19 Chuo City
20 Accra
21 Levin
22 Waiouru
23 Brisbane
24 New York
25 Chuo City
26 Lucerne
27 Whanganui
28 Los Angeles
29 Melbourne
df['city'].head(30).value_counts(ascending=False, normalize=True)*100
Melbourne 16.666667
Levin 10.000000
Brisbane 10.000000
Whanganui 6.666667
Coffeyville 6.666667
New York 6.666667
Chuo City 6.666667
Waiouru 3.333333
Greymouth 3.333333
Te Awamutu 3.333333
Bishkek 3.333333
Lucerne 3.333333
Ashburn 3.333333
Yokohama 3.333333
Whakatane 3.333333
Accra 3.333333
Brighton 3.333333
Los Angeles 3.333333
From Ashburn down - it should be renamed to 'other'
I have tried the below which is a start, but not exactly what I want:
city_map = dict(df['city'].value_counts(ascending=False, normalize=True)*100)
df['city_count']= df['city'].map(city_map)
def count(df):
if df["city_count"] > 10:
return "High"
elif df["city_count"] < 0:
return "Medium"
else:
return "Low"
df.apply(count, axis=1)
I'm not expecting any code - just some guidance on where to start or ideas on how I can achieve this
We can groupby on city and get the size of each city. We divide those values by the length of our dataframe with len and calculate the cumsum. Last step is to check from which point we exceed the threshold, so we can broadcast the boolean series back to your dataframe with map.
threshold = 0.7
m = df['city'].map(df.groupby('city')['city'].size().sort_values(ascending=False).div(len(df)).cumsum().le(threshold))
df['city'] = np.where(m, df['city'], 'Other')
city
0 Other
1 Other
2 Levin
3 Melbourne
4 Coffeyville
5 Other
6 Melbourne
7 Melbourne
8 Levin
9 Ashburn
10 Other
11 Bishkek
12 Melbourne
13 Other
14 Coffeyville
15 New York
16 Brisbane
17 Other
18 Brisbane
19 Chuo City
20 Other
21 Levin
22 Other
23 Brisbane
24 New York
25 Chuo City
26 Other
27 Other
28 Other
29 Melbourne
old method
If I understand you correctly you want calculate a cumulative sum with .cumsum and check when it exceeds your set threshold.
Then we use np.where to conditionally fill in the City name or Other.
threshold = 80
m = df['Normalized'].cumsum().le(threshold)
df['City'] = np.where(m, df['City'], 'Other')
City Normalized
0 Auckland 40.399513
1 Christchurch 13.130783
2 Wellington 12.267604
3 Hamilton 4.026242
4 Tauranga 3.867353
5 (not set) 3.540075
6 Dunedin 2.044508
7 Other 1.717975
8 Other 1.632849
9 Other 1.520342
10 Other 1.255651
11 Other 1.173878
12 Other 1.040508
13 Other 0.988166
14 Other 0.880502
15 Other 0.766877
16 Other 0.601468
17 Other 0.539067
18 Other 0.471824
19 Other 0.440903
20 Other 0.440344
21 Other 0.405884
22 Other 0.365836
23 Other 0.321131
24 Other 0.306602
25 Other 0.280524
26 Other 0.237123
27 Other 0.207878
28 Other 0.186084
29 Other 0.167085
30 Other 0.163732
31 Other 0.154977
Note: this method assumed that your Normalized column is sorted descending.

Create a dataframe including group by sums and total sum

I have the following data frame:
Race Course Horse Year Month Day Amount Won/Lost
0 Aintree Red Rum 2017 5 12 11.58 won
1 Punchestown Camelot 2016 12 22 122.52 won
2 Sandown Beef of Salmon 2016 11 17 20.00 lost
3 Ayr Corbiere 2016 11 3 25.00 lost
4 Fairyhouse Red Rum 2016 12 2 65.75 won
5 Ayr Camelot 2017 3 11 12.05 won
6 Aintree Hurricane Fly 2017 5 12 11.58 won
7 Punchestown Beef or Salmon 2016 12 22 112.52 won
8 Sandown Aldaniti 2016 11 17 10.00 lost
9 Ayr Henry the Navigator 2016 11 1 15.00 lost
10 Fairyhouse Jumanji 2016 10 2 65.75 won
11 Ayr Came Second 2017 3 11 12.05 won
12 Aintree Murder 2017 5 12 5.00 lost
13 Punchestown King Arthur 2016 6 22 52.52 won
14 Sandown Filet of Fish 2016 11 17 20.00 lost
15 Ayr Denial 2016 11 3 25.00 lost
16 Fairyhouse Don't Gamble 2016 12 12 165.75 won
17 Ayr Ireland 2017 1 11 22.05 won
I am trying to create another data frame which includes only the sum of all races (rows) and the sum of all races won. It would ideally look like the following:
total races 18
total won 11
However, all I have been able to do is group by counts, counting total won and total lost. This is what I have attempted:
df = df.groupby(['Won/Lost']).size().add_prefix('total')
And this is what it returns:
Won/Lost
total lost 7
total won 11
dtype: int64
I am at a dead end and cannot figure out a simple solution.
Assuming content of races.csv is:
Race Course,Horse,Year,Month,Day,Amount,Won/Lost
Aintree,Red Rum,2017,5,12,11.58,won
Punchestown,Camelot,2016,12,22,122.52,won
Sandown,Beef of Salmon,2016,11,17,20.00,lost
Ayr,Corbiere,2016,11,3,25.00,lost
Fairyhouse,Red Rum,2016,12,2,65.75,won
Ayr,Camelot,2017,3,11,12.05,won
Aintree,Hurricane Fly,2017,5,12,11.58,won
Punchestown,Beef or Salmon,2016,12,22,112.52,won
Sandown,Aldaniti,2016,11,17,10.00,lost
Ayr,Henry the Navigator,2016,11,1,15.00,lost
Fairyhouse,Jumanji,2016,10,2,65.75,won
Ayr,Came Second,2017,3,11,12.05,won
Aintree,Murder,2017,5,12,5.00,lost
Punchestown,King Arthur,2016,6,22,52.52,won
Sandown,Filet of Fish,2016,11,17,20.00,lost
Ayr,Denial,2016,11,3,25.00,lost
Fairyhouse,Don't Gamble,2016,12,12,165.75,won
Ayr,Ireland,2017,1,11,22.05,won
Steps to get the new dataframe:
>>> races_df = pd.read_csv('races.csv')
>>> races_df
Race Course Horse Year Month Day Amount Won/Lost
0 Aintree Red Rum 2017 5 12 11.58 won
1 Punchestown Camelot 2016 12 22 122.52 won
2 Sandown Beef of Salmon 2016 11 17 20.00 lost
3 Ayr Corbiere 2016 11 3 25.00 lost
4 Fairyhouse Red Rum 2016 12 2 65.75 won
5 Ayr Camelot 2017 3 11 12.05 won
6 Aintree Hurricane Fly 2017 5 12 11.58 won
7 Punchestown Beef or Salmon 2016 12 22 112.52 won
8 Sandown Aldaniti 2016 11 17 10.00 lost
9 Ayr Henry the Navigator 2016 11 1 15.00 lost
10 Fairyhouse Jumanji 2016 10 2 65.75 won
11 Ayr Came Second 2017 3 11 12.05 won
12 Aintree Murder 2017 5 12 5.00 lost
13 Punchestown King Arthur 2016 6 22 52.52 won
14 Sandown Filet of Fish 2016 11 17 20.00 lost
15 Ayr Denial 2016 11 3 25.00 lost
16 Fairyhouse Don't Gamble 2016 12 12 165.75 won
17 Ayr Ireland 2017 1 11 22.05 won
>>>
>>> total_races = len(races_df)
>>>
>>> total_win = races_df[races_df['Won/Lost'] == 'won']['Won/Lost'].count()
>>>
>>> new_df = pd.DataFrame({'total_races': total_races, 'total_win': total_win}, index=pd.RangeIndex(1))
>>>
>>> new_df
total_races total_win
0 18 11

How to narrow data in a list

I am having an issue with a python script I am running that is attempting to get one of the 22 top trending topics on the PyTrends (https://github.com/GeneralMills/pytrends/) from the output printed. I am trying to create a random number from 1 to 22, and then use that to choose one of the 22 results printed on lines 176-198 in the python shell.
import pytrends
import random
pytrend = TrendReq()
random = random.randint(1,22)
random = random + 99
itemList = list(pytrend.trending_searches())
Data = itemList.index(random) # This is one of the issue lines, as I cannot figure out how to index the output as needed.
Data = str(Data)
Data = Data[1:21] # An attempt at indexing output
print (Data)
This is my output on the Shell:
<bound method NDFrame.head of date exploreUrl \
0 20180504 /trends/explore?q=Free+Comic+Book+Day&date=now...
1 20180504 /trends/explore?q=Brad+Marchand&date=now+7-d&g...
2 20180504 /trends/explore?q=jrue+holiday&date=now+7-d&ge...
3 20180504 /trends/explore?q=Kentucky+Derby&date=now+7-d&...
4 20180504 /trends/explore?q=Cinco+de+Mayo&date=now+7-d&g...
5 20180504 /trends/explore?q=Warriors&date=now+7-d&geo=US
6 20180504 /trends/explore?q=Bruins&date=now+7-d&geo=US
7 20180504 /trends/explore?q=Rockets&date=now+7-d&geo=US
8 20180504 /trends/explore?q=Matt+Harvey&date=now+7-d&geo=US
9 20180504 /trends/explore?q=DJ+Khaled&date=now+7-d&geo=US
10 20180504 /trends/explore?q=Matthew+Lawrence&date=now+7-...
11 20180504 /trends/explore?q=junot+diaz&date=now+7-d&geo=US
12 20180504 /trends/explore?q=nashville+predators&date=now...
13 20180504 /trends/explore?q=albert+pujols&date=now+7-d&g...
14 20180504 /trends/explore?q=indians+vs+yankees&date=now+...
15 20180504 /trends/explore?q=zoe+saldana&date=now+7-d&geo=US
16 20180504 /trends/explore?q=Rihanna&date=now+7-d&geo=US
17 20180504 /trends/explore?q=Becky+Hammon&date=now+7-d&ge...
18 20180504 /trends/explore?q=dte+outage+map&date=now+7-d&...
19 20180504 /trends/explore?q=hawaii+news+now&date=now+7-d...
20 20180504 /trends/explore?q=Colton+Haynes&date=now+7-d&g...
21 20180504 /trends/explore?q=Audrey+Hepburn&date=now+7-d&...
22 20180504 /trends/explore?q=Carol+Burnett&date=now+7-d&g...
formattedTraffic hotnessColor hotnessLevel \
0 20,000+ #f0a049 2.0
1 20,000+ #f0a049 2.0
2 20,000+ #f0a049 2.0
3 2,000,000+ #d04108 5.0
4 1,000,000+ #db601e 4.0
5 500,000+ #db601e 4.0
6 200,000+ #e68033 3.0
7 200,000+ #e68033 3.0
8 200,000+ #e68033 3.0
9 200,000+ #e68033 3.0
10 100,000+ #e68033 3.0
11 100,000+ #e68033 3.0
12 100,000+ #e68033 3.0
13 100,000+ #e68033 3.0
14 100,000+ #e68033 3.0
15 50,000+ #f0a049 2.0
16 50,000+ #f0a049 2.0
17 50,000+ #f0a049 2.0
18 50,000+ #f0a049 2.0
19 50,000+ #f0a049 2.0
20 50,000+ #f0a049 2.0
21 50,000+ #f0a049 2.0
22 50,000+ #f0a049 2.0
imgLinkUrl imgSource \
0 https://wtop.com/entertainment/2018/05/grab-fr... WTOP
1 http://www.espn.com/nhl/story/_/id/23414142/nh... ESPN
2 https://www.slamonline.com/nba/jrue-holiday-an... SLAM Online
3 https://www.nbcnews.com/business/business-news... NBCNews.com
4 https://www.nytimes.com/2018/05/05/business/ci... New York Times
5 https://www.goldenstateofmind.com/2018/5/5/173... Golden State of Mind
6 https://www.bostonglobe.com/sports/bruins/2018... The Boston Globe
7 http://www.espn.com/nba/story/_/id/23409022/ho... ESPN
8 https://www.forbes.com/sites/tomvanriper/2018/... Forbes
9 http://people.com/music/dj-khaled-2015-video-w... PEOPLE.com
10 https://www.goodhousekeeping.com/life/a2015507... GoodHousekeeping.com
11 https://www.washingtonpost.com/news/arts-and-e... Washington Post
12 https://www.tennessean.com/story/sports/nhl/pr... The Tennessean
13 https://www.cbssports.com/mlb/news/leaderboard... CBSSports.com
14 https://www.mlb.com/news/miguel-andujar-yankee... MLB.com
15 http://people.com/movies/mila-kunis-gets-emoti... PEOPLE.com
16 http://www.bbc.com/news/newsbeat-44000486 BBC News
17 http://www.espn.com/nba/story/_/id/23407719/be... ESPN
18 https://www.lansingstatejournal.com/story/news... Lansing State Journal
19 http://www.hawaiinewsnow.com/story/38110613/li... Hawaii News Now
20 http://people.com/tv/colton-haynes-denies-rumo... PEOPLE.com
21 http://people.com/movies/see-audrey-hepburn-in... PEOPLE.com
22 https://www.vanityfair.com/hollywood/2018/05/c... Vanity Fair
imgUrl \
0 //t1.gstatic.com/images?q=tbn:ANd9GcRgX9VkY3X0...
1 //t2.gstatic.com/images?q=tbn:ANd9GcQNtvvQkzuu...
2 //t1.gstatic.com/images?q=tbn:ANd9GcSWuoUKvQM1...
3 //t0.gstatic.com/images?q=tbn:ANd9GcSvx53B96Jy...
4 //t3.gstatic.com/images?q=tbn:ANd9GcS7m8935VXh...
5 //t1.gstatic.com/images?q=tbn:ANd9GcQw4FYzAfaN...
6 //t1.gstatic.com/images?q=tbn:ANd9GcQKEOxhee7r...
7 //t1.gstatic.com/images?q=tbn:ANd9GcTMGOQfUc7u...
8 //t3.gstatic.com/images?q=tbn:ANd9GcQrbRgWqQM-...
9 //t3.gstatic.com/images?q=tbn:ANd9GcTH2gEcxXtQ...
10 //t2.gstatic.com/images?q=tbn:ANd9GcQuOq7biu30...
11 //t1.gstatic.com/images?q=tbn:ANd9GcQroHePQnEr...
12 //t0.gstatic.com/images?q=tbn:ANd9GcSgdsziSLo-...
13 //t1.gstatic.com/images?q=tbn:ANd9GcT8Z0CYLzOL...
14 //t0.gstatic.com/images?q=tbn:ANd9GcQJUrmvZbvz...
15 //t3.gstatic.com/images?q=tbn:ANd9GcSBQuX6A0c3...
16 //t0.gstatic.com/images?q=tbn:ANd9GcQU6AztveLs...
17 //t2.gstatic.com/images?q=tbn:ANd9GcQX6uw7bDSG...
18 //t2.gstatic.com/images?q=tbn:ANd9GcTKzcn18NOd...
19 //t1.gstatic.com/images?q=tbn:ANd9GcRSizKTqReb...
20 //t1.gstatic.com/images?q=tbn:ANd9GcTjJAoEQ0A2...
21 //t0.gstatic.com/images?q=tbn:ANd9GcRWzAeeA3c3...
22 //t0.gstatic.com/images?q=tbn:ANd9GcTCjUox_o9U...
newsArticlesList \
0 [{'title': 'Grab a freebie on <b>Free Comic Bo...
1 [{'title': 'NHL to give <b>Brad Marchand</b> e...
2 [{'title': 'Pelicans' <b>Jrue Holiday</b>:...
3 [{'title': '<b>Kentucky Derby</b> Field: No. 5...
4 [{'title': 'What Is <b>Cinco de Mayo</b>?', 'l...
5 [{'title': '<b>Warriors</b> deservedly get the...
6 [{'title': 'Dan Girardi lifts Lightning over <...
7 [{'title': '<b>Rockets</b> take 2-1 lead by bl...
8 [{'title': '<b>Matt Harvey</b> And Mets Just C...
9 [{'title': '<b>DJ Khaled</b> Faces Critics Aft...
10 [{'title': '<b>Matthew Lawrence</b> Proposed t...
11 [{'title': 'Pulitzer Prize-winning author <b>J...
12 [{'title': '<b>Predators</b> coach Peter Lavio...
13 [{'title': 'Leaderboarding: The astounding Hal...
14 [{'title': 'Andujar walks off Yanks to 13th wi...
15 [{'title': 'Mila Kunis Gets Emotional at BFF <...
16 [{'title': '<b>Rihanna</b> opens up about her ...
17 [{'title': 'Sources: Spurs assistant <b>Becky ...
18 [{'title': 'Crews restoring power quickly afte...
19 [{'title': 'LIST: Lava threat forces evacuatio...
20 [{'title': '<b>Colton Haynes</b> Shuts Down Ru...
21 [{'title': 'See <b>Audrey Hepburn</b> in Gorge...
22 [{'title': '<b>Carol Burnett</b> Wants to Be L...
relatedSearchesList safe \
0 [] 1.0
1 [] 1.0
2 [] 1.0
3 [{'query': 'Kentucky Derby 2018 horses', 'safe... 1.0
4 [{'query': 'Cinco De Mayo 2018 Events', 'safe'... 1.0
5 [{'query': 'Warriors Vs Pelicans', 'safe': Tru... 1.0
6 [{'query': 'Boston Bruins', 'safe': True}, {'q... 1.0
7 [{'query': 'Rockets Vs Jazz', 'safe': True}] 1.0
8 [] 1.0
9 [{'query': 'Dj Khaled Wife', 'safe': True}] 1.0
10 [{'query': 'Cheryl Burke', 'safe': True}] 1.0
11 [] 1.0
12 [{'query': 'Predators', 'safe': True}] 1.0
13 [] 1.0
14 [] 1.0
15 [] 1.0
16 [] 1.0
17 [] 1.0
18 [] 1.0
19 [] 1.0
20 [] 1.0
21 [] 1.0
22 [] 1.0
shareUrl startTime \
0 https://www.google.com/trends/hottrends?stt=Fr... 1.525540e+09
1 https://www.google.com/trends/hottrends?stt=Br... 1.525543e+09
2 https://www.google.com/trends/hottrends?stt=Jr... 1.525532e+09
3 https://www.google.com/trends/hottrends?stt=Ke... 1.525460e+09
4 https://www.google.com/trends/hottrends?stt=Ci... 1.525453e+09
5 https://www.google.com/trends/hottrends?stt=Wa... 1.525482e+09
6 https://www.google.com/trends/hottrends?stt=Br... 1.525486e+09
7 https://www.google.com/trends/hottrends?stt=Ro... 1.525493e+09
8 https://www.google.com/trends/hottrends?stt=Ma... 1.525468e+09
9 https://www.google.com/trends/hottrends?stt=DJ... 1.525475e+09
10 https://www.google.com/trends/hottrends?stt=Ma... 1.525439e+09
11 https://www.google.com/trends/hottrends?stt=Ju... 1.525457e+09
12 https://www.google.com/trends/hottrends?stt=Na... 1.525435e+09
13 https://www.google.com/trends/hottrends?stt=Al... 1.525439e+09
14 https://www.google.com/trends/hottrends?stt=In... 1.525486e+09
15 https://www.google.com/trends/hottrends?stt=Zo... 1.525446e+09
16 https://www.google.com/trends/hottrends?stt=Ri... 1.525432e+09
17 https://www.google.com/trends/hottrends?stt=Be... 1.525493e+09
18 https://www.google.com/trends/hottrends?stt=Dt... 1.525468e+09
19 https://www.google.com/trends/hottrends?stt=Ha... 1.525453e+09
20 https://www.google.com/trends/hottrends?stt=Co... 1.525489e+09
21 https://www.google.com/trends/hottrends?stt=Au... 1.525475e+09
22 https://www.google.com/trends/hottrends?stt=Ca... 1.525478e+09
title titleLinkUrl \
0 Free Comic Book Day //www.google.com/search?q=Free+Comic+Book+Day
1 Brad Marchand //www.google.com/search?q=Brad+Marchand
2 Jrue Holiday //www.google.com/search?q=Jrue+Holiday
3 Kentucky Derby //www.google.com/search?q=Kentucky+Derby
4 Cinco de Mayo //www.google.com/search?q=Cinco+de+Mayo
5 Warriors //www.google.com/search?q=Warriors
6 Bruins //www.google.com/search?q=Bruins
7 Rockets //www.google.com/search?q=Rockets
8 Matt Harvey //www.google.com/search?q=Matt+Harvey
9 DJ Khaled //www.google.com/search?q=DJ+Khaled
10 Matthew Lawrence //www.google.com/search?q=Matthew+Lawrence
11 Junot Diaz //www.google.com/search?q=Junot+Diaz
12 Nashville Predators //www.google.com/search?q=Nashville+Predators
13 Albert Pujols //www.google.com/search?q=Albert+Pujols
14 Indians Vs Yankees //www.google.com/search?q=Indians+Vs+Yankees
15 Zoe Saldana //www.google.com/search?q=Zoe+Saldana
16 Rihanna //www.google.com/search?q=Rihanna
17 Becky Hammon //www.google.com/search?q=Becky+Hammon
18 Dte Outage Map //www.google.com/search?q=Dte+Outage+Map
19 Hawaii News Now //www.google.com/search?q=Hawaii+News+Now
20 Colton Haynes //www.google.com/search?q=Colton+Haynes
21 Audrey Hepburn //www.google.com/search?q=Audrey+Hepburn
22 Carol Burnett //www.google.com/search?q=Carol+Burnett
trafficBucketLowerBound
0 20000.0
1 20000.0
2 20000.0
3 2000000.0
4 1000000.0
5 500000.0
6 200000.0
7 200000.0
8 200000.0
9 200000.0
10 100000.0
11 100000.0
12 100000.0
13 100000.0
14 100000.0
15 50000.0
16 50000.0
17 50000.0
18 50000.0
19 50000.0
20 50000.0
21 50000.0
22 50000.0 >
pytrends returns a pandas dataframe as an output. Pandas dataframes have all sorts of useful methods for subsetting and indexing, so when you call list and str on it rather than the native methods you are getting some weird results.
To take a random sample of a dataframe, you can use the sample method:
data.sample(n)
So, to get you a randomly chosen row from the request:
from pytrends.request import TrendReq
pytrend = TrendReq()
mydata = pytrend.trending_searches()
print(mydata.sample(1)) #or assign it, or get the required rows etc

Categories