Merge two related dataframe to one - python

How can I create a new DF such that each teacher should contain a list of Students
Teacher df
name married school
0 Pep Guardiola True Manchester High School
1 Jurgen Klopp True Liverpool High School
2 Mikel Arteta False Arsenal High
3 Zinadine Zidane True NaN
Student df
teacher name age height weight
0 Mikel Arteta Bukayo Saka 21 2.1m 80kg
1 Mikel Arteta Gabriel Martinelli 21 2.1m 75kg
2 Pep Guardiola Jack Grealish 27 2.1m 80kg
3 Jurgen Klopp Roberto Firmino 31 2.1m 65kg
4 Jurgen Klopp Andrew Robertson 28 2.1m 70kg
5 Jurgen Klopp Darwin Nunez 23 2.1m 75kg
6 Pep Guardiola Ederson Moraes 29 2.1m 90kg
7 Pep Guardiola Manuel Akanji 27 2.1m 80kg
8 Mikel Arteta Thomas Partey 29 2.1m 80kg

If need new column filled by list of students use Series.map with aggregate list:
df1['students'] = df1['name'].map(df2.groupby('teacher')['name'].agg(list))

You can consider using:
df.merge(df.groupby('teacher',as_index=False).agg({'name':list}),
how='left',
on='teacher',
suffixes=('','_list'))
Retuning:
teacher name age height weight name_list
0 Mikel Arteta Saka 21 2.1m 80kg [Saka, Martinelli, Partey]
1 Mikel Arteta Martinelli 21 2.1m 75kg [Saka, Martinelli, Partey]
2 Pep Guardiola Grealish 27 2.1m 80kg [Grealish, Moraes, Akanji]
3 Jurgen Klopp Firmino 31 2.1m 65kg [Firmino, Robertson, Nunez]
4 Jurgen Klopp Robertson 28 2.1m 70kg [Firmino, Robertson, Nunez]
5 Jurgen Klopp Nunez 23 2.1m 75kg [Firmino, Robertson, Nunez]
6 Pep Guardiola Moraes 29 2.1m 90kg [Grealish, Moraes, Akanji]
7 Pep Guardiola Akanji 27 2.1m 80kg [Grealish, Moraes, Akanji]
8 Mikel Arteta Partey 29 2.1m 80kg [Saka, Martinelli, Partey]

Related

Python - Link 2 columns

I want to create a data frame to link 2 columns together (customer ID to each order ID the customer placed). The row index + 1 correlates to the customer ID. Is there a way to do this through mapping?
Data: invoice_df
Order Id,Date,Meal Id,Company Id,Date of Meal,Participants,Meal Price,Type of Meal
839FKFW2LLX4LMBB,27-05-2016,INBUX904GIHI8YBD,LJKS5NK6788CYMUU,2016-05-31 07:00:00+02:00,['David Bishop'],469,Breakfast
97OX39BGVMHODLJM,27-09-2018,J0MMOOPP709DIDIE,LJKS5NK6788CYMUU,2018-10-01 20:00:00+02:00,['David Bishop'],22,Dinner
041ORQM5OIHTIU6L,24-08-2014,E4UJLQNCI16UX5CS,LJKS5NK6788CYMUU,2014-08-23 14:00:00+02:00,['Karen Stansell'],314,Lunch
YT796QI18WNGZ7ZJ,12-04-2014,C9SDFHF7553BE247,LJKS5NK6788CYMUU,2014-04-07 21:00:00+02:00,['Addie Patino'],438,Dinner
6YLROQT27B6HRF4E,28-07-2015,48EQXS6IHYNZDDZ5,LJKS5NK6788CYMUU,2015-07-27 14:00:00+02:00,['Addie Patino' 'Susan Guerrero'],690,Lunch
AT0R4DFYYAFOC88Q,21-07-2014,W48JPR1UYWJ18NC6,LJKS5NK6788CYMUU,2014-07-17 20:00:00+02:00,['David Bishop' 'Susan Guerrero' 'Karen Stansell'],181,Dinner
2DDN2LHS7G85GKPQ,29-04-2014,1MKLAKBOE3SP7YUL,LJKS5NK6788CYMUU,2014-04-30 21:00:00+02:00,['Susan Guerrero' 'David Bishop'],14,Dinner
FM608JK1N01BPUQN,08-05-2014,E8WJZ1FOSKZD2MJN,36MFTZOYMTAJP1RK,2014-05-07 09:00:00+02:00,['Amanda Knowles' 'Cheryl Feaster' 'Ginger Hoagland' 'Michael White'],320,Breakfast
CK331XXNIBQT81QL,23-05-2015,CTZSFFKQTY7SBZ4J,36MFTZOYMTAJP1RK,2015-05-18 13:00:00+02:00,['Cheryl Feaster' 'Amanda Knowles' 'Ginger Hoagland'],697,Lunch
FESGKOQN2OZZWXY3,10-01-2016,US0NQYNNHS1SQJ4S,36MFTZOYMTAJP1RK,2016-01-14 22:00:00+01:00,['Glenn Gould' 'Amanda Knowles' 'Ginger Hoagland' 'Michael White'],451,Dinner
YITOTLOF0MWZ0VYX,03-10-2016,RGYX8772307H78ON,36MFTZOYMTAJP1RK,2016-10-01 22:00:00+02:00,['Ginger Hoagland' 'Amanda Knowles' 'Michael White'],263,Dinner
8RIGCF74GUEQHQEE,23-07-2018,5XK0KTFTD6OAP9ZP,36MFTZOYMTAJP1RK,2018-07-27 08:00:00+02:00,['Amanda Knowles'],210,Breakfast
TH60C9D8TPYS7DGG,15-12-2016,KDSMP2VJ22HNEPYF,36MFTZOYMTAJP1RK,2016-12-13 08:00:00+01:00,['Cheryl Feaster' 'Bret Adams' 'Ginger Hoagland'],755,Breakfast
W1Y086SRAVUZU1AL,17-09-2017,8IUOYVS031QPROUG,36MFTZOYMTAJP1RK,2017-09-14 13:00:00+02:00,['Bret Adams'],469,Lunch
WKB58Q8BHLOFQAB5,31-08-2016,E2K2TQUMENXSI9RP,36MFTZOYMTAJP1RK,2016-09-03 14:00:00+02:00,['Michael White' 'Ginger Hoagland' 'Bret Adams'],502,Lunch
N8DOG58MW238BHA9,25-12-2018,KFR2TAYXZSVCHAA2,36MFTZOYMTAJP1RK,2018-12-20 12:00:00+01:00,['Ginger Hoagland' 'Cheryl Feaster' 'Glenn Gould' 'Bret Adams'],829,Lunch
DPDV9UGF0SUCYTGW,25-05-2017,6YV61SH7W9ECUZP0,36MFTZOYMTAJP1RK,2017-05-24 22:00:00+02:00,['Michael White'],708,Dinner
KNF3E3QTOQ22J269,20-06-2018,737T2U7604ABDFDF,36MFTZOYMTAJP1RK,2018-06-15 07:00:00+02:00,['Glenn Gould' 'Cheryl Feaster' 'Ginger Hoagland' 'Amanda Knowles'],475,Breakfast
LEED1HY47M8BR5VL,22-10-2017,I22P10IQQD06MO45,36MFTZOYMTAJP1RK,2017-10-22 14:00:00+02:00,['Glenn Gould'],27,Lunch
LSJPNJQLDTIRNWAL,27-01-2017,247IIVNN6CXGWINB,36MFTZOYMTAJP1RK,2017-01-23 13:00:00+01:00,['Amanda Knowles' 'Bret Adams'],672,Lunch
6UX5RMHJ1GK1F9YQ,24-08-2014,LL4AOPXDM8V5KP5S,H3JRC7XX7WJAD4ZO,2014-08-27 12:00:00+02:00,['Anthony Emerson' 'Irvin Gentry' 'Melba Inlow'],552,Lunch
5SYB15QEFWD1E4Q4,09-07-2017,KZI0VRU30GLSDYHA,H3JRC7XX7WJAD4ZO,2017-07-13 08:00:00+02:00,"['Anthony Emerson' 'Emma Steitz' 'Melba Inlow' 'Irvin Gentry'
'Kelly Killebrew']",191,Breakfast
W5S8VZ61WJONS4EE,25-03-2017,XPSPBQF1YLIG26N1,H3JRC7XX7WJAD4ZO,2017-03-25 07:00:00+01:00,['Irvin Gentry' 'Kelly Killebrew'],471,Breakfast
795SVIJKO8KS3ZEL,05-01-2015,HHTLB8M9U0TGC7Z4,H3JRC7XX7WJAD4ZO,2015-01-06 22:00:00+01:00,['Emma Steitz'],588,Dinner
8070KEFYSSPWPCD0,05-08-2014,VZ2OL0LREO8V9RKF,H3JRC7XX7WJAD4ZO,2014-08-09 12:00:00+02:00,['Lewis Eyre'],98,Lunch
RUQOHROBGBOSNUO4,10-06-2016,R3LFUK1WFDODC1YF,H3JRC7XX7WJAD4ZO,2016-06-09 08:00:00+02:00,['Anthony Emerson' 'Kelly Killebrew' 'Lewis Eyre'],516,Breakfast
6P91QRADC2O9WOVT,25-09-2016,L2F2HEGB6Q141080,H3JRC7XX7WJAD4ZO,2016-09-26 07:00:00+02:00,"['Kelly Killebrew' 'Lewis Eyre' 'Irvin Gentry' 'Emma Steitz'
'Anthony Emerson']",664,Breakfast
Code:
# Function to convert string ['name' 'name2'] to list ['name', 'name2']
# Returns a list of participant names
def string_to_list(participant_string): return re.findall(r"'(.*?)'", participant_string)
invoice_df["Participants"] = invoice_df["Participants"].apply(string_to_list)
# Obtain an array of all unique customer names
customers = invoice_df["Participants"].explode().unique()
# Create new customer dataframe
customers_df = pd.DataFrame(customers, columns = ["CustomerName"])
# Add customer id
customers_df["customer_id"] = customers_df.index + 1
# Create a first_name and last_name column
customers_df["first_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" "[0])
# Splice the list 1: in the event the person has multiple last names
customers_df["last_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" ")[1])
Solution
# Find all the occurrences of customer names
# then explode to convert values in lists to rows
cust = invoice_df['Participants'].str.findall(r"'(.*?)'").explode()
# Join with orderid
customers_df = invoice_df[['Order Id']].join(cust)
# factorize to encode the unique values in participants
customers_df['Customer Id'] = customers_df['Participants'].factorize()[0] + 1
Result
Order Id Participants Customer Id
0 839FKFW2LLX4LMBB David Bishop 1
1 97OX39BGVMHODLJM David Bishop 1
2 041ORQM5OIHTIU6L Karen Stansell 2
3 YT796QI18WNGZ7ZJ Addie Patino 3
4 6YLROQT27B6HRF4E Addie Patino 3
4 6YLROQT27B6HRF4E Susan Guerrero 4
5 AT0R4DFYYAFOC88Q David Bishop 1
5 AT0R4DFYYAFOC88Q Susan Guerrero 4
5 AT0R4DFYYAFOC88Q Karen Stansell 2
6 2DDN2LHS7G85GKPQ Susan Guerrero 4
6 2DDN2LHS7G85GKPQ David Bishop 1
7 FM608JK1N01BPUQN Amanda Knowles 5
7 FM608JK1N01BPUQN Cheryl Feaster 6
7 FM608JK1N01BPUQN Ginger Hoagland 7
7 FM608JK1N01BPUQN Michael White 8
8 CK331XXNIBQT81QL Cheryl Feaster 6
8 CK331XXNIBQT81QL Amanda Knowles 5
8 CK331XXNIBQT81QL Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Glenn Gould 9
9 FESGKOQN2OZZWXY3 Amanda Knowles 5
9 FESGKOQN2OZZWXY3 Ginger Hoagland 7
9 FESGKOQN2OZZWXY3 Michael White 8
10 YITOTLOF0MWZ0VYX Ginger Hoagland 7
10 YITOTLOF0MWZ0VYX Amanda Knowles 5
10 YITOTLOF0MWZ0VYX Michael White 8
11 8RIGCF74GUEQHQEE Amanda Knowles 5
12 TH60C9D8TPYS7DGG Cheryl Feaster 6
12 TH60C9D8TPYS7DGG Bret Adams 10
12 TH60C9D8TPYS7DGG Ginger Hoagland 7
13 W1Y086SRAVUZU1AL Bret Adams 10
14 WKB58Q8BHLOFQAB5 Michael White 8
14 WKB58Q8BHLOFQAB5 Ginger Hoagland 7
14 WKB58Q8BHLOFQAB5 Bret Adams 10
15 N8DOG58MW238BHA9 Ginger Hoagland 7
15 N8DOG58MW238BHA9 Cheryl Feaster 6
15 N8DOG58MW238BHA9 Glenn Gould 9
15 N8DOG58MW238BHA9 Bret Adams 10
16 DPDV9UGF0SUCYTGW Michael White 8
17 KNF3E3QTOQ22J269 Glenn Gould 9
17 KNF3E3QTOQ22J269 Cheryl Feaster 6
17 KNF3E3QTOQ22J269 Ginger Hoagland 7
17 KNF3E3QTOQ22J269 Amanda Knowles 5
18 LEED1HY47M8BR5VL Glenn Gould 9
19 LSJPNJQLDTIRNWAL Amanda Knowles 5
19 LSJPNJQLDTIRNWAL Bret Adams 10
20 6UX5RMHJ1GK1F9YQ Anthony Emerson 11
20 6UX5RMHJ1GK1F9YQ Irvin Gentry 12
20 6UX5RMHJ1GK1F9YQ Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Anthony Emerson 11
21 5SYB15QEFWD1E4Q4 Emma Steitz 14
21 5SYB15QEFWD1E4Q4 Melba Inlow 13
21 5SYB15QEFWD1E4Q4 Irvin Gentry 12
21 5SYB15QEFWD1E4Q4 Kelly Killebrew 15
22 W5S8VZ61WJONS4EE Irvin Gentry 12
22 W5S8VZ61WJONS4EE Kelly Killebrew 15
23 795SVIJKO8KS3ZEL Emma Steitz 14
24 8070KEFYSSPWPCD0 Lewis Eyre 16
25 RUQOHROBGBOSNUO4 Anthony Emerson 11
25 RUQOHROBGBOSNUO4 Kelly Killebrew 15
25 RUQOHROBGBOSNUO4 Lewis Eyre 16
26 6P91QRADC2O9WOVT Kelly Killebrew 15
26 6P91QRADC2O9WOVT Lewis Eyre 16
26 6P91QRADC2O9WOVT Irvin Gentry 12
26 6P91QRADC2O9WOVT Emma Steitz 14
26 6P91QRADC2O9WOVT Anthony Emerson 11

pd.read_html() not reading date

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:
One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF
Try setting the parse_dates parameter to True inside read_html method.

Print out largest value from column in Pandas [duplicate]

This question already has answers here:
How to find maximum value of a column in python dataframe
(4 answers)
Closed 1 year ago.
I was given a task where I'm supposed to find largest value in 'Salary' column and print the value.
Here's the code:
import numpy as np
import pandas as pd
# TODO: Find the largest number in 'Salary' column
data_one = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Employee.xlsx")
dataframe_data = pd.DataFrame(data_one)
find_largest_salary = dataframe_data['Salary']
Output:
First Name Last Name Gender Age Experience (Years) Salary
0 Arnold Carter Male 21 10 8344
1 Arthur Farrell Male 20 7 6437
2 Richard Perry Male 28 3 8338
3 Ellia Thomas Female 26 4 8870
4 Jacob Kelly Male 21 4 548
... ... ... ... ... ... ...
95 Leonardo Barnes Male 23 6 7120
96 Kristian Mason Male 21 7 7018
97 Paul Perkins Male 21 6 2929
98 Justin Moore Male 25 0 3141
99 Naomi Ryan Female 22 10 7486
The output I wanted (example):
99999
the way you have described the question, the answer is simply this:
find_largest_salary = dataframe_data['Salary'].max()

How to sort dataframe with values

I want to sort my dataframe in decending order with "Total Confirmed cases"
My Code
high_cases_sorted_df = df.sort_values(by='Total Confirmed cases',ascending=False)
print(high_cases_sorted_df)
Output
state Total Confirmed cases
19 Maharashtra 8590
14 Jharkhand 82
24 Puducherry 8
9 Goa 7
32 West Bengal 697
13 Jammu and Kashmir 546
15 Karnataka 512
30 Uttarakhand 51
16 Kerala 481
6 Chandigarh 40
12 Himachal Pradesh 40
7 Chhattisgarh 37
4 Assam 36
10 Gujarat 3548
5 Bihar 345
1 Andaman and Nicobar Islands 33
25 Punjab 313
8 Delhi 3108
11 Haryana 296
26 Rajasthan 2262
18 Madhya Pradesh 2168
17 Ladakh 20
20 Manipur 2
29 Tripura 2
31 Uttar Pradesh 1955
I don't know why it shows like this it should be
(1.Maharashtra, 2.Gujarat, 3.Delhi, etc)
complete script Here
simple by converting that column into integer
df['Total_Confirmed_cases'] = df['Total_Confirmed_cases'].astype(int)

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories