How to split a dataframe row into two rows in Pandas? - python

I have a dataframe as follows:
location | amount
---------------------------
1 new york $27.00
2 california $21.00
3 florida $19.00
4 texas $18.00
What I want to do is split the row where Location='California' into two rows where California turns into 'Sacramento' and 'Los Angeles' and the amount (21) gets divided into two, split between the two new rows.
This is the desired result:
location | amount
------------------------------
1 new york $27.00
2 los angeles $10.50
3 sacramento $10.50
4 florida $19
5 texas $18

Duplicating & Removing
cal = df.loc["location" == "california"]
df = df.append({
"location": "sacramento",
"amount": cali["amount"] / 2
}, ignore_index=True)
df = df.append({
"location": "los angeles",
"amount": cali["amount"] / 2
}, ignore_index=True)
df.drop(cal.index.to(list))
Sources: https://www.codeforests.com/2020/09/27/pandas-split-one-row-of-data-into-multiple-rows/
Python pandas: fill a dataframe row by row

Related

Group by, Pivot with multiple columns and condition count

I have the dataframe:
df = pd.DataFrame({
"Agreement": ["Peace", "Peace", "Love", "Love", "Sun","Sun","Sun"],
"country1": ["USA", "UK", "Germany", "Spain", "Italy","India","China"],
"country2": ["Canada", "France", "Portugal", "Italy","India","Spain","UK"],
"EP1": [1, 0, 1, 0, 0,1,1],
"EP2": [0, 0, 0, 0,0,0,0],
"EP3": [1, 0, 1, 0,1,1,1]
})
I would like to group by or pivot so that I get the count of times a country is in an agreement with at least one EP equal or greater than 1. I would like as output:
df = pd.DataFrame({
"Country": ["USA", "UK", "Germany", "Spain", "Italy","India","China", "Canada","France","Portugal"],
"Agreement with at least one EP per country": [1, 1, 1, 1,1,2,1,1,0,1]
})
I have tried with pivot and group by and loop but I never reach the desired output. Thanks
Summarize 'EPx' columns in 'Agreement' then flatten your dataframe. Finally group by Country to count the number of agreement.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].sum().reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 2
6 China 1
7 Canada 1
8 France 0
9 Portugal 1
Update
I am interested in the count of times a country is in a unique agreement with at least one EP equal or greater than 1.
cols = ['country1', 'country2', 'Agreement']
out = (df.assign(Agreement=df.filter(like='EP').any(axis=1))[cols]
.melt('Agreement', value_name='Country')
.groupby('Country', sort=False)['Agreement'].max().astype(int).reset_index())
print(out)
# Output
Country Agreement
0 USA 1
1 UK 1
2 Germany 1
3 Spain 1
4 Italy 1
5 India 1
6 China 1
7 Canada 1
8 France 0
9 Portugal 1

Combine text using delimiter for duplicate column values

What im trying to achieve is to combine Name into one value using comma delimiter whenever Country column is duplicated, and sum the values in Salary column.
Current input :
pd.DataFrame({'Name': {0: 'John',1: 'Steven',2: 'Ibrahim',3: 'George',4: 'Nancy',5: 'Mo',6: 'Khalil'},
'Country': {0: 'USA',1: 'UK',2: 'UK',3: 'France',4: 'Ireland',5: 'Ireland',6: 'Ireland'},
'Salary': {0: 100, 1: 200, 2: 200, 3: 100, 4: 50, 5: 100, 6: 10}})
Name Country Salary
0 John USA 100
1 Steven UK 200
2 Ibrahim UK 200
3 George France 100
4 Nancy Ireland 50
5 Mo Ireland 100
6 Khalil Ireland 10
Expected output :
Row 1 & 2 (in inputs) got grupped into one since Country column is duplicated & Salary column is summed up.
Tha same goes for Row 4,5 & 6.
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160
What i have tried, but im not sure how to combine text in Name column :
df.groupby(['Country'],as_index=False)['Salary'].sum()
[Out:]
Country Salary
0 France 100
1 Ireland 160
2 UK 400
3 USA 100
use groupby() and agg():
out=df.groupby('Country',as_index=False).agg({'Name':', '.join,'Salary':'sum'})
If needed unique values of 'Name' column then use :
out=(df.groupby('Country',as_index=False)
.agg({'Name':lambda x:', '.join(set(x)),'Salary':'sum'}))
Note: use pd.unique() in place of set() if order of unique values is important
output of out:
Country Name Salary
0 France George 100
1 Ireland Nancy, Mo, Khalil 160
2 UK Steven, Ibrahim 400
3 USA John 100
Use agg:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})
And to get the columns in order you can add [df.columns] to the pipe:
df.groupby(['Country'], as_index=False).agg({'Name': ', '.join, 'Salary':'sum'})[df.columns]
Name Country Salary
0 John USA 100
1 Steven, Ibrahim UK 400
2 George France 100
3 Nancy, Mo, Khalil Ireland 160

How can I get top 5 names with their working hours, who worked the most using pandas data frame

I want to individuality print top each names with their working hours one by one.
pandas dataframe:
df = pd.DataFrame({'NAME': ['Joesph Morse', 'Katie Plotkin', 'Denny Heaps', 'Evelia Chesson', 'Drew Hassett', 'Robt Buckles', 'Suzy Lafler'], 'CITY': ["New York", "Boston", "Los Angeles", "Chicago", "Atlanta",
"Salt Lake City", "Dallas"], 'WORK HOURS': [3, 7, 0, 6, 10, 1, 9]}
)
Currently I'm targeting "Work Hours" column, with nlargest that filters top big numbers from the row, but It do not help me to get name of the worker along with their work hours. How can I get their names too?
row = df['WORK HOURS']
leading_workers = row.nlargest(5, 'all')
print('Top First worker',leading_workers.values[0]) # user_1
print('Top Second worker',leading_workers.values[1]) # user_2
print('Top Third worker',leading_workers.values[2]) # user_3
print('Top Forth worker',leading_workers.values[3]) # user_4
print('Top Fifth worker',leading_workers.values[4]) # user_5
Use DataFrame.nlargest with specify column names for test and select NAME for leading_workers:
leading_workers = df.nlargest(5, 'WORK HOURS', 'all')['NAME']
print (leading_workers)
4 Drew Hassett
6 Suzy Lafler
1 Katie Plotkin
3 Evelia Chesson
0 Joesph Morse
Name: NAME, dtype: object
for w in leading_workers:
print (w)
Drew Hassett
Suzy Lafler
Katie Plotkin
Evelia Chesson
Joesph Morse
import pandas as pd
df = pd.DataFrame({'NAME': ['Joesph Morse', 'Katie Plotkin', 'Denny Heaps',
'Evelia Chesson', 'Drew Hassett', 'Robt Buckles', 'Suzy Lafler'],
'CITY': ["New York", "Boston", "Los Angeles", "Chicago", "Atlanta",
"Salt Lake City", "Dallas"],
'WORK HOURS': [3, 7, 0, 6, 10, 1, 9]}
)
df.sort_values(by='WORK HOURS',ascending=False,inplace=True)
Is another way to do it, with the output
NAME CITY WORK HOURS
4 Drew Hassett Atlanta 10
6 Suzy Lafler Dallas 9
1 Katie Plotkin Boston 7
3 Evelia Chesson Chicago 6
0 Joesph Morse New York 3
5 Robt Buckles Salt Lake City 1
2 Denny Heaps Los Angeles 0
and if you want the first 5 rows
df.head(5)
NAME CITY WORK HOURS
4 Drew Hassett Atlanta 10
6 Suzy Lafler Dallas 9
1 Katie Plotkin Boston 7
3 Evelia Chesson Chicago 6
0 Joesph Morse New York 3

Python split one column into multiple columns and reattach the split columns into original dataframe

I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None

Dataframe merging in Pandas

I have two dataframes. First (df1) contains Name, ID and PIN. Second contains Identifier, City and Country. Dataframe shown below.
df1 = pd.DataFrame({"Name": ["Sam", "Ajay", "Lee", "Lee Yong Dae", "Cai Yun"], "ID": ["S01", "A01", "L02", "L03", "C01"], "PIN": ["SM392", "AA09", "Lee101", "Lee201", "C101"]})
df2 = pd.DataFrame({"Identifier": ["Sam", "L02", "C101"], "City": ["Moscow", "Seoul", "Beijing"], "Country": ["Russia", "Korea", "China"]})
I want to merge the dataframes if either name or ID or PIN matches with the identifier of df2. The expected output is:
City Country Name PIN Student ID
0 Moscow Russia Sam SM392 S01
1 0 0 Ajay AA09 A01
2 Seoul Korea Lee Lee101 L02
3 0 0 Lee Yong Dae Lee201 L03
4 Beijing China Cai Yun C101 C01
This is perhaps not the most elegant solution, but it works for me.
You have to create 3 separate merges and combine the results.
The code below gives the expected output (with nan values instead of 0 for the unmatched elements of the DataFrame)
import numpy as np
import pandas as pd
#Initial data
df1 = pd.DataFrame({"Name": ["Sam", "Ajay", "Lee", "Lee Yong Dae", "Cai Yun"], "ID": ["S01", "A01", "L02", "L03", "C01"], "PIN": ["SM392", "AA09", "Lee101", "Lee201","C101"]})
df2 = pd.DataFrame({"Identifier": ["Sam", "L02", "C101"], "City": ["Moscow", "Seoul", "Beijing"], "Country": ["Russia", "Korea", "China"]})
def merge_three(df1,df2):
#Perform three seperate merges
df3=df1.merge(df2, how='outer', left_on='ID', right_on='Identifier')
df4=df1.merge(df2, how='outer', left_on='Name', right_on='Identifier')
df5=df1.merge(df2, how='outer', left_on='PIN', right_on='Identifier')
#Copy 2nd and 3rd merge results to df3
df3['City_x']=df4['City']
df3['Country_x']=df4['Country']
df3['City_y']=df5['City']
df3['Country_y']=df5['Country']
#Merge the correct City and Country values. Use max to remove the NaN values
df6=df3[['City','Country','Name','PIN','ID']]
df6['City']=np.max([df3['City'],df3['City_x'],df3['City_y']],axis=0)
df6['Country']=np.max([df3['Country'],df3['Country_x'],df3['Country_y']],axis=0)
#Remove extra un-matched rows from merge
df_final=df6[df6['Name'].notnull()]
return df_final
df_out = merge_three(df1,df2)
Output:
df_out
City Country Name PIN ID
0 Moscow Russia Sam SM392 S01
1 NaN NaN Ajay AA09 A01
2 Seoul Korea Lee Lee101 L02
3 NaN NaN Lee Yong Dae Lee201 L03
4 Beijing China Cai Yun C101 C01
Not sure, but maybe this is what you are looking for:
a = df1.merge(df2, left_on='ID', right_on='Identifier')
b = df1.merge(df2, left_on='Name', right_on='Identifier')
с = df1.merge(df2, left_on='PIN', right_on='Identifier')
df = a.append(b).append(с)
df
ID Name PIN City Country Identifier
0 L02 Lee Lee101 Seoul Korea L02
0 S01 Sam SM392 Moscow Russia Sam
0 C01 Cai Yun C101 Beijing China C101

Categories