How to add a suffix to the first N columns in pandas? - python

I want to add a suffix to the first N columns. But I can't.
This is how to add a suffix to all columns:
import pandas as pd
df = pd.DataFrame( {"name" : ["John","Alex","Kate","Martin"], "surname" : ["Smith","Morgan","King","Cole"],
"job": ["Engineer","Dentist","Coach","Teacher"],"Age":[25,20,25,30],
"Id": [1,2,3,4]})
df.add_suffix("_x")
And this is the result:
name_x surname_x job_x Age_x Id_x
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4
But I want to add the first N columns so let's say the first 3. Desired output is:
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

Work with the indices and take slices to modify a subset of them:
df.columns = (df.columns[:3]+'_x').union(df.columns[3:], sort=False)
print(df)
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

This should work:
N=3
cols=[i for i in df.columns[:N]]
new_cols=[i+'_x' for i in df.columns[:N]]
dict_cols=dict(zip(cols,new_cols))
df.rename(dict_cols,axis=1)

set the column labels using a list comprehension:
n = 3
df.columns = [f'{c}_x' if i < n else c for i, c in enumerate(df.columns)]
results in
name_x surname_x job_x Age Id
0 John Smith Engineer 25 1
1 Alex Morgan Dentist 20 2
2 Kate King Coach 25 3
3 Martin Cole Teacher 30 4

Related

Pandas group by a specific value in any of given columns

Given the pandas dataframe as follows:
Partner1 Partner2 Interactions
0 Ann Alice 1
1 Alice Kate 8
2 Kate Tony 9
3 Tony Ann 2
How can I group by a specific partner, let's say to find the total number of interactions of Ann?
Something like
gb = df.groupby(['Partner1'] or ['Partner2']).agg({'Interactions': 'sum'})
and getting the answer:
Partner Interactions
Ann 3
Alice 9
Kate 17
Tony 11
You can use melt together with groupby. First melt:
df = pd.melt(df, id_vars='Interactions', value_vars=['Partner1', 'Partner2'], value_name='Partner')
This will give:
Interactions variable Partner
0 1 Partner1 Ann
1 8 Partner1 Alice
2 9 Partner1 Kate
3 2 Partner1 Tony
4 1 Partner2 Alice
5 8 Partner2 Kate
6 9 Partner2 Tony
7 2 Partner2 Ann
Now, group by Partner and sum:
df.groupby('Partner')[['Interactions']].sum()
Result:
Partner Interactions
Alice 9
Ann 3
Kate 17
Tony 11
You can do merge dataframe itself:
# join the df to itself
join_df = df.merge(df, left_on='Partner1', right_on='Partner2', suffixes=('', '_'))
# get sum
join_df['InteractionsSum'] = join_df[['Interactions', 'Interactions_']].agg(sum, 1)
join_df = join_df[['Partner1', 'Interactions']].copy()
print(join_df)
Partner1 Interactions
0 Ann 1
1 Alice 8
2 Kate 9
3 Tony 2

Adding rows to a column for every element in the list for every unique value in another column in python pandas

I have two lists of unequal length:
Name = ['Tom', 'Jack', 'Nick', 'Juli', 'Harry']
bId= list(range(0,3))
I want to build a data frame that would look like below:
'Name' 'bId'
Tom 0
Tom 1
Tom 2
Jack 0
Jack 1
Jack 2
Nick 0
Nick 1
Nick 2
Juli 0
Juli 1
JUli 2
Harry 0
Harry 1
Harry 2
Please suggest.
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(Name, bId), columns=['Name','bId'])
print (df)
Name bId
0 Tom 0
1 Tom 1
2 Tom 2
3 Jack 0
4 Jack 1
5 Jack 2
6 Nick 0
7 Nick 1
8 Nick 2
9 Juli 0
10 Juli 1
11 Juli 2
12 Harry 0
13 Harry 1
14 Harry 2

pandas to find earliest occurrence of statement and set to starter

Consider the following df
data = {'Name' : ['John','John','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,75,75,75],
'Week' : [1,2,1,2,3]}
df = pd.DataFrame(data)
Name Payroll Week
0 John 15 1
1 John 15 2
2 Lucy 75 1
3 Lucy 75 2
4 Lucy 75 3
What I'm attempting to do is true apply a Boolean throughout a DataFrame very similar to this one with 2m+ rows and 20+ columns to find out when someone started
To find if someone is active or not I pass a condition to another df:
df2 = df.loc[df.Week == df.Week.max()]
This gives me the final week i then use an isin filter to find out if the person is active or has left
df['Status'] = np.where(df['Payroll'].isin(df2['Payroll']), 'Active','Leaver')
So using the above code I get the following which is great, which tells me that since John is not in the latest week he has left the company
Name Payroll Week Status
0 John 15 1 Leaver
1 John 15 2 Leaver
2 Lucy 75 1 Active
3 Lucy 75 2 Active
4 Lucy 75 3 Active
What I'm trying to achieve is to know when John started with us, I could try a mask for each week of the year and an isin to check for when they first appeared but I figured there must be a more pythonic way do this!
Desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active
Any help is much appreciated.
Edit for Clarity :
data = {'Name' : ['John','John','John','John','Lucy','Lucy','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,15,15,75,75,75,75,75],
'Week' : [1,2,3,4,1,2,3,4,5]}
df = pd.DataFrame(data)
desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
things to note:
Max week is 5 so anyone not in week 5 is a leaver
first week of person in df makes them a starter.
all weeks in between are set to Active.
Use numpy.select with new condition by duplicated:
a = df.loc[df.Week == df.Week.max(), 'Payroll']
m1 = ~df['Payroll'].isin(a)
m2 = ~df['Payroll'].duplicated()
m3 = ~df['Payroll'].duplicated(keep='last')
df['Status'] = np.select([m2, m1 & m3], ['Starter', 'Leaver'], 'Active')
print (df)
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
The simplest way that I have come across is using groupby and finding minimal index for the name in the group:
for _, dfg in df.groupby(df['Name']):
gidx = min(dfg.index)
df.loc[df.index == gidx,'Status'] = 'Starter'
print(df)
And the df is then:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active

Python: how to drop rows in Pandas if two columns don't appear in another pandas column?

I have two dataframes df and df1. df contains name and attributes of people.
df Name Age
0 Jack 33
1 Anna 25
2 Emilie 49
3 Frank 19
4 John 42
while df1 contains the info of the number of contacts between two people. In df1 we can have some people that don't appear in df.
df1 Name1 Name2 c
0 Frank Paul 2
1 Julia Anna 5
2 Frank John 1
3 Emilie Jack 3
4 Tom Steven 2
5 Tom Jack 5
I want to drop all the rows from df1 in Name1 or Name2 don't appear in df.
df1 Name1 Name2 c
0 Frank John 1
1 Emilie Jack 3
Use isin -
df1[df1[['Name1', 'Name2']].isin(df.Name).all(1)]
# Name1 Name2 c
#2 Frank John 1
#3 Emilie Jack 3
Or:
df1[df1.Name1.isin(df.Name) & df1.Name2.isin(df.Name)]
# Name1 Name2 c
#2 Frank John 1
#3 Emilie Jack 3
Can also use np.isin
df1[np.isin(df1.Name1, df.Name) &
np.isin(df1.Name2, df.Name)]

Pandas intersection of groups

Hi I'm trying to find the unique Player which show up in every Team.
df =
Team Player Number
A Joe 8
A Mike 10
A Steve 11
B Henry 9
B Steve 19
B Joe 4
C Mike 18
C Joe 6
C Steve 18
C Dan 1
C Henry 3
and the result should be:
result =
Team Player Number
A Joe 8
A Steve 11
B Joe 4
B Steve 19
C Joe 6
C Steve 18
since Joe and Steve are the only Player in each Team
You can use a GroupBy.transform to get a count of unique teams that each player is a member of, and compare this to the overall count of unique teams. This will give you a Boolean array, which you can use to filter your DataFrame:
df = df[df.groupby('Player')['Team'].transform('nunique') == df['Team'].nunique()]
The resulting output:
Team Player Number
0 A Joe 8
2 A Steve 11
4 B Steve 19
5 B Joe 4
7 C Joe 6
8 C Steve 18

Categories