I am trying to fill the column based on some condition. Can you please help me how to do this?
Example:
df:
Name Age
0 Tom 20
1 nick 21
2 nick 19
3 jack 18
4 shiv 21
5 shiv 22
6 jim 23
I have created the dataframe with one more column:
df['New'] = df['Name'].shift()
Name Age New
0 Tom 20 NaN
1 nick 21 Tom
2 nick 19 nick
3 jack 18 nick
4 shiv 21 jack
5 shiv 22 shiv
6 jim 23 shiv
Expected Output:
Name Age New order
0 Tom 20 NaN 1
1 nick 21 Tom 2
2 nick 19 nick 2
3 jack 18 nick 3
4 shiv 21 jack 4
5 shiv 22 shiv 4
6 jim 23 shiv 5
condition :
if Name is matching the New column then check the previous row number and fill the number same number else fill the next number.
It is quiet similar like dense_rank() but I don't want to use dense_rank concept here. So is there any way to fill this column?
Using .cumsum() over boolean Series:
df['order'] = (df['Name'] != df['Name'].shift()).cumsum()
print(df)
Prints:
Name Age order
0 Tom 20 1
1 nick 21 2
2 nick 19 2
3 jack 18 3
4 shiv 21 4
5 shiv 22 4
6 jim 23 5
Related
I have two lists of unequal length:
Name = ['Tom', 'Jack', 'Nick', 'Juli', 'Harry']
bId= list(range(0,3))
I want to build a data frame that would look like below:
'Name' 'bId'
Tom 0
Tom 1
Tom 2
Jack 0
Jack 1
Jack 2
Nick 0
Nick 1
Nick 2
Juli 0
Juli 1
JUli 2
Harry 0
Harry 1
Harry 2
Please suggest.
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(Name, bId), columns=['Name','bId'])
print (df)
Name bId
0 Tom 0
1 Tom 1
2 Tom 2
3 Jack 0
4 Jack 1
5 Jack 2
6 Nick 0
7 Nick 1
8 Nick 2
9 Juli 0
10 Juli 1
11 Juli 2
12 Harry 0
13 Harry 1
14 Harry 2
I'm having trouble with some pandas groupby object issue, which is the following:
so I have this dataframe:
Letter name num_exercises
A carl 1
A Lenna 2
A Harry 3
A Joe 4
B Carl 5
B Lenna 3
B Harry 3
B Joe 6
C Carl 6
C Lenna 3
C Harry 4
C Joe 7
And I want to add a column on it, called num_exercises_total , which contains the total sum of num_exercises for each letter. Please note that this value must be repeated for each row in the letter group.
The output would be as follows:
Letter name num_exercises num_exercises_total
A carl 1 15
A Lenna 2 15
A Harry 3 15
A Joe 4 15
B Carl 5 18
B Lenna 3 18
B Harry 3 18
B Joe 6 18
C Carl 6 20
C Lenna 3 20
C Harry 4 20
C Joe 7 20
I've tried adding the new column like this:
df['num_exercises_total'] = df.groupby(['letter'])['num_exercises'].sum()
But it returns the value NaN for all the rows.
Any help would be highly appreciated.
Thank you very much in advance!
You may want to check transform
df.groupby(['Letter'])['num_exercises'].transform('sum')
0 10
1 10
2 10
3 10
4 17
5 17
6 17
7 17
8 20
9 20
10 20
11 20
Name: num_exercises, dtype: int64
df['num_of_total']=df.groupby(['Letter'])['num_exercises'].transform('sum')
Transform works perfectly for this question. WenYoBen is right. I am just putting slightly different version here.
df['num_of_total']=df['num_excercises'].groupby(df['Letter']).transform('sum')
>>> df
Letter name num_excercises num_of_total
0 A carl 1 10
1 A Lenna 2 10
2 A Harry 3 10
3 A Joe 4 10
4 B Carl 5 17
5 B Lenna 3 17
6 B Harry 3 17
7 B Joe 6 17
8 C Carl 6 20
9 C Lenna 3 20
10 C Harry 4 20
11 C Joe 7 20
Consider the following df
data = {'Name' : ['John','John','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,75,75,75],
'Week' : [1,2,1,2,3]}
df = pd.DataFrame(data)
Name Payroll Week
0 John 15 1
1 John 15 2
2 Lucy 75 1
3 Lucy 75 2
4 Lucy 75 3
What I'm attempting to do is true apply a Boolean throughout a DataFrame very similar to this one with 2m+ rows and 20+ columns to find out when someone started
To find if someone is active or not I pass a condition to another df:
df2 = df.loc[df.Week == df.Week.max()]
This gives me the final week i then use an isin filter to find out if the person is active or has left
df['Status'] = np.where(df['Payroll'].isin(df2['Payroll']), 'Active','Leaver')
So using the above code I get the following which is great, which tells me that since John is not in the latest week he has left the company
Name Payroll Week Status
0 John 15 1 Leaver
1 John 15 2 Leaver
2 Lucy 75 1 Active
3 Lucy 75 2 Active
4 Lucy 75 3 Active
What I'm trying to achieve is to know when John started with us, I could try a mask for each week of the year and an isin to check for when they first appeared but I figured there must be a more pythonic way do this!
Desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active
Any help is much appreciated.
Edit for Clarity :
data = {'Name' : ['John','John','John','John','Lucy','Lucy','Lucy','Lucy','Lucy'],
'Payroll' : [15,15,15,15,75,75,75,75,75],
'Week' : [1,2,3,4,1,2,3,4,5]}
df = pd.DataFrame(data)
desired output:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
things to note:
Max week is 5 so anyone not in week 5 is a leaver
first week of person in df makes them a starter.
all weeks in between are set to Active.
Use numpy.select with new condition by duplicated:
a = df.loc[df.Week == df.Week.max(), 'Payroll']
m1 = ~df['Payroll'].isin(a)
m2 = ~df['Payroll'].duplicated()
m3 = ~df['Payroll'].duplicated(keep='last')
df['Status'] = np.select([m2, m1 & m3], ['Starter', 'Leaver'], 'Active')
print (df)
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Active
2 John 15 3 Active
3 John 15 4 Leaver
4 Lucy 75 1 Starter
5 Lucy 75 2 Active
6 Lucy 75 3 Active
7 Lucy 75 4 Active
8 Lucy 75 5 Active
The simplest way that I have come across is using groupby and finding minimal index for the name in the group:
for _, dfg in df.groupby(df['Name']):
gidx = min(dfg.index)
df.loc[df.index == gidx,'Status'] = 'Starter'
print(df)
And the df is then:
Name Payroll Week Status
0 John 15 1 Starter
1 John 15 2 Leaver
2 Lucy 75 1 Starter
3 Lucy 75 2 Active
4 Lucy 75 3 Active
I have a df as below:
Index Site Name
0 Site_1 Tom
1 Site_2 Tom
2 Site_4 Jack
3 Site_8 Rose
5 Site_11 Marrie
6 Site_12 Marrie
7 Site_21 Jacob
8 Site_34 Jacob
I would like to strip the 'Site_' and only leave the number in the "Site" column, as shown below:
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
What is the best way to do this operation?
Using pd.Series.str.extract
This produces a copy with an updated columns
df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
To persist the results, reassign to the data frame name
df = df.assign(Site=df.Site.str.extract('\D+(\d+)', expand=False))
Using pd.Series.str.split
df.assign(Site=df.Site.str.split('_', 1).str[1])
Alternative
Update instead of producing a copy
df.update(df.Site.str.extract('\D+(\d+)', expand=False))
# Or
# df.update(df.Site.str.split('_', 1).str[1])
df
Site Name
Index
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
Make a array consist of the names you want. Then call
yourarray = pd.DataFrame(yourpd, columns=yournamearray)
Just call replace on the column to replace all instances of "Site_":
df['Site'] = df['Site'].str.replace('Site_', '')
Use .apply() to apply a function to each element in a series:
df['Site Name'] = df['Site Name'].apply(lambda x: x.split('_')[-1])
You can use exactly what you wanted (the strip method)
>>> df["Site"] = df.Site.str.strip("Site_")
Output
Index Site Name
0 1 Tom
1 2 Tom
2 4 Jack
3 8 Rose
5 11 Marrie
6 12 Marrie
7 21 Jacob
8 34 Jacob
I have a pandas dataframe that looks something like this:
df = pd.DataFrame({'Name' : ['Kate', 'John', 'Peter','Kate', 'John', 'Peter'],'Distance' : [23,16,32,15,31,26], 'Time' : [3,5,2,7,9,4]})
df
Distance Name Time
0 23 Kate 3
1 16 John 5
2 32 Peter 2
3 15 Kate 7
4 31 John 9
5 26 Peter 2
I want to add a column that tells me, for each Name, what's the order of the times.
I want something like this:
Order Distance Name Time
0 16 John 5
1 31 John 9
0 23 Kate 3
1 15 Kate 7
0 32 Peter 2
1 26 Peter 4
I can do it using a for loop:
df2 = df[df['Name'] == 'aaa'].reset_index().reset_index() # I did this just to create an empty data frame with the columns I want
for name, row in df.groupby('Name').count().iterrows():
table = df[df['Name'] == name].sort_values('Time').reset_index().reset_index()
to_concat = [df2,table]
df2 = pd.concat(to_concat)
df2.drop('index', axis = 1, inplace = True)
df2.columns = ['Order', 'Distance', 'Name', 'Time']
df2
This works, the problem is (apart from being very unpythonic), for large tables (my actual table has about 50 thousand rows) it takes about half an hour to run.
Can someone help me write this in a simpler way that runs faster?
I'm sorry if this has been answered somewhere, but I didn't really know how to search for it.
Best,
Use sort_values with cumcount:
df = df.sort_values(['Name','Time'])
df['Order'] = df.groupby('Name').cumcount()
print (df)
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
If need first column use insert:
df = df.sort_values(['Name','Time'])
df.insert(0, 'Order', df.groupby('Name').cumcount())
print (df)
Order Distance Name Time
1 0 16 John 5
4 1 31 John 9
0 0 23 Kate 3
3 1 15 Kate 7
2 0 32 Peter 2
5 1 26 Peter 4
In [67]: df = df.sort_values(['Name','Time']) \
.assign(Order=df.groupby('Name').cumcount())
In [68]: df
Out[68]:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
PS I'm not sure this is the most elegant way to do this...