Python Pandas Sequentially Count Up Occurrences For Unique Pairs in Multiindex - python

I have a dataframe logging exercises completed, with a two column multiindex: Day and Person. Each day, each person logs which exercises they did (if they exercised). I would like to add another column which sequentially counts the entries made into this log, as shown below. So for each unique pair of day and person, count up by 1.
Day Person Exercise EntryNumber
1 Joe Curls 1
1 Joe Squats 1
1 Sandy Sprints 2
1 Sandy Bench 2
2 Joe Curls 3
2 Sandy Squats 4
3 Bob Pushups 5
Here is the code to generate that above dataframe.
import pandas as pd
df = pd.DataFrame({'Day':[1,1,1,1,2,2,3],
'Person':['Joe','Joe','Sandy','Sandy','Joe','Sandy','Bob'],
'Exercise':['Curls','Squats','Sprints','Bench','Curls','Squats','Pushups']})
df = df.set_index(['Day','Person'])
How would I go about creating the EntryNumber column? I've tried all manner of groupby and cumcount but have not yet figured it out.
Thanks!

May be you can try with groupby followed by ngroup():
#Generating df from above
import pandas as pd
df = pd.DataFrame({'Day':[1,1,1,1,2,2,3],
'Person':['Joe','Joe','Sandy','Sandy','Joe','Sandy','Bob'],
'Exercise':['Curls','Squats','Sprints','Bench','Curls','Squats','Pushups']})
df = df.set_index(['Day','Person'])
# applying reset index and ngroup
df.reset_index(inplace=True)
df['Entry Number'] = df.groupby(['Day','Person']).ngroup() +1
df
Result:
Day Person Exercise Entry Number
0 1 Joe Curls 1
1 1 Joe Squats 1
2 1 Sandy Sprints 2
3 1 Sandy Bench 2
4 2 Joe Curls 3
5 2 Sandy Squats 4
6 3 Bob Pushups 5

Another way is factorize by index without having to group:
df['EntryNumber'] = df.index.factorize()[0]+1
#df = df.reset_index() -> if you want to reset theindex
print(df)
Exercise EntryNumber
Day Person
1 Joe Curls 1
Joe Squats 1
Sandy Sprints 2
Sandy Bench 2
2 Joe Curls 3
Sandy Squats 4
3 Bob Pushups 5

Related

Add blank row to a pandas dataframe after every period

I have a pandas dataframe that is quite similar to this:-
name
status
eric
single
.
0
xavier
couple
sarah
couple
.
0
aaron
divorced
.
0
I would like to add a new row after every period as below:-
name
status
eric
single
.
0
xavier
couple
sarah
couple
.
0
aaron
divorced
.
0
Appreciate any guidance on this!
You can use groupby and apply a concatenation of each group to a dummy row:
(df
.groupby(df['name'].shift().eq('.').cumsum(), group_keys=False)
.apply(lambda g: pd.concat([g, pd.DataFrame(index=[0], columns=g.columns)]).fillna(''))
)
output:
name status
0 eric single
1 . 0
0
2 xavier couple
3 sarah couple
4 . 0
0
5 aaron divorced
6 . 0
0
Or extract the rows with . and concat:
df2 = df[df['name'].eq('.')].copy()
df2.loc[:] = ''
pd.concat([df, df2]).sort_index(kind='stable')
output:
name status
0 eric single
1 . 0
1
2 xavier couple
3 sarah couple
4 . 0
4
5 aaron divorced
6 . 0
6

Find the indices of rows in one data frame that exist in another data frame [duplicate]

This question already has answers here:
Select certain rows by index of another DataFrame
(2 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I have two dataframes.
df1:
Name
Place
Price
0
John
NY
0
1
Alex
London
10
2
Bob
Sydney
20
3
Will
Munich
15
4
Alex
London
10
df2:
Name
Place
Price
0
John
NY
0
1
Alex
London
10
2
Tim
HK
6
I want an output as follows:
df2:
Name
Place
Price
Index
0
John
NY
0
[0]
1
Alex
London
10
[1,4]
2
Tim
HK
6
Empty list
I tried:
index_list = []
for row in df2.rows:
i = df1[(df1['Name'] == row['Name']) & (df1['Place'] == row['Place']) & (df1['Price'] == row['Price']].index
index_list.append(i.to_list())
df2['Index'] = index_list
Is there an efficient (and elegant) way to do this?
Not sure if this is what you have in mind; a merge, coupled with a groupby works:
grouper = [*df_2.columns]
(df_1.reset_index()
.merge(df_2, on = grouper)
.groupby(grouper, sort = False, as_index = False)
.index
.agg(list)
)
Name Place Price index
0 John NY 0 [0]
1 Alex London 10 [1, 4]

Add the same row multiple times from a pandas dataframe to a new one, each time altering a value in a specific column

I have a df like this:
MEMBER_ID FirstName LastName I MONTH
0 1 John Doe 10 0
1 2 Mary Jones 15 0
2 3 Andy Right 8 0
I need to create a new df (df_new) which contains each row corresponding to a unique MEMBER_ID, replicated by the amount of times that is in the 'I' column, and the 'MONTH' column has to be filled from 0 and up to and including the value of 'I' in the original df. For example: first row (MEMBER_ID==1) has to be replicated 10 times (value of 'I') and the only difference would be the 'MONTH' column which will go from 0 to 10. After that the rows continue for the next unique value in the 'MEMBER_ID' column.
So I need the df_new to look like this:
MEMBER_ID FirstName LastName I MONTH
0 1 John Doe 10 0
1 1 John Doe 10 1
2 1 John Doe 10 2
3 1 John Doe 10 3
...
10 1 John Doe 10 10
11 2 Mary Jones 15 0
12 2 Mary Jones 15 1
13 2 Mary Jones 15 2
...
N-1 3 Andy Right 8 7
N 3 Andy Right 8 8
I have tried this but it gives me gibberish:
df_new=pd.DataFrame(columns=['MEMBER_ID','FirstName','LastName','I','MONTH'])
for i in range(len(df)):
max_i=df.iloc[i]["I"] #this gets the value in the "I" column
for j in range(0,max_i+1): #to append same row max_i+1 times since I need MONTH to start with 0
df_new.loc[i]=df.iloc[i] #this picks the whole row from the original df
df_new["MONTH"]=j #this assigns the value of each iteration to the MONTH column
df_new=df_new.append(df_new.loc[i],ignore_index=True)
Thank you for your help, dear community!
I was able to fix the SettingWithCopyWarning with this:
index =0
for i in range(len(df)):
for j in range(df.iloc[i]["I"]+1):
row=df.iloc[i]
df_new=df_new.append(row,ignore_index=True)
df_new.at[index,'MONTH']=j
index+=1
df.head()
The problem is, that you overwrite df_new many times.
This should work. df ist the old DataFrame
df_new = pd.DataFrame()
for member in range(len(df)): #iterate over every member
for count in range(df.iloc[member]['I']+1): # you want to add 'I'+1 rows
row = df.iloc[member] # select the row you want to add
row['MONTH'] = count #change the month-vale of the row to add
df_new = df_new.append(row,ignore_index=True) # add the row to the new DataFrame
df_new
Otherwise please show, what's wrong with the output.

How can I generate a new column to group by membership in Pandas?

I have a dataframe:
df = pd.DataFrame({'name':['John','Fred','John','George','Fred']})
How can I transform this to generate a new column giving me group membership by value? Such that:
new_df = pd.DataFrame({'name':['John','Fred','John','George','Fred'], 'group':[1,2,1,3,2]})
Use factorize:
df['group'] = pd.factorize(df['name'])[0] + 1
print (df)
name group
0 John 1
1 Fred 2
2 John 1
3 George 3
4 Fred 2

Sort Pandas Dataframe by substrings of a column

Given a DataFrame:
name email
0 Carl carl#yahoo.com
1 Bob bob#gmail.com
2 Alice alice#yahoo.com
3 David dave#hotmail.com
4 Eve eve#gmail.com
How can it be sorted according to the email's domain name (alphabetically, ascending), and then, within each domain group, according to the string before the "#"?
The result of sorting the above should then be:
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Use:
df = df.reset_index(drop=True)
idx = df['email'].str.split('#', expand=True).sort_values([1,0]).index
df = df.reindex(idx).reset_index(drop=True)
print (df)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Explanation:
First reset_index with drop=True for unique default indices
Then split values to new DataFrame and sort_values
Last reindex to new order
Option 1
sorted + reindex
df = df.set_index('email')
df.reindex(sorted(df.index, key=lambda x: x.split('#')[::-1])).reset_index()
email name
0 bob#gmail.com Bob
1 eve#gmail.com Eve
2 dave#hotmail.com David
3 alice#yahoo.com Alice
4 carl#yahoo.com Carl
Option 2
sorted + pd.DataFrame
As an alternative, you can ditch the reindex call from Option 1 by re-creating a new DataFrame.
pd.DataFrame(
sorted(df.values, key=lambda x: x[1].split('#')[::-1]),
columns=df.columns
)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com

Categories