Add blank row to a pandas dataframe after every period - python

I have a pandas dataframe that is quite similar to this:-
name
status
eric
single
.
0
xavier
couple
sarah
couple
.
0
aaron
divorced
.
0
I would like to add a new row after every period as below:-
name
status
eric
single
.
0
xavier
couple
sarah
couple
.
0
aaron
divorced
.
0
Appreciate any guidance on this!

You can use groupby and apply a concatenation of each group to a dummy row:
(df
.groupby(df['name'].shift().eq('.').cumsum(), group_keys=False)
.apply(lambda g: pd.concat([g, pd.DataFrame(index=[0], columns=g.columns)]).fillna(''))
)
output:
name status
0 eric single
1 . 0
0
2 xavier couple
3 sarah couple
4 . 0
0
5 aaron divorced
6 . 0
0
Or extract the rows with . and concat:
df2 = df[df['name'].eq('.')].copy()
df2.loc[:] = ''
pd.concat([df, df2]).sort_index(kind='stable')
output:
name status
0 eric single
1 . 0
1
2 xavier couple
3 sarah couple
4 . 0
4
5 aaron divorced
6 . 0
6

Related

Python Transpose data and change to 0,1's

I have some data in the following format, thousands of rows.
I want to transpose the data and also change the format to 1 and 0's
Name Codes
Dave DSFFS
Dave SDFDF
stu SDFDS
stu DSGDSG
I want to retain the Name column in row format, but have the codes column go into Column format instead and have 1 and 0's
u can use df.transpose() in pandas!
pd.get_dummies() should help here.
df = pd.DataFrame({'Name': ['Dave','Dave','stu','stu'],
'Codes': ['DSFFS','SDFDF','SDFDS','DSGDSG']})
print(df)
Codes Name
0 DSFFS Dave
1 SDFDF Dave
2 SDFDS stu
3 DSGDSG stu
print(pd.get_dummies(df, columns=['Codes']))
Name Codes_DSFFS Codes_DSGDSG Codes_SDFDF Codes_SDFDS
0 Dave 1 0 0 0
1 Dave 0 0 1 0
2 stu 0 0 0 1
3 stu 0 1 0 0

Python Pandas Sequentially Count Up Occurrences For Unique Pairs in Multiindex

I have a dataframe logging exercises completed, with a two column multiindex: Day and Person. Each day, each person logs which exercises they did (if they exercised). I would like to add another column which sequentially counts the entries made into this log, as shown below. So for each unique pair of day and person, count up by 1.
Day Person Exercise EntryNumber
1 Joe Curls 1
1 Joe Squats 1
1 Sandy Sprints 2
1 Sandy Bench 2
2 Joe Curls 3
2 Sandy Squats 4
3 Bob Pushups 5
Here is the code to generate that above dataframe.
import pandas as pd
df = pd.DataFrame({'Day':[1,1,1,1,2,2,3],
'Person':['Joe','Joe','Sandy','Sandy','Joe','Sandy','Bob'],
'Exercise':['Curls','Squats','Sprints','Bench','Curls','Squats','Pushups']})
df = df.set_index(['Day','Person'])
How would I go about creating the EntryNumber column? I've tried all manner of groupby and cumcount but have not yet figured it out.
Thanks!
May be you can try with groupby followed by ngroup():
#Generating df from above
import pandas as pd
df = pd.DataFrame({'Day':[1,1,1,1,2,2,3],
'Person':['Joe','Joe','Sandy','Sandy','Joe','Sandy','Bob'],
'Exercise':['Curls','Squats','Sprints','Bench','Curls','Squats','Pushups']})
df = df.set_index(['Day','Person'])
# applying reset index and ngroup
df.reset_index(inplace=True)
df['Entry Number'] = df.groupby(['Day','Person']).ngroup() +1
df
Result:
Day Person Exercise Entry Number
0 1 Joe Curls 1
1 1 Joe Squats 1
2 1 Sandy Sprints 2
3 1 Sandy Bench 2
4 2 Joe Curls 3
5 2 Sandy Squats 4
6 3 Bob Pushups 5
Another way is factorize by index without having to group:
df['EntryNumber'] = df.index.factorize()[0]+1
#df = df.reset_index() -> if you want to reset theindex
print(df)
Exercise EntryNumber
Day Person
1 Joe Curls 1
Joe Squats 1
Sandy Sprints 2
Sandy Bench 2
2 Joe Curls 3
Sandy Squats 4
3 Bob Pushups 5

Efficient method for generating a list of values from a column in a data frame based on common secondary columns

I have a data frame (df) in Python with 4 columns (ID, Status, Person, Output). Each ID is repeated 6 times and the Output is the same for each ID. For each ID, the Status will be On/Off (3 of each).
I need to generate a new column with a list of people for each unique ID/Status combination. I also need a second new column with a group ID for each unique list of people.
This is my current code which works but is very slow when working with a large data frame due to the apply(list) function. Is there a more efficient way to do this?
people = df.groupby(['ID','Status'])['Person'].apply(list).reset_index(name='Names_ID')
people['Group_ID'] = people['Names_ID'].rank(method='dense')
df = df.drop_duplicates(subset=['ID','Status'])
df = df.merge(people, on = ('ID', 'Status'))
Here is an example input data frame:
df=
ID Status Person Output
0 On John 1
0 On Mark 1
0 On Michael 1
0 Off Peter 1
0 Off Tim 1
0 Off Jake 1
1 On Peter 0.5
1 On Dennis 0.5
1 On Jasper 0.5
1 Off John 0.5
1 Off Mark 0.5
1 Off Michael 0.5
2 On John 2
2 On Mark 2
2 On Larry 2
2 Off Peter 2
2 Off Dennis 2
2 Off Jasper 2
The desired output is:
df =
ID Status People Group_ID Output
0 On [John, Mark, Michael ] 0 1
0 Off [Peter, Tim, Jake ] 1 1
1 On [Peter, Dennis, Jasper ] 2 0.5
1 Off [John, Mark, Michael ] 0 0.5
2 On [John, Mark, Larry ] 3 2
2 Off [Peter, Dennis, Jasper ] 2 2
Try this:
df_out = df.groupby(['ID', 'Status'])['Person'].apply(list).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'].apply(tuple))[0]
df_out
Output:
ID Status Person Group_ID
0 0 Off [Peter, Tim, Jake] 0
1 0 On [John, Mark, Michael] 1
2 1 Off [John, Mark, Michael] 1
3 1 On [Peter, Dennis, Jasper] 2
4 2 Off [Peter, Dennis, Jasper] 2
5 2 On [John, Mark, Larry] 3
OR
df_out = df.groupby(['ID', 'Status'])['Person'].apply(', '.join).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'])[0]
df_out
import pandas as pd
df = pd.read_clipboard()
df
One method is to use shift twice and join the three columns into a list. Then use groupby to figure out the Group_ID and merge it back into the dataframe.
df['Person1'] = df['Person'].shift(-1)
df['Person2'] = df['Person'].shift(-2)
df['People'] = '[' + df['Person'] + ',' + df['Person1'] + ',' + df['Person2'] + ']'
mult_3 = []
for i in df.index:
if i==0 or i%3 == 0:
mult_3.append(i)
df = df.loc[df.index.isin(mult_3)].drop(['Person', 'Person1', 'Person2'], axis=1)
df_people = df.groupby('People').Status.count().reset_index().drop(['Status'], axis=1).reset_index()
df = df.merge(df_people, how='left', on='People').rename(columns={'index':'Group_ID'})
df = df[['ID', 'Status', 'People', 'Group_ID', 'Output']]
df
Python 3.7.6 and Pandas 1.0.3: The bottleneck here are probably the apply calls.
people = df.groupby(['ID','Status', "Output"])['Person'].apply(list).reset_index(name = 'People')
people['Group_ID'] = people["People"].apply(str).astype('category').cat.codes
Output:
ID Status Output People Group_ID
0 0 Off 1 [Peter, Tim, Jake] 3
1 0 On 1 [John, Mark, Michael] 1
2 1 Off 0.5 [John, Mark, Michael] 1
3 1 On 0.5 [Peter, Dennis, Jasper] 2
4 2 Off 2 [Peter, Dennis, Jasper] 2
5 2 On 2 [John, Mark, Larry] 0

Check if value from one dataframe exists in another dataframe

I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.
Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32
This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))
By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)

Sort Pandas Dataframe by substrings of a column

Given a DataFrame:
name email
0 Carl carl#yahoo.com
1 Bob bob#gmail.com
2 Alice alice#yahoo.com
3 David dave#hotmail.com
4 Eve eve#gmail.com
How can it be sorted according to the email's domain name (alphabetically, ascending), and then, within each domain group, according to the string before the "#"?
The result of sorting the above should then be:
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Use:
df = df.reset_index(drop=True)
idx = df['email'].str.split('#', expand=True).sort_values([1,0]).index
df = df.reindex(idx).reset_index(drop=True)
print (df)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Explanation:
First reset_index with drop=True for unique default indices
Then split values to new DataFrame and sort_values
Last reindex to new order
Option 1
sorted + reindex
df = df.set_index('email')
df.reindex(sorted(df.index, key=lambda x: x.split('#')[::-1])).reset_index()
email name
0 bob#gmail.com Bob
1 eve#gmail.com Eve
2 dave#hotmail.com David
3 alice#yahoo.com Alice
4 carl#yahoo.com Carl
Option 2
sorted + pd.DataFrame
As an alternative, you can ditch the reindex call from Option 1 by re-creating a new DataFrame.
pd.DataFrame(
sorted(df.values, key=lambda x: x[1].split('#')[::-1]),
columns=df.columns
)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com

Categories