Split single column into two based on column values - python

I have a dataframe that looks like this:
Supervisor Score
Bill Pass
Bill Pass
Susan Fail
Susan Fail
Susan Fail
I would like to do some aggregates (such as getting the % of pass by supervisor) and would like to split up the Score column so all the Pass are in one column and all the Fail are in another column. Like this:
Supervisor Pass Fail
Bill 0 1
Bill 0 1
Susan 1 0
Susan 1 0
Susan 1 0
Any ideas? Would a simple groupby work by grouping both the supervisor and score columns and getting a count of Score?

pd.get_dummies
Removes any columns you specify from your DataFrame in favor of N dummy columns with the default naming convention 'OrigName_UniqueVal'. Specifying empty strings for the prefix and separator gives you column headers of only the unique values.
pd.get_dummies(df, columns=['Score'], prefix_sep='', prefix='')
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
If in the end you just want % of each category by supervisor then you don't really need the dummies. You can groupby. I use a reindex to ensure the resulting DataFrame has each category represented for each Supervisor.
(df.groupby(['Supervisor']).Score.value_counts(normalize=True)
.reindex(pd.MultiIndex.from_product([df.Supervisor.unique(), df.Score.unique()]))
.fillna(0))
#Bill Pass 1.0
# Fail 0.0
#Susan Pass 0.0
# Fail 1.0
#Name: Score, dtype: float64

IIUC, you want DataFrame.pivot_table + DataFrmae.join
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0))
print(new_df)
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
For the output expect:
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0)
.eq(0)
.astype(int))
print(new_df)
Supervisor Fail Pass
0 Bill 1 0
1 Bill 1 0
2 Susan 0 1
3 Susan 0 1
4 Susan 0 1

**Let's try this one**
df=pd.DataFrame({'Supervisor':['Bill','Bill','Susan','Susan','Susan'],
'Score':['Pass','Pass','Fail','Fail','Fail']}).set_index('Supervisor')
pd.get_dummies(df['Score'])
PANDAS 100 tricks
For More Pandas trick refer following : https://www.kaggle.com/python10pm/pandas-100-tricks

To get the df you want you can do it like this:
df["Pass"] = df["Score"].apply(lambda x: 0 if x == "Pass" else 1)
df["Fail"] = df["Score"].apply(lambda x: 0 if x == "Fail" else 1)

Related

Pandas Concat dataframes with Duplicates

I am facing issues in concatenating two Dataframes of different lengths. Below is the issue:
df1 =
emp_id emp_name counts
1 sam 0
2 joe 0
3 john 0
df2 =
emp_id emp_name counts
1 sam 0
2 joe 0
2 joe 1
3 john 0
My Expected output is:
Please Note that my expectation is not to merge the 2 dataframes into one but I would like to concat two dataframes side by side and highlight the differences in such a way that, if there is a duplicate row in one df, example df2, the respective row of df1 should show as NaN/blank/None any Null kind of values
Expected_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
NaN NaN NaN 2 joe 1
3 john 0 3 john 0
whereas am getting output as below:
actual_output_df = pd.concat([df1, df2], axis='columns', keys=['df1','df2'])
the above code gives me below mentioned Dataframe. but how can I get the dataframe which is mentioned in the Expected output
actual_output_df =
df1 df2
empId emp_name counts emp_id emp_name counts
1 sam 0 1 sam 0
2 joe 0 2 joe 0
3 john 0 2 joe 1
NaN NaN NaN 3 john 0
Tried pd.concat by passing different parameters but not getting expected result.
The main issue I have in concat is, I am not able to move the duplicate rows one row down.
Can anyone please help me on this? Thanks in advance
This does not give the exact output you asked, but it could solve your problem anyway:
df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
Output:
emp_id emp_name counts _merge
0 1 sam 0 both
1 2 joe 0 both
2 3 john 0 both
3 2 joe 1 right_only
You don't have rows with NaNs as you wanted, but in this way you can check whether a row is in the left df, right df or both by looking at the _merge column. You can also give a custom name to that columns using indicator='name'.
Update
To get the exact output you want you can do the following:
output_df = df1.merge(df2, on=['emp_id', 'emp_name', 'counts'], how='outer', indicator=True)
output_df[['emp_id2', 'emp_name2', 'counts2']] = output_df[['emp_id', 'emp_name', 'counts']]
output_df.loc[output_df._merge == 'right_only', ['emp_id', 'emp_name', 'counts']] = np.nan
output_df.loc[output_df._merge == 'left_only', ['emp_id2', 'emp_name2', 'counts2']] = np.nan
output_df = output_df.drop('_merge', axis=1)
output_df.columns = pd.MultiIndex.from_tuples([('df1', 'emp_id'), ('df1', 'emp_name'), ('df1', 'counts'),
('df2', 'emp_id'), ('df2', 'emp_name'), ('df2', 'counts')])
Output:
df1 df2
emp_id emp_name counts emp_id emp_name counts
0 1.0 sam 0.0 1.0 sam 0.0
1 2.0 joe 0.0 2.0 joe 0.0
2 3.0 john 0.0 3.0 john 0.0
3 NaN NaN NaN 2.0 joe 1.0

One-hot encoding across multiple columns - but as one group

I Have a Python Pandas DataFrame :
Name Item1 Item2 Item3
John Sword
Mary Shield Ring
Doe Ring Sword
Desired output :
Name Item-Sword Item-Shield Item-Ring
John 1 0 0
Mary 0 1 1
Doe 1 0 1
Is this any way to achieve this outside of manual processing?
Use get_dummies with convert Name column to index and remove only missing values columns, then use max for only 0,1 values in output, add prefix and convert index to column:
df = (pd.get_dummies(df.set_index('Name')
.dropna(axis=1, how='all'), prefix='', prefix_sep='')
.max(axis=1, level=0)
.add_prefix('Item-')
.reset_index())
print (df)
Name Item-Ring Item-Shield Item-Sword
0 John 0 0 1
1 Mary 1 1 0
2 Doe 1 0 1
Alternative with melt and crosstab - #sammywemmy solution with drop_duplicates:
df1 = (df.melt("Name")
.assign(value=lambda x: "Item-" + x.value)
.drop_duplicates(['Name','value']))
df1 = pd.crosstab(df1.Name, df1.value)
print (df1)
value Item-Ring Item-Shield Item-Sword
Name
Doe 1 0 1
John 0 0 1
Mary 1 1 0
Another solution with DataFrame.melt + DataFrame.groupby
new_df = (df.melt('Name').groupby(['Name', 'value'])
.count()
.clip(0, 1)
.unstack('value', fill_value=0)
.droplevel(0, axis=1)
.add_prefix('Item-')
.rename_axis(columns=None)
.reset_index())
print(new_df)
Or DataFrame.pivot_table
df2 = df.melt('Name')
new_df = (df2.pivot_table(index='Name', columns='value', values='variable',
aggfunc='any', fill_value=0)
.astype(int)
.add_prefix('Item-')
.rename_axis(columns=None)
.reset_index())
print(new_df)
Output
Name Item-Ring Item-Shield Item-Sword
0 Doe 1 0 1
1 John 0 0 1
2 Mary 1 1 0
If you set your index to "Name", and then stack() your data into a single Series you can use pd.get_dummies to encode the data. Then you'll need to use max to get the maximum value for each "Name" (this logic boils down to: we don't care if "Mary" has a ring as item 1 or item 2, so long as she has a ring). Once that's done, we can tidy up by adding a prefix and resetting our index back into the DataFrame
out = (df.set_index("Name")
.stack()
.pipe(pd.get_dummies)
.max(level="Name")
.add_prefix("Item-")
.reset_index())
print(out)
Name Item-Ring Item-Shield Item-Sword
0 John 0 0 1
1 Mary 1 1 0
2 Doe 1 0 1

How can I generate a new column to group by membership in Pandas?

I have a dataframe:
df = pd.DataFrame({'name':['John','Fred','John','George','Fred']})
How can I transform this to generate a new column giving me group membership by value? Such that:
new_df = pd.DataFrame({'name':['John','Fred','John','George','Fred'], 'group':[1,2,1,3,2]})
Use factorize:
df['group'] = pd.factorize(df['name'])[0] + 1
print (df)
name group
0 John 1
1 Fred 2
2 John 1
3 George 3
4 Fred 2

Efficient method for generating a list of values from a column in a data frame based on common secondary columns

I have a data frame (df) in Python with 4 columns (ID, Status, Person, Output). Each ID is repeated 6 times and the Output is the same for each ID. For each ID, the Status will be On/Off (3 of each).
I need to generate a new column with a list of people for each unique ID/Status combination. I also need a second new column with a group ID for each unique list of people.
This is my current code which works but is very slow when working with a large data frame due to the apply(list) function. Is there a more efficient way to do this?
people = df.groupby(['ID','Status'])['Person'].apply(list).reset_index(name='Names_ID')
people['Group_ID'] = people['Names_ID'].rank(method='dense')
df = df.drop_duplicates(subset=['ID','Status'])
df = df.merge(people, on = ('ID', 'Status'))
Here is an example input data frame:
df=
ID Status Person Output
0 On John 1
0 On Mark 1
0 On Michael 1
0 Off Peter 1
0 Off Tim 1
0 Off Jake 1
1 On Peter 0.5
1 On Dennis 0.5
1 On Jasper 0.5
1 Off John 0.5
1 Off Mark 0.5
1 Off Michael 0.5
2 On John 2
2 On Mark 2
2 On Larry 2
2 Off Peter 2
2 Off Dennis 2
2 Off Jasper 2
The desired output is:
df =
ID Status People Group_ID Output
0 On [John, Mark, Michael ] 0 1
0 Off [Peter, Tim, Jake ] 1 1
1 On [Peter, Dennis, Jasper ] 2 0.5
1 Off [John, Mark, Michael ] 0 0.5
2 On [John, Mark, Larry ] 3 2
2 Off [Peter, Dennis, Jasper ] 2 2
Try this:
df_out = df.groupby(['ID', 'Status'])['Person'].apply(list).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'].apply(tuple))[0]
df_out
Output:
ID Status Person Group_ID
0 0 Off [Peter, Tim, Jake] 0
1 0 On [John, Mark, Michael] 1
2 1 Off [John, Mark, Michael] 1
3 1 On [Peter, Dennis, Jasper] 2
4 2 Off [Peter, Dennis, Jasper] 2
5 2 On [John, Mark, Larry] 3
OR
df_out = df.groupby(['ID', 'Status'])['Person'].apply(', '.join).reset_index()
df_out['Group_ID'] = pd.factorize(df_out['Person'])[0]
df_out
import pandas as pd
df = pd.read_clipboard()
df
One method is to use shift twice and join the three columns into a list. Then use groupby to figure out the Group_ID and merge it back into the dataframe.
df['Person1'] = df['Person'].shift(-1)
df['Person2'] = df['Person'].shift(-2)
df['People'] = '[' + df['Person'] + ',' + df['Person1'] + ',' + df['Person2'] + ']'
mult_3 = []
for i in df.index:
if i==0 or i%3 == 0:
mult_3.append(i)
df = df.loc[df.index.isin(mult_3)].drop(['Person', 'Person1', 'Person2'], axis=1)
df_people = df.groupby('People').Status.count().reset_index().drop(['Status'], axis=1).reset_index()
df = df.merge(df_people, how='left', on='People').rename(columns={'index':'Group_ID'})
df = df[['ID', 'Status', 'People', 'Group_ID', 'Output']]
df
Python 3.7.6 and Pandas 1.0.3: The bottleneck here are probably the apply calls.
people = df.groupby(['ID','Status', "Output"])['Person'].apply(list).reset_index(name = 'People')
people['Group_ID'] = people["People"].apply(str).astype('category').cat.codes
Output:
ID Status Output People Group_ID
0 0 Off 1 [Peter, Tim, Jake] 3
1 0 On 1 [John, Mark, Michael] 1
2 1 Off 0.5 [John, Mark, Michael] 1
3 1 On 0.5 [Peter, Dennis, Jasper] 2
4 2 Off 2 [Peter, Dennis, Jasper] 2
5 2 On 2 [John, Mark, Larry] 0

Check if value from one dataframe exists in another dataframe

I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.
Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32
This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))
By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)

Categories