One-hot encoding across multiple columns - but as one group - python

I Have a Python Pandas DataFrame :
Name Item1 Item2 Item3
John Sword
Mary Shield Ring
Doe Ring Sword
Desired output :
Name Item-Sword Item-Shield Item-Ring
John 1 0 0
Mary 0 1 1
Doe 1 0 1
Is this any way to achieve this outside of manual processing?

Use get_dummies with convert Name column to index and remove only missing values columns, then use max for only 0,1 values in output, add prefix and convert index to column:
df = (pd.get_dummies(df.set_index('Name')
.dropna(axis=1, how='all'), prefix='', prefix_sep='')
.max(axis=1, level=0)
.add_prefix('Item-')
.reset_index())
print (df)
Name Item-Ring Item-Shield Item-Sword
0 John 0 0 1
1 Mary 1 1 0
2 Doe 1 0 1
Alternative with melt and crosstab - #sammywemmy solution with drop_duplicates:
df1 = (df.melt("Name")
.assign(value=lambda x: "Item-" + x.value)
.drop_duplicates(['Name','value']))
df1 = pd.crosstab(df1.Name, df1.value)
print (df1)
value Item-Ring Item-Shield Item-Sword
Name
Doe 1 0 1
John 0 0 1
Mary 1 1 0

Another solution with DataFrame.melt + DataFrame.groupby
new_df = (df.melt('Name').groupby(['Name', 'value'])
.count()
.clip(0, 1)
.unstack('value', fill_value=0)
.droplevel(0, axis=1)
.add_prefix('Item-')
.rename_axis(columns=None)
.reset_index())
print(new_df)
Or DataFrame.pivot_table
df2 = df.melt('Name')
new_df = (df2.pivot_table(index='Name', columns='value', values='variable',
aggfunc='any', fill_value=0)
.astype(int)
.add_prefix('Item-')
.rename_axis(columns=None)
.reset_index())
print(new_df)
Output
Name Item-Ring Item-Shield Item-Sword
0 Doe 1 0 1
1 John 0 0 1
2 Mary 1 1 0

If you set your index to "Name", and then stack() your data into a single Series you can use pd.get_dummies to encode the data. Then you'll need to use max to get the maximum value for each "Name" (this logic boils down to: we don't care if "Mary" has a ring as item 1 or item 2, so long as she has a ring). Once that's done, we can tidy up by adding a prefix and resetting our index back into the DataFrame
out = (df.set_index("Name")
.stack()
.pipe(pd.get_dummies)
.max(level="Name")
.add_prefix("Item-")
.reset_index())
print(out)
Name Item-Ring Item-Shield Item-Sword
0 John 0 0 1
1 Mary 1 1 0
2 Doe 1 0 1

Related

How can I generate a new column to group by membership in Pandas?

I have a dataframe:
df = pd.DataFrame({'name':['John','Fred','John','George','Fred']})
How can I transform this to generate a new column giving me group membership by value? Such that:
new_df = pd.DataFrame({'name':['John','Fred','John','George','Fred'], 'group':[1,2,1,3,2]})
Use factorize:
df['group'] = pd.factorize(df['name'])[0] + 1
print (df)
name group
0 John 1
1 Fred 2
2 John 1
3 George 3
4 Fred 2

Count frequency of each word contained in column string values

For example, I have a dataframe like this:
data = {'id': [1,1,1,2,2],
'value': ['red','red and blue','yellow','oak','oak wood']
}
df = pd.DataFrame (data, columns = ['id','value'])
I want :
id value count
1 red 2
1 blue 1
1 yellow 1
2 oak 2
2 wood 1
Many thanks!
Solution for pandas 0.25+ with DataFrame.explode by lists created by Series.str.split and GroupBy.size:
df1 = (df.assign(value = df['value'].str.split())
.explode('value')
.groupby(['id','value'], sort=False)
.size()
.reset_index(name='count'))
print (df1)
id value count
0 1 red 2
1 1 and 1
2 1 blue 1
3 1 yellow 1
4 2 oak 2
5 2 wood 1
For lower pandas versions use DataFrame.set_index with Series.str.split and expand=True for DataFrame, reshape by DataFrame.stack, create columns from MultiIndex Series ands use same solution like above:
df1 = (df.set_index('id')['value']
.str.split(expand=True)
.stack()
.reset_index(name='value')
.groupby(['id','value'], sort=False)
.size()
.reset_index(name='count')
)
print (df1)
id value count
0 1 red 2
1 1 and 1
2 1 blue 1
3 1 yellow 1
4 2 oak 2
5 2 wood 1

Split single column into two based on column values

I have a dataframe that looks like this:
Supervisor Score
Bill Pass
Bill Pass
Susan Fail
Susan Fail
Susan Fail
I would like to do some aggregates (such as getting the % of pass by supervisor) and would like to split up the Score column so all the Pass are in one column and all the Fail are in another column. Like this:
Supervisor Pass Fail
Bill 0 1
Bill 0 1
Susan 1 0
Susan 1 0
Susan 1 0
Any ideas? Would a simple groupby work by grouping both the supervisor and score columns and getting a count of Score?
pd.get_dummies
Removes any columns you specify from your DataFrame in favor of N dummy columns with the default naming convention 'OrigName_UniqueVal'. Specifying empty strings for the prefix and separator gives you column headers of only the unique values.
pd.get_dummies(df, columns=['Score'], prefix_sep='', prefix='')
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
If in the end you just want % of each category by supervisor then you don't really need the dummies. You can groupby. I use a reindex to ensure the resulting DataFrame has each category represented for each Supervisor.
(df.groupby(['Supervisor']).Score.value_counts(normalize=True)
.reindex(pd.MultiIndex.from_product([df.Supervisor.unique(), df.Score.unique()]))
.fillna(0))
#Bill Pass 1.0
# Fail 0.0
#Susan Pass 0.0
# Fail 1.0
#Name: Score, dtype: float64
IIUC, you want DataFrame.pivot_table + DataFrmae.join
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0))
print(new_df)
Supervisor Fail Pass
0 Bill 0 1
1 Bill 0 1
2 Susan 1 0
3 Susan 1 0
4 Susan 1 0
For the output expect:
new_df = df[['Supervisor']].join(df.pivot_table(columns = 'Score',
index = df.index,
values ='Supervisor',
aggfunc='count',
fill_value=0)
.eq(0)
.astype(int))
print(new_df)
Supervisor Fail Pass
0 Bill 1 0
1 Bill 1 0
2 Susan 0 1
3 Susan 0 1
4 Susan 0 1
**Let's try this one**
df=pd.DataFrame({'Supervisor':['Bill','Bill','Susan','Susan','Susan'],
'Score':['Pass','Pass','Fail','Fail','Fail']}).set_index('Supervisor')
pd.get_dummies(df['Score'])
PANDAS 100 tricks
For More Pandas trick refer following : https://www.kaggle.com/python10pm/pandas-100-tricks
To get the df you want you can do it like this:
df["Pass"] = df["Score"].apply(lambda x: 0 if x == "Pass" else 1)
df["Fail"] = df["Score"].apply(lambda x: 0 if x == "Fail" else 1)

How to Hot Encode with Pandas without combining rows levels

I have created a really big dataframe in pandas like similar to the following:
0 1
user
0 product4 product0
1 product3 product1
I want to use something, like pd.get_dummies(), in such a way that the final df would be like:
product0 product1 product2 product3 product4
user
0 1 0 0 0 1
1 0 1 0 1 0
instead of getting the following from pd.get_dummies():
0_product3 0_product4 1_product0 1_product1
user
0 0 1 1 0
1 1 0 0 1
In summary, I do not want that the rows are combined into the binary columns.
Thanks a lot!
Use reindex with get_dummies
In [539]: dff = pd.get_dummies(df, prefix='', prefix_sep='')
In [540]: s = dff.columns.str[-1].astype(int)
In [541]: cols = 'product' + pd.RangeIndex(s.min(), s.max()+1).astype(str)
In [542]: dff.reindex(columns=cols, fill_value=0)
Out[542]:
product0 product1 product2 product3 product4
user
0 1 0 0 0 1
1 0 1 0 1 0
df = pd.get_dummies(df, prefix='', prefix_sep='') # remove prefix from dummy column names and underscore
df = df.sort_index(axis=1) # order data by column names

How to Compare Values of two Dataframes in Pandas?

I have two dataframes df and df2 like this
id initials
0 100 J
1 200 S
2 300 Y
name initials
0 John J
1 Smith S
2 Nathan N
I want to compare the values in the initials columns found in (df and df2) and copy the name (in df2) which its initial is matching to the initial in the first dataframe (df)
import pandas as pd
for i in df.initials:
for j in df2.initials:
if i == j:
# copy the name value of this particular initial to df
The output should be like this:
id name
0 100 Johon
1 200 Smith
2 300
Any idea how to solve this problem?
How about?:
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id'])
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0 NaN
So the 'initials' column is dropped and so is anything with np.nan in the 'id' column.
If you don't want the np.nan in there tack on a .fillna():
df3 = df.merge(df2,on='initials',
how='outer').drop(['initials'],axis=1).dropna(subset=['id']).fillna('')
>>> df3
id name
0 100.0 John
1 200.0 Smith
2 300.0
df1
id initials
0 100 J
1 200 S
2 300 Y
df2
name initials
0 John J
1 Smith S
2 Nathan N
Use Boolean masks: df2.initials==df1.initials will tell you which values in the two initials columns are the same.
0 True
1 True
2 False
Use this mask to create a new column:
df1['name'] = df2.name[df2.initials==df1.initials]
Remove the initials column in df1:
df1.drop('initials', axis=1)
Replace the NaN using fillna(' ')
df1.fillna('', inplace=True) #inplace to avoid creating a copy
id name
0 100 John
1 200 Smith
2 300

Categories