Python - Pandas - Edit duplicate items keeping last - python

Lets say my df is:
import pandas as pd
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'col2':[10,20, 30, 10, 20, 10, 10, 20, 30]})
How can I make all numbers zero keeping the last one only? In this case the result should be:
col1 col2
a 0
a 0
a 30
b 0
b 20
c 10
d 0
d 0
d 30
Thanks!

Use loc and duplicated with the argument keep='last':
df.loc[df.duplicated(subset='col1',keep='last'), 'col2'] = 0
>>> df
col1 col2
0 a 0
1 a 0
2 a 30
3 b 0
4 b 20
5 c 10
6 d 0
7 d 0
8 d 30

Related

How to efficiently reorder rows based on condition?

My dataframe:
df = pd.DataFrame({'col_1': [10, 20, 10, 20, 10, 10, 20, 20],
'col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 10 f
6 20 g
7 20 h
I don't want consecutive rows with col_1 = 10, instead a row below a repeating 10 should jump up by one (in this case, index 6 should become index 5 and vice versa), so the order is always 10, 20, 10, 20...
My current solution:
for idx, row in df.iterrows():
if row['col_1'] == 10 and df.iloc[idx + 1]['col_1'] != 20:
df = df.rename({idx + 1:idx + 2, idx + 2: idx + 1})
df = df.sort_index()
df
gives me:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 20 g
6 10 f
7 20 h
which is what I want but it is very slow (2.34s for a dataframe with just over 8000 rows).
Is there a way to avoid loop here?
Thanks
You can use a custom key in sort_values with groupby.cumcount:
df.sort_values(by='col_1', kind='stable', key=lambda s: df.groupby(s).cumcount())
Output:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
6 20 g
5 10 f
7 20 h

How can pd.get_dummies() be used to dummy-code a list of categories?

I understand that pd.get_dummies() works very well for creating a dummy set to represent a categorical variable (in my case for a decision tree algorithm). My question is, how can this be adapted to handle entries that are a list of categories?
MWE:
import pandas as pd
a = pd.DataFrame({
'id': ['i', 'j', 'k', 'l'],
'category': [['a', 'b'], 'b', 'c', ['b', 'c']],
'x': ['p', 'q', 'r', 's'],
'y': [10, 20, 30, 40]
})
...
a_dummied
id a b c x y
0 i 1 1 0 p 10
1 j 0 1 0 q 20
2 k 0 0 1 r 30
3 l 0 1 1 s 40
You can explode the category column and then call pd.get_dummies:
print( pd.get_dummies(a.explode('category').set_index('id'), prefix='', prefix_sep='').groupby(level=0).sum() )
Prints:
a b c
id
i 1 1 0
j 0 1 0
k 0 0 1
l 0 1 1
EDIT: To work with more columns, first make a pd.get_dummies() on category column and then .join with original dataframe:
c = pd.get_dummies( a[['id', 'category']].explode('category').set_index('id'), prefix='', prefix_sep='').groupby(level=0).sum()
print( a.set_index('id').drop(columns='category').join(c) )
Prints:
x y a b c
id
i p 10 1 1 0
j q 20 0 1 0
k r 30 0 0 1
l s 40 0 1 1

Create a unique indicator two join two datasets in pandas/python

How can I combine four columns in a dataframe in pandas/python to create a unique indicator and do a left join?
Is this even the best way to do what I am trying to accomplish?
example: make a unique indicator (col5)
then setup a join with another dataframe using the same logic
col1 col2 col3 col4 col5
apple pear mango tea applepearmangotea
then do a join something like
pd.merge(df1, df2, how='left', on='col5')
This problem is the same whether its 4 columns or 2. You don't need to create a unique combined key. You just need to merge on multiple columns.
Consider the two dataframes d1 and d2. They share two columns in common.
d1 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[1, 1, 'g', 'h']
], columns=list('ABCD'))
d2 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[2, 0, 'g', 'h']
], columns=list('ABEF'))
d1
A B C D
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 1 1 g h
d2
A B E F
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 2 0 g h
We can perform the equivalent of a left join using pd.DataFrame.merge
d1.merge(d2, 'left')
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
We can be explicit with the columns
d1.merge(d2, 'left', on=['A', 'B'])
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN

Create and populate a dataframe using the unique values of another dataframe

I have a dataframe df like this:
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
I would like to create a new dataframe newdf which has one column (uentries) that contains the unique entries of df and the three columns of df which are filled with 0 and 1 depending on whether the the entry of uentries exists in the respective column in df.
My desired output would therefore look as follows (uentries does not need to be ordered):
uentries X1 X2 X3
0 a 1 0 1
1 b 1 0 0
2 c 1 1 1
3 d 1 0 0
4 e 0 1 1
Currently, I do it like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])
newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)
for coli in df.columns:
newdf[coli] = newdf['uentries'].isin(df[coli])
newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)
which gives me the desired output.
Is it possible to fill newdf in a more efficient manner?
This is a simple way to approach this problem using pd.value_counts.
newdf = df.apply(pd.value_counts).fillna(0)
newdf['uentries'] = newdf.index
newdf = newdf[['uentries', 'X1','X2','X3']]
newdf
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1
nan nan 0 2 1
Then you can just drop the row with the nan values:
newdf.drop(['nan'])
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1
You can use get_dummies, sum and last concat with fillna:
import pandas as pd
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
print df
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1
nan 0 2 1
If you use np.nan in test data:
import pandas as pd
import numpy as np
import io
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', np.nan, np.nan],
'X3': ['a', 'c', 'e', np.nan]})
print df
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1

Pandas Groupby : group **by** a column containing tuples

I'm trying to group by a column containing tuples. Each tuple has a different length.
I'd like to perform simple groupby operations on this column of tuples, such as sum or count.
Example :
df = pd.DataFrame(data={
'col1': [1,2,3,4] ,
'col2': [('a', 'b'), ('a'), ('b', 'n', 'k'), ('a', 'c', 'k', 'z') ] ,
})
print df
outputs :
col1 col2
0 1 (a, b)
1 2 (a, m)
2 3 (b, n, k)
3 4 (a, c, k, z)
I'd like to be able to group by col2 on col1, with for instance a sum.
Expected output would be :
col2 sum_col1
0 a 7
1 b 4
2 c 4
3 n 3
3 m 2
3 k 7
3 z 4
I feel that pd.melt might be able to use, but i can't see exactly how.
Here is an approach using .get_dummies and .melt:
import pandas as pd
df = pd.DataFrame(data={
'col1': [1,2,3,4] ,
'col2': [('a', 'b'), ('a'), ('b', 'n', 'k'), ('a', 'c', 'k', 'z') ] ,
})
value_col = 'col1'
id_col = 'col2'
Unpack tuples to DataFrame:
df = df.join(df.col2.apply(lambda x: pd.Series(x)))
Create columns with values of tuples:
dummy_cols = df.columns.difference(df[[value_col, id_col]].columns)
dfd = pd.get_dummies(df[dummy_cols | pd.Index([value_col])])
Producing:
col1 0_a 0_b 1_b 1_c 1_n 2_k 3_z
0 1 1 0 1 0 0 0 0
1 2 1 0 0 0 0 0 0
2 3 0 1 0 0 1 1 0
3 4 1 0 0 1 0 1 1
Then .melt it and clean variable column from prefixes:
dfd = pd.melt(dfd, value_vars=dfd.columns.difference([value_col]).tolist(), id_vars=value_col)
dfd['variable'] = dfd.variable.str.replace(r'\d_', '')
print dfd.head()
Yielding:
col1 variable value
0 1 a 1
1 2 a 1
2 3 a 0
3 4 a 1
4 1 b 0
And finally get your output:
dfd[dfd.value != 0].groupby('variable')[value_col].sum()
variable
a 7
b 4
c 4
k 7
n 3
z 4
Name: col1, dtype: int64

Categories