I have a dataframe df like this:
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
I would like to create a new dataframe newdf which has one column (uentries) that contains the unique entries of df and the three columns of df which are filled with 0 and 1 depending on whether the the entry of uentries exists in the respective column in df.
My desired output would therefore look as follows (uentries does not need to be ordered):
uentries X1 X2 X3
0 a 1 0 1
1 b 1 0 0
2 c 1 1 1
3 d 1 0 0
4 e 0 1 1
Currently, I do it like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])
newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)
for coli in df.columns:
newdf[coli] = newdf['uentries'].isin(df[coli])
newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)
which gives me the desired output.
Is it possible to fill newdf in a more efficient manner?
This is a simple way to approach this problem using pd.value_counts.
newdf = df.apply(pd.value_counts).fillna(0)
newdf['uentries'] = newdf.index
newdf = newdf[['uentries', 'X1','X2','X3']]
newdf
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1
nan nan 0 2 1
Then you can just drop the row with the nan values:
newdf.drop(['nan'])
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1
You can use get_dummies, sum and last concat with fillna:
import pandas as pd
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
print df
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1
nan 0 2 1
If you use np.nan in test data:
import pandas as pd
import numpy as np
import io
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', np.nan, np.nan],
'X3': ['a', 'c', 'e', np.nan]})
print df
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1
Related
Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
My data frame looks like this
Pandas data frame with multiple categorical variables for a user
I made sure there are no duplicates in it. I want to encode it and I want my final output like this
I tried using pandas dummies directly but I am not getting the desired result.
Can anyone help me through this??
IIUC, your user is empty and everything is on name. If that's the case, you can
pd.pivot_table(df, index=df.name.str[0], columns=df.name.str[1:].values, aggfunc='count').fillna(0)
You can split each row in name using r'(\d+)' to separate digits from letters, and use pd.crosstab:
d = pd.DataFrame(df.name.str.split(r'(\d+)').values.tolist())
pd.crosstab(columns=d[2], index=d[1], values=d[1], aggfunc='count')
You could try the the str accessor get_dummies with groupby user column:
df.name.str.get_dummies().groupby(df.user).sum()
Example
Given your sample DataFrame
df = pd.DataFrame({'user': [1]*4 + [2]*4 + [3]*3,
'name': ['a', 'b', 'c', 'd']*2 + ['d', 'e', 'f']})
df_dummies = df.name.str.get_dummies().groupby(df.user).sum()
print(df_dummies)
[out]
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 0 1 1 1
Assuming the following dataframe:
user name
0 1 a
1 1 b
2 1 c
3 1 d
4 2 a
5 2 b
6 2 c
7 3 d
8 3 e
9 3 f
You could groupby user and then use get_dummmies:
import pandas as pd
# create data-frame
data = [[1, 'a'], [1, 'b'], [1, 'c'], [1, 'd'], [2, 'a'],
[2, 'b'], [2, 'c'], [3, 'd'], [3, 'e'], [3, 'f']]
df = pd.DataFrame(data=data, columns=['user', 'name'])
# group and get_dummies
grouped = df.groupby('user')['name'].apply(lambda x: '|'.join(x))
print(grouped.str.get_dummies())
Output
a b c d e f
user
1 1 1 1 1 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
As a side-note, you can do it all in one line:
result = df.groupby('user')['name'].apply(lambda x: '|'.join(x)).str.get_dummies()
Lets say my df is:
import pandas as pd
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'col2':[10,20, 30, 10, 20, 10, 10, 20, 30]})
How can I make all numbers zero keeping the last one only? In this case the result should be:
col1 col2
a 0
a 0
a 30
b 0
b 20
c 10
d 0
d 0
d 30
Thanks!
Use loc and duplicated with the argument keep='last':
df.loc[df.duplicated(subset='col1',keep='last'), 'col2'] = 0
>>> df
col1 col2
0 a 0
1 a 0
2 a 30
3 b 0
4 b 20
5 c 10
6 d 0
7 d 0
8 d 30
How can I combine four columns in a dataframe in pandas/python to create a unique indicator and do a left join?
Is this even the best way to do what I am trying to accomplish?
example: make a unique indicator (col5)
then setup a join with another dataframe using the same logic
col1 col2 col3 col4 col5
apple pear mango tea applepearmangotea
then do a join something like
pd.merge(df1, df2, how='left', on='col5')
This problem is the same whether its 4 columns or 2. You don't need to create a unique combined key. You just need to merge on multiple columns.
Consider the two dataframes d1 and d2. They share two columns in common.
d1 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[1, 1, 'g', 'h']
], columns=list('ABCD'))
d2 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[2, 0, 'g', 'h']
], columns=list('ABEF'))
d1
A B C D
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 1 1 g h
d2
A B E F
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 2 0 g h
We can perform the equivalent of a left join using pd.DataFrame.merge
d1.merge(d2, 'left')
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
We can be explicit with the columns
d1.merge(d2, 'left', on=['A', 'B'])
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
I am attempting to add at least one, or even multiple columns to a dataframe from a mapped dictionary. I have a dictionary keyed on product catalog numbers containing a list of standardized hierarchical nomenclature for that product number. Example below.
dict = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
df = pd.DataFrame( {"product": [1, 2, 3]})
df['catagory'] = df['product'].map(dict)
print(df)
I get the following result:
product catagory
0 1 [a, b, c, d]
1 2 [w, x, y, z]
2 3 NaN
I would like to obtain the following:
product cat1 cat2 cat3 cat4
0 1 a b c d
1 2 w x y z
2 3 NaN NaN NaN NaN
Or even better:
product category
0 1 d
1 2 z
2 3 NaN
I have been trying just to parse our one of the items from the list within the dictionary and append it to the dataframe but have only found advice for mapping dictionaries that contain one item within the list, per this EXAMPLE.
Any help appreciated.
Notice:
Never use reserved words like list, type, dict... as variables because masking built-in functions.
So if use:
#dict is variable name
dict = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
#create dictionary is not possible, because dict is dictionary
print (dict(a=1, b=2))
{'a': 1, 'b': 2}
get error:
TypeError: 'dict' object is not callable
and debug is very complicated. (After testing restart IDE)
So use another variable like d or categories:
d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
print (dict(a=1, b=2))
{'a': 1, 'b': 2}
I think you need DataFrame.from_dict with join:
d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
df = pd.DataFrame( {"product": [1, 2, 3]})
print (df)
product
0 1
1 2
2 3
df1 = pd.DataFrame.from_dict(d, orient='index')
df1.columns = ['cat' + (str(i+1)) for i in df1.columns]
print(df1)
cat1 cat2 cat3 cat4
1 a b c d
2 w x y z
df2 = df.join(df1, on='product')
print (df2)
product cat1 cat2 cat3 cat4
0 1 a b c d
1 2 w x y z
2 3 NaN NaN NaN NaN
Then is possible use melt or stack:
df3 = df2.melt('product', value_name='category').drop('variable', axis=1)
print (df3)
product category
0 1 a
1 2 w
2 3 NaN
3 1 b
4 2 x
5 3 NaN
6 1 c
7 2 y
8 3 NaN
9 1 d
10 2 z
11 3 NaN
df2 = df.set_index('product').join(df1)
.stack(dropna=False)
.reset_index(level=1, drop=True)
.rename('category')
.reset_index()
print (df2)
product category
0 1 a
1 1 b
2 1 c
3 1 d
4 2 w
5 2 x
6 2 y
7 2 z
8 3 NaN
9 3 NaN
10 3 NaN
11 3 NaN
If column category is in df solution is similar, only is necessary remove rows with NaN by DataFrame.dropna:
d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
df = pd.DataFrame( {"product": [1, 2, 3]})
df['category'] = df['product'].map(d)
print(df)
df1 = df.dropna(subset=['category'])
df1 = pd.DataFrame(df1['category'].values.tolist(), index=df1['product'])
df1.columns = ['cat' + (str(i+1)) for i in df1.columns]
print(df1)
cat1 cat2 cat3 cat4
product
1 a b c d
2 w x y z
df2 = df[['product']].join(df1, on='product')
print (df2)
product cat1 cat2 cat3 cat4
0 1 a b c d
1 2 w x y z
2 3 NaN NaN NaN NaN
d = {1: ['a', 'b', 'c', 'd'], 2: ['w', 'x', 'y', 'z']}
#Split product to 4 columns
df[['product']].join(
df.apply(lambda x: pd.Series(d.get(x['product'],[np.nan])),axis=1)
.rename_axis(lambda x: 'cat{}'.format(x+1), axis=1)
)
Out[187]:
product cat1 cat2 cat3 cat4
0 1 a b c d
1 2 w x y z
2 3 NaN NaN NaN NaN
#only take the last element
df['catagory'] = df.apply(lambda x: d.get(x['product'],[np.nan])[-1],axis=1)
df
Out[171]:
product catagory
0 1 d
1 2 z
2 3 NaN
Let's use set_index, apply, add_prefix , reset_index:
df_out = (df.set_index('product')['catagory']
.apply(lambda x:pd.Series(x)))
df_out.columns = df_out.columns + 1
df_out.add_prefix('cat').reset_index()
Output:
product cat1 cat2 cat3 cat4
0 1 a b c d
1 2 w x y z
2 3 NaN NaN NaN NaN
To it to the next even better setp:
(df.set_index('product')['catagory']
.apply(lambda x:pd.Series(x))
.stack(dropna=False)
.rename('category')
.reset_index()
.drop('level_1',axis=1)
.drop_duplicates()
)
Output:
product category
0 1 a
1 1 b
2 1 c
3 1 d
4 2 w
5 2 x
6 2 y
7 2 z
8 3 NaN