How can I binarize a dataset according to the index? E.g.
A B C
idUser
3 1 1 1
2 0 1 0
4 1 0 0
I have tried using pd.get_dummies but the result is almost what I need.
dictio = {'idUser': [3, 3, 3, 2, 4], 'artist': ['A', 'B', 'C', 'B', 'A']}
df = pd.DataFrame(dictio)
df = df.set_index('idUser')
df_binary = pd.get_dummies(df, columns=['artist'])
print(df_binary)
A B C
idUser
3 1 0 0
3 0 1 0
3 0 0 1
2 0 1 0
4 1 0 0
In [27]: df_binary.groupby(level=0).any().astype(int)
Out[27]:
artist_A artist_B artist_C
idUser
2 0 1 0
3 1 1 1
4 1 0 0
alternatively starting from your df before the .set_index()
In [33]: df.pivot_table(index='idUser', columns='artist', aggfunc='size', fill_value=0).rename_axis(columns=None)
Out[33]:
A B C
idUser
2 0 1 0
3 1 1 1
4 1 0 0
Related
I want to transform the following data from df1 to df2:
df1:
ID a b c d a-d c-d a-c-d
0 1 0 0 0 0 0 0 1
1 2 0 0 1 0 1 0 0
2 3 0 1 0 0 0 1 0
3 4 0 0 0 0 1 0 1
4 5 0 0 1 1 0 0 0
And df2 is:
ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Basically, I want to get the total values of "a", from all the columns in which the letter "a" appears in the column name. E.g. in the 4th row of df1 there are 2 column names in which the letter "a" appears. If you sum up all the "a" from the 4th row, there would be a total of 2 a's there. I want a single column for apples in the new dataset (df2). Note that a 1 for "a-c-d" is a 1 for EACH "a", "b", "c".
If you know the unique categories in advance (e.g. ["a", "b", "c", "d"]) then you can take a little short cut and rely on df.filter to gather all of the columns with that letter, then use .sum(axis=1) to sum across those columns to create your expected summary value:
data = {"ID": df["ID"]}
for letter in ["a", "b", "c", "d"]:
data[letter] = df.filter(like=letter).sum(axis=1)
final_df = pd.DataFrame(data)
print(final_df)
ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Let's try melt to stack the column names, then str.split followed by explode to split the a,b,c,d and duplicate the data:
(df1.melt('ID', var_name='col')
.assign(col=lambda x: x['col'].str.split('-'))
.explode('col')
.pivot_table(index='ID',columns='col',
values='value', aggfunc='sum')
.reset_index()
)
Output:
col ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Something like split then explode and groupby with sum
out = df.T.reset_index().assign(index=lambda x : x['index'].str.split('-')).explode('index').\
groupby('index').sum().T
Out[102]:
index ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
Well, just to complete the answers here, a method more manual is like follow:
df1.loc[:, 'a'] = df1.loc[:, 'a'] + df1.loc[:, 'a-d'] + df1.loc[:, 'a-c-d']
df1.loc[:, 'c'] = df1.loc[:, 'c'] + df1.loc[:, 'c-d'] + df1.loc[:, 'a-c-d']
df1.loc[:, 'd'] = df1.loc[:, 'd'] + df1.loc[:, 'a-d'] + df1.loc[:, 'c-d'] + df1.loc[:, 'a-c-d']
Output:
col ID a b c d
0 1 1 0 1 1
1 2 1 0 1 1
2 3 0 1 1 1
3 4 2 0 1 2
4 5 0 0 1 1
I have a dataset with multiple columns that I wish to one hot encode. However, I don't want to have the encoding for each one of them since said columns are related to the said items. What I want is one "set" of dummies variables that uses all the columns. See my code for a better explanation.
Suppose my dataframe looks like this:
In [103]: dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})
In [104]: dum
Out[104]:
ch1 ch2 ch3
0 A B C
1 C G D
2 A F E
If I execute
pd.get_dummies(dum)
The output will be
ch1_A ch1_C ch2_B ch2_F ch2_G ch3_C ch3_D ch3_E
0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 1 0
2 1 0 0 1 0 0 0 1
However, what I would like to obtain is something like this:
A B C D E F G
1 1 1 0 0 0 0
0 0 1 1 0 0 1
1 0 0 0 1 1 0
Instead of having multiple columns representing the encoding, e.g. ch1_A and ch1_C, I only wish to have one group (A, B, and so on) with value 1 when any of the values in the columns ch1, ch2, ch3 show up.
To clarify, in my original dataset, a single row won't contain the same value (A,B,C...) more than once; it will just appear on one of the columns.
Using stack and str.get_dummies
dum.stack().str.get_dummies().sum(level=0)
Out[938]:
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
You could use pd.crosstab to create a frequency table:
import pandas as pd
dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})
stacked = dum.stack()
index = stacked.index.get_level_values(0)
result = pd.crosstab(index=index, columns=stacked)
result.index.name = None
result.columns.name = None
print(result)
yields
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
Call it this way
x = pd.get_dummies(dum, prefix="", prefix_sep="")
And then print using
print(x.to_string(index=False))
You can create dummies for separate columns and concat the results:
temp = pd.concat([pd.get_dummies(dum[col]) for col in dum], axis=1)
A C B F G C D E
0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 1 0
2 1 0 0 1 0 0 0 1
temp.groupby(level=0, axis=1).sum()
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
Example
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Now I would like to map a and b to a dummy variable, but nothing else. How can I do that?
What I tried
>>> pd.get_dummies(s, columns=['a', 'b'])
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
A simpler method is to just mask the resultant df with the cols of interest:
In[16]:
pd.get_dummies(s)[list('ab')]
Out[16]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
So this will sub-select the resultant dummies df with the cols of interest
If you don't want to calculate the dummies column for the columns that you are not interested in the first place, then you could filter out the rows of interest but this requires reindexing with a fill_value (thanks to #jezrael for the suggestion):
In[20]:
pd.get_dummies(s[s.isin(list('ab'))]).reindex(s.index, fill_value=0)
Out[20]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Setting everything else to nan is one option:
s[~((s == 'a') | (s == 'b'))] = float('nan')
which yields:
>>> pd.get_dummies(s)
a b
0 1 0
1 0 1
2 0 0
3 1 0
Another way
In [3907]: pd.DataFrame({c:s.eq(c).astype(int) for c in ['a', 'b']})
Out[3907]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Or, (s==c).astype(int)
Is there a more efficient way to create multiple new columns in a pandas dataframe df initialized to zero than:
for col in add_cols:
df.loc[:, col] = 0
UPDATE: using #jeff's method, but doing it dynamically:
In [208]: add_cols = list('xyz')
In [209]: df.assign(**{i:0 for i in add_cols})
Out[209]:
a b c x y z
0 4 8 6 0 0 0
1 3 7 0 0 0 0
2 4 0 1 0 0 0
3 5 4 5 0 0 0
4 1 3 0 0 0 0
OLD answer:
Another method:
df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
Demo:
In [343]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [344]: add_cols = list('xyz')
In [345]: add_cols
Out[345]: ['x', 'y', 'z']
In [346]: df
Out[346]:
a b c
0 4 9 0
1 1 1 1
2 8 8 1
3 0 1 4
4 8 5 6
In [347]: df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
In [348]: df
Out[348]:
a b c x y z
0 4 9 0 0 0 0
1 1 1 1 0 0 0
2 8 8 1 0 0 0
3 0 1 4 0 0 0
4 8 5 6 0 0 0
In [13]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [14]: df
Out[14]:
a b c
0 7 2 3
1 7 0 7
2 5 1 5
3 9 1 4
4 2 1 4
In [15]: df.assign(x=0, y=0, z=0)
Out[15]:
a b c x y z
0 7 2 3 0 0 0
1 7 0 7 0 0 0
2 5 1 5 0 0 0
3 9 1 4 0 0 0
4 2 1 4 0 0 0
Here is a hack:
[df.insert(0, col, 0) for col in add_cols]
You can treat a DataFrame with a dict-like syntax:
for col in add_cols:
df[col] = 0
I'm fairly new on Python.
I have 2 columns on a dataframe, columns are something like:
db = pd.read_excel(path_to_file/file.xlsx)
db = db.loc[:,['col1','col2']]
col1 col2
C 4
C 5
A 1
B 6
B 1
A 2
C 4
I need them to be like this:
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
so they act like rows and columns and values refer to the number of coincidences.
Say your columns are called cat and val:
In [26]: df = pd.DataFrame({'cat': ['C', 'C', 'A', 'B', 'B', 'A', 'C'], 'val': [4, 5, 1, 6, 1, 2, 4]})
In [27]: df
Out[27]:
cat val
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
Then you can groupby the table hierarchicaly, then unstack it:
In [28]: df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
Out[28]:
val 1 2 4 5 6
cat
A 1 2 0 0 0
B 1 0 0 0 6
C 0 0 8 5 0
Edit
As IanS pointed out, 3 is missing here (thanks!). If there's a range of columns you must have, then you can use
r = df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
for c in set(range(1, 7)) - set(df.val.unique()):
r[c] = 0
I think you need aggreagate by size and add missing values to columns by reindex:
print (df)
a b
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
df1 = df.b.groupby([df.a, df.b])
.size()
.unstack()
.reindex(columns=(range(1,df.b.max() + 1)))
.fillna(0)
.astype(int)
df1.index.name = None
df1.columns.name = None
print (df1)
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
Instead size you can use count, size counts NaN values, count does not.