I have this data frame df
A B C D
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
and this array of events
events = [A C None None B]
I want to put 1 in the dataframe for every column where the corrsponding event occured, and nothing if None. So my result dataframe would be
A B C D
1 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
0 1 0 0
The obvious manner would be to do the loop
for i, event in enumerate(events):
if event is not None:
df[event][i] = 1
Is there a more efficient manner when the number of rows is huge?
you can use str.get_dummies on the Series created from events and then reindex the column as in df.
events = ['A', 'C', None, None, 'B']
df_ = (pd.Series(events)
.str.get_dummies()
.reindex(columns=df.columns, fill_value=0)
)
print (df_)
A B C D
0 1 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 0 1 0 0
The reindex is really to add the missing column D here, in your real case, you may not need it
Use numpy broadcasting to compare df.columns against events and populate values
import numpy as np
df[:] = (df.columns.to_numpy() == np.array(events)[:,None]).astype(int)
Out[44]:
A B C D
0 1 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 0 1 0 0
If you want more verbose
df[:] = np.equal(df.columns, np.array(events)[:,None]).astype(int)
Related
I have a pd Dataframe, whose elements are 0 or 1 and each row and column has a label.
For example:
import pandas as pd
import numpy as np
N = 100
M = 200
p = 0.8
df = pd.DataFrame(np.random.choice([0,1], (M,N), p=(p, 1-p)),
columns=sorted((list(range(10))*N)[0:N]),
index=sorted((list(range(10))*N)[0:M]))
What I would like to do is sorting according to degree (row and column sum) the elements of each block (each label defines a block).
The idea is to use the sum of the rows and columns of each block.
For example, for the first block:
df.loc['0', '0'].sum(axis=0)
df.loc['0', '0'].sum(axis=1)
and use these values to sort all the rows and columns with label 0
Imagine to have:
0 0 0 1 1 1
0 1 1 1 0 0 1
0 0 0 1 0 0 0
0 1 1 0 0 0 1
1 0 0 0 1 0 0
1 0 0 0 1 1 0
1 0 0 0 1 1 1
the desired output is:
0 0 0 1 1 1
0 1 1 1 0 0 1
0 1 1 0 0 0 1
0 1 0 0 0 0 0
1 0 0 0 1 1 1
1 0 0 0 1 1 0
1 0 0 0 1 0 0
What matters is the sum of the elements within each block, not of all the line or of the entire column.
The following is a starting point that might lead to a more general solution:
import pandas as pd
import numpy as np
d= """1 1 1 0 0 1
0 0 1 0 0 0
1 1 0 0 0 1
0 0 0 1 0 0
0 0 0 1 1 0
0 0 0 1 1 1"""
a = d.split('\n')
a = [r.split(' ') for r in a]
df = pd.DataFrame({column : [int(a[row][column]) for row in range(6)] for column in range(6)})
print(df)
print('-' * 10)
df['rsum'] = df.sum(axis=1)
df['row'] = [0,0,0,1,1,1]
df.sort_values(by=['row', 'rsum'], ascending = [True, False], inplace=True)
df.drop(['rsum', 'row'], axis=1, inplace=True)
print(df)
print('-' * 10)
df = df.transpose()
df.reset_index(drop=True, inplace=True)
df['rsum'] = df.sum(axis=1)
df['row'] = [0,0,0,1,1,1]
df.sort_values(by=['row', 'rsum'], ascending = [True, False], inplace=True)
df.drop(['rsum', 'row'], axis=1, inplace=True)
df = df.transpose()
print(df.values)
I have a dataset that looks like this:
df = pd.DataFrame(data= [[0,0,1],[1,0,0],[0,1,0]], columns = ['A','B','C'])
A B C
0 0 0 1
1 1 0 0
2 0 1 0
I want to create a new column where on each row appears the value of the previous column where there is a 1:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Use dot:
df['value'] = df.values.dot(df.columns)
Output:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Using pd.DataFrame.idxmax:
df['value'] = df.idxmax(1)
print(df)
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Example
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Now I would like to map a and b to a dummy variable, but nothing else. How can I do that?
What I tried
>>> pd.get_dummies(s, columns=['a', 'b'])
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
A simpler method is to just mask the resultant df with the cols of interest:
In[16]:
pd.get_dummies(s)[list('ab')]
Out[16]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
So this will sub-select the resultant dummies df with the cols of interest
If you don't want to calculate the dummies column for the columns that you are not interested in the first place, then you could filter out the rows of interest but this requires reindexing with a fill_value (thanks to #jezrael for the suggestion):
In[20]:
pd.get_dummies(s[s.isin(list('ab'))]).reindex(s.index, fill_value=0)
Out[20]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Setting everything else to nan is one option:
s[~((s == 'a') | (s == 'b'))] = float('nan')
which yields:
>>> pd.get_dummies(s)
a b
0 1 0
1 0 1
2 0 0
3 1 0
Another way
In [3907]: pd.DataFrame({c:s.eq(c).astype(int) for c in ['a', 'b']})
Out[3907]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Or, (s==c).astype(int)
Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
For example I have a dataframe, which has 5 rows and 5 columns. The have the same name. Example:
...A B C D E
A 0 0 0 0 0
B 0 0 0 0 0
C 0 0 0 0 0
D 0 0 0 0 0
E 0 0 0 0 0
How I can make the loop through my dataframe to compare column name and row name in order to set value of 1 where col and row names are equal.
...A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
You could use numpy.fill_diagonal for your values of dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((5,5)), columns=list('ABCDE'), index=list("ABCDE"))
In [37]: np.fill_diagonal(df.values, 1)
In [38]: df
Out[38]:
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
EDIT
If you need to fill values where rows and column indices are the same you could use slice your values of the diagonal where that's true and assign to it whatever you want:
df = pd.DataFrame(np.zeros((5,5)), columns=list('ABCDE'), index=list("ABCGE"))
mask = df.columns == df.index
df.values[mask, mask] = 1
In [72]: df
Out[72]:
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 0 0
E 0 0 0 0 1
Or if your rows and columns are not ordered:
df.apply(lambda row: row.index == row.name, axis=1).astype(int)
The .astype(int) at the end converts booleans to integers.