Get name of the row in dataframe (python pandas) - python

For example I have a dataframe, which has 5 rows and 5 columns. The have the same name. Example:
...A B C D E
A 0 0 0 0 0
B 0 0 0 0 0
C 0 0 0 0 0
D 0 0 0 0 0
E 0 0 0 0 0
How I can make the loop through my dataframe to compare column name and row name in order to set value of 1 where col and row names are equal.
...A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1

You could use numpy.fill_diagonal for your values of dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((5,5)), columns=list('ABCDE'), index=list("ABCDE"))
In [37]: np.fill_diagonal(df.values, 1)
In [38]: df
Out[38]:
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
EDIT
If you need to fill values where rows and column indices are the same you could use slice your values of the diagonal where that's true and assign to it whatever you want:
df = pd.DataFrame(np.zeros((5,5)), columns=list('ABCDE'), index=list("ABCGE"))
mask = df.columns == df.index
df.values[mask, mask] = 1
In [72]: df
Out[72]:
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 0 0
E 0 0 0 0 1

Or if your rows and columns are not ordered:
df.apply(lambda row: row.index == row.name, axis=1).astype(int)
The .astype(int) at the end converts booleans to integers.

Related

alter dataframe rows based on exogeneous array values

I have this data frame df
A B C D
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
and this array of events
events = [A C None None B]
I want to put 1 in the dataframe for every column where the corrsponding event occured, and nothing if None. So my result dataframe would be
A B C D
1 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
0 1 0 0
The obvious manner would be to do the loop
for i, event in enumerate(events):
if event is not None:
df[event][i] = 1
Is there a more efficient manner when the number of rows is huge?
you can use str.get_dummies on the Series created from events and then reindex the column as in df.
events = ['A', 'C', None, None, 'B']
df_ = (pd.Series(events)
.str.get_dummies()
.reindex(columns=df.columns, fill_value=0)
)
print (df_)
A B C D
0 1 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 0 1 0 0
The reindex is really to add the missing column D here, in your real case, you may not need it
Use numpy broadcasting to compare df.columns against events and populate values
import numpy as np
df[:] = (df.columns.to_numpy() == np.array(events)[:,None]).astype(int)
Out[44]:
A B C D
0 1 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 0 1 0 0
If you want more verbose
df[:] = np.equal(df.columns, np.array(events)[:,None]).astype(int)

Create a categorical column based on different binary columns in python

I have a dataset that looks like this:
df = pd.DataFrame(data= [[0,0,1],[1,0,0],[0,1,0]], columns = ['A','B','C'])
A B C
0 0 0 1
1 1 0 0
2 0 1 0
I want to create a new column where on each row appears the value of the previous column where there is a 1:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Use dot:
df['value'] = df.values.dot(df.columns)
Output:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Using pd.DataFrame.idxmax:
df['value'] = df.idxmax(1)
print(df)
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B

How do I create dummy variables for a subset of a categorical variable?

Example
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> s
0 a
1 b
2 c
3 a
dtype: object
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Now I would like to map a and b to a dummy variable, but nothing else. How can I do that?
What I tried
>>> pd.get_dummies(s, columns=['a', 'b'])
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
A simpler method is to just mask the resultant df with the cols of interest:
In[16]:
pd.get_dummies(s)[list('ab')]
Out[16]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
So this will sub-select the resultant dummies df with the cols of interest
If you don't want to calculate the dummies column for the columns that you are not interested in the first place, then you could filter out the rows of interest but this requires reindexing with a fill_value (thanks to #jezrael for the suggestion):
In[20]:
pd.get_dummies(s[s.isin(list('ab'))]).reindex(s.index, fill_value=0)
Out[20]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Setting everything else to nan is one option:
s[~((s == 'a') | (s == 'b'))] = float('nan')
which yields:
>>> pd.get_dummies(s)
a b
0 1 0
1 0 1
2 0 0
3 1 0
Another way
In [3907]: pd.DataFrame({c:s.eq(c).astype(int) for c in ['a', 'b']})
Out[3907]:
a b
0 1 0
1 0 1
2 0 0
3 1 0
Or, (s==c).astype(int)

Pandas DataFrame: How to convert binary columns into one categorical column?

Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B

How to get dummies in complicated condition

I have dataframe below
col
A a
A b
A c
B d
B e
C f
I would like to get dummy variable
a b c d e f
A 1 1 1 0 0 0
B 0 0 0 1 1 0
C 0 0 0 0 0 1
How can I get this?
I tried
df.col.get_dummies()
But I couldnt groupby.
You need groupby by index and aggregate max:
print (df.col.str.get_dummies().groupby(level=0).max())
a b c d e f
A 1 1 1 0 0 0
B 0 0 0 1 1 0
C 0 0 0 0 0 1
Or:
print (pd.get_dummies(df.col).groupby(level=0).max())
a b c d e f
A 1 1 1 0 0 0
B 0 0 0 1 1 0
C 0 0 0 0 0 1

Categories