Sort rows and columns of Dataframe - python

I have a pd Dataframe, whose elements are 0 or 1 and each row and column has a label.
For example:
import pandas as pd
import numpy as np
N = 100
M = 200
p = 0.8
df = pd.DataFrame(np.random.choice([0,1], (M,N), p=(p, 1-p)),
columns=sorted((list(range(10))*N)[0:N]),
index=sorted((list(range(10))*N)[0:M]))
What I would like to do is sorting according to degree (row and column sum) the elements of each block (each label defines a block).
The idea is to use the sum of the rows and columns of each block.
For example, for the first block:
df.loc['0', '0'].sum(axis=0)
df.loc['0', '0'].sum(axis=1)
and use these values to sort all the rows and columns with label 0
Imagine to have:
0 0 0 1 1 1
0 1 1 1 0 0 1
0 0 0 1 0 0 0
0 1 1 0 0 0 1
1 0 0 0 1 0 0
1 0 0 0 1 1 0
1 0 0 0 1 1 1
the desired output is:
0 0 0 1 1 1
0 1 1 1 0 0 1
0 1 1 0 0 0 1
0 1 0 0 0 0 0
1 0 0 0 1 1 1
1 0 0 0 1 1 0
1 0 0 0 1 0 0
What matters is the sum of the elements within each block, not of all the line or of the entire column.

The following is a starting point that might lead to a more general solution:
import pandas as pd
import numpy as np
d= """1 1 1 0 0 1
0 0 1 0 0 0
1 1 0 0 0 1
0 0 0 1 0 0
0 0 0 1 1 0
0 0 0 1 1 1"""
a = d.split('\n')
a = [r.split(' ') for r in a]
df = pd.DataFrame({column : [int(a[row][column]) for row in range(6)] for column in range(6)})
print(df)
print('-' * 10)
df['rsum'] = df.sum(axis=1)
df['row'] = [0,0,0,1,1,1]
df.sort_values(by=['row', 'rsum'], ascending = [True, False], inplace=True)
df.drop(['rsum', 'row'], axis=1, inplace=True)
print(df)
print('-' * 10)
df = df.transpose()
df.reset_index(drop=True, inplace=True)
df['rsum'] = df.sum(axis=1)
df['row'] = [0,0,0,1,1,1]
df.sort_values(by=['row', 'rsum'], ascending = [True, False], inplace=True)
df.drop(['rsum', 'row'], axis=1, inplace=True)
df = df.transpose()
print(df.values)

Related

Pandas merge columns with similar prefixes

I have a pandas dataframe with binary columns that looks like this:
DEM_HEALTH_PRIV DEM_HEALTH_PRE DEM_HEALTH_HOS DEM_HEALTH_OUT
0 1 0 0
0 0 1 1
I want to take the suffix of each variable and convert the binary variables to one categorical variable that corresponds with the prefix. For example, merge all DEM_HEALTH variables to include a list of "PRE", "HOS", "OTH" etc. where the value of the column is equal to 1.
Output
DEM_HEALTH_PRIV
['PRE']
['HOS','OUT']
Any help would be much appreciated!
Try this -
#original dataframe is called df
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
Explanation
IIUC your data looks something like the following
print(df)
DEM_HEALTH_PRIV DEM_HEALTH_OUT DEM_HEALTH_PRE DEM_HEALTH_HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
1. Create multi-index by rsplit
First step is to rsplit (reverse split) the columns by last occurance of "_" substring. Then create a multi-index, DEM_HEALTH is level 0 and PRE, HOS, etc are level 1.
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
print(df)
DEM_HEALTH
PRIV OUT PRE HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
2. Stack and Groupby over level=0
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
0 [HOS, OUT, PRE]
1 [OUT]
2 [PRE]
3 [OUT]
4 [PRIV]
5 [HOS, PRE]
6 [PRE, PRIV]
7 [HOS, PRIV]
8 [OUT]
9 [OUT, PRE]
Name: level_1, dtype: object

alter dataframe rows based on exogeneous array values

I have this data frame df
A B C D
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
and this array of events
events = [A C None None B]
I want to put 1 in the dataframe for every column where the corrsponding event occured, and nothing if None. So my result dataframe would be
A B C D
1 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
0 1 0 0
The obvious manner would be to do the loop
for i, event in enumerate(events):
if event is not None:
df[event][i] = 1
Is there a more efficient manner when the number of rows is huge?
you can use str.get_dummies on the Series created from events and then reindex the column as in df.
events = ['A', 'C', None, None, 'B']
df_ = (pd.Series(events)
.str.get_dummies()
.reindex(columns=df.columns, fill_value=0)
)
print (df_)
A B C D
0 1 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 0 1 0 0
The reindex is really to add the missing column D here, in your real case, you may not need it
Use numpy broadcasting to compare df.columns against events and populate values
import numpy as np
df[:] = (df.columns.to_numpy() == np.array(events)[:,None]).astype(int)
Out[44]:
A B C D
0 1 0 0 0
1 0 0 1 0
2 0 0 0 0
3 0 0 0 0
4 0 1 0 0
If you want more verbose
df[:] = np.equal(df.columns, np.array(events)[:,None]).astype(int)

Fill a numpy matrix's certain column with 1 if all columns are 0

Let's say I have this matrix :
> mat
index values
0 0 0 0 0 0
1 0 0 0 0 0
2 0 1 0 0 0
3 0 1 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 1 0 0
7 0 0 1 0 0
8 0 0 0 0 0
I want to fill mat's first column with the value 1 if all the columns in the iterated row are 0.
So that mat will look like this :
> mat
index values
0 1 0 0 0 0
1 1 0 0 0 0
2 0 1 0 0 0
3 0 1 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 0 0 1 0 0
7 0 0 1 0 0
8 1 0 0 0 0
Here's what I have tried :
for i in range(len(mat)):
for j in range(5):
if (mat[i][j]!=1):
mat[i][0]=1
But this puts 1 in all columns. Why ?
for i in range(len(mat)):
for j in range(5):
if (mat[i][j]!=1):
mat[i][0]=1
This doesn't work because it would set the first column to 1 if any column has a zero. You want to set first column to 1 if all columns have a 0
This would work
for i in range(len(mat)):
for j in range(5):
if (mat[i][j]==1):
break;
mat[i][0] = 1
Also, a much better solution would be to use sum
for i in range(len(mat)):
if (sum(mat[i]) == 0):
mat[i][0] = 1
An alternative solution is to evaluate the row with numpy.any():
for i in range(len(mat)):
mat[i][0] = 0 if np.any(mat[i]) else 1
or simply without a for-loop
mat[:,0] = ~np.any(mat, axis=1)
mat[mat.sum(axis=1).astype(np.bool).flat] = 1

Pandas DataFrame: How to convert binary columns into one categorical column?

Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B

Get name of the row in dataframe (python pandas)

For example I have a dataframe, which has 5 rows and 5 columns. The have the same name. Example:
...A B C D E
A 0 0 0 0 0
B 0 0 0 0 0
C 0 0 0 0 0
D 0 0 0 0 0
E 0 0 0 0 0
How I can make the loop through my dataframe to compare column name and row name in order to set value of 1 where col and row names are equal.
...A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
You could use numpy.fill_diagonal for your values of dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.zeros((5,5)), columns=list('ABCDE'), index=list("ABCDE"))
In [37]: np.fill_diagonal(df.values, 1)
In [38]: df
Out[38]:
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
EDIT
If you need to fill values where rows and column indices are the same you could use slice your values of the diagonal where that's true and assign to it whatever you want:
df = pd.DataFrame(np.zeros((5,5)), columns=list('ABCDE'), index=list("ABCGE"))
mask = df.columns == df.index
df.values[mask, mask] = 1
In [72]: df
Out[72]:
A B C D E
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
G 0 0 0 0 0
E 0 0 0 0 1
Or if your rows and columns are not ordered:
df.apply(lambda row: row.index == row.name, axis=1).astype(int)
The .astype(int) at the end converts booleans to integers.

Categories