I have a pandas dataframe with binary columns that looks like this:
DEM_HEALTH_PRIV DEM_HEALTH_PRE DEM_HEALTH_HOS DEM_HEALTH_OUT
0 1 0 0
0 0 1 1
I want to take the suffix of each variable and convert the binary variables to one categorical variable that corresponds with the prefix. For example, merge all DEM_HEALTH variables to include a list of "PRE", "HOS", "OTH" etc. where the value of the column is equal to 1.
Output
DEM_HEALTH_PRIV
['PRE']
['HOS','OUT']
Any help would be much appreciated!
Try this -
#original dataframe is called df
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
Explanation
IIUC your data looks something like the following
print(df)
DEM_HEALTH_PRIV DEM_HEALTH_OUT DEM_HEALTH_PRE DEM_HEALTH_HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
1. Create multi-index by rsplit
First step is to rsplit (reverse split) the columns by last occurance of "_" substring. Then create a multi-index, DEM_HEALTH is level 0 and PRE, HOS, etc are level 1.
new_cols = [tuple(i.rsplit('_',1)) for i in df.columns]
new_cols = pd.MultiIndex.from_tuples(new_cols)
df.columns = new_cols
print(df)
DEM_HEALTH
PRIV OUT PRE HOS
0 0 1 1 1
1 0 1 0 0
2 0 0 1 0
3 0 1 0 0
4 1 0 0 0
5 0 0 1 1
6 1 0 1 0
7 1 0 0 1
8 0 1 0 0
9 0 1 1 0
2. Stack and Groupby over level=0
data = df[df==1]\
.stack()\
.reset_index(-1)\
.groupby(level=0)['level_1']\
.apply(list)
0 [HOS, OUT, PRE]
1 [OUT]
2 [PRE]
3 [OUT]
4 [PRIV]
5 [HOS, PRE]
6 [PRE, PRIV]
7 [HOS, PRIV]
8 [OUT]
9 [OUT, PRE]
Name: level_1, dtype: object
Related
I have the following dataframe with multi-level columns
In [1]: data = {('A', '10'):[1,3,0,1],
('A', '20'):[3,2,0,0],
('A', '30'):[0,0,3,0],
('B', '10'):[3,0,0,0],
('B', '20'):[0,5,0,0],
('B', '30'):[0,0,1,0],
('C', '10'):[0,0,0,2],
('C', '20'):[1,0,0,0],
('C', '30'):[0,0,0,0]
}
df = pd.DataFrame(data)
df
Out[1]:
A B C
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0
1 3 2 0 0 5 0 0 0 0
2 0 0 3 0 0 1 0 0 0
3 1 0 0 0 0 0 2 0 0
In a new column results I want to return the combined column name containing the maximum value for each subset (i.e. second level column)
My desired output should look like the below
Out[2]:
A B C
10 20 30 10 20 30 10 20 30 results
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
For example the first row:
For column 'A' the max value is under column '20' &
for column 'B' there is only 1 value under '10' &
for column 'C' also it is only one value under '20' &
so the result would be A20&B10&C20
Edit: replacing "+" with "&" in the results column, apparently I was misunderstood and you guys thought I need the summation while I need to column names separated by a separator
Edit2:
The solution provided by #A.B below didn't work for me for some reason. Although it is working on his side and for the sample data on google colab.
somehow the .idxmax(skipna = True) is causing a ValueError: No axis named 1 for object type Series
I found a workaround by transposing the data before this step, and then transposing it back after.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.T\ # Transpose here
.groupby(level=0)\ # Remove (axis=1) from here
.idxmax(skipna = True)\
.T\ # Transpose back here
.apply(map_res,axis=1)
I am still interested to know why it was is not working without the transpose though?
Idea is replace 0 by NaN, so if use DataFrame.stack all rows with NaNs are removed. Then get indices by DataFrameGroupBy.idxmax, mapping second and third tuple values by map and aggregate join to new column per indices - first level:
df['results'] = (df.replace(0, np.nan)
.stack([0,1])
.groupby(level=[0,1])
.idxmax()
.map(lambda x: f'{x[1]}{x[2]}')
.groupby(level=0)
.agg('&'.join))
print (df)
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
Try:
df["results"] = df.groupby(level=0, axis=1).max().sum(1)
print(df)
Prints:
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 7
1 3 2 0 0 5 0 0 0 0 8
2 0 0 3 0 0 1 0 0 0 4
3 1 0 0 0 0 0 2 0 0 3
Group by level 0 and axis=1
You use idxmax to get max sub-level indexes as tuples (while skipping NaNs).
Apply function to rows (axix-1) to concat names
In function (that you apply to rows), Iterate on keys/columns and concatenate the column levels. Replace Nan (which have type 'float') with an empty string and filter them later.
You won't need df.replace(0, np.nan) if you initially have NaN and let them remain.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.groupby(level=0, axis=1)\
.idxmax(skipna = True)\
.apply(map_res,axis=1)
Here's output
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20,B10,C20
1 3 2 0 0 5 0 0 0 0 A10,B20
2 0 0 3 0 0 1 0 0 0 A30,B30
3 1 0 0 0 0 0 2 0 0 A10,C10
How do I do this operation using pandas?
Initial Df:
A B C D
0 0 1 0 0
1 0 1 0 0
2 0 0 1 1
3 0 1 0 1
4 1 1 0 0
5 1 1 1 0
Final Df:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Basically Param is the number of the 1 in that row which is appearing for the first time in its own column.
Example:
index 0 : 1 in the column B is appearing for the first time hence Param1 = 1
index 1 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 2 : 1 in the column C and D is appearing for the first time in their columns hence Paramm1 = 2
index 3 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 4 : 1 in the column A is appearing for the first time in the column hence Paramm1 = 1
index 5 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
I will do idxmax and value_counts
df['Param']=df.idxmax().value_counts().reindex(df.index,fill_value=0)
df
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
You can check for duplicated values, multiply with df and sum:
df['Param'] = df.apply(lambda x: ~x.duplicated()).mul(df).sum(1)
Output:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Assuming these are integers, you can use cumsum() twice to isolate the first occurrence of 1.
df2 = (df.cumsum() > 0).cumsum() == 1
df['Param'] = df2.sum(axis = 1)
print(df)
If df elements are strings, you should first convert them to integers.
df = df.astype(int)
I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0
I have a DataFrame:
df.head()
Index Value
0 1.0,1.0,1.0,1.0
1 1.0,1.0
2 1.0,1.0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0
4 4
I'd like to count the occurrences of values in the Value column:
Index Value 1 2 3 4
0 1.0,1.0,1.0,1.0 4 0 0 0
1 1.0,1.0 2 0 0 0
2 1.0,1.0 2 0 0 0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0 0 0 6 2
4 4 0 0 0 1
I've done this before with string values but I used Counter - which I found you can't use with floats?
df_counts = df['Value'].apply(lambda x: pd.Series(Counter(x.split(','))), 1).fillna(0).astype(int)
Use map to floats and last columns to integers:
df_counts = (df['Value'].apply(lambda x: pd.Series(Counter(map(float, x.split(',')))), 1)
.fillna(0)
.astype(int)
.rename(columns=int))
print (df_counts)
1 3 4
0 4 0 0
1 2 0 0
2 2 0 0
3 0 6 2
4 0 0 1
Last if necessary add all missing categories add reindex and join to original:
cols = np.arange(df_counts.columns.min(), df_counts.columns.max() + 1)
df = df.join(df_counts.reindex(columns=cols, fill_value=0))
print (df)
Value 1 2 3 4
Index
0 1.0,1.0,1.0,1.0 4 0 0 0
1 1.0,1.0 2 0 0 0
2 1.0,1.0 2 0 0 0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0 0 0 6 2
4 4 0 0 0 1
Let's say i have a dataframe:
df = pd.DataFrame(np.random.randint(0,5, size=(5,6)), columns=list('ABCDEF'))
Crossing variables with pd.crosstab is simple enough:
table = pd.crosstab(index=df['A'], columns=df['B'])
Yields:
B 1 2 3 4
A
0 1 0 0 0
1 0 0 0 1
2 0 1 1 0
3 0 1 0 0
Where I would for example want a table like this:
B (1+2+3) 1 2 3 4
A
0 1 1 0 0 0
1 0 0 0 0 1
2 2 0 1 1 0
3 1 0 1 0 0
Can anyone set me on the right track here?
Use sum with subset, but if use small random df there can be problem you get always another values so values of columns will be different. If use np.random.seed(100) get same test output as my answer.
table['(1+2+3)'] = table[[1,2,3]].sum(axis=1)
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(0,5, size=(5,6)), columns=list('ABCDEF'))
table = pd.crosstab(index=df['A'], columns=df['B'])
table['(1+2+3)'] = table[[1,2,3]].sum(axis=1)
print (table)
B 0 1 2 3 4 (1+2+3)
A
0 1 0 0 0 1 0
1 0 0 0 1 0 1
2 0 0 1 0 0 1
3 0 1 0 0 0 1