Pandas: Reshaping dataframe - python

I have a panda's related question. My dataframe looks something like this:
id val1 val2
0 1 0 1
1 1 1 0
2 1 0 0
3 2 1 1
4 2 1 1
5 2 1 0
6 3 0 0
7 3 0 1
8 3 1 1
9 4 1 0
10 4 0 1
11 4 0 0
I want to transform it into something like:
a b c
id a0 a1 b0 b1 c0 c1
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 1 1 1 1
4 1 0 0 1 0 0
I thought of something like adding a sub_id column that is enumerated cyclically by a, b and c and then do an unstack of the frame. Is there an easier/smarter solution?
Thanks a lot!
Tim

If possible numbers instead abc is use GroupBy.cumcount for counter, create MultiIndex by DataFrame.set_index and reshape by DataFrame.unstack and last sorting second level with DataFrame.swaplevel:
g = df.groupby('id').cumcount()
df = df.set_index(['id', g]).unstack().sort_index(axis=1, level=1).swaplevel(0,1,axis=1)
print (df)
0 1 2
val1 val2 val1 val2 val1 val2
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
If want a,b,c values is possible generate dictionary from string.ascii_lowercase and rename columns:
import string
d = dict(enumerate(string.ascii_lowercase))
df = df.rename(columns=d)
print (df)
a b c
val1 val2 val1 val2 val1 val2
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
Solution for rename both levels is first create default columns names by range after set_index:
g = df.groupby('id').cumcount()
df = df.set_index(['id', g])
df.columns = range(len(df.columns))
df = df.unstack().sort_index(axis=1, level=1).swaplevel(0,1,axis=1)
print (df)
0 1 2
0 1 0 1 0 1
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
And last in list comprehension set new values:
import string
d = dict(enumerate(string.ascii_lowercase))
df.columns = pd.MultiIndex.from_tuples([(d[a], f'{d[a]}{b}') for a, b in df.columns])
print (df)
a b c
a0 a1 b0 b1 c0 c1
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0

One of possible solutions:
Start from reformatting values for each id into a single row:
res = df.set_index('id').groupby('id').apply(
lambda grp: pd.Series(grp.values.flatten()))
For now the result is:
0 1 2 3 4 5
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
Then set proper column names:
res.columns = pd.MultiIndex.from_tuples(
[(x, x + y) for x in list('abc') for y in list('01')])
The finale result is:
a b c
a0 a1 b0 b1 c0 c1
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0

Related

How do you count the common 1's in pandas data frame?

I have this data for example:
A
B
C
Class_label
0
1
1
B_C
1
1
1
A_B_C
0
0
1
C
How do you obtain (classified label column) this and count the common ones and display that as well using pandas dataframe?
Use DataFrame.assign for add new columns by DataFrame.dot with columns names for labels and sum for count 1, but only numeric columns selected by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
df = df.assign(classifiedlabel = df1.dot(df1.columns + '_').str[:-1],
countones = df1.sum(axis=1))
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If column classifiedlabel exist simpliest is use sum only:
df["countones"] = df.sum(axis=1)
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If values are 1/0 then you can use:
(
df.assign(
count=df._get_numeric_data().sum(axis=1)
)
)
Output:
A B C D classifiedlabel count
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
Try:
df["number_of_ones"] = (df == 1).astype(int).sum(axis=1)
print(df)
A B C D classifiedlabel number_of_ones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

Convert one-hot encoded data-frame columns into one column

In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5

How do I create a column such that its values is count of the number of,1, in that row, which are appearing for the first time in their own column?

How do I do this operation using pandas?
Initial Df:
A B C D
0 0 1 0 0
1 0 1 0 0
2 0 0 1 1
3 0 1 0 1
4 1 1 0 0
5 1 1 1 0
Final Df:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Basically Param is the number of the 1 in that row which is appearing for the first time in its own column.
Example:
index 0 : 1 in the column B is appearing for the first time hence Param1 = 1
index 1 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 2 : 1 in the column C and D is appearing for the first time in their columns hence Paramm1 = 2
index 3 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
index 4 : 1 in the column A is appearing for the first time in the column hence Paramm1 = 1
index 5 : none of the 1 is appearing for the first time in its own column hence Param1 = 0
I will do idxmax and value_counts
df['Param']=df.idxmax().value_counts().reindex(df.index,fill_value=0)
df
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
You can check for duplicated values, multiply with df and sum:
df['Param'] = df.apply(lambda x: ~x.duplicated()).mul(df).sum(1)
Output:
A B C D Param
0 0 1 0 0 1
1 0 1 0 0 0
2 0 0 1 1 2
3 0 1 0 1 0
4 1 1 0 0 1
5 1 1 1 0 0
Assuming these are integers, you can use cumsum() twice to isolate the first occurrence of 1.
df2 = (df.cumsum() > 0).cumsum() == 1
df['Param'] = df2.sum(axis = 1)
print(df)
If df elements are strings, you should first convert them to integers.
df = df.astype(int)

Cumulative count in a pandas df

I am trying to export a cumulative count based off two columns in a pandas df.
An example is the df below. I'm trying to export a count based off Value and Count. So when the count increase I want attribute that to the adjacent value
import pandas as pd
d = ({
'Value' : ['A','A','B','C','D','A','B','A'],
'Count' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
I have used this:
for val in ['A','B','C','D']:
cond = df.Value.eq(val) & df.Count.eq(int)
df.loc[cond, 'Count_' + val] = cond[cond].cumsum()
If I alter int to a specific number it will return the count. But I need this to read any number as the Count column keeps increasing.
My intended output is:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1
So the count increase on the second row so 1 to Value A. Count increases again on row 4 and it's the first time for Value C so 1. Same again for rows 5 and 7. The count increases on row 8 so A becomes 2.
You could use str.get_dummies and diff and cumsum
In [262]: df['Value'].str.get_dummies().multiply(df['Count'].diff().gt(0), axis=0).cumsum()
Out[262]:
A B C D
0 0 0 0 0
1 1 0 0 0
2 1 0 0 0
3 1 0 1 0
4 1 0 1 1
5 1 0 1 1
6 1 1 1 1
7 2 1 1 1
Which is
In [266]: df.join(df['Value'].str.get_dummies()
.multiply(df['Count'].diff().gt(0), axis=0)
.cumsum().add_suffix('_Count'))
Out[266]:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1

Pandas DataFrame: How to convert binary columns into one categorical column?

Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B

Categories