How do you count the common 1's in pandas data frame? - python

I have this data for example:
A
B
C
Class_label
0
1
1
B_C
1
1
1
A_B_C
0
0
1
C
How do you obtain (classified label column) this and count the common ones and display that as well using pandas dataframe?

Use DataFrame.assign for add new columns by DataFrame.dot with columns names for labels and sum for count 1, but only numeric columns selected by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
df = df.assign(classifiedlabel = df1.dot(df1.columns + '_').str[:-1],
countones = df1.sum(axis=1))
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If column classifiedlabel exist simpliest is use sum only:
df["countones"] = df.sum(axis=1)
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

If values are 1/0 then you can use:
(
df.assign(
count=df._get_numeric_data().sum(axis=1)
)
)
Output:
A B C D classifiedlabel count
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

Try:
df["number_of_ones"] = (df == 1).astype(int).sum(axis=1)
print(df)
A B C D classifiedlabel number_of_ones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2

Related

Pandas: Reshaping dataframe

I have a panda's related question. My dataframe looks something like this:
id val1 val2
0 1 0 1
1 1 1 0
2 1 0 0
3 2 1 1
4 2 1 1
5 2 1 0
6 3 0 0
7 3 0 1
8 3 1 1
9 4 1 0
10 4 0 1
11 4 0 0
I want to transform it into something like:
a b c
id a0 a1 b0 b1 c0 c1
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 1 1 1 1
4 1 0 0 1 0 0
I thought of something like adding a sub_id column that is enumerated cyclically by a, b and c and then do an unstack of the frame. Is there an easier/smarter solution?
Thanks a lot!
Tim
If possible numbers instead abc is use GroupBy.cumcount for counter, create MultiIndex by DataFrame.set_index and reshape by DataFrame.unstack and last sorting second level with DataFrame.swaplevel:
g = df.groupby('id').cumcount()
df = df.set_index(['id', g]).unstack().sort_index(axis=1, level=1).swaplevel(0,1,axis=1)
print (df)
0 1 2
val1 val2 val1 val2 val1 val2
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
If want a,b,c values is possible generate dictionary from string.ascii_lowercase and rename columns:
import string
d = dict(enumerate(string.ascii_lowercase))
df = df.rename(columns=d)
print (df)
a b c
val1 val2 val1 val2 val1 val2
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
Solution for rename both levels is first create default columns names by range after set_index:
g = df.groupby('id').cumcount()
df = df.set_index(['id', g])
df.columns = range(len(df.columns))
df = df.unstack().sort_index(axis=1, level=1).swaplevel(0,1,axis=1)
print (df)
0 1 2
0 1 0 1 0 1
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
And last in list comprehension set new values:
import string
d = dict(enumerate(string.ascii_lowercase))
df.columns = pd.MultiIndex.from_tuples([(d[a], f'{d[a]}{b}') for a, b in df.columns])
print (df)
a b c
a0 a1 b0 b1 c0 c1
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
One of possible solutions:
Start from reformatting values for each id into a single row:
res = df.set_index('id').groupby('id').apply(
lambda grp: pd.Series(grp.values.flatten()))
For now the result is:
0 1 2 3 4 5
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0
Then set proper column names:
res.columns = pd.MultiIndex.from_tuples(
[(x, x + y) for x in list('abc') for y in list('01')])
The finale result is:
a b c
a0 a1 b0 b1 c0 c1
id
1 0 1 1 0 0 0
2 1 1 1 1 1 0
3 0 0 0 1 1 1
4 1 0 0 1 0 0

Create a categorical column based on different binary columns in python

I have a dataset that looks like this:
df = pd.DataFrame(data= [[0,0,1],[1,0,0],[0,1,0]], columns = ['A','B','C'])
A B C
0 0 0 1
1 1 0 0
2 0 1 0
I want to create a new column where on each row appears the value of the previous column where there is a 1:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Use dot:
df['value'] = df.values.dot(df.columns)
Output:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Using pd.DataFrame.idxmax:
df['value'] = df.idxmax(1)
print(df)
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B

Cumulative count in a pandas df

I am trying to export a cumulative count based off two columns in a pandas df.
An example is the df below. I'm trying to export a count based off Value and Count. So when the count increase I want attribute that to the adjacent value
import pandas as pd
d = ({
'Value' : ['A','A','B','C','D','A','B','A'],
'Count' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
I have used this:
for val in ['A','B','C','D']:
cond = df.Value.eq(val) & df.Count.eq(int)
df.loc[cond, 'Count_' + val] = cond[cond].cumsum()
If I alter int to a specific number it will return the count. But I need this to read any number as the Count column keeps increasing.
My intended output is:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1
So the count increase on the second row so 1 to Value A. Count increases again on row 4 and it's the first time for Value C so 1. Same again for rows 5 and 7. The count increases on row 8 so A becomes 2.
You could use str.get_dummies and diff and cumsum
In [262]: df['Value'].str.get_dummies().multiply(df['Count'].diff().gt(0), axis=0).cumsum()
Out[262]:
A B C D
0 0 0 0 0
1 1 0 0 0
2 1 0 0 0
3 1 0 1 0
4 1 0 1 1
5 1 0 1 1
6 1 1 1 1
7 2 1 1 1
Which is
In [266]: df.join(df['Value'].str.get_dummies()
.multiply(df['Count'].diff().gt(0), axis=0)
.cumsum().add_suffix('_Count'))
Out[266]:
Value Count A_Count B_Count C_Count D_Count
0 A 0 0 0 0 0
1 A 1 1 0 0 0
2 B 1 1 0 0 0
3 C 2 1 0 1 0
4 D 3 1 0 1 1
5 A 3 1 0 1 1
6 B 4 1 1 1 1
7 A 5 2 1 1 1

How to set index to an existing dataframe in the form of cartesian product?

I have a list. I want to set_index of dataframe in the form of a cartesian product of list values with dataframe i.e
li = ['A','B']
df = pd.DataFrame([[0,0,0],[1,1,1],[2,2,2]])
I want the resulting dataframe to be like
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
How can I do this?
​
Option 1
pd.concat with keys argument
pd.concat([df] * len(li), keys=li)
0 1 2
A 0 0 0 0
1 1 1 1
2 2 2 2
B 0 0 0 0
1 1 1 1
2 2 2 2
To replicate your output exactly:
pd.concat([df] * len(li), keys=li).reset_index(1, drop=True)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Option 2
np.tile and np.repeat
pd.DataFrame(np.tile(df, [len(li), 1]), np.repeat(li, len(df)), df.columns)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Use MultiIndex.from_product with reindex:
mux = pd.MultiIndex.from_product([li, df.index])
df = df.reindex(mux, level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Or you can using .
li = [['A','B']]
df['New']=li*len(df)
df.set_index([0,1,2])['New'].apply(pd.Series).stack().to_frame().rename(columns={0:'keys'})\
.reset_index().drop('level_3',1).sort_values('keys')
Out[698]:
0 1 2 keys
0 0 0 0 A
2 1 1 1 A
4 2 2 2 A
1 0 0 0 B
3 1 1 1 B
5 2 2 2 B

Pandas DataFrame: How to convert binary columns into one categorical column?

Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B

Categories