Iterate and assign value in Pandas dataframe based on condition - python

I have a pandas dataframe composed of 8 columns (c1 to c7 and the last is called total). c1 to c7 are 0 and 1.
The column total should be an assignment for the maximum number of 1 in a sequence within c1 to c7. c1 to c7 represent weekdays, hence 7 should then flip to 1.
For example, if we would have and initial dataframe like df:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
My initial thought was to create a loop with an if statement within to evaluate the criteria within the columns and assign the value to the column total.
i = "c1"
d =
for i in df.iloc[:,0:7]:
if df[i] == 1 and df[i-1] == 1:
df["total"]:= df["total"] + 1
I would expect df to look like:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 2 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 6 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 3 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 2 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
I haven't been able to get to a result, was trying to build step by step but kept getting an error in the if statement evaluation
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().

df = pd.DataFrame([[ 1,1,1,1,1,0,0], [0,0,1,0,0,0,0],[1,1,0,1,1,0,0]])
def fun(x):
count = 0;
result = 0;
n = len(x)
for i in range(0,2*n):
if x[i % n] == 0:
result = max(result, count)
count = 0
else:
count += 1
return result
df['total'] = df.apply(lambda x: fun(x), axis=1)
0 1 2 3 4 5 6 total
0 1 1 1 1 1 0 0 5
1 0 0 1 0 0 0 0 1
2 1 1 0 1 1 0 0 2
Bugs in your loop
df[i-1] when i==0 will throw an error
df[i] gives the values of ith column of all the rows
7 should then flip to 1: This part is missing in your code
To flip the tail (7) of the row back to head(1), place a copy of row at the tail and then check for constitutive 1's. This can also be done by looping the row twice and using a modulus operator. Check this algorithm for more details

Related

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Group by as a list and create new colum for each value

I have a dataframe where every row is a user id and if he has an item:
| user | item_id |
|------|---------|
| 1 | a |
| 1 | b |
| 2 | b |
| 3 | c |
| 4 | a |
| 4 | c |
I want to create n columns where n is all the possible values of item_id, group one row per user and fill 1/0 according if the value is present for the user.
| user | item_a | item_b | item_c |
|------|---------|---------|----------|
| 1 | 1 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 1 |
| 4 | 1 | 0 | 1 |
Use pivot_table:
import pandas as pd
df = pd.DataFrame({'user': [1,1,2,3,4,4], 'item_id': list('abbcac')})
df = df.assign(val=1).pivot_table(values='val',
index='user',
columns='item_id',
fill_value=0)
pd.crosstab(df.user,df.item_id).add_prefix('item_').reset_index()
Yet another approach is to use get_dummies and group by sum where:
pd.get_dummies(df, columns=['item_id']).groupby('user').sum().reset_index()
desired result:
user item_id_a item_id_b item_id_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1
and to change the columns:
df.columns = df.columns.str.replace(r"_id", "")
df
user item_a item_b item_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1

pandas: Split and convert series of alphanumeric texts to columns and rows

Current data frame: I have a pandas data frame where each employee has a text code(all codes start with T) and an associated frequency right next to the code. All text codes have 8 characters.
+----------+-------------------------------------------------------------+
| emp_id | text |
+----------+-------------------------------------------------------------+
| E0001 | [T0431516,-8,T0401531,-12,T0517519,12] |
| E0002 | [T0701540,-1,T0431516,-2] |
| E0003 | [T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]|
| E0004 | [T0516319,-3] |
| E0005 | [T0431516,2] |
+----------+-------------------------------------------------------------+
Expected data frame: I am trying to make the text codes present in the data frame as individual columns and if an employee has a frequency for that code then populate frequency else 0.
+----------+----------------------------------------------------------------------------------------+
| emp_id | T0431516 | T0401531 | T0517519 | T0701540 | T0421531 | T0516319 | T0500371 | T0309711 |
+----------+----------------------------------------------------------------------------------------+
| E0001 | -8 | -12 | 12 | 0 | 0 | 0 | 0 | 0 |
| E0002 | -2 | 0 | 0 | -1 | 0 | 0 | 0 | 0 |
| E0003 | 0 | 0 | -1 | 0 | -7 | 9 | -6 | -3 |
| E0004 | 0 | 0 | 0 | 0 | 0 | -3 | 0 | 0 |
| E0005 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------+----------------------------------------------------------------------------------------+
Sample data:
pd.DataFrame({'emp_id' : {0: 'E0001', 1: 'E0002', 2: 'E0003', 3: 'E0004', 4: 'E0005'},
'text' : {0: '[T0431516,-8,T0401531,-12,T0517519,12]', 1: '[T0701540,-1,T0431516,-2]', 2: '[T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]', 3: '[T0516319,-3]', 4: '[T0431516,2]'}
})
So, far my attempts were unsuccessful. Any pointers/help is much appreciated!
You can explode the dataframe and then create a pivot_table:
df = pd.DataFrame({'emp_id' : ['E0001', 'E0002', 'E0003', 'E0004', 'E0005'],
'text' : [['T0431516',-8,'T0401531',-12,'T0517519',12],
['T0701540',-1,'T0431516',-2],['T0517519',-1,'T0421531',-7,'T0516319',9,'T0500371',-6,'T0309711',-3],
['T0516319',-3], ['T0431516',2]]})
df = df.explode('text')
df['freq'] = df['text'].shift(-1)
df = df[df['text'].str[0] == 'T']
df['freq'] = df['freq'].astype(int)
df = pd.pivot_table(df, index='emp_id', columns='text', values='freq',aggfunc = 'sum').fillna(0).astype(int)
df
Out[1]:
text T0309711 T0401531 T0421531 T0431516 T0500371 T0516319 T0517519 \
emp_id
E0001 0 -12 0 -8 0 0 12
E0002 0 0 0 -2 0 0 0
E0003 -3 0 -7 0 -6 9 -1
E0004 0 0 0 0 0 -3 0
E0005 0 0 0 2 0 0 0
text T0701540
emp_id
E0001 0
E0002 -1
E0003 0
E0004 0
E0005 0

Dataframe conditional column subtract until zero

This is different than the usual 'subtract until 0' questions on here as it is conditional on another column. This question is about creating that conditional column.
This dataframe consists of three columns.
Column 'quantity' tells you how much to add/subtract.
Column 'in' tells you when to subtract.
Column 'cumulative_in' tells you how much you have.
+----------+----+---------------+
| quantity | in | cumulative_in |
+----------+----+---------------+
| 5 | 0 | |
| 1 | 0 | |
| 3 | 1 | 3 |
| 4 | 1 | 7 |
| 2 | 1 | 9 |
| 1 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
| 1 | -1 | |
| 2 | 0 | |
| 1 | 0 | |
| 2 | 0 | |
| 3 | 0 | |
| 3 | 0 | |
| 1 | 0 | |
| 3 | 0 | |
+----------+----+---------------+
Whenever column 'in' equals -1, starting from next row I want to create a column 'out' (0/1) that tells it to keep subtracting until 'cumulative_in' reaches 0. Doing it by hand,
Column 'out' tells you when to keep subtracting.
Column 'cumulative_subtracted' tells you how much you have already subtracted.
I subtract column 'cumulative_in' by 'cumulative_subtracted' until it reaches 0, the output looks something like this:
+----------+----+---------------+-----+-----------------------+
| quantity | in | cumulative_in | out | cumulative_subtracted |
+----------+----+---------------+-----+-----------------------+
| 5 | 0 | | | |
| 1 | 0 | | | |
| 3 | 1 | 3 | | |
| 4 | 1 | 7 | | |
| 2 | 1 | 9 | | |
| 1 | 0 | | | |
| 1 | 0 | | | |
| 3 | 0 | | | |
| 1 | -1 | | | |
| 2 | 0 | 7 | 1 | 2 |
| 1 | 0 | 6 | 1 | 3 |
| 2 | 0 | 4 | 1 | 5 |
| 3 | 0 | 1 | 1 | 8 |
| 3 | 0 | 0 | 1 | 9 |
| 1 | 0 | | | |
| 3 | 0 | | | |
+----------+----+---------------+-----+-----------------------+
I couldn't find a vector solution to this. I would love to see one. However, the problem is not that hard when going through row by row. I hope your dataframe is not too big!!
First set up the data.
data = {
"quantity": [
5,1,3,4,2,1,1,3,1,2,1,2,3,3,1,3
],
"in":[
0,0,1,1,1,0,0,0,-1,0,0,0,0,0,0,0
],
"cumulative_in": [
np.NaN,np.NaN,3,7,9,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN
]
}
Then set up the dataframe and extra columns. I used np.NaN for the 'out' but 0 was easier for 'cumulative_subtracted'
df=pd.DataFrame(data)
df['out'] = np.NaN
df['cumulative_subtracted'] = 0
Set the initial variables
last_in = 0.
reduce = False
Go through the dataframe row by row, unfortunately.
for i in df.index:
# check if necessary to adjust last_in value.
if ~np.isnan(df.at[i, "cumulative_in"]) and reduce == False:
last_in = df.at[i, "cumulative_in"]
# check if -1 and change reduce to true
elif df.at[i, "in"] == -1:
reduce = True
# check if reduce true, the implement reductions
elif reduce == True:
df.at[i, "out"] = 1
if df.at[i, "quantity"] <= last_in:
last_in -= df.at[i, "quantity"]
df.at[i, "cumulative_in"] = last_in
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + df.at[i, "quantity"]
)
elif df.at[i, "quantity"] > last_in:
df.at[i, "cumulative_in"] = 0
df.at[i, "cumulative_subtracted"] = (
df.at[i - 1, "cumulative_subtracted"] + last_in
)
last_in = 0
reduce = False
This works for the data given, and hopefully for all your dataset.
print(df)
quantity in cumulative_in out cumulative_subtracted
0 5 0 NaN NaN 0
1 1 0 NaN NaN 0
2 3 1 3.0 NaN 0
3 4 1 7.0 NaN 0
4 2 1 9.0 NaN 0
5 1 0 NaN NaN 0
6 1 0 NaN NaN 0
7 3 0 NaN NaN 0
8 1 -1 NaN NaN 0
9 2 0 7.0 1.0 2
10 1 0 6.0 1.0 3
11 2 0 4.0 1.0 5
12 3 0 1.0 1.0 8
13 3 0 0.0 1.0 9
14 1 0 NaN NaN 0
15 3 0 NaN NaN 0
It is not clear for me what happens when the quantity to subtract has not yet reached zero and you have another '1' in the 'in' column.
Yet, here is a rough solution for a simple case:
import pandas as pd
import numpy as np
size = 20
df = pd.DataFrame(
{
"quantity": np.random.randint(1, 6, size),
"in": np.full(size, np.nan),
}
)
# These are just to place a random 1 and -1 into 'in', not important
df.loc[np.random.choice(df.iloc[:size//3, :].index, 1), 'in'] = 1
df.loc[np.random.choice(df.iloc[size//3:size//2, :].index, 1), 'in'] = -1
df.loc[np.random.choice(df.iloc[size//2:, :].index, 1), 'in'] = 1
# Fill up with 1/-1 values the missing values after each entry up to the
# next 1/-1 entry.
df.loc[:, 'in'] = df['in'].fillna(method='ffill')
# Calculates the cumulative sum with a negative value for subtractions
df["cum_in"] = (df["quantity"] * df['in']).cumsum()
# Subtraction indicator and cumulative column
df['out'] = (df['in'] == -1).astype(int)
df["cumulative_subtracted"] = df.loc[df['in'] == -1, 'quantity'].cumsum()
# Remove values when the 'cum_in' turns to negative
df.loc[
df["cum_in"] < 0 , ["in", "cum_in", "out", "cumulative_subtracted"]
] = np.NaN
print(df)

Get Pandas Duplicate Row Count with Original Index

I need to find duplicate rows in a Pandas Dataframe, and then add an extra column with the count. Lets say we have a dataframe:
>>print(df)
+----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 |
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+
The above frame would then become the one below with an additional column with the count. You can see that we are still preserving the index column.
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
I've seen other solutions to this such as:
df.groupby(list(df.columns.values)).size()
But that returns a matrix with gaps and with no initial index.
You can use reset_index first for convert index to columns and then aggregate by first and len:
Also if need groupby by all columns is necessary remove index column by difference:
print (df.columns.difference(['index']))
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None))
2 3 4 5 6 7 8 9 size
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1
If necessary add next column 10 need rename:
#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None)
.rename(columns={'size':last_col}))
2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1

Categories