How to keep track of how many times a unique condition occurs - python

I have a df that looks like this:
time val
0 1
1 1
2 2
3 3
4 1
5 2
6 3
7 3
8 3
9 3
10 1
11 1
How do I create new columns that hold the amount of times a condition occurs and does not change? In this case, I want to create a column for each unique value in val that holds the cumulative sum at the given row of occurences, but does not increment the value if the condition doesn't change.
Expected outcome below:
time val sum_1 sum_2 sum_3
0 1 1 0 0
1 1 1 0 0
2 2 1 1 0
3 3 1 1 1
4 1 2 1 1
5 2 2 2 1
6 3 2 2 2
7 3 2 2 2
8 3 2 2 2
9 3 2 2 2
10 1 3 2 2
11 1 3 2 2
EDIT
To be more specific with the condition:
I want to count the number of times a unique value appears in val. For example, using the code below, I could get this result:
df['sum_1'] = (df['val'] == 1).cumsum()
df['sum_2'] = (df['val'] == 2).cumsum()
df['sum_3'] = (df['val'] == 3).cumsum()
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 2 0 0
2 2 2 2 1 0
3 3 3 2 1 1
4 4 1 3 1 1
5 5 2 3 2 1
However, this code counts EVERY occurence of a condition. For example, val shows 1 occurring 3 times total. However, I want to treat consecutive occurrences of 1 as a single group, counting only the number of consecutive groupings that occur. In the example above, 1 occurs in total 3 times, but only 2 times as a consecutive grouping.

You can chain mask by & for bitwise AND for test first consecutive values by compare by shifted values by Series.ne with Series.shift and run code for test all unique values of column val:
uniq = df['val'].unique()
m = df['val'].ne(df['val'].shift())
for c in uniq:
df[f'sum_{c}'] = (df['val'].eq(c) & m).cumsum()
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2
For better performance (I hope) here is numpy alternative:
a = df['val'].to_numpy()
uniq = np.unique(a)
m = np.concatenate(([False], a[:-1])) != a
arr = np.cumsum((a[:, None] == uniq) & m[:, None], axis=0)
df = df.join(pd.DataFrame(arr, index=df.index, columns=uniq).add_prefix('sum_'))
print (df)
time val sum_1 sum_2 sum_3
0 0 1 1 0 0
1 1 1 1 0 0
2 2 2 1 1 0
3 3 3 1 1 1
4 4 1 2 1 1
5 5 2 2 2 1
6 6 3 2 2 2
7 7 3 2 2 2
8 8 3 2 2 2
9 9 3 2 2 2
10 10 1 3 2 2
11 11 1 3 2 2

Related

Duplicate a selected row and put the duplicate just below in a Pandas DataFrame

I have a Pandas dataframe like this :
id A B
0 2 2
1 1 1
2 3 3
3 7 7
And I want to duplicate the first row 3 times just below the selected row :
id A B
0 2 2
1 2 2
2 2 2
3 2 2
4 1 1
5 3 3
6 7 7
Is there a method that already exist in Pandas library ?
There is no built-in method for doing just this. However, you can create a list of indexes, and use df.loc + df.index.repeat:
new_df = df.loc[df.index.repeat([4] + [1] * (len(df) - 1))].reset_index(drop=True)
Output:
>>> new_df
id A B
0 0 2 2
1 0 2 2
2 0 2 2
3 0 2 2
4 1 1 1
5 2 3 3
6 3 7 7
Use reindex and Index.repeat to create your dataframe:
>>> df.reindex(df.index.repeat([3] + [1] * (len(df) - 1)))
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Another way:
>>> df.loc[[df.index[0]]*3 + df.index[1:].tolist()]
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
A more generalized way proposed by #MuhammadHassan:
row_index = 0
repeat_time = 3
>>> df.reindex(df.index.tolist() + [row_index]*repeat_time).sort_index()
id A B
0 0 2 2
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Let us try
n=3
row = 0
df = df.append(df.loc[[row]*(n-1)]).sort_index()
df
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7

Counting weight of unique combinations within groups

I have the following dataframe:
Group from to
1 2 1
1 1 2
1 3 2
1 3 1
2 1 4
2 3 1
2 1 2
2 3 1
I want create a 4th column that counts the of unique combinations (from, to) within each group and drops any repeated combination within each group (leaves only one)
Expected output:
Group from to weight
1 2 1 1
1 1 2 1
1 3 2 1
1 3 1 1
2 1 4 1
2 3 1 2
2 1 2 1
In the expected output, the 2nd from 3, to 1 row in group 2 was dropped because it is a duplicate.
In your case we just need groupby with size
out = df.groupby(df.columns.tolist()).size().to_frame(name='weight').reset_index()
Out[258]:
Group from to weight
0 1 1 2 1
1 1 2 1 1
2 1 3 1 1
3 1 3 2 1
4 2 1 2 1
5 2 1 4 1
6 2 3 1 2
You can group by the 3 columns using .groupby() and take their size by GroupBy.size(), as follows:
df_out = df.groupby(['Group', 'from', 'to'], sort=False).size().reset_index(name='weight')
Result:
print(df_out)
Group from to weight
0 1 2 1 1
1 1 1 2 1
2 1 3 2 1
3 1 3 1 1
4 2 1 4 1
5 2 3 1 2
6 2 1 2 1

Python: create new column conditionally on values from two other columns

I would like to combine two columns in a new column.
Lets suppose I have:
Index A B
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 2
6 1 2
7 1 2
8 1 2
9 1 2
10 1 2
Now I would like to create a column C with the entries from A from Index 0 to 4 and from column B from Index 5 to 10. It should look like this:
Index A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2
Is there a python code how I can get this? Thanks in advance!
If Index is an actual column you can use numpy.where and specify your condition
import numpy as np
df['C'] = np.where(df['Index'] <= 4, df['A'], df['B'])
Index A B C
0 0 1 0 1
1 1 1 0 1
2 2 1 0 1
3 3 1 0 1
4 4 1 0 1
5 5 1 2 2
6 6 1 2 2
7 7 1 2 2
8 8 1 2 2
9 9 1 2 2
10 10 1 2 2
if your index is your actual index
you can slice your indices with iloc and create your column with concat.
df['C'] = pd.concat([df['A'].iloc[:5], df['B'].iloc[5:]])
print(df)
A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2

How to add a row with countif in pandas

I have a DataFrame with values 1 and 2.
I want to add a row in the end of the DataFrame, counting the number of 1 in each column. It should be similar
COUNTIF(A:A,1) and drag to all columns in excel
I tried something like df.loc['lastrow']=df.count()[1], but the result is not correct.
How can I do it and what this function (count()[1]) does?
You can do append after sum
df.append(df.eq(1).sum(),ignore_index=True )
You can just compare your dataframe to the value you are interested in (1 for example), and then perform a sum on these booleans, like:
>>> df
0 1 2 3 4
0 2 2 2 1 2
1 2 2 2 2 2
2 2 1 2 1 1
3 1 2 2 1 1
4 2 2 1 2 1
5 2 2 2 2 2
6 1 1 1 1 2
7 2 2 1 1 1
8 1 1 1 2 1
9 2 2 1 2 1
>>> (df == 1).sum()
0 3
1 3
2 5
3 5
4 6
dtype: int64
You can thus append that row, like:
>>> df.append((df == 1).sum(), ignore_index=True)
0 1 2 3 4
0 2 2 2 1 2
1 2 2 2 2 2
2 2 1 2 1 1
3 1 2 2 1 1
4 2 2 1 2 1
5 2 2 2 2 2
6 1 1 1 1 2
7 2 2 1 1 1
8 1 1 1 2 1
9 2 2 1 2 1
10 3 3 5 5 6
The last row here thus contains the number of 1s of the previous rows.

Pandas - Assign unique ID to each group in grouped data

I have a dataframe with many attributes. I want to assign an id for all unique combinations of these attributes.
assume, this is my df:
df = pd.DataFrame(np.random.randint(1,3, size=(10, 3)), columns=list('ABC'))
A B C
0 2 1 1
1 1 1 1
2 1 1 1
3 2 2 2
4 1 2 2
5 1 2 1
6 1 2 2
7 1 2 1
8 1 2 2
9 2 2 1
Now, I need to append a new column with an id for unique combinations. It has to be 0, it the combination occurs only once. In this case:
A B C unique_combination
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
My first approach was to use a for loop and check for every row, if I find more than one combination in the dataframe of the row's values with .query:
unique_combination = 1 #acts as a counter
df['unique_combination'] = 0
for idx, row in df.iterrows():
if len(df.query('A == #row.A & B == #row.B & C == #row.C')) > 1:
# check, if one occurrence of the combination already has a value > 0???
df.loc[idx, 'unique_combination'] = unique_combination
unique_combination += 1
However, I have no idea how to check whether there already is an ID assigned for a combination (see comment in code). Additionally my approach feels very slow and hacky (I have over 15000 rows). Do you data wrangler see a different approach to my problem?
Thank you very much!
Step1 : Assign a new column with values 0
df['new'] = 0
Step2 : Create a mask with repetition more than 1 i.e
mask = df.groupby(['A','B','C'])['new'].transform(lambda x : len(x)>1)
Step3 : Assign the values factorizing based on mask i.e
df.loc[mask,'new'] = df.loc[mask,['A','B','C']].astype(str).sum(1).factorize()[0] + 1
# or
# df.loc[mask,'new'] = df.loc[mask,['A','B','C']].groupby(['A','B','C']).ngroup()+1
Output:
A B C new
0 2 1 1 0
1 1 1 1 1
2 1 1 1 1
3 2 2 2 0
4 1 2 2 2
5 1 2 1 3
6 1 2 2 2
7 1 2 1 3
8 1 2 2 2
9 2 2 1 0
A new feature added in Pandas version 0.20.2 creates a column of unique ids automatically for you.
df['unique_id'] = df.groupby(['A', 'B', 'C']).ngroup()
gives the following output
A B C unique_id
0 2 1 2 3
1 2 2 1 4
2 1 2 1 1
3 1 2 2 2
4 1 1 1 0
5 1 2 1 1
6 1 1 1 0
7 2 2 2 5
8 1 2 2 2
9 1 2 2 2
The groups are given ids based on the order they would be iterated over.
See the documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups

Categories