group consecutive equal values and count up - python

The df:
a b
0 1
0 3
0 3
0 1
1 1
1 2
1 4
I would like to group by a and count up the equal consecutive rows in a group:
a b c
0 1 1
0 3 2
0 3 2
0 1 3
1 1 1
1 2 2
1 4 3
I tried:
df['c'] = df.b.groupby([df.a, df.b.diff().ne(0).cumsum()])
which gave me a type error:
Length of values does not match length of index

In your case , that is factor
s=df.b.diff().ne(0).cumsum().groupby(df.a).transform(lambda x : x.factorize()[0])+1
Out[276]:
0 1
1 2
2 2
3 3
4 1
5 2
6 3
Name: b, dtype: int32
df['c']=s
Or
df.b.groupby(df.a).apply(lambda x : x.diff().ne(0).cumsum())
Out[277]:
0 1
1 2
2 2
3 3
4 1
5 2
6 3
Name: b, dtype: int32

Another approach:
s = df.ne(df.shift()).any(1).astype(int)
df['c'] = s.groupby(df['a']).cumsum()
Output:
a b c
0 0 1 1
1 0 3 2
2 0 3 2
3 0 1 3
4 1 1 1
5 1 2 2
6 1 4 3

Related

Pandas group consecutive and label the length

I want get consecutive length labeled data
a
---
1
0
1
0
1
1
1
0
1
1
I want :
a | c
--------
1 1
0 0
1 2
1 2
0 0
1 3
1 3
1 3
0 0
1 2
1 2
then I can calculate the mean of "b" column by group "c". tried with shift and cumsum and cumcount all not work.
Use GroupBy.transform by consecutive groups and then set 0 if not 1 in a column:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.where(df.a.eq(1), 0))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2
If there are only 0, 1 values is possible multiple by a:
df['c1'] = (df.groupby(df.a.ne(df.a.shift()).cumsum())['a']
.transform('size')
.mul(df.a))
print (df)
a b c c1
0 1 1 1 1
1 0 2 0 0
2 1 3 2 2
3 1 2 2 2
4 0 1 0 0
5 1 3 3 3
6 1 1 3 3
7 1 3 3 3
8 0 2 0 0
9 1 2 2 2
10 1 1 2 2

Duplicate a selected row and put the duplicate just below in a Pandas DataFrame

I have a Pandas dataframe like this :
id A B
0 2 2
1 1 1
2 3 3
3 7 7
And I want to duplicate the first row 3 times just below the selected row :
id A B
0 2 2
1 2 2
2 2 2
3 2 2
4 1 1
5 3 3
6 7 7
Is there a method that already exist in Pandas library ?
There is no built-in method for doing just this. However, you can create a list of indexes, and use df.loc + df.index.repeat:
new_df = df.loc[df.index.repeat([4] + [1] * (len(df) - 1))].reset_index(drop=True)
Output:
>>> new_df
id A B
0 0 2 2
1 0 2 2
2 0 2 2
3 0 2 2
4 1 1 1
5 2 3 3
6 3 7 7
Use reindex and Index.repeat to create your dataframe:
>>> df.reindex(df.index.repeat([3] + [1] * (len(df) - 1)))
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Another way:
>>> df.loc[[df.index[0]]*3 + df.index[1:].tolist()]
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
A more generalized way proposed by #MuhammadHassan:
row_index = 0
repeat_time = 3
>>> df.reindex(df.index.tolist() + [row_index]*repeat_time).sort_index()
id A B
0 0 2 2
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7
Let us try
n=3
row = 0
df = df.append(df.loc[[row]*(n-1)]).sort_index()
df
id A B
0 0 2 2
0 0 2 2
0 0 2 2
1 1 1 1
2 2 3 3
3 3 7 7

Python: create new column conditionally on values from two other columns

I would like to combine two columns in a new column.
Lets suppose I have:
Index A B
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 2
6 1 2
7 1 2
8 1 2
9 1 2
10 1 2
Now I would like to create a column C with the entries from A from Index 0 to 4 and from column B from Index 5 to 10. It should look like this:
Index A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2
Is there a python code how I can get this? Thanks in advance!
If Index is an actual column you can use numpy.where and specify your condition
import numpy as np
df['C'] = np.where(df['Index'] <= 4, df['A'], df['B'])
Index A B C
0 0 1 0 1
1 1 1 0 1
2 2 1 0 1
3 3 1 0 1
4 4 1 0 1
5 5 1 2 2
6 6 1 2 2
7 7 1 2 2
8 8 1 2 2
9 9 1 2 2
10 10 1 2 2
if your index is your actual index
you can slice your indices with iloc and create your column with concat.
df['C'] = pd.concat([df['A'].iloc[:5], df['B'].iloc[5:]])
print(df)
A B C
0 1 0 1
1 1 0 1
2 1 0 1
3 1 0 1
4 1 0 1
5 1 2 2
6 1 2 2
7 1 2 2
8 1 2 2
9 1 2 2
10 1 2 2

Pandas - updating sequence of values

I have this Sample DataFrame:
pd.DataFrame(data={1:[0,3,4,1], 2:[4,1,0,0], 3:[0,0,1,2], 4:[1,2,3,4] })
1 2 3 4
0 0 4 0 1
1 3 1 0 2
2 4 0 1 3
3 1 0 2 4
But i want to convert it to the format below:
pd.DataFrame(data={1:[1,1,1,1], 2:[0,2,0,2], 3:[0,3,3,0], 4:[4,0,4,4] })
1 2 3 4
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Is there any way or a function to do this as i have more than 100,000 rows so for loops, dictionaries, lists won't work.
My entry:
data = df.reset_index().melt("index").query("value > 0")
out = data.pivot("index", "value", "value").fillna(0).astype(int)
giving
In [273]: out
Out[273]:
value 1 2 3 4
index
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Unfortunately you'd have to clear the index and column names if you want to get rid of them, using either df.index.name = df.columns.name = None or df.rename_axis(None).rename_axis(None, 1) or something.
Using get_dummies:
s = pd.get_dummies(df, columns=df.columns, prefix_sep='', prefix='')
out = s.groupby(s.columns, axis=1).sum().drop('0', 1)
out.mask(out.ne(0)).fillna(dict(zip(out.columns, out.columns))).astype(int)
1 2 3 4
0 1 0 0 4
1 1 2 3 0
2 1 0 3 4
3 1 2 0 4
Using zip and np.isin
pd.DataFrame([ np.isin(y, x)*df.columns.values for x , y in zip([df.columns.values]*len(df),df.values)])
Out[900]:
0 1 2 3
0 0 2 0 4
1 1 2 0 4
2 1 0 3 4
3 1 0 3 4

Dynamic cumulative summations in Pandas

In the following DataFrame, the column B computes the sum of column A from index 0 to n.
ix A B
---------------
0 1 1
1 1 2
2 1 3
3 1 4
4 2 6
5 -1 5
6 -3 2
Alternatively, the column B sums 1 for each type == 'I' and -1 for each type == 'O'.
ix type B
----------------
0 I 1
1 I 2
2 O 1
3 I 2
4 O 1
5 O 0
6 I 1
How to perform this type of calculations, where the n-th result of one column depends on the aggregated results of another column, up to n?
You can use cumsum:
df['C'] = df.A.cumsum()
print (df)
ix A B C
0 0 1 1 1
1 1 1 2 2
2 2 1 3 3
3 3 1 4 4
4 4 2 6 6
5 5 -1 5 5
6 6 -3 2 2
And for second df add map by dict:
df['C'] = df.type.map({'I':1, 'O':-1}).cumsum()
print (df)
ix type B C
0 0 I 1 1
1 1 I 2 2
2 2 O 1 1
3 3 I 2 2
4 4 O 1 1
5 5 O 0 0
6 6 I 1 1
Or:
df['C'] = df.type.replace({'I':1, 'O':-1}).cumsum()
print (df)
ix type B C
0 0 I 1 1
1 1 I 2 2
2 2 O 1 1
3 3 I 2 2
4 4 O 1 1
5 5 O 0 0
6 6 I 1 1

Categories