count sets of consecutive true values in a column - python

Let's say that I have a dataframe as follow:
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
Then, I convert it into a boolean form:
df.eq(1)
Out[213]:
A
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 True
9 False
10 True
11 True
12 True
13 True
14 True
15 False
16 False
17 False
18 False
19 False
20 True
21 True
What I want is to count consecutive sets of True values in the column. In this example, the output would be:
df
Out[215]:
A count
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN
My progress has been by using tools as 'groupby' and 'cumsum' but honestly, I can not figure out how to solve it. Thanks in advance

You can use df['A'].diff().ne(0).cumsum() to generate a grouper that will group each consecutive group of zeros/ones:
# A side-by-side comparison:
>>> pd.concat([df['A'], df['A'].diff().ne(0).cumsum()], axis=1)
A A
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 0 2
6 0 2
7 1 3
8 1 3
9 0 4
10 1 5
11 1 5
12 1 5
13 1 5
14 1 5
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6
20 1 7
21 1 7
Thus, group by that grouper, calculate sums, replace zero with NaN + dropna, and reset the index:
df['count'] = df.groupby(df['A'].diff().ne(0).cumsum()).sum().replace(0, np.nan).dropna().reset_index(drop=True)
Output:
>>> df
A B
0 1 5.0
1 1 2.0
2 1 5.0
3 1 2.0
4 1 NaN
5 0 NaN
6 0 NaN
7 1 NaN
8 1 NaN
9 0 NaN
10 1 NaN
11 1 NaN
12 1 NaN
13 1 NaN
14 1 NaN
15 0 NaN
16 0 NaN
17 0 NaN
18 0 NaN
19 0 NaN
20 1 NaN
21 1 NaN

I propose an alternative way that makes use of the split string function.
Let's transform the Series df.A into a string and then split it where the zeros are.
df = pd.DataFrame({'A':[1,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,0,0,0,1,1]})
ll = ''.join(df.A.astype('str').tolist()).split('0')
The list ll looks like
print(ll)
['11111', '', '11', '11111', '', '', '', '', '11']
now we count the lengths of every string and put it into a list
[len(item) for item in ll if len(item)>0]
This is doable if the Series is not too long.

Related

Isolate sequence of positive numbers in a pandas dataframe

I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let's say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() + 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count += 1
else:
if count >= 3:
n_groups += 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN
One simple approach using find_peaks to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
def function1(dd:pd.DataFrame):
dd.loc[:,'period']=None
if len(dd)>=4:
dd.iloc[1:,2]=dd.iloc[1:,1]
return dd
df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)
out:
values col1 period
0 0 0 None
1 8 0 None
2 1 0 None
3 0 1 None
4 5 1 1
5 6 1 1
6 4 1 1
7 7 1 1
8 0 2 None
9 2 2 2
10 9 2 2
11 1 2 2
12 0 3 None

Add missing rows based on column

I have given the following df
df = pd.DataFrame(data = {'day': [1, 1, 1, 2, 2, 3], 'pos': 2*[1, 14, 18], 'value': 2*[1, 2, 3]}
df
day pos value
0 1 1 1
1 1 14 2
2 1 18 3
3 2 1 1
4 2 14 2
5 3 18 3
and i want to fill in rows such that every day has every possible value of column 'pos'
desired result:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Proposition:
df.set_index('pos').reindex(pd.Index(3*[1,14,18])).reset_index()
yields:
ValueError: cannot reindex from a duplicate axis
Let's try pivot then stack:
df.pivot('day','pos','value').stack(dropna=False).reset_index(name='value')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
Option 2: merge with MultiIndex:
df.merge(pd.DataFrame(index=pd.MultiIndex.from_product([df['day'].unique(), df['pos'].unique()])),
left_on=['day','pos'], right_index=True, how='outer')
Output:
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 3 18 3.0
5 2 18 NaN
5 3 1 NaN
5 3 14 NaN
You can reindex:
s = pd.MultiIndex.from_product([df["day"].unique(),df["pos"].unique()], names=["day","pos"])
print (df.set_index(["day","pos"]).reindex(s).reset_index())
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
I'd avoid the manual product of all possible values.
Instead, one can get the unique values and just reindex per day:
u = df.pos.unique()
df.groupby('day').apply(lambda s: s.set_index('pos').reindex(u))['value']\
.reset_index()
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0
You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('day', 'pos')
day pos value
0 1 1 1.0
1 1 14 2.0
2 1 18 3.0
3 2 1 1.0
4 2 14 2.0
5 2 18 NaN
6 3 1 NaN
7 3 14 NaN
8 3 18 3.0

Iterating over rows subtracting by 1 in dataframe pandas

I have a pandas dataframe that I would like to iterate from the last non Null value and then subtract 1 from that value for all following rows.
z = pd.DataFrame({'l':range(10),'r':[4,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]\
,'gh':[np.nan,np.nan,np.nan,np.nan,15,np.nan,np.nan,np.nan,np.nan,np.nan],\
'gfh':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,2]})
df = z.transpose().copy()
df.reset_index(inplace=True)
df.drop(['index'],axis=1, inplace=True)
df.columns = ['a','b','c','d','e','f','g','h','i','j']
In [8]: df
Out[8]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I have the above dataframe that I would like to reduce by 1 for everyrow till the last column. For example row 2 the value is 15, so I want 14, 13,12,11,10 to follow. Nothing will follow the 2 in the first row since there are no columns left. Also, the 4 in the last row would be 3,2,1,0,0,0,0 etc.
I reached my desired output by doing the following.
for index, row in df.iterrows():
df.iloc[index,df.columns.get_loc(df.iloc[index].last_valid_index())+1:] =\
[(df.iloc[index,m.columns.get_loc(df.iloc[index].last_valid_index()):][0]-(x+1)).astype(int) \
for x in range((df.shape[1]-1)-df.columns.get_loc(df.iloc[index].last_valid_index()))]
df[df < 0] = 0
This gives me the desired output
In [13]: df
Out[13]:
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 14 13 12 11 10
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
BUT. In my realworld data I have 50K plus columns and the above code takes WAAAY too long.
Can anyone please suggest how I can make this run faster?
I believe the solution would be to somehow tell the code that once the subtaction equals zero move on to the next row. but Idk how to do that since even if I use max(0,subtraction formula) the code still waste time subtracting.
Thank you.
I don't know how fast it will be, but you could experiment with ffill, fillna, and cumsum. For example:
>>> df
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 NaN NaN NaN NaN NaN
2 0 1 2 3 4 5 6 7 8 9
3 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
This is a little tricky. First we figure out which cells we need to fill, by forward-filling the rightmost element and seeing whether it's null (there might be a faster way to use last_valid_index tests, but this is the first thing that occurred to me)
>>> mask = df.ffill(axis=1).notnull() & df.isnull()
>>> mask
a b c d e f g h i j
0 False False False False False False False False False False
1 False False False False False True True True True True
2 False False False False False False False False False False
3 False True True True True True True True True True
If we fill the empty spots with -1, we can get the values we want by cumulative summing to the right:
>>> (df.fillna(-1).cumsum(axis=1))
a b c d e f g h i j
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -7
1 -1 -2 -3 -4 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 -1 -2 -3 -4 -5
Many of those values we don't want, but that's okay, because we're only going to insert the ones we need. We should clip to 0, though:
>>> df.fillna(-1).cumsum(axis=1).clip_lower(0)
a b c d e f g h i j
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 11 10 9 8 7 6
2 0 1 3 6 10 15 21 28 36 45
3 4 3 2 1 0 0 0 0 0 0
and finally we can use the original ones where mask is False, and the new values where mask is True:
>>> df.where(~mask, df.fillna(-1).cumsum(axis=1).clip_lower(0))
a b c d e f g h i j
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
1 NaN NaN NaN NaN 15 10 9 8 7 6
2 0 1 2 3 4 5 6 7 8 9
3 4 3 2 1 0 0 0 0 0 0
(Note: this assumes the rows we need to fill look like the ones in your example. If they're messier we'd have to do a little more work, but the same techniques will apply.)

How to calculate a rolling count of a categorical variable in pandas

I'm attempting to do a rolling count on a dataframe. The problem that I am having is specifying the condition since it is a string, not an integer. The dataframe below is a snippet, along with a snippet of a dictionary.
GameID Event
0 100 NaN
1 100 NaN
2 100 Ben
3 100 NaN
4 100 Steve
5 100 Ben
6 100 NaN
7 100 Steve
8 100 NaN
9 100 NaN
10 101 NaN
11 101 NaN
12 101 Joe
13 101 NaN
14 101 Will
15 101 Joe
16 101 NaN
17 101 Will
18 101 NaN
19 101 NaN
gamedic = {'100':['Ben','Steve'], '101':['Joe','Will']}
Ultimately, I would want the dataframe to look like the following. I named the columns Ben and Steve for this example but in reality they will be First and Second, corresponding to their place in the dictionary.
GameID Event Ben Steve
0 100 NaN 0 0
1 100 NaN 0 0
2 100 Ben 0 0
3 100 NaN 1 0
4 100 Steve 1 0
5 100 Ben 1 1
6 100 NaN 2 1
7 100 Steve 2 1
8 100 NaN 2 2
9 100 NaN 2 2
10 101 NaN 0 0
11 101 NaN 0 0
12 101 Joe 0 0
13 101 NaN 1 0
14 101 Will 1 0
15 101 Joe 1 1
16 101 NaN 2 1
17 101 Will 2 1
18 101 NaN 2 2
19 101 NaN 2 2
pd.rolling_count(df.Event, 1000,0).shift(1)
ValueError: could not convert string to float: Steve
I'm not sure if this is a complicated problem or if I'm missing something obvious in pandas. The whole string concept makes it tough for me to even get going.
First you want to use your dictionary to get a column containing just "first" and "second". I cant think of a clever way to do this so let's just iterate over the rows:
import numpy as np
df['Winner'] = np.nan
for i,row in df.iterrows():
if row.Event == gamedic[row.GameID][0]:
df['Winner'].ix[i] = 'First'
if row.Event == gamedic[row.GameID][1]:
df['Winner'].ix[i] = 'Second'
You can use pd.get_dummies to convert a string column (representing a categorical variable) to indicator variables; in your case this will give you
pd.get_dummies(df.Winner)
Out[46]:
First Second
0 0 0
1 0 0
2 1 0
3 0 0
4 0 1
5 1 0
6 0 0
7 0 1
8 0 0
9 0 0
10 0 0
11 0 0
12 1 0
13 0 0
14 0 1
15 1 0
16 0 0
17 0 1
18 0 0
19 0 0
You can add these onto your original dataframe with pd.concat:
df = pd.concat([df,pd.get_dummies(df.Winner)],axis=1)
Then you can get your cumulative sums with groupby.cumsum as in #Brian's answer
df.groupby('GameID').cumsum()
Out[60]:
First Second
0 0 0
1 0 0
2 1 0
3 1 0
4 1 1
5 2 1
6 2 1
7 2 2
8 2 2
9 2 2
10 0 0
11 0 0
12 1 0
13 1 0
14 1 1
15 2 1
16 2 1
17 2 2
18 2 2
19 2 2
Is this what you're looking for?
df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
columns=['A'])
df
A
0 a
1 a
2 a
3 b
4 b
5 a
df.groupby('A').cumcount()
0 0
1 1
2 2
3 0
4 1
5 3
dtype: int64
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.cumcount.html

Pandas: Find empty/missing values and add them to DataFrame

I have dataframe where column 1 should have all the values from 1 to 169. If a value doesnt exists, I'd like to add a new row to my dataframe which contains the said value (and some zeros).
I can't get the following code to work, even tho there are no errors:
for i in range(1,170):
if i in df.col1 is False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
Any advices?
It would be better to do something like:
In [37]:
# create our test df, we have vales 1 to 9 in steps of 2
df = pd.DataFrame({'a':np.arange(1,10,2)})
df['b'] = np.NaN
df['c'] = np.NaN
df
Out[37]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
In [38]:
# now set the index to a, this allows us to reindex the values with optional fill value, then reset the index
df = df.set_index('a').reindex(index = np.arange(1,10), fill_value=0).reset_index()
df
Out[38]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
So just to explain the above:
In [40]:
# set the index to 'a', this allows us to reindex and fill missing values
df = df.set_index('a')
df
Out[40]:
b c
a
1 NaN NaN
3 NaN NaN
5 NaN NaN
7 NaN NaN
9 NaN NaN
In [41]:
# now reindex and pass fill_value for the extra rows we want
df = df.reindex(index = np.arange(1,10), fill_value=0)
df
Out[41]:
b c
a
1 NaN NaN
2 0 0
3 NaN NaN
4 0 0
5 NaN NaN
6 0 0
7 NaN NaN
8 0 0
9 NaN NaN
In [42]:
# now reset the index
df = df.reset_index()
df
Out[42]:
a b c
0 1 NaN NaN
1 2 0 0
2 3 NaN NaN
3 4 0 0
4 5 NaN NaN
5 6 0 0
6 7 NaN NaN
7 8 0 0
8 9 NaN NaN
If you modified your loop to the following then it would work:
In [63]:
for i in range(1,10):
if any(df.a.isin([i])) == False:
df.loc[len(df)+1] = [i,0,0]
else:
continue
df
Out[63]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
EDIT
If you wanted the missing rows to appear at the end of the df then you could just create a temporary df with the full range of values and other columns set to zero and then filter this df based on the values that are missing in the other df and concatenate them:
In [70]:
df_missing = pd.DataFrame({'a':np.arange(10),'b':0,'c':0})
df_missing
Out[70]:
a b c
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 0 0
7 7 0 0
8 8 0 0
9 9 0 0
In [73]:
df = pd.concat([df,df_missing[~df_missing.a.isin(df.a)]], ignore_index=True)
df
Out[73]:
a b c
0 1 NaN NaN
1 3 NaN NaN
2 5 NaN NaN
3 7 NaN NaN
4 9 NaN NaN
5 0 0 0
6 2 0 0
7 4 0 0
8 6 0 0
9 8 0 0
The expression if i in df.col1 is False always evaluates to false. I think it is looking in the index. Also I think you need to use pandas.concat in modern versions of pandas instead of assigning to df.loc[].
I would recommend gathering all missing values in a list then concatenating them to the dataframe at the end. For instance
>>> df = pd.DataFrame({'col1': range(5) + [i + 6 for i in range(5)], 'col2': range(10)})
>>> print df
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
>>> to_add = []
>>> for i in range(11):
... if i not in df.col1.values:
... to_add.append([i, 0])
... else:
... continue
...
>>> pd.concat([df, pd.DataFrame(to_add, columns=['col1', 'col2'])])
col1 col2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 6 5
6 7 6
7 8 7
8 9 8
9 10 9
0 5 0
I assume you don't care about the index values of the rows you add.

Categories