Pandas: conditional rolling count - python

I have a Series that looks the following:
col
0 B
1 B
2 A
3 A
4 A
5 B
It's a time series, therefore the index is ordered by time.
For each row, I'd like to count how many times the value has appeared consecutively, i.e.:
Output:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
I found 2 related questions, but I can't figure out how to "write" that information as a new column in the DataFrame, for each row (as above). Using rolling_apply does not work well.
Related:
Counting consecutive events on pandas dataframe by their index
Finding consecutive segments in a pandas data frame

I think there is a nice way to combine the solution of #chrisb and #CodeShaman (As it was pointed out CodeShamans solution counts total and not consecutive values).
df['count'] = df.groupby((df['col'] != df['col'].shift(1)).cumsum()).cumcount()+1
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1

One-liner:
df['count'] = df.groupby('col').cumcount()
or
df['count'] = df.groupby('col').cumcount() + 1
if you want the counts to begin at 1.

Based on the second answer you linked, assuming s is your series.
df = pd.DataFrame(s)
df['block'] = (df['col'] != df['col'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby('block').transform(lambda x: range(1, len(x) + 1))
In [88]: df
Out[88]:
col block count
0 B 1 1
1 B 1 2
2 A 2 1
3 A 2 2
4 A 2 3
5 B 3 1

I like the answer by #chrisb but wanted to share my own solution, since some people might find it more readable and easier to use with similar problems....
1) Create a function that uses static variables
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
2) apply it to your Series after converting to dataframe
df = pd.DataFrame(s)
df['count'] = df['col'].apply(rolling_count) #new column in dataframe
output of df
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1

If you wish to do the same thing but filter on two columns, you can use this.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
col_a col_b count
0 1 B 1
1 1 B 2
2 1 A 1
3 2 A 1
4 2 A 2
5 2 B 1

Related

Sort column names using wildcard using pandas

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]
Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining
The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5

Sort Pandas DataFrame based on previous row in another column

I have the following DataFrame in my Python porject:
df1 = pd.DataFrame({"Col A":[1,2,3],"Col B":[3,2,2]})
I wish to order it in this kind of way:
df2 = pd.DataFrame({"Col A":[1,3,2],"Col B":[3,2,2]})
My goal is that each value in Col A matches the previous' value in Col B.
Do you have any idea of how to make this work properly and as little effort as possible?
I tried to work with .sort_values(by=) but that's also where my current knowledge stops.
If need roll one value per Col B use lambda function:
df1 = pd.DataFrame({"Col A":[1,2,3,7,4,8],"Col B":[3,2,2,1,1,1]})
print (df1)
Col A Col B
0 1 3
1 2 2
2 3 2
3 7 1
4 4 1
5 8 1
df1['Col A'] = df1.groupby('Col B')['Col A'].transform(lambda x: np.roll(x, -1))
print (df1)
Col A Col B
0 1 3
1 3 2
2 2 2
3 4 1
4 8 1
5 7 1
Yes, you can achieve the desired output by using sort_values() and by creating a mapping dictionary so:
import pandas as pd
df1 = pd.DataFrame({"Col A":[1,2,3],"Col B":[3,2,2]})
# mapping_dict for ordering
mapping_dict = {1:3, 3:2, 2:2}
df1["sort_order"] = df1["Col A"].map(mapping_dict)
df2 = df1.sort_values(by="sort_order").drop(columns=["sort_order"])
print(df2)
Output:
Col A Col B
0 1 3
2 3 2
1 2 2

Count rows that have same value in all columns

If I have a dataframe like this:
A B C D E F
------------------
1 2 3 4 5 6
1 1 1 1 1 1
0 0 0 0 0 0
1 1 1 1 1 1
How can I get the number of rows that have value 1 in every column?
In this case, 2 rows have 1 in every field.
I know one way, for example if we only take columns A and B:
count = df2.query("A == 1 & B == 1").shape[0]
But I have to put the name of every column, is there a more fancy approach?
Thanks in advance
Try:
(df == 1).all(axis=1).sum()
Output:
2
For the large data frame and multiple row , you may always want to try with any , since when it detect the first item , it will yield the result
sum(~df.ne(1).any(1))
2

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

Use pandas to group by column and then create a new column based on a condition

I need to reproduce with pandas what SQL does so easily:
select
del_month
, sum(case when off0_on1 = 1 then 1 else 0 end) as on1
, sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month
Here is a sample, illustrative pandas dataframe to work on:
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})
Here are my attempts to reproduce the above SQL with pandas. The first line works. The second line gives an error:
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))
Here's the second line's error:
TypeError: 'function' object is not iterable
This previous question of mine had a problem with the lambda function, which was solved. The bigger problem is how to reproduce SQL's "sum(case when)" logic on grouped data. I'm looking for a general solution, since I need to do this sort of thing often. The answers in my previous question suggested using map() inside the lambda function, but the following results for the "off0" column are not what I need. The "on1" column is what I want. The answer should be the same for the whole group (i.e. "del_month").
Simply sum the Trues in your conditional logic expressions:
import pandas as pd
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2],
'off0_on1':[0,0,1,1,0,1,1,1]})
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))
print(a1)
# del_month off0_on1 on1 off0
# 0 1 0 2 2
# 1 1 0 2 2
# 2 1 1 2 2
# 3 1 1 2 2
# 4 2 0 3 1
# 5 2 1 3 1
# 6 2 1 3 1
# 7 2 1 3 1
Similarly, you can do the same in SQL if dialect supports it which most should:
select
del_month
, sum(off0_on1 = 1) as on1
, sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month
And to replicate above SQL in pandas, don't use transform but send multiple aggregates in a groupby().apply() call:
def aggfunc(x):
data = {'on1': sum(x['off0_on1'] == 1),
'off0': sum(x['off0_on1'] == 0)}
return pd.Series(data)
g = a1.groupby('del_month').apply(aggfunc)
print(g)
# on1 off0
# del_month
# 1 2 2
# 2 3 1
Using get_dummies would only need a single groupby call, which is simpler.
v = pd.get_dummies(df.pop('off0_on1')).groupby(df.del_month).transform(sum)
df = pd.concat([df, v.rename({0: 'off0', 1: 'on1'}, axis=1)], axis=1)
df
del_month off0 on1
0 1 2 2
1 1 2 2
2 1 2 2
3 1 2 2
4 2 1 3
5 2 1 3
6 2 1 3
7 2 1 3
Additionally, for the case of aggregation, call sum directly instead of using apply:
(pd.get_dummies(df.pop('off0_on1'))
.groupby(df.del_month)
.sum()
.rename({0: 'off0', 1: 'on1'}, axis=1))
off0 on1
del_month
1 2 2
2 1 3

Categories