Pandas Conditional Rolling Count - python

I have a question that extends from Pandas: conditional rolling count. I would like to create a new column in a dataframe that reflects the cumulative count of rows that meets several criteria.
Using the following example and code from stackoverflow 25119524
import pandas as pd
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
cowmast['xmast'] = cowmast['Cow'].apply(rolling_count) #new column in dataframe
cowmast
The output is xmast (number of times mastitis) for each cow
Cow Lact DIM xmast
0 1 1 45 1
1 1 2 25 2
2 1 2 28 3
3 2 2 70 1
4 2 2 95 2
5 2 2 98 3
6 2 2 120 4
7 2 3 80 5
What I would like to do is restart the count for each cow (cow) lactation (Lact) and only increment the count when the number of days (DIM) between rows is more than 7.
To incorporate more than one condition to reset the count for each cows lactation (Lact) I used the following code.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
count_consecutive_items_n_cols(cowmast, ['Cow', 'Lact'], ['Lxmast'])
That produces the following output
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
I would appreciate insight as to how to add another condition in the cumulative count that takes into consideration the time between mastitis events (difference in DIM between rows for cows within the same Lact). If the difference in DIM between rows for the same cow and lactation is less than 7 then the count should not increment.
The output I am looking for is called "Adjusted" in the table below.
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
In the example above for cow 1 lact 2 the count is not incremented when the dim goes from 25 to 28 as the difference between the two events is less than 7 days. Same for cow 2 lact 2 when is goes from 95 to 98. For the larger increments 70 to 95 and 98 to 120 the count is increased.
Thank you for your help
John

Actually, your codes to set up xmast and Lxmast can be much simplified if you had used the solution with the highest upvotes in the referenced question.
Renaming your dataframe cowmast to df, you can set up xmast as follows:
df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1
Similarly, to set up Lxmast, you can use:
df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(),
(df['Lact'] != df['Lact'].shift()).cumsum()])
.cumcount()+1
)
Data Input
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
df = cowmast
Output
print(df)
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
Now, continue with the last part of your requirement highlighted in bold below:
What I would like to do is restart the count for each cow (cow)
lactation (Lact) and only increment the count when the number of days
(DIM) between rows is more than 7.
we can do it as follows:
To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:
m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()
Then, we can rewrite the codes to set up Lxmast in a more readable format, as follows:
df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1
Now, turn to the main works here. Let's say we create another new column Adjusted for it:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().abs().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Result:
print(df)
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
Here, after df.groupby([m_Cow, m_Lact]), we take the column DIM and check for each row's difference with previous row by .diff() and take the absolute value by .abs(), then check whether it is > 7 by .gt(7) in the code fragment ['DIM'].diff().abs().gt(7). We then group by the same grouping again .groupby([m_Cow, m_Lact]) since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum() on the 3rd condition, so that only when the 3rd condition is true we increment the count.
Just in case you want to increment the count only when the DIM is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs() part in the codes above:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Edit (Possible simplification depending on your data sequence)
As your sample data have the main grouping keys Cow and Lact somewhat already in sorted sequence, there's opportunity for further simplification of the codes.
Different from the sample data from the referenced question, where:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
Here, the last B in the last row is separated from other B's and it required the count be reset to 1 rather than continuing from the last count of 2 of the previous B (to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby() and the values of B are grouped together during processing, the count value may not be correctly reset to 1 for the last entry.
If your data for the main grouping keys Cow and Lact are already naturally sorted during data construction, or have been sorted by instruction such as:
df = df.sort_values(['Cow', 'Lact'])
Then, we can simplify our codes, as follows:
(when data already sorted by [Cow, Lact]):
df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1
df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
['DIM'].diff().abs().gt(7)
.groupby([df['Cow'], df['Lact']])
.cumsum()+1
)
Same result and output values in the 3 columns xmast, Lxmast and Adjusted

Related

How to create a counter based on another column?

I've created this data frame -
Range = np.arange(0,9,1)
A={
0:2,
1:2,
2:2,
3:2,
4:3,
5:3,
6:3,
7:2,
8:2
}
Table = pd.DataFrame({"Row": Range})
Table["Intervals"]=(Table["Row"]%9).map(A)
Table
Row Intervals
0 0 2
1 1 2
2 2 2
3 3 2
4 4 3
5 5 3
6 6 3
7 7 2
8 8 2
I'd like to create another column that will be based on the intervals columns and will act as sort of a counter - so the values will be 1,2,1,2,1,2,3,1,2.
The logic is that I want to count by the value of the intervals column.
I've tried to use group by but the issue is that the values are displayed multiple times.
Logic:
We have 2 different values - 2 and 3. Each value will occur in the intervals column as the value itself - so 2 for example will occur twice 2,2. And 3 will occur 3 times - 3,3,3.
For the first 4 rows, the value 2 is displayed twice - that is why the new column should be 1,2 (counter of the first 2) and then again 1,2 (counter of the second 2).
Afterward, there is 3, so the values are 1,2,3.
And then once again 2, so the values are 1,2.
Hope I managed to explain myself.
Thanks in advance!
You can use groupby.cumcount combined with mod:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = Table.groupby(group).cumcount().mod(Table['Intervals']).add(1)
Or:
group = Table['Intervals'].ne(Table['Intervals'].shift()).cumsum()
Table['Counter'] = (Table.groupby(group)['Intervals']
.transform(lambda s: np.arange(len(s))%s.iloc[0]+1)
)
Output:
Row Intervals Counter
0 0 2 1
1 1 2 2
2 2 2 1
3 3 2 2
4 4 3 1
5 5 3 2
6 6 3 3
7 7 2 1
8 8 2 2

Allocate lowest value over n rows to n rows in DataFrame

I need to take the lowest value over n rows and add it to these n rows in a new colomn of the dataframe. For example:
n=3
Column 1 Column 2
5 3
3 3
4 3
7 2
8 2
2 2
5 4
4 4
9 4
8 2
2 2
3 2
5 2
Please take note that if the number of rows is not dividable by n, the last values are incorporated in the last group. So in this example n=4 for the end of the dataframe.
Thanking you in advance!
I do not know any straight forward way to do this, but here is a working example (not elegant, but working...).
If you do not worry about the number of rows being dividable by n, you could use .groupby():
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
df['new_col']=df.groupby(df.index // n).transform('min')
which yields:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 4
7 6 4
8 4 4
9 1 1
10 2 1
However, we can see that the last 2 rows are grouped together, instead of them being grouped with the 3 previous values in this case.
A way around would be to look at the .count() of elements in each group generated by grouby, and check the last one:
import pandas as pd
d = {'col1': [1, 2,1,5,3,2,5,6,4,1,2] }
df = pd.DataFrame(data=d)
n=3
# Temporary dataframe
A = df.groupby(df.index // n).transform('min')
# The min value of each group in a second dataframe
min_df = df.groupby(df.index // n).min()
# The size of the last group
last_batch = df.groupby(df.index // n).count()[-1:]
# if the last size is not equal to n
if last_batch.values[0][0] !=n:
last_group = last_batch+n
A[-last_group.values[0][0]:]=min_df[-2:].min()
# Assign the temporary modified dataframe to df
df['new_col'] = A
which yields the expected result:
col1 new_col
0 1 1
1 2 1
2 1 1
3 5 2
4 3 2
5 2 2
6 5 1
7 6 1
8 4 1
9 1 1
10 2 1

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

Python / Pandas: Eliminate for Loop using 2 DataFrames

First time asking a question here so hopefully I will make my issue clear. I am trying to understand how to better apply a list of scenarios (via for loop) to the same dataset and summarize results. *Note that once a scenario is applied, and I pull the relevant statistical data from dataframe and put into the summary table, I do not need to retain the information. Iterrows is painfully slow as I have tens of thousands of scenarios I want to run. Thank you for taking the time to review.
I have two Pandas dataframes: df_analysts and df_results:
1) df_analysts contains a specific list of factors (e.g. TB,JK,SF,PWR) scenarios of weights (e.g. 50,50,50,50)
TB JK SF PWR
0 50 50 50 50
1 50 50 50 100
2 50 50 50 150
3 50 50 50 200
4 50 50 50 250
2) df_results holds results by date and group and entrant an then ranking by each factor, finally it has the final finish result.
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W)
0 11182017 1 1 2 1 2 1 2
1 11182017 1 2 3 2 3 2 1
2 11182017 1 3 1 3 1 3 3
3 11182017 2 1 1 2 2 1 1
4 11182017 2 2 2 1 1 2 1
3) I am using iterrows to
loop through each scenario in the df_analysts dataframe
apply weight scenario to each factor rank (if rank = 1, then 1.0*weight, rank = 2, then 0.68*weight, rank = 3, then 0.32*weight). Those results go into the W1-W4 columns.
Sum the W1-W4 columns.
Rank the SUM(W) column.
Result sample below for a single scenario (e.g. 50,50,50,50)
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W) Rank
0 11182017 1 1 2 1 2 1 1 34 50 34 50 168 1
1 11182017 1 2 3 2 3 2 3 16 34 16 34 100 3
2 11182017 1 3 1 3 1 3 2 50 16 50 16 132 2
3 11182017 2 1 2 2 2 1 1 34 34 34 50 152 2
4 11182017 2 2 1 1 1 2 1 50 50 50 34 184 1
4) Finally, for each scenario, I am creating a new dataframe for the summary results (df_summary) which logs the factor / weight scenario used (from df_analysts) and compares the RANK result to the Finish by date and group and keeps a tally where they land. Sample below (only the 50,50,50,50 scenario is shown above which results in a 1,1).
Factors Weights Top Top2
0 (TB,JK,SF,PWR) (50,50,50,50) 1 1
1 (TB,JK,SF,PWR) (50,50,50,100) 1 0
2 (TB,JK,SF,PWR) (50,50,50,150) 1 1
3 (TB,JK,SF,PWR) (50,50,50,200) 1 0
4 (TB,JK,SF,PWR) (50,50,50,250) 1 1
You could merge your analyst and results dataframe and then perform the calculations.
def factor_rank(x,y):
if (x==1): return y
elif (x==2): return y*0.68
elif (x==3): return y*0.32
df_analysts.index.name='SCENARIO'
df_analysts.reset_index(inplace=True)
df_analysts['key'] = 1
df_results['key'] = 1
df = pd.merge(df_analysts, df_results, on='key')
df.drop(['key'],axis=1,inplace=True)
df['W1'] = df.apply(lambda r: factor_rank(r['TB-R'], r['TB']), axis=1)
df['W2'] = df.apply(lambda r: factor_rank(r['JK-R'], r['JK']), axis=1)
df['W3'] = df.apply(lambda r: factor_rank(r['SF-R'], r['SF']), axis=1)
df['W4'] = df.apply(lambda r: factor_rank(r['PWR-R'], r['PWR']), axis=1)
df['SUM(W)'] = df.W1 + df.W1 + df.W3 + df.W4
df["rank"] = df.groupby(['GR','SCENARIO'])['SUM(W)'].rank(ascending=False)
You may also want to check out this question which deals with improving processing times on row based calculations:
How to apply a function to mulitple columns of a pandas DataFrame in parallel

Categories