Pandas: sum column until condition met in other column

Pandas: sum column until condition met in other column - python

I need to sum the value column until I hit a break.
df = pd.DataFrame({'value': [1,2,3,4,5,6,7,8], 'break': [0,0,1,0,0,1,0,0]})
value break
0 1 0
1 2 0
2 3 1
3 4 0
4 5 0
5 6 1
6 7 0
7 8 0
Expected output
value break
0 6 1
1 15 1
I was thinking a group by but I can't seem to get anywhere with it. I don't even need the break columns at the end.

You're on the right track, try groupby on reverse cumsum:
(df.groupby(df['break'][::-1].cumsum()[::-1],
as_index=False, sort=False)
.sum()
.query('`break` != 0') # remove this for full data
)
Output:
value break
0 6 1
1 15 1

Related

Sort column names using wildcard using pandas

I have a big dataframe with more than 100 columns. I am sharing a miniature version of my real dataframe below
ID rev_Q1 rev_Q5 rev_Q4 rev_Q3 rev_Q2 tx_Q3 tx_Q5 tx_Q2 tx_Q1 tx_Q4
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
I would like to do the below
a) sort the column names based on Quarters (ex:Q1,Q2,Q3,Q4,Q5..Q100..Q1000) for each column pattern
b) By column pattern, I mean the keyword that is before underscore which is rev and tx.
So, I tried the below but it doesn't work and it also shifts the ID column to the back
df = df.reindex(sorted(df.columns), axis=1)
I expect my output to be like as below. In real time, there are more than 100 columns with more than 30 patterns like rev, tx etc. I want my ID column to be in the first position as shown below.
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1

For the provided example, df.sort_index(axis=1) should work fine.
If you have Q values higher that 9, use natural sorting with natsort:
from natsort import natsort_key
out = df.sort_index(axis=1, key=natsort_key)
Or using manual sorting with np.lexsort:
idx = df.columns.str.split('_Q', expand=True, n=1)
order = np.lexsort([idx.get_level_values(1).astype(float), idx.get_level_values(0)])
out = df.iloc[:, order]

Something like:
new_order = list(df.columns)
new_order = ['ID'] + sorted(new_order.remove("ID"))
df = df[new_order]
we manually put "ID" in front and then sort what is remaining

The idea is to create a dataframe from the column names. Create two columns: one for Variable and another one for Quarter number. Finally sort this dataframe by values then extract index.
idx = (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']).index)
df = df.iloc[:, idx]
Output:
>>> df
ID rev_Q1 rev_Q2 rev_Q3 rev_Q4 rev_Q5 tx_Q1 tx_Q2 tx_Q3 tx_Q4 tx_Q5
0 1 1 1 1 1 1 1 1 1 1 1
1 2 1 1 1 1 1 1 1 1 1 1
>>> (df.columns.str.extract(r'(?P<V>[^_]+)_Q(?P<Q>\d+)')
.fillna(0).astype({'Q': int})
.sort_values(by=['V', 'Q']))
V Q
0 0 0
1 rev 1
5 rev 2
4 rev 3
3 rev 4
2 rev 5
9 tx 1
8 tx 2
6 tx 3
10 tx 4
7 tx 5

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0

I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

counting consequtive elements in a dataframe and storing them in a new column

so i have this code:
import pandas as pd
id_1=[0,0,0,0,0,0,2,0,4,5,6,7,1,0,5,3]
exp_1=[1,2,3,4,5,6,1,7,1,1,1,1,1,8,2,1]
df = pd.DataFrame(list(zip(id_1,exp_1)), columns =['Patch', 'Exploit'])
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
)
print(df)
the output is:
Patch
first count
0 0 6
1 2 1
2 0 1
3 4 1
4 5 1
5 6 1
6 7 1
7 1 1
8 0 1
9 5 1
10 3 1
I wanted to create a data frame with a new column called count where I can store the consecutive appearance of the patch (id_1).
However, the above code creates a dictionary of the patch and I don't know how to individually manipulate only the values stored in the column called count.
suppose I want to remove all the 0 from id_1 and then count the consecutive appearance.
or I have to find the average of the count column only then?

If you want to remove all 0 from column Patch, then you can filter the dataframe just before .groupby. For example:
df = (
df[df.Patch != 0]
.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
)
print(df)
Prints:
Patch
first count
0 2 1
1 4 1
2 5 1
3 6 1
4 7 1
5 1 1
6 5 1
7 3 1

Count the number of events in Python

In continuation to my previous Question I need some more help.
The dataframe is like
time eve_id sub_id flag
0 5 2 0
1 5 2 0
2 5 2 1
3 5 2 1
4 5 2 0
5 4 25 0
6 4 30 0
7 5 2 1
I need to count the eve_id in the time flag goes 0 to 1,
and count the eve_id for the time flag is 1 to 1
the output will look like this
time flag count
0 0 2
2 1 2
4 0 3
Can someone help me here ?

First we make a grouper indicator which checks if the difference between two rows is not equal to 0, which indicates a difference.
Then we groupby on this indicator and use agg. Since pandas 0.25.0 we have named aggregations:
s = df['flag'].diff().ne(0).cumsum()
grpd = df.groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
Output
time flag count
0 0 0 2
1 2 1 2
2 4 0 3
3 7 1 1
If time is your index, use:
grpd = df.assign(time=df.index).groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
notice: the row extra is because there's a difference between the last row and the row before as well

Change aggregate function sum to GroupBy.size:
df1 = (df.groupby([df['flag'].ne(df['flag'].shift()).cumsum(), 'flag'])
.size()
.reset_index(level=0, drop=True)
.reset_index(name='count'))
print (df1)
flag count
0 0 2
1 1 2
2 0 3
3 1 1

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?

Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: sum column until condition met in other column - python

You're on the right track, try groupby on reverse cumsum: (df.groupby(df['break'][::-1].cumsum()[::-1], as_index=False, sort=False) .sum() .query('`break` != 0') # remove this for full data ) Output: value break 0 6 1 1 15 1

Related

Sort column names using wildcard using pandas

Count number of consecutive rows that are greater than current row value but less than the value from other column

counting consequtive elements in a dataframe and storing them in a new column

Count the number of events in Python

pandas: Grouping or filtering based on values in list, instead of dataframe

Categories

Resources