Iterate over a groupby dataframe to operate in each row - python

I have a DataFrame like this:
subject trial attended
0 1 1 1
1 1 3 0
2 1 4 1
3 1 7 0
4 1 8 1
5 2 1 1
6 2 2 1
7 2 6 1
8 2 8 0
9 2 9 1
10 2 11 1
11 2 12 1
12 2 13 1
13 2 14 1
14 2 15 1
I would like to GroupBy subject.
Then iterate in each row of the GroupBy dataframe.
If for a row 'attended' == 1, then to increase a variable sum_reactive by 1.
If the sum_reactive variable reaches == 4, then to add in a dictionary the 'subject' and 'trial' in which the variable sum_reactive reached a value of 4.
I as trying to define a function for this, but it doesn't work:
def count_attended():
sum_reactive = 0
dict_attended = {}
for i, g in reactive.groupby(['subject']):
for row in g:
if g['attended'][row] == 1:
sum_reactive += 1
if sum_reactive == 4:
dict_attended.update({g['subject'] : g['trial'][row]})
return dict_attended
return dict_attended
I think that I don't have clear how to iterate inside each GroupBy dataframe. I'm quite new using pandas.

IIUC try,
df = df.query('attended == 1')
df.loc[df.groupby('subject')['attended'].cumsum() == 4, ['subject', 'trial']].to_dict(orient='record')
Output:
[{'subject': 2, 'trial': 9}]
Using groupby with cumsum will do the counting attended, then check to see when this value equals to 4 to create a boolean series. You can use this boolean series to do boolean indexing to filter your dataframe to certain rows. Lastly, with lock and column filtering select subject and trial.

Related

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

Filtering rows that have unique value in a column using pandas

I have a df:
id value
1 10
2 15
1 10
1 10
2 13
3 10
3 20
I am trying to keep only rows that have 1 unique value in column value so that the result df looks like this:
id value
1 10
1 10
1 10
I dropped id = 2, 3 because it has more than 1 unique value in column value, 15, 13 & 10, 20 respectively.
I read this answer.
But this simply removes duplicates whereas I want to check if a given column - in this case column value has more than 1 unique value.
I tried:
df['uniques'] = pd.Series(df.groupby('id')['value'].nunique())
But this returns nan for every row since I am trying to fit n returns on n+m rows after grouping. I can write a function and apply it to every row but I was wondering if there is a smart quick filter that achieves my goal.
Use transform with groupby to align the group values to the individual rows:
df['nuniques'] = df.groupby('id')['value'].transform('nunique')
Output:
id value nuniques
0 1 10 1
1 2 15 2
2 1 10 1
3 1 10 1
4 2 13 2
5 3 10 2
6 3 20 2
If you only need to filter your data, you don't need to assign the new column:
df[df.groupby('id')['value'].transform('nunique') == 1]
Let us do filter
out = df.groupby('id').filter(lambda x : x['value'].nunique()==1)
Out[6]:
id value
0 1 10
2 1 10
3 1 10

counting consequtive elements in a dataframe and storing them in a new column

so i have this code:
import pandas as pd
id_1=[0,0,0,0,0,0,2,0,4,5,6,7,1,0,5,3]
exp_1=[1,2,3,4,5,6,1,7,1,1,1,1,1,8,2,1]
df = pd.DataFrame(list(zip(id_1,exp_1)), columns =['Patch', 'Exploit'])
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
)
print(df)
the output is:
Patch
first count
0 0 6
1 2 1
2 0 1
3 4 1
4 5 1
5 6 1
6 7 1
7 1 1
8 0 1
9 5 1
10 3 1
I wanted to create a data frame with a new column called count where I can store the consecutive appearance of the patch (id_1).
However, the above code creates a dictionary of the patch and I don't know how to individually manipulate only the values stored in the column called count.
suppose I want to remove all the 0 from id_1 and then count the consecutive appearance.
or I have to find the average of the count column only then?
If you want to remove all 0 from column Patch, then you can filter the dataframe just before .groupby. For example:
df = (
df[df.Patch != 0]
.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
)
print(df)
Prints:
Patch
first count
0 2 1
1 4 1
2 5 1
3 6 1
4 7 1
5 1 1
6 5 1
7 3 1

pandas - Access value in a previous row in a Dataframe

I am trying to add a new column to my dataframe that depends on values that may or may not exist in previous rows. My dataframe looks like this:
index id timestamp sequence_index value prev_seq_index
0 10 1 0 5 0
1 10 1 1 1 2
2 10 1 2 2 0
3 10 2 0 9 0
4 10 2 1 10 1
5 10 2 2 3 1
6 11 2 0 42 1
7 11 2 1 13 0
Note: there is no relation between index and sequence_index, index is just a counter.
What I want to do is add a column prev_value, that finds the value of the most recent row with the same id and sequence_index == prev_seq_index, if no such previous row exist, use default value, for the purpose of this question I will use default value of -1
index id timestamp sequence_index value prev_seq_index prev_value
0 10 1 0 5 0 -1
1 10 1 1 1 2 -1
2 10 1 2 2 0 -1
3 10 2 0 9 0 5 # value from df[index == 0]
4 10 2 1 10 1 1 # value from df[index == 1]
5 10 2 2 3 1 1 # value from df[index == 1]
6 11 2 0 42 1 -1
7 11 2 1 13 0 -1
My current solution is a brute force which is very slow, and I was wondering if there was a faster way:
prev_values = np.zeros(len(df))
i = 0
for index, row in df.iterrows():
# filter for previous rows with the same id and desired sequence index
tmp_df = df[(df.id == row.id) & (df.timestamp < row.timestamp) \
& (df.sequence_index == row.prev_seq_index)]
if (len(tmp_df) > 0):
# get value from the most recent row
prev_value = tmp_df[tmp_df.index == tmp_df.timestamp.idxmax()].value
else:
prev_value = -1
prev_values[i] = prev_value
i += 1
df['prev_value'] = prev_values
i would suggest tackling this via a left join. However first you'll need to make sure that your data doesn't have duplicates. You'll need to create a dataframe of most recent timestamps and grab the values.
agg=pd.groupby(['sequence_index']).agg({'timestamp':'max'})
agg=pd.merge(agg,df['timestamp','sequence_index','value'], how='inner', on = ['timestamp','sequence_index'])
agg.rename(columns={'value': 'prev_value'}, inplace=True)
now you can join the data back on itself
df=pd.merge(df,agg,how='left',left_on='prev_seq_index',right_on='sequence_index')
now you can deal with the NaN values
df.prev_value=df.prev_value.fillna(-1)

Pandas - Delete only contiguous rows that equal zero

I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .

Categories