Search N consecutive rows with same value in one dataframe

Search N consecutive rows with same value in one dataframe - python

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.

Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Related

Flagging NaN values based on a condition and year

I am trying to get this requirement of flagging NaN values based on condition and particular year, below is my code:
import pandas as pd
import numpy as np
s={'Fruits':['Apple','Orange', 'Banana', 'Mango'],'month':['201401','201502','201603','201604'],'weight':[2,4,1,6],'Quant':[251,178,298,300]}
p=pd.DataFrame(data=s)
upper = 250
How would I be able to flag NaN values for month- 201603 and 201604 (03 and 04 are the months), if upper>250. Basically my intention is to check if Quant value is greater than defined upper value, but for specific date i.e. 201603 and 201604.
This is how the output should look like-
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN

You can use .loc:
p.loc[(p.Quant > upper) & (p.month.str[-2:].isin(['03','04'])), 'Quant'] = np.nan
OutPut:
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN

You could build a boolean condition that checks if "Quant" is greater than "upper" and the month is "03" or "04", and mask "Quant" column:
p['Quant'] = p['Quant'].mask(p['Quant'].gt(upper) & p['month'].str[-2:].isin(['03','04']))
Output:
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN

Use:
p['Quant1'] = p[~(((p['month']=='201603')|(p['month']=='201604'))&(p['Quant']>250))]['Quant']

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values

You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1

Redefining a pandas dataframe based on its group

Iam using this dataframe
source fruit 2019 2020 2021
0 a apple 3 1 1
1 a banana 4 3 5
2 a orange 2 2 2
3 b apple 3 4 5
4 b banana 4 5 2
5 b orange 1 6 4
i want to refine it like this
source fruit 2019 2020 2021
0 a total 9 6 8
1 a seeds 5 3 3
2 a banana 4 3 5
3 b total 8 15 11
4 b seeds 4 10 9
5 b banana 4 5 2
total is sum of all fruits in that year for each source.
seeds is the sum of fruits containing seeds for each year for each source.
I tried
Appending new empty rows : Insert a new row after every nth row & Insert row at any position
But wasn't getting the expected result.
What would be the best way to get the desired output?

TRY:
df1 = df.groupby('source', as_index=False).sum().assign(fruit = 'total')
seeds = ['orange','apple']
df2 = df.loc[df['fruit'].isin(seeds)].groupby('source', as_index=False).sum().assign(fruit = 'seeds')
final_df = pd.concat([df.loc[~df['fruit'].isin(seeds)], df1,df2])

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?

I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

FuzzyWuzzy using two pandas dataframes python

I want to find the fuzz.ratio of strings that are in two dataframes. Let's say I have 2 dataframes df with columns A, B and bt_df with columns A1, B1.. I want to compare the column df['B'] and bt_df['B1'] and return the best matching score and its corresponding id in df[A] and .
df
Out[8]:
A B
0 11111111111111111111 Cheesesalad
1 22222222222222222222 Cheese
2 33333333333333333333 salad
3 44444444444444444444 BMWSalad
4 55555555555555555555 BMW
5 66666666666666666666 Apple
6 77777777777777777777 Apple####
7 88888888888888888888 Macrooni!
bt_df
Out[9]:
A1 B1
0 180336 NaN
1 154263 Cheese
2 130876 Salad
3 204430 Macrooni
4 153490 NaN
5 48879 NaN
6 185495 NaN
7 105099 NaN
8 8645 Apple
9 54038 NaN
10 156523 NaN
11 18156 BWM
Hence the result should be:
B1 matchedstring score id
Cheese Cheese 100 22222222222222222222
.....
.....
Thanks in advance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search N consecutive rows with same value in one dataframe - python

Related

Flagging NaN values based on a condition and year

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Redefining a pandas dataframe based on its group

update pandas groupby group with column value

FuzzyWuzzy using two pandas dataframes python

Categories

Resources