Pandas Custom Cumulative Calculation Over Group By in DataFrame

Pandas Custom Cumulative Calculation Over Group By in DataFrame - python

I am trying to run a simple calculation over the values of each row from within a group inside of a dataframe, but I'm having trouble with the syntax, I think I'm specifically getting confused in relation to what data object I should return, i.e. dataframe vs series etc.
For context, I have a bunch of stock values for each product I am tracking and I want to estimate the number of sales via a custom function which essentially does the following:
# Because stock can go up and down, I'm looking to record the difference
# when the stock is less than the previous stock number from the previous row.
# How do I access each row of the dataframe and then return the series I need?
def get_stock_sold(x):
# Written in pseudo
stock_sold = previous_stock_no - current_stock_no if current_stock_no < previous_stock_no else 0
return pd.Series(stock_sold)
I then have the following dataframe:
# 'Order' is a date in the real dataset.
data = {
'id' : ['1', '1', '1', '2', '2', '2'],
'order' : [1, 2, 3, 1, 2, 3],
'current_stock' : [100, 150, 90, 50, 48, 30]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['id', 'order'])
df['previous_stock'] = df.groupby('id')['current_stock'].shift(1)
I'd like to create a new column (stock_sold) and apply the logic from above to each row within the grouped dataframe object:
df['stock_sold'] = df.groupby('id').apply(get_stock_sold)
Desired output would look as follows:
| id | order | current_stock | previous_stock | stock_sold |
|----|-------|---------------|----------------|------------|
| 1 | 1 | 100 | NaN | 0 |
| | 2 | 150 | 100.0 | 0 |
| | 3 | 90 | 150.0 | 60 |
| 2 | 1 | 50 | NaN | 0 |
| | 2 | 48 | 50.0 | 2 |
| | 3 | 30 | 48 | 18 |

Try:
df["previous_stock"] = df.groupby("id")["current_stock"].shift()
df["stock_sold"] = np.where(
df["current_stock"] > df["previous_stock"].fillna(0),
0,
df["previous_stock"] - df["current_stock"],
)
print(df)
Prints:
id order current_stock previous_stock stock_sold
0 1 1 100 NaN 0.0
1 1 2 150 100.0 0.0
2 1 3 90 150.0 60.0
3 2 1 50 NaN 0.0
4 2 2 48 50.0 2.0
5 2 3 30 48.0 18.0

Related

Optimizing a Pandas DataFrame Transformation to Link two Columns

Given the following df:
SequenceNumber | ID | CountNumber | Side | featureA | featureB
0 0 | 0 | 3 | Sell | 4 | 2
1 0 | 1 | 1 | Buy | 12 | 45
2 0 | 2 | 1 | Buy | 1 | 4
3 0 | 3 | 1 | Buy | 3 | 36
4 1 | 0 | 1 | Sell | 5 | 11
5 1 | 1 | 1 | Sell | 7 | 12
6 1 | 2 | 2 | Buy | 5 | 35
I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.
NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.
The output should look like this:
SequenceNumber | From | To | featureA | featureB
0 0 | 0 | 1 | 12 | 45
1 0 | 0 | 2 | 1 | 4
2 0 | 0 | 3 | 3 | 36
3 1 | 0 | 2 | 5 | 11
4 1 | 1 | 2 | 7 | 12
I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?
Here is the original df:
df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1],
'ID': [0, 1, 2, 3, 0, 1, 2],
'CountNumber': [3, 1, 1, 1, 1, 1, 2],
'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
'featureA': [4, 12, 1, 3, 5, 7, 5],
'featureB': [2, 45, 4, 36, 11, 12, 35]})

You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:
features = list(df.filter(like='feature'))
out = (
# repeat the rows with CountNumber > 1
df.loc[df.index.repeat(df['CountNumber'])]
# rename Sell/Buy into from/to and de-duplicate the rows per group
.assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
)
# mask the features where CountNumber > 1
.assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
.drop(columns='CountNumber')
# reshape with a pivot
.pivot(index=['SequenceNumber', 'n'], columns='Side')
)
out = (
pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
.reset_index('SequenceNumber')
)
Output:
SequenceNumber from to featureA featureB
n
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
0 1 0 2 5.0 11.0
1 1 1 2 7.0 12.0
atlernative using a merge like suggested by ifly6:
features = list(df.filter(like='feature'))
df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))
df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))
out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
.set_index(['SequenceNumber', 'from', 'to'])
.filter(like='feature')
.pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
.reset_index()
)
Output:
SequenceNumber from to featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0

Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:
ndf = df.query('Side == "Sell"').merge(
df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
I then drop the side variable.
ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])
This creates a very wide table:
SequenceNumber From CountNumber_sell featureA_sell featureB_sell To CountNumber_buy featureA_buy featureB_buy
0 0 0 3 4 2 1 1 12 45
1 0 0 3 4 2 2 1 1 4
2 0 0 3 4 2 3 1 3 36
3 1 0 1 5 11 2 2 5 35
4 1 1 1 7 12 2 2 5 35
This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.
Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.
Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.
s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
ndf = s.merge(
b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf = ndf.drop(
columns=[i for i in ndf.columns if i.startswith('Side')
or i.endswith('_sell') or i.endswith('_buy')])
The final version of ndf then is:
SequenceNumber From To featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0

Here is an alternative approach
df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)
SequenceNumber From To featureA featureB
0 0 0 1 12 45
1 0 0 2 1 4
2 0 0 3 3 36
3 1 0 2 5 11
4 1 1 2 7 12

Pandas find cell location that matches regex

I'm currently trying to parse excel files that contain somewhat structured information. The data I am interested in is in a subrange of an excel sheet. Basically the excel contains key-value pairs where the key is usually named in a predictable manner (found with regex). Keys are in the same column and the value pair is on the right side of the key in the excel sheet.
Regex pattern pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment' predictably matches the keys. Therefore if I can find the column where the keys are located and the rows where the keys are present, I am able to find the subrange of interest and parse it further.
Goals:
Get list of row indices that match regex (e.g. [5, 6, 8, 9])
Find which column contains keys that match regex (e.g. Unnamed: 3)
When I read in the excel using df_original = pd.read_excel(filename, sheet_name=sheet) the dataframe looks like this
df_original = pd.DataFrame({'Unnamed: 0':['Value', 'Name', np.nan, 'Mark', 'Molly', 'Jack', 'Tom', 'Lena', np.nan, np.nan],
'Unnamed: 1':['High', 'New York', np.nan, '5000', '5250', '4600', '2500', '4950', np.nan, np.nan],
'Unnamed: 2':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Unnamed: 3':['Other', 125, 127, np.nan, np.nan, 'Temperature (C)', 'Strength', np.nan, 'Temperature (F)', 'Comment'],
'Unnamed: 4':['Other 2', 25, 14.125, np.nan, np.nan, np.nan, '1500', np.nan, np.nan, np.nan],
'Unnamed: 5':[np.nan, np.nan, np.nan, np.nan, np.nan, 25, np.nan, np.nan, 77, 'Looks OK'],
'Unnamed: 6':[np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'Add water'],
})
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------|
| 0 | Value | High | nan | Other | Other 2 | nan | nan |
| 1 | Name | New York | nan | 125 | 25 | nan | nan |
| 2 | nan | nan | nan | 127 | 14.125 | nan | nan |
| 3 | Mark | 5000 | nan | nan | nan | nan | nan |
| 4 | Molly | 5250 | nan | nan | nan | nan | nan |
| 5 | Jack | 4600 | nan | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | nan | Strength | 1500 | nan | nan |
| 7 | Lena | 4950 | nan | nan | nan | nan | nan |
| 8 | nan | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+--------------+-----------------+--------------+--------------+--------------+
This code finds the rows of interest and solves Goal 1.
df = df_original.dropna(how='all', axis=1)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
mask = np.column_stack([df[col].str.contains(pattern, regex=True, na=False) for col in df])
row_range = df.loc[(mask.any(axis=1))].index.to_list()
print(df.loc[(mask.any(axis=1))].index.to_list())
[5, 6, 8, 9]
display(df.loc[row_range])
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
| | Unnamed: 0 | Unnamed: 1 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+--------------+--------------+-----------------+--------------+--------------+--------------|
| 5 | Jack | 4600 | Temperature (C) | nan | 25 | nan |
| 6 | Tom | 2500 | Strength | 1500 | nan | nan |
| 8 | nan | nan | Temperature (F) | nan | 77 | nan |
| 9 | nan | nan | Comment | nan | Looks OK | Add water |
+----+--------------+--------------+-----------------+--------------+--------------+--------------+
What is the easiest way to solve Goal 2? Basically I want to find columns that contain at least one value that matches the regex pattern. The wanted output would be [Unnamed: 5]. There may be some easy way to solve goals 1 and 2 at the same time. For example:
col_of_interest = 'Unnamed: 3' # <- find this value
col_range = df_original.columns[df_original.columns.to_list().index(col_of_interest): ]
print(col_range)
Index(['Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6'], dtype='object')
target = df_original.loc[row_range, col_range]
display(target)
+----+-----------------+--------------+--------------+--------------+
| | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 |
|----+-----------------+--------------+--------------+--------------|
| 5 | Temperature (C) | nan | 25 | nan |
| 6 | Strength | 1500 | nan | nan |
| 8 | Temperature (F) | nan | 77 | nan |
| 9 | Comment | nan | Looks OK | Add water |
+----+-----------------+--------------+--------------+--------------+

One option is with xlsx_cells from pyjanitor; it reads each cell as a single row; this way you are afforded more manipulation freedom; for your use case it can be handy and an alternative:
# pip install pyjanitor
import pandas as pd
import janitor as jn
Read in data
df = jn.xlsx_cells('test.xlsx', include_blank_cells=False)
df.head()
value internal_value coordinate row column data_type is_date number_format
0 Value Value A2 2 1 s False General
1 High High B2 2 2 s False General
2 Other Other D2 2 4 s False General
3 Other 2 Other 2 E2 2 5 s False General
4 Name Name A3 3 1 s False General
Filter for rows that match the pattern:
bools = df.value.str.startswith(('Temperature', 'Strength', 'Comment'), na = False)
vals = df.loc[bools, ['value', 'row', 'column']]
vals
value row column
16 Temperature (C) 7 4
20 Strength 8 4
24 Temperature (F) 10 4
26 Comment 11 4
Look for values that are on the same row as vals, and are in columns greater than the column in vals:
bools = df.column.gt(vals.column.unique().item()) & df.row.between(vals.row.min(), vals.row.max())
result = df.loc[bools, ['value', 'row', 'column']]
result
value row column
17 25 7 6
21 1500 8 5
25 77 10 6
27 Looks OK 11 6
28 Add water 11 7
Merge vals and result to get the final output
(vals
.drop(columns='column')
.rename(columns={'value':'val'})
.merge(result.drop(columns='column'))
)
val row value
0 Temperature (C) 7 25
1 Strength 8 1500
2 Temperature (F) 10 77
3 Comment 11 Looks OK
4 Comment 11 Add water

Try one of the following 2 options:
Option 1 (assuming no not-NaN data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature'
idx, col = df_original.stack().str.contains(pattern, regex=True, na=False).idxmax()
res = df_original.loc[idx:, col:].dropna(how='all')
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanation
First, we use df.stack to add column names as a level to the index, and get all the data just in one column.
Now, we can apply Series.str.contains to find a match for r'[Tt]emperature'. We chain Series.idxmax to "[r]eturn the row label of the maximum value". I.e. this will be the first True, so we will get back (5, 'Unnamed: 3'), to be stored in idx and col respectively.
Now, we know where to start our selection from the df, namely at index 5 and column Unnamed: 3. If we simply want all the data (to the right, and to bottom) from here on, we can use: df_original.loc[idx:, col:] and finally, drop all remaining rows that have only NaN values.
Option 2 (potential data below row with "[Tt]emperature (C)" that we don't want to include)
pattern = r'[Tt]emperature|[Ss]tren|[Cc]omment'
tmp = df_original.stack().str.contains(pattern, regex=True, na=False)
tmp = tmp[tmp].index
res = df_original.loc[tmp.get_level_values(0), tmp.get_level_values(1)[1]:]
print(res)
Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
5 Temperature (C) NaN 25 NaN
6 Strength 1500 NaN NaN
8 Temperature (F) NaN 77 NaN
9 Comment NaN Looks OK Add water
Explanantion
Basically, the procedure here is the same as with option 1, except that we want to retrieve all the index values, rather than just the first one (for "[Tt]emperature (C)"). After tmp[tmp].index, we get tmp as:
MultiIndex([(5, 'Unnamed: 3'),
(6, 'Unnamed: 3'),
(8, 'Unnamed: 3'),
(9, 'Unnamed: 3')],
)
In the next step, we use these values as coordinates for df.loc. I.e. for the index selection, we want all values, so we use index.get_level_values; for the column, we only need the first value (they should all be the same of course: Unnamed: 3).

Pandas labeling in a for loop

here's the thing, I want to add a new column as a label for a selection of rows.
when failure is 1, select 2 rows before and 1 after then add a label column. Here is an attempt that I want...
df_new = pd.DataFrame()
for i in range(0, len(df)):
if df.iloc[i]['failure'] == 1:
n += 1
df_new = df_new.append(df.iloc[i-2:i+2])
df_new = df_new.append({'label': n}, ignore_index=True)```
The result of that:
var_1 | var_2 | failure | label
------------------------------------
0 75.0 | 55.0 | 0.0 | NaN
------------------------------------
1 45.0 | 19.0 | 0.0 | NaN
------------------------------------
2 76.0 | 46.0 | 1.0 | NaN
------------------------------------
3 18.0 | 63.0 | 0.0 | NaN
------------------------------------
4 NaN | NaN | NaN | 1.0
------------------------------------
But I want...
var_1 | var_2 | failure | label
------------------------------------
0 75.0 | 55.0 | 0.0 | 1
------------------------------------
1 45.0 | 19.0 | 0.0 | 1
------------------------------------
2 76.0 | 46.0 | 1.0 | 1
------------------------------------
3 18.0 | 63.0 | 0.0 | 1
------------------------------------

Instead of a for loop, a more pandas approach would be to first compute the sum as a series, and add it to your frame with a condition.
For example, signal = df['failure'].rolling(window=4).sum().shift(-3) (You'll want to double check the shift offset to make sure it's what you intend).
Then you can create df['label'] = np.where(signal == 1, 1, 0).
Does that fit what you need?

For dataset:
dataset with 10,000 rows and 6 columns of random data between 0 and 100 (inclusive) and last column is a random number intiger between 0 and 1
df = pd.DataFrame(np.random.randint(0, 100, size=(10000, 6)), columns=['a', 'b', 'c', 'd', 'e', 'f'])
df['g'] = np.random.randint(0, 2, size=10000)
df.columns = [ 'var_' + str(i) for i in range(1, 7) ] + ['failure']
df['failure'] = np.random.binomial(1, 0.1, size=10000)
when failure is 1, select 2 rows before and 1 after then add a label column
n = 0
df_new = pd.DataFrame()
for i in range(0, len(df)):
if df.iloc[i]['failure'] == 1:
n += 1
df_new = df_new.append(df.iloc[i-2:i+2])
df_new = df_new.append({'label': n}, ignore_index=True)
df_new['label'].fillna(method='bfill', inplace=True)
df_new.dropna(inplace=True)

List in pandas dataframe columns

I have the following pandas dataframe
| A | B |
| :-|:------:|
| 1 | [2,3,4]|
| 2 | np.nan |
| 3 | np.nan |
| 4 | 10 |
I would like to unlist the first row and place those values sequentially in the subsequent rows. The outcome will look like this:
| A | B |
| :-|:------:|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 10 |
How can I achieve this in a very large dataset with this phenomena occurring in many rows?

If the number of NaN values serve as a "slack" space, so that list elements can slot in, i.e. if the lengths match, then you can explode columns "B", then drop NaN values with dropna, reset index and assign back to "B":
df['B'] = df['B'].explode().dropna().reset_index(drop=True)
Output:
A B
0 1 2
1 2 3
2 3 4
3 4 10

As the number of consecutive NaNs does not match the length of the list, you can make groups starting with non NaN elements and explode while keeping the length of the group constant.
I used a slightly different example for clarity (I also assigned to a different column):
df['C'] = (df['B']
.groupby(df['B'].notna().cumsum())
.apply(lambda s: s.explode().iloc[:len(s)])
.values
)
Output:
A B C
0 1 [2, 3, 4] 2
1 2 NaN 3
2 3 NaN 4
3 4 NaN NaN
4 5 10 10
Used input:
df = pd.DataFrame({'A': range(1,6),
'B': [[2,3,4], np.nan, np.nan, np.nan, 10]
})

in a few conditions, compare row with a previous row and drop rows with condition in python pandas

I have a concept of what I need to do, but I can't write the right code to run, please take a look and give some advice.
step 1. find the rows that contains values in the second column
step 2. with those rows, compare the value in the first column with their previous row
step 3. drop the rows with larger first column value
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | 60 |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 1 | 180 |
| 1 | nan |
| 0 | 120 |
eg. I want to compare the missing values with the rows values in diff [120,180,60] and their previous rows. in the end, the desire dataframe will look like
|missing | diff |
|--------|------|
| 0 | nan |
| 1 | nan |
| 0 | nan |
| 0 | nan |
| 0 | 120 |
update question according to the answer, got the same df as original df
import pandas as pd
import numpy as np
data={'missing':[0,1,1,0,0,1,1,0],'diff':[np.nan,60,np.nan,np.nan,np.nan,180,np.nan,120]}
df=pd.DataFrame(data)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0
if df['diff'][ind]!=np.nan:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
df
missing diff
0 0 NaN
1 1 60.0
2 1 NaN
3 0 NaN
4 0 NaN
5 1 180.0
6 1 NaN
7 0 120.0

IIUC, you can try:
m = df['diff'].notna()
df = (
pd.concat([
df[df['diff'].isna()],
df[m][df[m.shift(-1).fillna(False)]['missing'].values >
df[m]['missing'].values]
])
)
OUTPUT:
missing diff
1 0 <NA>
3 1 <NA>
4 0 <NA>
5 0 <NA>
7 1 <NA>
8 0 120

This will work for sure
for ind in df.index:
if np.isnan(df['diff'][ind])==False:
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)

This will work
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
if df['missing'][ind]>df['missing'][ind-1]:
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)

import pandas as pd #import pandas
#define dictionary
data={'missing':[0,1,1,0,0,1,1,0],'diff':[nan,60,nan,nan,nan,180,nan,120]}
#dictionary to dataframe
df=pd.DataFrame(data)
print(df)
#for each row in dataframe
for ind in df.index:
if df['diff'][ind]!="nan":
if ind!=0:
#only each row whose diff value is a number
#find the rows that contains values in the second column and compare it with previous value
if df['missing'][ind]>df['missing'][ind-1]:
#drop the rows with larger first column value
df=df.drop(ind,0)
else:
df=df.drop(ind-1,0)
print(df)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Custom Cumulative Calculation Over Group By in DataFrame - python

Related

Optimizing a Pandas DataFrame Transformation to Link two Columns

Pandas find cell location that matches regex

Pandas labeling in a for loop

List in pandas dataframe columns

in a few conditions, compare row with a previous row and drop rows with condition in python pandas

Categories

Resources