How to subset and eliminate rows by condition

How to subset and eliminate rows by condition - python

I have a dataset that has many columns: among them AMS card number, registration date, and first purchase date. The data has duplicates for a large number of AMS card numbers. The final dataset needs to be unique on card number. I need to keep the rows in the dataset corresponding to the latest registration date and earliest first purchase date and this is how I've done it. I'm pretty sure it works, but it's too slow, since the dataset has over 1 million rows. In the grand scheme of python and pandas this is not an exorbitant number, which is why I'm certain my algorithm is poor and needs to be rewritten. I'm new to Pandas and fairly new to Python.
amsset = set(df["AMS Card"]) #capture all unique AMS numbers for each in amsset:
samecarddf = df.loc[df["AMS Card"] == each] #put all rows of df with same ams numbers in samecarddf
lensamecarddf = len(samecarddf)
if lensamecarddf > 1: #if there is more than one row with the same ams number in samecarddf
latestreg = samecarddf['Customer Reg Date'].max() #find the latest registration date
samecarddf = samecarddf.loc[samecarddf['Customer Reg Date'] == latestreg] #keep the rows with the latest registration date
earliestpur = samecarddf['Customer First Purchase Date'].min() #find earliest first purchase date
samecarddf = samecarddf.loc[samecarddf["Customer First Purchase Date"] == earliestpur] #keep the rows with the earliest first purchase date
dffinal = dffinal.append(samecarddf).drop_duplicates() #put all rows with 1 ams or those with latest registration and earliest first purchase and drop any remaining duplicates

Here's a way to do what your question asks:
# Update df to contain only unique `AMS Card` values,
# and in case of duplicates, choose the row with latest `Customer Reg Date` and
# (among duplicates thereof) earliest `Customer First Purchase Date`.
dffinal = ( df
.sort_values(['AMS Card', 'Customer Reg Date', 'Customer First Purchase Date'], ascending=[True, False, True])
.drop_duplicates(['AMS Card'])
.drop_duplicates(['AMS Card', 'Customer Reg Date']) )
Sample input:
AMS Card Customer Reg Date Customer First Purchase Date some_data
0 1 2020-01-01 2021-01-01 1
1 2 2020-01-01 2021-02-01 2
2 2 2020-01-01 2021-03-01 3
3 3 2020-01-01 2021-04-01 4
4 3 2020-02-01 2021-05-01 5
5 3 2020-02-01 2021-06-01 6
Output:
AMS Card Customer Reg Date Customer First Purchase Date some_data
0 1 2020-01-01 2021-01-01 1
1 2 2020-01-01 2021-02-01 2
4 3 2020-02-01 2021-05-01 5
As an alternative, the sorting can be split into two parts (so that we sort on Customer First Purchase Date only after removing duplicates of Customer Reg Date):
dffinal = ( df
.sort_values(['AMS Card', 'Customer Reg Date'], ascending=[True, False])
.drop_duplicates(['AMS Card'])
.sort_values(['AMS Card', 'Customer First Purchase Date'], ascending=[True, True])
.drop_duplicates(['AMS Card', 'Customer Reg Date']) )

Related

pandas filter on sorted row sequence

I have a dataframe where every row is a product transaction. I want to sort by date (ascending), group by product id and find every product id where there's been transaction type X and no proceeding event type Y or Z. The sorting and grouping are straightforward, but how to accomplish that last part?
idx id evt_type date price
57773 10 listed 2021-11-19 14:51:30.964379 250.00
59013 10 sold 2021-11-21 15:00:48.439708 250.00
60111 10 listed 2021-11-25 00:18:08.863694 255.00
60806 10 sold 2021-11-27 10:24:31.431779 255.00
61445 10 refund_req 2021-11-27 11:40:39.033327 NaN
61455 10 refunded 2021-11-27 11:49:39.033327 255.00
60808 10 listed 2021-11-27 12:30:05.368266 280.00
62177 10 sold 2021-11-30 13:21:07.421889 280.00
61887 10 refund_req 2021-11-30 13:21:07.421889 NaN
63742 10 listed 2021-12-04 00:27:00.393276 290.00
64222 10 sold 2021-12-30 13:21:07.421889 290.00
In the above example, I have filter out only product id 10 and I only care about the fact that there was a refund_req event without a following refunded event. So groupby 'id' and return ids where there's a refund_req event_type without a following refund event_type.
Current code:
df.sort_values('date').groupby('id')
But I'm not sure about the aggregation type that delivers the desired ids.
I was thinking about iterating over all known product ids and attempting from there...
EDIT1:
This gets me partly there
df.sort_values('date').groupby(['id', 'evt_type']).agg({'created': 'max'})
Which outputs:
id evt_type date
10 listed 2021-11-15 20:47:51.364352
sold 2022-01-10 15:07:42.048301
refund_req 2021-11-30 15:51:41.443962
refunded 2021-22-27 00:55:55.05162
TLDR
I want all "id"s where there is a refund_req "evt_type" not followed by (higher "date") a refunded "evt_type"

How about:
# Calculate last date of refund_req and refunded event types for each id
last_refund_reqs = (
df
.loc[df['evt_type'] == 'refund_req']
.groupby('id')
[['date']].max()
)
last_refunded = (
df
.loc[df['evt_type'] == 'refunded']
.groupby('id')
[['date']].max()
)
# Merge and compare last dates - see those where refund_req
# comes after the last refunded or where there are no refunded events
merged = last_refund_reqs.join(
last_refunded, lsuffix='_refund_req', rsuffix='_refunded', how='left'
)
merged.loc[
(merged['date_refunded'] < merged['date_refund_req']) |
merged['date_refunded'].isna()
]

Pandas substract above row

Basically this is the challenge I have
Data set with time range and unique ID, what I need to do is to find if ID is duplicated in date range.
123 transaction 1/1/2021
345 transaction 1/1/2021
123 transaction 1/2/2021
123 transaction 1/20/2021
Where I want to return 1 for ID 123 because the duplicate transaction is in range of 7 days.
I can do this with Excel and I added some more date ranges depending on day for exple Wednesday range up to 6 days, Thursday 5 days, Friday 4 days range. But I have no idea how to accomplish this with pandas...
The reason why I want to do this with pandas is because each data set has up to 1M rows and it takes forever with Excel to accomplish and on top of that I need to split by category and it's just a pain to do all that manual work.
Is there any recommendations or ideas in how to accomplish that task?

The df:
df = pd.read_csv(StringIO(
"""id,trans_date
123,1/1/2021
345,1/1/2021
123,1/2/2021
123,1/20/2021
345,1/3/2021
"""
)) # added extra record for demo
df
id trans_date
0 123 1/1/2021
1 345 1/1/2021
2 123 1/2/2021
3 123 1/20/2021
4 345 1/3/2021
df['trans_date'] = pd.to_datetime(df['trans_date'])
As you have to look into each of the ids separately, you can group by id and then get the maximum and minimum dates and if the difference is greater than 7, then those would be 1. Otherwise, 0.
result = df.groupby('id')['trans_date'].apply(
lambda x: True if (x.max()-x.min()).days > 7 else False)
result
id
123 True
345 False
Name: trans_date, dtype: bool
If you just need the required ids, then
result.index[result].values
array([123])

The context and data you've provided about your situation are scanty, but you can probably do something like this:
>>> df
id type date
0 123 transaction 2021-01-01
1 345 transaction 2021-01-01
2 123 transaction 2021-01-02
3 123 transaction 2021-01-20
>>> dupes = df.groupby(pd.Grouper(key='date', freq='W'))['id'].apply(pd.Series.duplicated)
>>> dupes
0 False
1 False
2 True
3 False
Name: id, dtype: bool
There, item 2 (the third item) is True because 123 already occured in the past week.

As far as I can understand the question, I think this is what you need.
from datetime import datetime
import pandas as pd
df = pd.DataFrame({
"id": [123, 345, 123, 123],
"name": ["transaction", "transaction", "transaction", "transaction"],
"date": ["01/01/2021", "01/01/2021", "01/02/2021", "01/10/2021"]
})
def dates_in_range(dates):
num_days_frame = 6
processed_dates = sorted([datetime.strptime(date, "%m/%d/%Y") for date in dates])
difference_in_range = any(abs(processed_dates[i] - processed_dates[i-1]).days < num_days_frame for i in range(1, len(processed_dates)))
return difference_in_range and 1 or 0
group = df.groupby("id")
df_new = group.apply(lambda x: dates_in_range(x["date"]))
print(df_new)
"""
print(df_new)
id
123 1
345 0
"""
Here you first group by the id such that you get all dates for that particular id in the same row.
After which a row-wise function operation is applied to the aggregated dates such that, first they are sorted and afterward checked if the difference between consecutive items is greater than the defined range. The sorting makes sure that consecutive differences will actually result in a true or false outcome if dates are close by.
Finally if any such row exists for which the difference of consecutive sorted dates are less than num_days_frame (6), we return a 1 else we return a 0.
All that being said this might not be as performant as each row is being sorted. One way to avoid that is sort the entire df first and apply the group operation to ensure sorted dates.

Summary of data for each month

I do have health diagnosis data for last year and I did like get count of diagnosis for each month. Here is my data:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
print (df2)
How can I get diagnosis counts for each month. Example is how many diagnosis of bacteria sepsis we had for that month. Final result is a table showing value counts of diagnosis for each month

If you want to see results per month, you can use pivot_table.
df2.pivot_table(index=['outcome','diagnosis'], columns=pd.to_datetime(df2['Date']).dt.month, aggfunc='size', fill_value=0)
Date 4 5 6
outcome diagnosis
alive Risk sepsis 0 1 0
bacteria sepsis 2 0 0
dead Neonatal sepsis 0 0 1
Sepsis 0 1 0
4,5,6 are the months in your dataset.
Try playing around with the parameters here, you might be able to land on a better view that suits your ideal result better.

I modified your dataframe by setting the Date column as index:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
df2.index = pd.to_datetime(df2['Date']) # <--- I set your Date column as the index (also convert it to datetime)
df2.drop('Date',inplace=True, axis=1) # <--- Drop the Date column
print (df2)
if you groupby the dataframe by a pd.Grouper and the columns you want to group with (diagnosis and outcome):
df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
Output:
ID
Date diagnosis outcome
2020-04-30 bacteria sepsis alive 1
2020-05-31 Risk sepsis alive 1
Sepsis dead 1
2020-06-30 Neonatal sepsis dead 1
2021-04-30 bacteria sepsis alive 1
Note: the freq='M' in pd.Grouper groups the dataframe by month. Read more about the freq attribute here
Edit: Assigning the grouped dataframe to new_df and resetting the other indices except Date:
new_df = df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
new_df.reset_index(level=[1,2],inplace=True)
Iterate over each month and get the table separately inside df_list:
df_list = [] # <--- this will contain each separate table for each month
for month in np.unique(new_df.index):
df_list += [pd.DataFrame(new_df.loc[[month]])]
df_list[0] # <-- get the first dataframe in df_list
will return:
diagnosis outcome ID
Date
2020-04-30 bacteria sepsis alive 1

First you need to create a month variable through to_datetime() function; then you can group by the month and make a value_counts() within the month
import pandas as pd
df2['month'] = pd.to_datetime(df2['Date']).dt.month
df2.groupby('month').apply(lambda x: x['diagnosis'].value_counts())
month
4 bacteria sepsis 2
5 Risk sepsis 1
Sepsis 1
6 Neonatal sepsis 1
Name: diagnosis, dtype: int64

I think what you mean by for each month is not only mean month figure only, but year-month combination. As such, let's approach as follows:
First, we create a 'year-month' column according to the Date column. Then use .groupby() on this new year-month column and get .value_counts() on column diagnosis, as follows:
df2['year-month'] = pd.to_datetime(df2['Date']).dt.strftime("%Y-%m")
df2.groupby('year-month')['diagnosis'].value_counts().to_frame(name='Count').reset_index()
Result:
year-month diagnosis Count
0 2020-04 bacteria sepsis 1
1 2020-05 Risk sepsis 1
2 2020-05 Sepsis 1
3 2020-06 Neonatal sepsis 1
4 2021-04 bacteria sepsis 1

find first unique items selected by user and ranking them in order of user selection by date

I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?

# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1

To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.

filter pandas rows by other dataframe columns

I have 3 dataframes already sorted with date and p_id and with no null values as:
First DataFrame
df1 = pd.DataFrame([['2018-07-05',8.0,1],
['2018-07-15',1.0,1],
['2018-08-05',2.0,1],
['2018-08-05',2.0,2]],
columns=["purchase_date", "qty", "p_id"])
Second DataFrame
df2 = pd.DataFrame([['2018-07-15',2.0,1],
['2018-08-04',7.0,1],
['2018-08-15',1.0,2]],
columns=["sell_date", "qty", "p_id"])
Third DataFrame
df3 = pd.DataFrame([['2018-07-25',1.0,1],
['2018-08-15',1.0,1]],
columns=["expired_date", "qty", "p_id"])
dataframe looks like:
1st: (Holds Purchase details)
purchase_date qty p_id
0 2018-07-05 8.0 1
1 2018-07-15 1.0 1
2 2018-08-05 2.0 1
3 2018-08-05 2.0 2
2nd: (Holds Sales Details)
sell_date qty p_id
0 2018-07-15 2.0 1
1 2018-08-04 7.0 1
2 2018-08-15 1.0 2
3rd: (Holds Expiry Details)
expired_date qty p_id
0 2018-07-25 1.0 1
1 2018-08-15 1.0 1
Now What I want to do is find when the product that has expired was bought following FIFO (product first purchased will expire first)
Explanation: Consider product with id 1
By date 2018-07-15
We had 8+1 purchased quantity and -2 sold quantity i.e. total of 8+1-2 quantity in stock , -ve sign signify quantity deduction
By date 2018-07-25
1 quantity expired so first entry for our new when_product_expired dataframe will be:
purchase_date expired_date p_id
2018-07-05 2018-07-25 1
And then for next expiry entry
By date 2018-08-04
7 quantity were sold out so current quantity will be 8+1-2-7 = 0
By date 2018-08-05
2 quantity were bought so current quantity is 0+2
By date 2018-08-15
1 quantity expired
So a new and final entry will be:
purchase_date expired_date p_id
2018-07-05 2018-07-25 1
2018-08-05 2018-08-15 1
This time the product expired was one that was purchased on 2018-07-25
Actually I have date time, so purchase and sell time will never be equal (you may assume), also before selling and expire, there will always be some quantity of product in stock, i.e. data is consistent
And Thank you in advance :-)
Updated
What by now I am thinking is rename all date fields to same field name and append purchase, sell, expired dataframe with negative sign, but that won't help me
df2.qty = df2.qty*-1
df3.qty=df3.qty*-1
new = pd.concat([df1,df2, df3],sort=False)
.sort_values(by=["purchase_date"],ascending=True)
.reset_index(drop=True)

What you essentially want is this FIFO list of items in stock. In my experience pandas is not the right tool to relate different rows to each other. The workflow should be split-apply-combine. If you split it and don't really see a way how to puzzle it back together, it may be a ill-formulated problem. You can still get a lot done with groupby, but this is something I would not try to solve with some clever trick in pandas. Even if you make it work, it will be hell to maintain.
I don't know how performance critical your problem is (i.e. how large are your Dataframes). If its just a few 10000 entries you can just explicitly loop over the pandas rows (warning: this is slow) and build the fifo list by hand.
I hacked together some code for this. The DateFrame you proposed is in there. I loop over all rows and do bookkeeping how many items are in stock. This is done in a queue q which contains an element for each item and the element convienently is the purchase_date.
import queue
import pandas as pd
from pandas import Series, DataFrame
# modified (see text)
df1 = pd.DataFrame([['2018-07-05',8.0,1],
['2018-07-15',3.0,1],
['2018-08-05',2.0,1],
['2018-08-05',2.0,2]],
columns=["purchase_date", "qty", "p_id"])
df2 = pd.DataFrame([['2018-07-15',2.0,1],
['2018-08-04',7.0,1],
['2018-08-15',1.0,2]],
columns=["sell_date", "qty", "p_id"])
df3 = pd.DataFrame([['2018-07-25',1.0,1],
['2018-08-15',1.0,1]],
columns=["expired_date", "qty", "p_id"])
df1 = df1.rename(columns={'purchase_date':'date'})
df2 = df2.rename(columns={'sell_date':'date'})
df3 = df3.rename(columns={'expired_date' : 'date'})
df3['qty'] *= -1
df2['qty'] *= -1
df = pd.concat([df1,df2])\
.sort_values(by=["date"],ascending=True)\
.reset_index(drop=True)
# Necessary to distinguish between sold and expried items while looping
df['expired'] = False
df3['expired'] = True
df = pd.concat([df,df3])\
.sort_values(by=["date"],ascending=True)\
.reset_index(drop=True)
#date qty p_id expired
#7-05 8.0 1 False
#7-15 1.0 1 False
#7-15 -2.0 1 False
#7-25 -1.0 1 True
#8-04 -7.0 1 False
#8-05 2.0 1 False
#8-05 2.0 2 False
#8-15 -1.0 2 False
#8-15 -1.0 1 True
# Iteratively build up when_product_expired
when_product_expired = []
# p_id hardcoded here
p_id = 1
# q contains purchase dates for all individual items 'currently' in stock
q = queue.Queue()
for index, row in df[df['p_id'] == p_id].iterrows():
# if items are bought, put as many as 'qty' into q
if row['qty'] > 0:
for tmp in range(int(round(row['qty']))):
date = row['date']
q.put(date)
# if items are sold or expired, remove as many from q.
# if expired additionaly save purchase and expiration date into when_product_expired
elif row['qty'] < 0:
for tmp in range(int(round(-row['qty']))):
purchase_date = q.get()
if row['expired']:
print 'item p_id 1 was bought on', purchase_date
when_product_expired.append([purchase_date, row['date'], p_id])
when_product_expired = DataFrame(when_product_expired, columns=['purchase_date', 'expired_date', 'p_id'])
A few remarks:
I relied on your guarentee that
before selling and expire, there will always be some quantity of product in stock
This is not given for your example DataFrames. Before 2018-07-25 there are 9 items with p_id 1 bought and 9 sold. There is nothing in stock that could expire. I modified df1 so that 11 pieces are bought.
If this assumption is violated Queue will try to get an item that is not there. On my machine that leads to an endless loop. You might want to catch the exception.
The queue is not in the least efficiently implemented. If many items are in stock, there will be a lot of data doubling.
You can generalize that to more p_id's by either putting everything into a function and .groupby('p_id').apply(function) or loop over df['p_id'].unique()
So while this is not scalable solution, I hope it helps you a bit. Good look

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to subset and eliminate rows by condition - python

Related

pandas filter on sorted row sequence

Pandas substract above row

Summary of data for each month

find first unique items selected by user and ranking them in order of user selection by date

filter pandas rows by other dataframe columns

Categories

Resources