Looking at a specific timeframe for consecutive dates - python

This is a follow up question Check for 3 consecutive declined payment dates based on name That I had the other day
The following is the same as the question above except, Jim has 2 loans to repay and the dates in df2 are not sorted in any particular order.
Jim and Bob are part of a loan repayment program. They need to be making monthly payments to pay off their loans. Sometimes, a payment can be declined for various reasons. I would like to find when there are 3 consecutive declined payments in a row(so no complete payment in-between any of the 3).
I believe the proper way to solve this is to look at a certain time frame. Having something that looks within a 3 month window prior to the due date seems like a good strategy.
Here are the dataframes:
DF1
| Name | ID | Due Date |
| -----| ---- |------------------------|
| Jim | 1 | 2020-05-10 |
| Bob | 2 | 2021-06-11 |
| Jim | 3 | 2022-06-10 |
Here we have the "payment" dataframe.
DF2
| Name | Payment Date | Declined/Complete |
| -----|------------ | -------------------|
| Jim | 2020-04-5 | declined |
| Jim | 2020-03-9 | declined |
| Jim | 2020-05-6 | declined |
| Bob | 2021-04-11 | declined |
| Bob | 2021-03-20 | complete |
| Bob | 2021-05-11 | declined |
| Jim | 2022-04-3 | declined |
| Jim | 2022-03-5 | complete |
| Jim | 2022-05-15 | declined |
Jim (ID = 1) has had 3 consecutive declined payments before his due date, he gets flagged(1).
Bob (ID = 2) had a complete payment in-between his last 3 payments, so he does not get flagged(0).
Jim (ID = 3) has a complete payment between two declines, so he does not get flagged(0).
Expected Output
| Name | ID |Due Date | 3 consecutive declines |
| -----|----|-------------|---------------------------|
| Jim | 1 |2020-05-10 |1 |
| Bob | 2 |2021-06-11 |0 |
| Jim | 3 |2022-06-10 |0 |
I believe the proper way to solve this is to look at a certain time frame. Having something that looks within a 3 month window prior to the due date seems like a good strategy.

here is one way to do it
btw, we need an ID in the DF2 in order to map, otherwise, both JIMs will be treated as a single person
#map the due date into the df2 dataframe, so we can filter on payment prior to the due date
df2['DueDate'] = df2['ID'].map(df1.set_index(['ID'])['DueDate'])
#setup dates as dates
df2['PaymentDate']=pd.to_datetime(df2['PaymentDate'])
df2['DueDate']=pd.to_datetime(df2['DueDate'])
# get the count of declined payments for each ID
ref=df2[(df2['PaymentDate'] <= df2['DueDate']) &
(df2['Declined_Complete'] == 'declined')
].groupby(['ID'], as_index=False)[['ID','Declined_Complete']].size()
ref
#map the declined count to the DF1
df1['flag'] = df1['ID'].map(ref.set_index(['ID'])['size'])
# flag the customers that has missed the threshold number of payments, here its 3
missed_payment_count = 3
df1['flag'] = np.where(df1['flag'] >= missed_payment_count, 1, 0)
df1
Name ID DueDate flag
0 Jim 1 2020-05-10 1
1 Bob 2 2021-06-11 0
2 Jim 3 2022-06-10 0

Related

How to generate new variable (column) based on a specific given column containing cycles of numbers from 0 to n, where n is a positive integer

The dataset contains data about COVID-19 patients. It is in both EXCEL and CSV file formats, and contains several variables and over 7 thousand records (rows) which has made the problem extremely harder and very time consuming to solve manually. Below are the 4 most important variables (columns) needed in solving the problem; 1: id for identifying each record (row), 2: day_at_hosp for each day a patient remained admitted at the hospital, 3: sex of patient, 4: death for whether the patient eventually died or survived.
I want to create a new variable total_days_at_hosp which should contain a total of days a patient remained admitted at hospital.
Old Table:
_______________________________________
| id | day_at_hosp | sex | death |
|_______|_____________|________|________|
| 1 | 0 | male | no |
| 2 | 1 | | |
| 3 | 2 | | |
| 4 | 0 | female | no |
| 5 | 1 | | |
| 6 | 0 | male | no |
| 7 | 0 | female | no |
| 8 | 0 | male | no |
| 9 | 1 | | |
| 10 | 2 | | |
| 11 | 3 | | |
| 12 | 4 | | |
| ... | ... | ... | ... |
| 7882 | 0 | female | no |
| 7883 | 1 | | |
|_______|_____________|________|________|
New Table:
I want to convert table above into table below:
____________________________________________
| id |total_days_at_hosp| sex | death |
|_______|__________________|________|________|
| 1 | 3 | male | no |
| 4 | 2 | male | yes |
| 6 | 1 | male | yes |
| 7 | 1 | female | no |
| 8 | 5 | male | no |
| ... | ... | ... | ... |
| 2565 | 2 | female | no |
|_______|__________________|________|________|
NOTE: the id column is for every record entered, and multiple records were entered for each patient depending on how long a patient remained admitted at the hospital. The day_at_hosp variable contains days: 0=initial day at hospital, 1=second day at hospital, ... , n=nth last day at hospital.
The record (row) where the variable (column) day_at_hosp is 0 corresponds to all entries in other columns, if the record (row) for day_at_hosp is *not 0, say 1,2,3, ...,5 then it belongs to the patient right above, and all the corresponding variables (columns) are left blank.
However the dataset I need should look like the table below.
It should include a new variable (column) called total_days_at_hosp generated from the variable (column) day_at_hosp. The new variable (column) total_days_at_hosp is more useful in statistical tests to be conducted and will replace variable (column) day_at_hosp, so that all blank rows can be deleted.
To move from old table to new table the needed program should do the following:
day_at_hosp ===> total_days_at_hosp
0
1 ---> 3
2
-------------------------------------
0 ---> 2
1
-------------------------------------
0 ---> 1
-------------------------------------
0 ---> 1
-------------------------------------
0
1
2 ---> 5
3
4
-------------------------------------
...
-------------------------------------
0 ---> 2
1
-------------------------------------
How can I achieve this?
Another formula option without dummy value placed at end of the Old/New Table.
1] Create New Table by >>
Copy and paste all Old Table data to a unused area
Click "Autofilter"
In "days_at_hospital" column select =0 value
Copy and paste filter of admissions to New Table column F
Delete all 0s in rows of Column G
Then,
2] In G2, formula copied down :
=IF(F2="","",IF(F3="",MATCH(9^9,A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
Remark : If your "ID Column" is Text value, formula changed to :
=IF(F2="","",IF(F3="",MATCH("zzz",A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
It is apparent that your data are sorted by patient, and that your desired table will be much 'shorter' - accordingly the starting point for this answer is to apply an AutoFilter to your original data, setting the filter criterion to be days_at_hospital = 0, and then copy this filter of admissions to column F:
after deleting the old column G data, the formula below can then be entered in cell G2 and copied down
=INDEX(B:B,MATCH(F3,A:A,0)-1)+1
to keep the formula simple the same dummy maximum value should be entered at both the end of the old and new tables.

How to identify a first occurence of a condition in python data frame and perform a calculation on it?

I am really struggling to get a logic for this . I have data set called Col as shown below . I am using Python and Pandas
I want to set a new column called as "STATUS" . The logic is
a. When Col==0 , i will Buy . But this Buy will happen only when Col==0 is the first value in the data set or after the Status Sell. There cannot be two Buy values without a Sell in between
b. When Col<=-8 I will Sell. But this will happen if there is a Buy preceding it in the Satus Column. There cannot be two Sells without a Buy in between them .
I have provided the example of how i want my output as. Any help is really appreciated
Here the raw data is in the column : Col and output i want is in Status
+-------+--------+
| Col | Status |
+-------+--------+
| 0 | Buy |
| -1.41 | 0 |
| 0 | 0 |
| -7.37 | 0 |
| -8.78 | Sell |
| -11.6 | 0 |
| 0 | Buy |
| -5 | 0 |
| -6.1 | 0 |
| -8 | Sell |
| -11 | 0 |
| 0 | Buy |
| 0 | 0 |
| -9 | Sell |
+-------+--------+
Took me some time.
Relies on the following property : the last order you can see from now, even if you chose not to send it, is always the last decision that you took. (Otherwise it would have been sent.)
df['order'] = (df['order'] == 0).astype(int) - (df['order'] <= -8).astype(int)
orders_no_filter = df.loc[df['order'] != 0, 'order']
possible = (orders_no_filter != orders_no_filter.shift(1))
df['order'] = df['order'] * possible.reindex(df.index, fill_value=0)

Transforming data frame (row to column and count)

Sorry for the dumb question, but I got stuck. I have the dataframe with the next structure:
|.....| ID | Cause | Date |
| 1 | AR | SGNLss| 10-05-2019 05:01:00|
| 2 | TD | PTRXX | 12-05-2019 12:15:00|
| 3 | GZ | FAIL | 10-05-2019 05:01:00|
| 4 | AR | PTRXX | 12-05-2019 12:15:00|
| 5 | GZ | SGNLss| 10-05-2019 05:01:00|
| 6 | AR | FAIL | 10-05-2019 05:01:00|
What I want is convert DATE column value to columns rounded to day so that the expected DF will have ID, 10-05-2019, 11-05-2019, 12-05-2019... columns and the values - the number of events (Causes) happened on this Id.
It's not a problem to round day and count values separately, but I can't get how to do both these operations.
You can use pd.crosstab:
pd.crosstab(df['ID'], df['Date'].dt.date)
Output:
Date 2019-10-05 2019-12-05
ID
AR 2 1
GZ 2 0
TD 0 1

Reorganize Pandas Dataframe

I have the following columns in a DataFrame:
| invoice_number | client | tax_rate_1_isp | tax_base_1_isp | tax_1_isp | tax_rate_2_isp | tax_base_2_isp | tax_2_isp | tax_rate_1_no_isp | tax_base_1_no_isp | tax_1_no_isp | tax_rate_2_no_isp | tax_base_2_no_isp | tax_2_no_isp | status |
|----------------|---------|----------------|----------------|-----------|----------------|----------------|-----------|-------------------|-------------------|--------------|-------------------|-------------------|--------------|---------|
| #1 | client1 | 15% | 100 | 15 | | | | 0% | 100 | 0 | 10% | 200 | 20 | correct |
| #2 | client2 | 0% | 300 | 0 | | | | 10% | 100 | 10 | | | | correct |
And I would like to reorganize the DataFrame so it looks like this:
invoice_number client tax_type tax_rate tax_base tax status
#1 client1 isp 15% 100 15 correct
#1 client1 no_isp 0% 100 0 correct
#1 client1 no_isp 10% 200 20 correct
#2 client2 isp 0% 300 0 correct
#2 client2 no_isp 10% 100 10 correct
where a new line is created for each group of tax_rate, tax_base and tax maintaining the same information for the rest of the columns and creating a new column that would specify which type of tax (isp or no_isp) corresponds to, which is identified in the column name of the first DataFrame.
The goal of doing this is at the end be able to create a pivot table from the data.
Is there an efficient way to do that?
What I am doing now, and it is pain, is to create different DataFrames selecting the columns that correspond to the same tax group, filtering those DataFrames to just select rows with data and appending those to a DataFrame that has the structure that I need.
What I shared is an example, however the actual data could easily have more than 50 tax groups...

Last Touch Attribution in MySQL

Conversions
user_id | tag | timestamp
|--------- |-------- |---------------------|
| 1 | click1 | 2016-11-01 01:20:39 |
| 2 | click2 | 2016-11-01 09:48:10 |
| 3 | click1 | 2016-11-04 14:27:22 |
| 4 | click4 | 2016-11-05 17:50:14 |
User Sessions
user_id | utm_campaign | session_start
|--------- |--------------- |---------------------|
| 1 | outbrain_2 | 2016-11-01 00:15:34 |
| 1 | email | 2016-11-01 01:00:29 |
| 2 | google_1 | 2016-11-01 08:24:39 |
| 3 | google_4 | 2016-11-04 14:25:06 |
| 4 | google_1 | 2016-11-05 17:43:02 |
Given the 2 tables above, I want to map each conversion event to the most recent campaign that brought a particular user to a site (aka last touch/last click attribution).
The desired output is a table of the format:
user_id | tag | timestamp | campaign
|--------- |-------- |---------------------|-----------
| 1 | click1 | 2016-11-01 01:20:39 | email
| 2 | click2 | 2016-11-01 09:48:10 | google_1
| 3 | click1 | 2016-11-04 14:27:22 | google_4
| 4 | click4 | 2016-11-05 17:50:14 | google_1
Note how user 1 visited the site via the outbrain_2 campaign and then came back to the site via the email campaign. Sometime during the user's second visit, they converted, thus the conversion should be attributed to email and not outbrain_2.
Is there a way to do this in MySQL or Python?
You can do this in Python with Pandas. I assume you can load the data from MySQL tables to Pandas dataframes conversions and sessions. First, concatenate both tables:
all = pd.concat([conversions,sessions])
Some of the elements in the new frame will be NAs. Create a new column that collects the time stamps from both tables:
all["ts"] = np.where(all["session_start"].isnull(),
all["timestamp"],
all["session_start"])
Sort by this column, forward fill the time values, group by the user ID, and select the last (most recent) row from each group:
groups = all.sort_values("ts").ffill().groupby("user_id",as_index=False).last()
Select the right columns:
result = groups[["user_id", "tag", "timestamp", "utm_campaign"]]
I tried this code with your sample data and got the right answer.

Categories