I'm working on a dataframe that has 2 possible ways of labeling canceled loans. I want to create a column with a calculated CHARGEOFF_DATE.
First, the account can be labeled 'Canceled' by the date found on df['ACCOUNT_CODE'].
Second, I want to include any loans that are 120+ days delinquent as charged off by shifting the date 121 from the df['LAST_PAYMENT_DATE']
I keep getting an error about data types while using nested np.where. I am confused because both 'True' statements that I am using in np.where are in datetime format.
Pseudocode: if ACCOUNT_CODE is equal to 'Canceled', then return ACCOUNT_CODE_DATE.
Else if DAYS_DELINQUENT is greater than 120, return LAST_PAYMENT_DATE + 121 days shifted forward.
Else, return BLANK (I tried np.nan and '')
Here is the code I have so far:
delta = dt.timedelta(days=121) #set delta
df['CHARGEOFF_DATE'] = np.where(df['ACCOUNT_CODE'] == 'Canceled',df['ACCOUNT_CODE_DATE'],(np.where(df['DAYS_DELINQUENT'] > 120,(df['LAST_PAYMENT_DATE']+delta),np.nan)))
error I'm receiving:
23 delta = dt.timedelta(days=121) #set delta
---> 24 df['CHARGEOFF_DATE'] = np.where(df['ACCOUNT_CODE'] == 'Canceled',df['ACCOUNT_CODE_DATE'],(np.where(df['DAYS_DELINQUENT'] > 120,(df['LAST_PAYMENT_DATE']+delta), '')))
25
26
<__array_function__ internals> in where(*args, **kwargs)
TypeError: The DTypes <class 'numpy.dtype[float16]'> and <class 'numpy.dtype[datetime64]'> do not have a common DType. For example they cannot be stored in a single array unless the dtype is `object`.
EDIT:
Here are the column data types
LAST_PAYMENT_DATE datetime64[ns]
DAYS_DELINQUENT int64
ACCOUNT_CODE object
ACCOUNT_CODE_DATE datetime64[ns]
Mock Dataset:
| LAST_PAYMENT_DATE| DAYS_DELINQUENT | ACCOUNT_CODE |ACCOUNT_CODE_DATE|
| :----------------| :-------------- | :----------- | :-------------- |
| 06/15/2020 | 0 | Paid in Full | 06/15/2020 |
| 03/10/2021 | 362 | Legal Category | 07/09/2021 |
| 10/03/2019 | 1186 | Canceled | 01/15/2020 |
| 08/04/2021 | 150 | Legal Category | 02/07/2021 |
| 09/02/2021 | 90 | Automatic Payment | 03/20/2019 |
I was able to land a solution by defining a function:
delta = dt.timedelta(days=121) #set days delta
def coFunc(row): #create a function that will test account code, else test days delinquent for greater than 120
if row['ACCOUNT_CODE'] == 'Canceled':
return row['ACCOUNT_CODE_DATE']
elif row['DAYS_DELINQUENT'] > 120:
return (row['LAST_PAYMENT_DATE']+delta)
else:
return ''
df['CHARGEOFF_DATE'] = df.apply(coFunc, axis=1) #apply function to df
I have a pandas df (called df2) like this:
id | orderdate |
___________________
123|2020-11-01 |
123|2020-08-01 |
233|2020-07-01 |
233|2020-11-04 |
444|2020-11-04 |
444|2020-05-03 |
444|2020-04-01 |
444|2020-11-25 |
The values of orderdate are datetime with the format '%Y%m%d'. They represent orders of a client. I want to calculate the delta time between the first order and the second one for each id (each client).
I come up with:
for i in list(set(df2.id)):
list_sorted=list(set((df2.loc[df2['id']==i, 'orderdate'] )))
list_sorted= sorted(list_sorted) #get sorted list of the order dates in ascending order
min_list= list_sorted[0] # first element is first order
df2.loc[df2['id']==i, 'First Order']= min_list
if len(list_sorted)>1:
penultimate_list= list_sorted[1]
df2.loc[df2['id']==i, 'Second Order']= penultimate_list # second element is second order
df2.loc[df2['id']==i, 'Delta orders']= min_list - penultimate_list #calculate delta
else:
df2.loc[df2['id_user']==i, 'Delta orders']= None
My expected outcome is:
id | orderdate | First Order | Second Order| Delta Orders
______________________________________________
123|2020-11-01 |2020-08-01 | 2020-11-01 | 92 days
123|2020-08-01 |2020-08-01 | 2020-11-01 | 92 days
233|2020-07-01 |2020-07-01 | 2020-11-04 | 126 days
233|2020-11-04 |2020-07-01 | 2020-11-04 | 126 days
444|2020-11-04 |2020-04-01 | 2020-05-03 | 32 days
444|2020-05-03 |2020-04-01 | 2020-05-03 | 32 days
444|2020-04-01 |2020-04-01 | 2020-05-03 | 32 days
444|2020-11-25 |2020-04-01 | 2020-05-03 | 32 days
It works but I feel like it's cumbersome. Any easier way to do it?
Slightly different from what you want, but it's a start:
import pandas as pd
from io import StringIO
data = StringIO(
"""id|orderdate
123|2020-11-01
123|2020-08-01
233|2020-07-01
233|2020-11-04
444|2020-11-04
444|2020-05-03
444|2020-04-01
444|2020-11-25 """)
df = pd.read_csv(data, sep='|')
df['orderdate'] = pd.to_datetime(df['orderdate'], infer_datetime_format=True)
df = df.sort_values(['id', 'orderdate'], ascending=False)
def date_diff(df):
df['order_time_diff'] = (df['orderdate'] - df['orderdate'].shift(-1)).dt.days
df = df.dropna()
return df
# this calculates all order differences
df.groupby('id').apply(date_diff)
# this will get the data as requested
df.groupby('id', as_index=False).apply(date_diff).groupby('id').tail(1)
I have daily profit data and I'm trying to find the best combinations of two assets that will return the highest profit. I need to purchase one asset long and short the other and find the best performing pair over a time window.
I can accomplish this by searching through all the permutations, but it's extremely slow. (no surprise there) I think this might be the type of problem suited for linear optimization with a library like PuLP.
Here's a sample of solving the problem exhaustively. I'm intentionally keeping the data simple, but I have 1000 assets that I need to search through. It took about 45 minutes to finish with the inefficient, manual approach I outline below.
Note: Because going long "Alpha" and short "Bravo" is different than going long "Bravo" and going short "Alpha", I'm using permutations, not combinations.
Edit: in case some aren't familiar with longing and shorting, I'm trying to pair the highest profit with the lowest profit (with a short, I make more profit the more negative the value is)
The logic would read something like this:
For all the permutations of nodes, add the node one profit to the inverse of the node two profit to get a total profit. Find the pair that has the highest total profit.
Here's my very inefficient (but working) implementation:
# Sample data
profits = [
('2019-11-18', 'Alpha', -79.629698),
('2019-11-19', 'Alpha', -17.452517),
('2019-11-20', 'Alpha', -19.069558),
('2019-11-21', 'Alpha', -66.061564),
('2019-11-18', 'Bravo', -87.698670),
('2019-11-19', 'Bravo', -73.812616),
('2019-11-20', 'Bravo', 198.513246),
('2019-11-21', 'Bravo', -69.579466),
('2019-11-18', 'Charlie', 66.302287),
('2019-11-19', 'Charlie', -16.132065),
('2019-11-20', 'Charlie', -123.735898),
('2019-11-21', 'Charlie', -30.046416),
('2019-11-18', 'Delta', -131.682322),
('2019-11-19', 'Delta', 13.296473),
('2019-11-20', 'Delta', 23.595053),
('2019-11-21', 'Delta', 14.103027),
]
profits_df = pd.DataFrame(profits, columns=('Date','Node','Profit')).sort_values('Date')
profits_df looks like this:
+----+------------+---------+-------------+
| | Date | Node | Profit |
+----+------------+---------+-------------+
| 0 | 2019-11-18 | Alpha | -79.629698 |
| 4 | 2019-11-18 | Bravo | -87.698670 |
| 8 | 2019-11-18 | Charlie | 66.302287 |
| 12 | 2019-11-18 | Delta | -131.682322 |
| 1 | 2019-11-19 | Alpha | -17.452517 |
+----+------------+---------+-------------+
To solve the problem manually, I can do this:
date_dfs = []
# I needed a way to take my rows and combine them pairwise, this
# is kind of gross but it does work
for date, date_df in profits_df.groupby('Date'):
tuples = [tuple(x) for x in date_df[['Node', 'Profit']].to_numpy()]
pp = list(itertools.permutations(tuples, 2))
flat_pp = [[p[0][0], p[0][1], p[1][0], p[1][1]] for p in pp]
df = pd.DataFrame(flat_cc, columns=['Long', 'LP', 'Short', 'SP'])
date_dfs.append(df)
result_df = pd.concat(daily_dfs)
result_df['Pair'] = result_df['Long'] + '/' + result_df['Short']
result_df['Profit'] = result_df['LP'] + result_df['SP'].multiply(-1)
result_df.groupby('Pair')['Profit'].sum().sort_values(ascending=False)
By computing the profits for all permutations each day then summing them up, I get this result:
+-----------------------------+
| Pair |
+-----------------------------+
| Bravo/Alpha 149.635831 |
| Delta/Alpha 101.525568 |
| Charlie/Alpha 78.601245 |
| Bravo/Charlie 71.034586 |
| Bravo/Delta 48.110263 |
| Delta/Charlie 22.924323 |
| Charlie/Delta -22.924323 |
| Delta/Bravo -48.110263 |
| Charlie/Bravo -71.034586 |
| Alpha/Charlie -78.601245 |
| Alpha/Delta -101.525568 |
| Alpha/Bravo -149.635831 |
+-----------------------------+
I'm certain there is a more efficient way to go about this. I don't understand the intricacies of optimization, but I know of it enough to know it's a possible solution. I don't understand the difference between linear optimization and non-linear, so I apologize if I'm getting the nomenclature wrong.
Can anyone suggest an approach I should try?
Summary of what I did:
create a dictionary from the profits list
run the permutations for each key,value pair
iterate through each pair to get a combination of the names and amounts separately.
Sort the container list by name, groupby by name, sum the amounts for each groupby, and load final result into a dictionary.
Read the dictionary into a dataframe and sort values by Profit in descending order.
I believe all the processing should be done before it comes into the dataframe, and you should get significant speed up:
from collections import defaultdict
from operator import itemgetter
from itertools import permutations, groupby
d = defaultdict(list)
for k, v,s in profits:
d[k].append((v,s))
container = []
for k,v in d.items():
l = (permutations(v,2))
#here I combine the names and the amounts separately into A and B
for i,j in l:
A = i[0]+'_'+j[0]
B = i[-1]+(j[-1]*-1)
container.append([A,B])
#here I sort the list, then groupby (groupby wont work if you don't sort first)
container = sorted(container, key=itemgetter(0,1))
sam = dict()
for name, amount in groupby(container,key=itemgetter(0)):
sam[name] = sum(i[-1] for i in amount)
outcome = pd.DataFrame
.from_dict(sam,
orient='index',
columns=['Profit'])
.sort_values(by='Profit',
ascending=False)
Profit
Bravo_Alpha 149.635831
Delta_Alpha 101.525568
Charlie_Alpha 78.601245
Bravo_Charlie 71.034586
Bravo_Delta 48.110263
Delta_Charlie 22.924323
Charlie_Delta -22.924323
Delta_Bravo -48.110263
Charlie_Bravo -71.034586
Alpha_Charlie -78.601245
Alpha_Delta -101.525568
Alpha_Bravo -149.635831
when I ran it on my PC it was 1.24ms, while urs came in at 14.1ms. Hopefully, someone can produce something much faster.
UPDATE:
All I did for the first one was unnecessary. There is no need for permutation - the multiplier is -1. Meaning all we need to do is get the sum for each name, pair the names (with no repeats), multiply one of the values by -1 and add to another, then when we have the lump sum for a pair, multiply by -1 again to get the reverse. I got a speed of about 18.6μs, which shoots up to 273μs once pandas is introduced. That's some significant speed up. Most of the compute was reading the data into pandas. Here goes:
from collections import defaultdict
from operator import itemgetter
from itertools import combinations, chain
import pandas as pd
def optimizer(profits):
nw = defaultdict(list)
content = dict()
[nw[node].append((profit)) for dat,node,profit in profits]
#sum the total for each key
B = {key : sum(value) for key ,value in nw.items()}
#multiply the value of the second item in the tuple by -1
#add that to the value of the first item in the tuple
#pair the result back to the tuple and form a dict
sumr = {(first,last):sum((B[first],B[last]*-1))
for first,last
in combinations(B.keys(),2)}
#reverse the positions in the tuple for each key
#multiply the value by -1 and pair to form a dict
rev = {tuple(reversed(k)): v*-1
for k,v in sumr.items()}
#join the two dictionaries into one
#sort in descending order
#and create a dictionary
result = dict(sorted(chain(sumr.items(),
rev.items()
),
key = itemgetter(-1),
reverse=True
))
#load into pandas
#trying to reduce the compute time here by reducing pandas workload
return pd.DataFrame(list(result.values()),
index = list(result.keys()),
)
I would probably delay reading into the dataframe till it is unavoidable. I'd love to know what the actual speed was when you run it on your end.
This isn't technically the answer because it's not solved using optimization techniques, but hopefully someone might find it useful.
From testing, it's the construction and concatenation of the DataFrames that's the slow part. It's trivially fast to use Numpy to create a matrix of pair prices:
arr = df['profit'].values + df['profit'].multiply(-1).values[:, None]
Produces this matrix of each node multiplied by each node:
+---+-------------+------------+------------+------------+
| | 0 | 1 | 2 | 3 |
+---+-------------+------------+------------+------------+
| 0 | 0.000000 | 149.635831 | 78.598163 | 101.525670 |
+---+-------------+------------+------------+------------+
| 1 | -149.635831 | 0.000000 | -71.037668 | -48.110161 |
+---+-------------+------------+------------+------------+
| 2 | -78.598163 | 71.037668 | 0.000000 | 22.927507 |
+---+-------------+------------+------------+------------+
| 3 | -101.525670 | 48.110161 | -22.927507 | 0.000000 |
+---+-------------+------------+------------+------------+
If you construct an empty numpy array with dimensions of number of nodes * number of nodes, then you can simply add the daily array to the totals array:
total_arr = np.zeros((4, 4))
# Do this for each day
arr = df['profit'].values + df['profit'].multiply(-1).values[:, None]
total_arr += arr
Once you have that, you need to do some Pandas voodoo to assign the node names to the matrix and unstack the matrix into individual long/short/profit rows.
My original (exhaustive) search took 47 minutes with 60 days of data. It's down to 13 seconds now.
Full working example:
profits = [
{'date':'2019-11-18', 'node':'A', 'profit': -79.629698},
{'date':'2019-11-19', 'node':'A', 'profit': -17.452517},
{'date':'2019-11-20', 'node':'A', 'profit': -19.069558},
{'date':'2019-11-21', 'node':'A', 'profit': -66.061564},
{'date':'2019-11-18', 'node':'B', 'profit': -87.698670},
{'date':'2019-11-19', 'node':'B', 'profit': -73.812616},
{'date':'2019-11-20', 'node':'B', 'profit': 198.513246},
{'date':'2019-11-21', 'node':'B', 'profit': -69.579466},
{'date':'2019-11-18', 'node':'C', 'profit': 66.3022870},
{'date':'2019-11-19', 'node':'C', 'profit': -16.132065},
{'date':'2019-11-20', 'node':'C', 'profit': -123.73898},
{'date':'2019-11-21', 'node':'C', 'profit': -30.046416},
{'date':'2019-11-18', 'node':'D', 'profit': -131.68222},
{'date':'2019-11-19', 'node':'D', 'profit': 13.2964730},
{'date':'2019-11-20', 'node':'D', 'profit': 23.5950530},
{'date':'2019-11-21', 'node':'D', 'profit': 14.1030270},
]
# Initialize a Numpy array of node_length * node_length dimension
profits_df = pd.DataFrame(profits)
nodes = profits_df['node'].unique()
total_arr = np.zeros((len(nodes), len(nodes)))
# For each date, calculate the pairs profit matrix and add it to the total
for date, date_df in profits_df.groupby('date'):
df = date_df[['node', 'profit']].reset_index()
arr = df['profit'].values + df['profit'].multiply(-1).values[:, None]
total_arr += arr
# This will label each column and row
nodes_series = pd.Series(nodes, name='node')
perms_df = pd.concat((nodes_series, pd.DataFrame(total_arr, columns=nodes_series)), axis=1)
# This collapses our matrix back to long, short, and profit rows with the proper column names
perms_df = perms_df.set_index('node').unstack().to_frame(name='profit').reset_index()
perms_df = perms_df.rename(columns={'level_0': 'long', 'node': 'short'})
# Get rid of long/short pairs where the nodes are the same (not technically necessary)
perms_df = perms_df[perms_df['long'] != perms_df['short']]
# Let's see our profit
perms_df.sort_values('profit', ascending=False)
Result:
+----+------+-------+-------------+
| | long | short | profit |
+----+------+-------+-------------+
| 4 | B | A | 149.635831 |
+----+------+-------+-------------+
| 12 | D | A | 101.525670 |
+----+------+-------+-------------+
| 8 | C | A | 78.598163 |
+----+------+-------+-------------+
| 6 | B | C | 71.037668 |
+----+------+-------+-------------+
| 7 | B | D | 48.110161 |
+----+------+-------+-------------+
| 14 | D | C | 22.927507 |
+----+------+-------+-------------+
| 11 | C | D | -22.927507 |
+----+------+-------+-------------+
| 13 | D | B | -48.110161 |
+----+------+-------+-------------+
| 9 | C | B | -71.037668 |
+----+------+-------+-------------+
| 2 | A | C | -78.598163 |
+----+------+-------+-------------+
| 3 | A | D | -101.525670 |
+----+------+-------+-------------+
| 1 | A | B | -149.635831 |
+----+------+-------+-------------+
Thanks to sammywemmy for helping me organize the problem and come up with something useful.
this one is a bit of a doozy.
At a high level, I'm trying to figure out how to run a nested for loop. I'm essentially trying to iterate through columns and rows, and perform a computational check to make sure outcomes match a specified requirement - if so, they loop to the next row, if not, they are kicked out and the loop moves onto the next user.
Specifically, I want to perform a T-Test between a control/treatment group of users, and make sure the result is less than a pre-determined value.
Example:
I have my table of values - "DF" - there are 7 columns. The user_id column specifies the user's unique identifier. The user_type column is a binary classifier, users can be of either T (treatment) or C (control) types. The 3 "hour" columns are dummy number columns, values that I'll perform computation on. The mon column is the month, and tval is the number that the computation will have to be less than to be accepted.
In this case, the month is all January data. Each month can have a different tval.
DF
| user_id | user_type | hour1 | hour2 | hour3 | mon | tval |
|---------|-----------|-------|-------|-------|-----|------|
| 4 | T | 1 | 10 | 100 | 1 | 2.08 |
| 5 | C | 2 | 20 | 200 | 1 | 2.08 |
| 6 | C | 3 | 30 | 300 | 1 | 2.08 |
| 7 | T | 4 | 40 | 400 | 1 | 2.08 |
| 8 | T | 5 | 50 | 500 | 1 | 2.08 |
My goal is to iterate through each T user - and for each, loop through each C user. For each "Pair", I want to perform computation (t-test) between their hour 1 values. If the value is less than the tval, move to hour2 values, etc. If not, it gets kicked out and the loop moves to the next C user without completing that C user's loop. If it passes all value checks, the user_ids of each would be appended to a list or something external.
The output would hopefully look like a table of pairs. The T user and C user that have successfully iterated through all hour columns, and the month that passed (as each set of users have data for all 12 months).
Output:
| t_userid | c_userid | month |
|--------- |-----------|-------|
| 4 | 5 | 1 |
| 8 | 6 | 1 |
To sum it all up:
For each T user:
For each C user:
If t-test on t.hour1 and c.hour1 is less than X number (passing test):
move to next hour (hour2) and repeat
If all hours pass, add pair (T user_id, c_user_id) to separate list/series/df,etc
else: skip following hours and move to next C user.
I'm wondering if my data format is also incorrect. Would this be easier if I unpivoted my hourly data and iterated over each row? Any help is greatly appreciated. Thanks, and let me know if any clarification is necessary.
EDIT:
So far I've split the data between Treat and Control groups, and calculated average and standard deviation for a users monthly data (which is normally broken down by day) and added them as columns, hour1_avg and hour1_stdev. I've attempted another for loop, but am getting a ValueError.
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this is due to the fact that I cant compare a pandas Series to a float, int, str, etc. I will make another post addressing this question.
Here's what I have so far:
for i in treatment.user_id:
for i in control.user_id:
if np.absolute((treatment['hour1_avg'] - control['hour1_avg'])/np.sqrt((treatment['hour1_stdev']**2/31)+(control['hour1_stdev']**2/31))) > treatment.tval:
"End loop, move to next control user"
else:
"copy paste if statement above, but for hour2, etc etc"
Split the dataframe into control and treatment groups
Make join of the resulting dataframes on a constant field (will create all pairs)
Use a combination of apply and any to make the decision
Filter out the join using the decision vector
Code to illustrate the idea:
# assuming the input is in df
control = df[df['user_type'] == 'C']
treatment = df[df['user_type'] == 'T']
# part 2: pairs will be created month-wise.
# If you want all vs all, create a temp field, e.g.: control['_temp'] = 1
pairs = treatment.merge(control, left_on='mon', right_on='mon')
# part 3
def test(row):
# all will stop executing at the first False
return all(
row['hours_%d_x' % i] - row['hours_%d_y' % i] < row['t_val']
for i in range())
# all_less is a series of bool
all_less = pairs.apply(test, axis=1)
# part 4
output = pairs.loc[all_less, ['user_id_x', 'user_id_y', 'mon']].rename(
columns={'user_id_x': 't_user_id', 'user_id_y': 'c_user_id'})