Find Pips value - 3 to 5 Digits forex pricing calculation - python

Nb in Python asking! Trying to subtract price (trade opened) with price.1 (trade closed) to get number of pips formatted properly without decimal. However, I could not proceed due to restriction involving split x lists.The, I was trying the solution below: , but, seems to be reduntant ..
I have created 4 lists and 4 loops to transform float to string, change the format to proceeed with subtraction. Any idea how to get correct number formatted ? Something to go directly into column (results) float .. if 3 digits before punctation decimal . . do 1000*100.. If one digit before . .. *100/10.
# Price Trade Opened
listp = []
listpf = []
for i in df2['Price']:
listp.append(format(i,'.5f'))
for i in listp:
listpf.append(str(i))
# Price.1 trade closed.
listpp = []
listppf = []
for i in df2['Price.1']:
listpp.append(format(i,'.5f'))
for i in listpp:
listppf.append(str(i))
# Transform list into DF and remove punctuation. Thereby, I could
subtract.
df3 = pd.DataFrame(listp)
col = ['Price']
df3.columns = col
df3 = df3.stack().str.replace('.', '').unstack()
df4 = pd.DataFrame(listpp)
col = ['Price1']
df4.columns = col
df4 = df4.stack().str.replace('.', '').unstack()
dfc = pd.concat([df3, df4], axis=1)
dfc.fillna(0)
dfc.replace({'nan': 0}, inplace=True)
dfc['Price'] = pd.to_numeric(dfc['Price'])
dfc['Price1'] = pd.to_numeric(dfc['Price1'])
dfc['Result'] = (dfc['Price'] - dfc['Price1'])
dfc.head()

You should be able to calculate the difference between open and close values, and divide by the relevant multiplier for the pair. Like this:
def pip_calc(open, close):
if str(open).index('.') >= 3: # JPY pair
multiplier = 0.01
else:
multiplier = 0.0001
pips = round((close - open) / multiplier)
return int(pips)
pip_calc(112.65, 112.68)
# 3
pip_calc(1.6566, 1.6568)
# 2

Related

Maximum Consecutive Ones/Trues per year that also considers the boundaries (Start-of-year and End-of-year)

Title says most of it. i.e. Find the maximum consecutive Ones/1s (or Trues) for each year, and if the consecutive 1s at the end of a year continues to the following year, merge them together.
I have tried to implement this, but seems a bit of a 'hack', and wondering if there is a better way to do it.
Reproducible Example Code:
# Modules needed
import pandas as pd
import numpy as np
# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)
InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean
# Wanted Output
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Below is my initial code to achieved wanted output
# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number
distinct = distinct[boolean_array] # only consider trues from the distinct values
consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
return consect
# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 7
# 2001 3
However, output above is still not what we want because groupby function cuts the data for each year.
So below code we will try and 'fix' this by computing the MaxConsecutive-Ones at the boundaries (i.e. current_year-01-01 and previous_year-12-31), And if the MaxConsecutive-Ones at the boundaries are larger than compared to the original MaxConsecutive-Ones from above output then we replace it.
# First) we aquire all start_of_year and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]
# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]
# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index.
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year
# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index
# Finally) Compute the consecutive 1s/trues at the boundaries
# for each matched years
for year in matched_years:
# Compute the amount of consecutive 1s/trues at the start-of-year
start = boolean_array.loc[boolean_array.index.year == (year + 1)]
distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number
distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)
# Compute the amount of consecutive 1s/trues at the previous-end-of-year
end = boolean_array.loc[boolean_array.index.year == year]
distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number
distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)
# Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
ConsecutiveAtBoundaries = start_consecutive + end_consecutive
# Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
Modify_MaxConsecutive = MaxConsecutive.copy()
if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
else:
None
# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
Now I've got the time. Here is my solution:
# Modules needed
import pandas as pd
import numpy as np
input_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1], dtype=bool)
input_dates = pd.date_range('2000-12-22', '2001-01-06')
df = pd.DataFrame({"input": input_array, "dates": input_dates})
streak_starts = df.index[~df.input.shift(1, fill_value=False) & df.input]
streak_ends = df.index[~df.input.shift(-1, fill_value=False) & df.input] + 1
streak_lengths = streak_ends - streak_starts
streak_df = df.iloc[streak_starts].copy()
streak_df["streak_length"] = streak_lengths
longest_streak_per_year = streak_df.groupby(streak_df.dates.dt.year).streak_length.max()
output:
dates
2000 9
2001 3
Name: streak_length, dtype: int64
Not sure if this is the most efficient, but it's one solution:
arr = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
arr.index = (pd.date_range('2000-12-22', '2001-01-06'))
arr = arr.astype(bool)
df = arr.reset_index() # convert to df
df['adj_year'] = df['index'].dt.year # adj_year will be adjusted for streaks
mask = (df[0].eq(True)) & (df[0].shift().eq(True))
df.loc[mask, 'adj_year'] = np.NaN # we mark streaks as NaN and fill from above
df.adj_year = df.adj_year.fillna(method='ffill').astype('int')
df.groupby('adj_year').apply(lambda x: ((x[0] == x[0].shift()).cumsum() + 1).max())
# find max streak for each adjusted year
Output:
adj_year
2000 9
2001 3
dtype: int64
Note:
By convention variable names in Python (except for classes) are lower case , so arr as opposed to InputArray
1 and 0 are equivalent to True and False, so you can make convert them to boolean without the explicit comparison
cumsum is zero-indexed (as is usual in Python) so we add 1
This solution doesn't answer the question exactly, so will not be the final answer.
i.e. This regards max_consecutive trues at the boundaries for both current-year and following year
boolean_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1]).astype(bool)
boolean_array.index = (pd.date_range('2000-12-22', '2001-01-06'))
distinct = boolean_array.ne(boolean_array.shift()).cumsum()
distinct_masked = distinct[boolean_array]
streak_sum = distinct_masked.value_counts()
streak_sum_series = pd.Series(streak_sum.loc[distinct_masked].values, index = distinct_masked.index.copy())
max_consect = streak_sum_series.groupby(lambda x: x.year).max()
Output:
max_consect
2000 9
2001 9
dtype: int64

How to not set value to slice of copy [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am trying to replace string values in a column without creating a copy. I have looked at the docs provided in the warning and also this question. I have also tried using .replace() with the same results. What am I not understanding?
Code:
import pandas as pd
from datetime import timedelta
# set csv file as constant
TRADER_READER = pd.read_csv('TastyTrades.csv')
TRADER_READER['Strategy'] = ''
def iron_condor():
TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S")
a = 0
b = 1
c = 2
d = 3
for row in TRADER_READER.index:
start_time = TRADER_READER['Date'][a]
end_time = start_time + timedelta(seconds=5)
e = TRADER_READER.iloc[a]
f = TRADER_READER.iloc[b]
g = TRADER_READER.iloc[c]
h = TRADER_READER.iloc[d]
if start_time <= f['Date'] <= end_time and f['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= g['Date'] <= end_time and g['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= h['Date'] <= end_time and h['Underlying Symbol'] == e['Underlying Symbol']:
e.loc[e['Strategy']] = 'Iron Condor'
f.loc[f['Strategy']] = 'Iron Condor'
g.loc[g['Strategy']] = 'Iron Condor'
h.loc[h['Strategy']] = 'Iron Condor'
print(e, f, g, h)
if (d + 1) > int(TRADER_READER.index[-1]):
break
else:
a += 1
b += 1
c += 1
d += 1
iron_condor()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
Hopefully this satisfies the data needed to replicate:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL
Expected result:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT,Iron Condor
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL,Iron Condor
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT,Iron Condor
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL,Iron Condor
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL,
Let's start from some improvements in the initial part of your code:
The leftmost column of your input file is apparently the index column,
so it should be read as the index. The consequence is some different approach
to the way to access rows (details later).
The Date column can be converted to datetime64 as early as at the reading time.
So the initial part of your code can be:
TRADER_READER = pd.read_csv('Input.csv', index_col=0, parse_dates=['Date'])
TRADER_READER['Strategy'] = ''
Then I decided to organize the loop other way:
indStart is the integer index of the index column.
As you process your file in "overlapping" couples of 4 consecutive rows,
a more natural way to organize the loop is to stop on 4-th row from the end.
So the loop is over the range(TRADER_READER.index.size - 3).
Indices of 4 rows of interest can be read from the respective slice of the
index, i.e. [indStart : indStart + 4]
Check of particular row can be performed with a nested function.
To avoid your warning, setting of values in Strategy column should be
performed using loc on the original DataFrame, with row parameter for
the respective row and column parameter for Strategy.
The whole update (for the current couple of 4 rows) can be performed in
a single instruction, specifying row parameter as a slice,
from a thru d.
So the code can be something like below:
def iron_condor():
def rowCheck(row):
return start_time <= row.Date <= end_time and row['Underlying Symbol'] == undSymb
for indStart in range(TRADER_READER.index.size - 3):
a, b, c, d = TRADER_READER.index[indStart : indStart + 4]
e = TRADER_READER.loc[a]
undSymb = e['Underlying Symbol']
start_time = e.Date
end_time = start_time + pd.Timedelta('5S')
if rowCheck(TRADER_READER.loc[b]) and rowCheck(TRADER_READER.loc[c]) and rowCheck(TRADER_READER.loc[d]):
TRADER_READER.loc[a:d, 'Strategy'] = 'Iron Condor'
print('New values:')
print(TRADER_READER.loc[a:d])
No need to increment a, b, c and d. Neither break is needed.
Edit
If for some reason you have to do other updates on the rows in question,
you can change my code accordingly.
But I don't understand "this csv file will make a new column" in your
comment. For now anything you do is performed on the DataFrame
in memory. Only after that you can save the DataFrame back to the
original file. But note that even your code changes the type of Date
column, so I assume you do it once and then the type of this column
is just datetime64.
So you probably should change the type of Date column as a separate
operation and then (possibly many times) update thie DataFrame and save
the updated content back to the source file.
Edit following the comment as of 21:22:46Z
re.search('.*TO_OPEN$', row['Action']) returns a re.Match object if
a match has been found, otherwise None.
So can not compare this result with the string searched. If you wanted to get
the string matched, you should run e.g.:
mtch = re.search('.*TO_OPEN$', row['Action'])
textFound = None
if mtch:
textFound = mtch.group(0)
But you actually don't need to do it. It is enough to check whether
a match has been found, so the condition can be:
found = bool(re.search('.*TO_OPEN$', row['Action']))
(note that None cast to bool returns False and any non-Null object
returns True).
Yet another (probably simpler and quicker) solution is that you run just:
row.Action.endswith('TO_OPEN')
without invoking any regex fuction.
Here is a quite elaborating post that can not only answer your question but also explain in details why things are the case.
Deal with SettingWithCopyWarning
In short if you want to set the value of the original df, either use .replace(inplace=True) or df.loc[condition, theColtoBeSet] = new_val

Pandas capitalization of compound interests

I am writing an emulation of a bank deposit account in pandas.
I got stuck with Compound interest (It is the result of reinvesting interest, so that interest in the next period is then earned on the principal sum plus previously accumulated interest.)
So far I have the following code:
import pandas as pd
from pandas.tseries.offsets import MonthEnd
from datetime import datetime
# Create a date range
start = '21/11/2017'
now = datetime.now()
date_rng = pd.date_range(start=start, end=now, freq='d')
# Create an example data frame with the timestamp data
df = pd.DataFrame(date_rng, columns=['Date'])
# Add column (EndOfMonth) - shows the last day of the current month
df['LastDayOfMonth'] = pd.to_datetime(df['Date']) + MonthEnd(0)
# Add columns for interest, Sasha, Artem, Total, Description
df['Debit'] = 0
df['Credit'] = 0
df['Total'] = 0
df['Description'] = ''
# Iterate through the DataFrame to set "IsItLastDay" value
for i in df:
df['IsItLastDay'] = (df['LastDayOfMonth'] == df['Date'])
# Add the transaction of the first deposit
df.loc[df.Date == '2017-11-21', ['Debit', 'Description']] = 10000, "First deposit"
# Calculate the principal sum (It the summ of all deposits minus all withdrows plus all compaund interests)
df['Total'] = (df.Debit - df.Credit).cumsum()
# Calculate interest per day and Cumulative interest
# 11% is the interest rate per year
df['InterestPerDay'] = (df['Total'] * 0.11) / 365
df['InterestCumulative'] = ((df['Total'] * 0.11) / 365).cumsum()
# Change the order of columns
df = df[['Date', 'LastDayOfMonth', 'IsItLastDay', 'InterestPerDay', 'InterestCumulative', 'Debit', 'Credit', 'Total', 'Description']]
df.to_excel("results.xlsx")
The output file looks fine, but I need the following:
The "InterestCumulative" column adds to the "Total" column at the last day of each months (compounding the interests)
At the beggining of each month the "InterestCumulative" column should be cleared (Because the interest were added to the Principal sum).
How can I do this?
You're going to need to loop, as your total changes depending on previous rows, which then affects the later rows. As a result your current interest calculations are wrong.
total = 0
cumulative_interest = 0
total_per_day = []
interest_per_day = []
cumulative_per_day = []
for day in df.itertuples():
total += day.Debit - day.Credit
interest = total * 0.11 / 365
cumulative_interest += interest
if day.IsItLastDay:
total += cumulative_interest
total_per_day.append(total)
interest_per_day.append(interest)
cumulative_per_day.append(cumulative_interest)
if day.IsItLastDay:
cumulative_interest = 0
df.Total = total_per_day
df.InterestPerDay = interest_per_day
df.InterestCumulative = cumulative_per_day
This is unfortunately a lot more confusing looking, but that's what happens when values depend on previous values. Depending on your exact requirements there may be nice ways to simplify this using math, but otherwise this is what you've got.
I've written this directly into stackoverflow so it may not be perfect.

New column based off certain input parameter to select what columns to use - Python

Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555

Pandas row analysis for consecutive dates

Following a "chain" of rows and counting the consecutive months from a CSV file.
Currently I am reading a CSV file with 5 columns of interest (based on insurance policies):
CONTRACT_ID START-DATE END-DATE CANCEL_FLAG OLD_CON_ID
123456 2015-05-30 2016-05-30 0 8788
123457 2014-03-20 2015-03-20 0 12000
123458 2009-12-20 2010-12-20 0 NaN
...
I want to count the number of consecutive months a Contract chain goes for.
Example: Taking the START-DATE from the contract at the "front" of the chain (oldest contract) and the END-DATE from the end of the chain (newest contract). Oldest contract being defined by either the one before a cancelled contract in a chain or the one that has no OLD_CON_ID value.
Each row represents a contract and the prev_Con_ID points to the previous contract ID. The desired output is how many months the contract chains goes back until a gap (i.e. customer didn't have a contract for a period of time). If nothing in that column then that is the first contract in this chain.
CANCEL_FLAG should also cut the chain because a value of 1 designates that the contract was cancelled.
Current code counts the number of active contracts for each year by editing the dataframe like so:
df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) &
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']
Here are the first 6 lines of code in which I create the dataframes and adjust the datetime values:
file_name = 'EXAMPLENAME.csv'
df = pd.read_csv(file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)
df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE'])
df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])
Ideal output is something like:
FIRST_CONTRACT_ID CHAIN_LENGTH CON_MONTHS
1234567 5 60
1500001 1 4
800 10 180
Those data points would then be graphed.
EDIT2: CSV file changed, might be easier now. Question updated.
Not sure if I totally undertand your requirement, but does something like this work?:
df_contract['TOTAL_YEARS'] = (df_contract['END_DATE'] - df_contract['START_DATE']
)/np.timedelta64(1,'Y')
df_contract['TOTAL_YEARS'][(df['CANCEL_FLAG'] == 1) && (df['TOTAL_YEARS'] > 1)] = 1
After a lot of trial and error I got it working!
This finds the time difference between the first and last contracts in the chain and finds the length of the chain.
Not the cleanest code by far, but it works:
test = 'START_DATE'
df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'},
inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)
df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)
df_merge1 = pd.merge(df_short, df_copy2,
how='left',
on=['PID'])
df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)
The copy, merge, fillna, and rename code is repeated for 5 merged dataframes then:
df_merged = pd.merge(df_merge5, df_test,
how='right',
on=['CID'])
df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
)/np.timedelta64(1,'M'))
df_merged4 = df_merged[
(df_merged['PED'] >= pd.to_datetime('2015-07-06'))
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3
Hopefully my code is understood and will help someone in the future.

Categories