I have a historical collection of ~ 500k loans, some of which have defaulted, others have not. My dataframe is lcd_temp. lcd_temp has information on the loan size (loan_amnt), if loan has defaulted or not (Total Defaults), annual loan rate (clean_rate),term of loan (clean_term), and months from origination to default (mos_to_default). mos_to_default is equal to clean_term if no default.
I would like to calculate the Cumulative Cashflow [cum_cf] for each loan as the sum of all coupons paid until default plus (1-severity) if loan defaults, and simply the loan_amnt if it pays back on time.
Here's my code, which takes an awful long time to run:
severity = 1
for i in range (0,len(lcd_temp['Total_Defaults'])-1):
if (lcd_temp.loc[i,'Total_Defaults'] ==1):
# Default, pay coupon only until time of default, plus (1-severity)
lcd_temp.loc[i,'cum_cf'] = ((lcd_temp.loc[i,'mos_to_default'] /12) * lcd_temp.loc[i,'clean_rate'])+(1 severity)*lcd_temp.loc[i,'loan_amnt']
else:
# Total cf is sum of coupons (non compounded) + principal
lcd_temp.loc[i,'cum_cf'] = (1+lcd_temp.loc[i,'clean_term']/12* lcd_temp.loc[i,'clean_rate'])*lcd_temp.loc[i,'loan_amnt']
Any thoughts or suggestions on improving the speed (which takes over an hour so far) welcomed!
Assuming you are using Pandas/NumPy, the standard way to replace an if-then construction such as the one you are using is to use np.where(mask, A, B). The mask is an array of boolean values. When True, the corresponding value from A is returned. When False, the corresponding value from B is returned. The result is an array of the same shape as mask with values from A and/or B.
severity = 1
mask = (lcd_temp['Total_Defaults'] == 1)
A = (((lcd_temp['mos_to_default'] /12) * lcd_temp['clean_rate'])
+ (1 severity)*lcd_temp['loan_amnt'])
B = (1+lcd_temp['clean_term']/12 * lcd_temp['clean_rate'])*lcd_temp['loan_amnt']
lcd_temp['cum_cf'] = np.where(mask, A, B)
Notice that this performs the calculation on whole columns instead of row-by-row. This improves performance greatly because it gives Pandas/NumPy the opportunity to pass larger arrays of values to fast underlying C/Fortran functions (in this case, to perform the arithmetic). When you work row-by-row, you are performing scalar arithmetic inside a Python loop, which gives NumPy zero chance to shine.
If you had to compute row-by-row, you would be just as well (and maybe better) off using plain Python.
Even though A and B computes the values for the entire column -- and some values are not used in the final result returned by np.where -- this is still faster than computing row-by-row assuming there are more than a trivial number of rows.
Related
BACKGROUND
I am calculating racial segregation statistics between and within firms using the Theil Index. The data structure is a multi-indexed pandas dataframe. The calculation involves a lot of df.groupby()['foo'].transform(), where the transformation is the entropy function from scipy.stats. I have to calculate entropy on smaller and smaller groups within this structure, which means calling entropy more and more times on the groupby objects. I get the impression that this is O(n), but I wonder whether there is an optimization that I am missing.
EXAMPLE
The key part of this dataframe comprises five variables: county, firm, race, occ, and size. The units of observation are counts: each row tells you the SIZE of the workforce of a given RACE in a given OCCupation in a FIRM in a specific COUNTY. Hence the multiindex:
df = df.set_index(['county', 'firm', 'occ', 'race']).sort_index()
The Theil Index is the size-weighted sum of sub-units' entropy deviations from the unit's entropy. To calculate segregation between counties, for example, you can do this:
from scipy.stats import entropy
from numpy import where
# Helper to calculate the actual components of the Theil statistic
define Hcmp(w_j, w, e_j, e):
return where(e == 0, 0, (w_j / w) * ((e - e_j) / e))
df['size_county'] = df.groupby(['county', 'race'])['size'].transform('sum')
df['size_total'] = df['size'].sum()
# Create a dataframe with observations aggregated over county/race tuples
counties = df.groupby(['county', 'race'])[['size_county', 'size_total']].first()
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4) # <--
# The base for entropy is 4 because there are four recorded racial categories.
# Assume that counties['entropy_total'] has already been calculated.
counties['seg_cmpnt'] = Hcmp(counties['size_county'], counties['size_total'],
counties['entropy_county'], counties['entropy_total'])
county_segregation = counties['seg_cmpnt'].sum()
Focus on this line:
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4)
The starting dataframe has 3,130,416 rows. When grouped by county, though, the resulting groupby object has just 2,267 groups. This runs quickly enough. When I calculate segregation within counties and between firms, the corresponding line is this:
firms['entropy_firm'] = firms.groupby('firm')['size_firm'].transform(entropy, base=4)
Here, the groupby object has 86,956 groups (the count of firms in the data). This takes about 40 times as long as the prior, which looks suspiciously like O(n). And when I try to calculate segregation within firms, between occupations...
# Grouping by firm and occupation because occupations are not nested within firms
occs['entropy_occ'] = occs.groupby(['firm', 'occ'])['size_occ'].transform(entropy, base=4)
...There are 782,604 groups. Eagle-eyed viewers will notice that this is exactly 1/4th the size of the raw dataset, because I have one observation for each firm/race/occupation tuple, and four racial categories. It is also nine times the number of groups in the by-firm groupby object, because the data break employment out into nine occupational categories.
This calculation takes about nine times as long: four or five minutes. When the underlying research project involves 40-50 years of data, this part of the process can take three or four hours.
THE PROBLEM, RESTATED
I think the issue is that, even though scipy.stats.entropy() is being applied in a smart, vectorized way, the necessity of calculating it over a very large number of small groups--and thus calling it many, many times--is swamping the performance benefits of vectorized calculations.
I could pre-calculate the necessary logarithms that entropy requires, for example with numpy.log(). If I did that, though, I'd still have to group the data to first get each firm/occupation/race's share within the firm/occupation. I would also lose any advantage of readable code that looks similar at different levels of analysis.
Thus my question, stated as clearly as I can: is there a more computationally efficient way to call something like this entropy function, when calculating it over a very large number of relatively small groups in a large dataset?
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I am pulling in a handful of different datasets daily, performing a few simple data quality checks, and then shooting off emails if a dataset fails the checks.
My checks are as plain as checking for duplicates in the dataset, as well as checking if the number of rows and columns in a dataset haven't changed -- See below.
assert df.shape == (1016545, 8)
assert len(df) - len(df.drop_duplicates()) == 0
Since these datasets are updated daily and may change the number of rows, is there a better way to check instead of hardcoding the specific number?
For instance, one dataset might have only 400 rows, and another might have 2 million.
Could I say to check within 'one standard deviation' of the number of rows from yesterday? But in that case, I would need to start collecting previous days counts in a separate table, and that could get ugly.
Right now, for tables that change daily, I'm doing the following rudimentary check:
assert df.shape[0] <= 1016545 + 100
assert df.shape[0] >= 1016545 - 100
But obviously this is not sustainable.
Any suggestions are much appreciated.
Yes, you would need to store some previous information, but since you don't seem to care about perfectly statistically accurate I think you can cheat a little. If you keep the average number of records based on the previous samples, the previous deviation you calculated, and the number of samples you took you can get reasonably close to what you are looking for by finding the weighted average of the previous deviation with the current deviation.
For example:
If the average count has been 1016545 with a deviation of 85 captured over 10 samples, and today's count is 1016612. If you calculate the difference from the mean (1016612 - 1016545 = 67) then the weighted average of the previous deviation and the current deviation ((85*10 + 67)/11 ≈ 83).
This makes it so you are only storing a handful of variables for each data set instead of all the record counts back in time, but this also means it's not actually standard deviation.
As for storage, you could store your data in a database or a json file or any number of other locations -- I won't go into detail for that since it's not clear what environment you are working in or what resources you have available.
Hope that helps!
Could you please help me with this issue as I made many searches but cannot solve it. I have a multivariate dataframe for electricity consumption and I am doing a forecasting using VAR (Vector Auto-regression) model for time series.
I made the predictions but I need to reverse the time series (energy_log_diff) as I applied a seasonal log difference to make the serie stationary, in order to get the real energy value:
df['energy_log'] = np.log(df['energy'])
df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(1)
For that, I did first:
df['energy'] = np.exp(df['energy_log_diff'])
This is supposed to give the energy difference between 2 values lagged by 365 days but I am not sure for this neither.
How can I do this?
The reason we use log diff is that they are additive so we can use cumulative sum then multiply by the last observed value.
last_energy=df['energy'].iloc[-1]
df['energy']=(np.exp(df['energy'].cumsum())*last_energy)
As per seasonality: if you de-seasoned the log diff simply add(or multiply) before you do the above step if you de-seasoned the original series then add after
Short answer - you have to run inverse transformations in the reversed order which in your case means:
Inverse transform of differencing
Inverse transform of log
How to convert differenced forecasts back is described e.g. here (it has R flag but there is no code and the idea is the same even for Python). In your post, you calculate the exponential, but you have to reverse differencing at first before doing that.
You could try this:
energy_log_diff_rev = []
v_prev = v_0
for v in df['energy_log_diff']:
v_prev += v
energy_log_diff_rev.append(v_prev)
Or, if you prefer pandas way, you can try this (only for the first order difference):
energy_log_diff_rev = df['energy_log_diff'].expanding(min_periods=0).sum() + v_0
Note the v_0 value, which is the original value (after log transformation before difference), it is described in the link above.
Then, after this step, you can do the exponential (inverse of log):
energy_orig = np.exp(energy_log_diff_rev)
Notes/Questions:
You mention lagged values by 365 but you are shifting data by 1. Does it mean you have yearly data? Or would you like to do this - df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(365) instead (in case of daily granularity of data)?
You want to get the reverse time series from predictions, is that right? Or am I missing something? In such a case you would make inverse transformations on prediction not on the data I used above for explanation.
I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.
First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.
In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.
Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.