I am trying to perform an interpolation based on the input data ( test_data_Inputs) on the test_data data frame. The way I have it set up now is I do it by Peril so so I first created a dataframe that only contained the fire peril (see below) then perform the interpolation on that specific peril group:
The goal is to have a column in the test_data_inputs that has both the Peril Type, and Factor. One of the issues I have been encountering is the situation where the amount of insurance in the test_data_input is a perfect match within the test_data dataframe. It still interpolates regardless of if it is a perfect match or not.
fire_peril_test=test_data[test_data['Peril Type'=='Fire']]
from scipy import interpolate
x=fire_peril_test['Amount of Insurance']
x=fire_peril_test['Amount Of Insurance']
y=_fire_peril_test['Factor']
y=fire_peril_test['Factor']
f=interpolate.interp1d(x,y)
xnew=test_data_Inputs["Amount of Insurance"]
ynew=f(xnew)
test_data_Inputs=pd.DataFrame({'Amount of Insurance':[320000,330000,340000]})
test_data=pd.DataFrame({'Amount of Insurance':[300000,350000,400000,300000,350000,400000],'Peril Type':['Fire','Fire','Fire','Water','Water','Water'],'Factor':[.10,.20,.35,.20,.30,.40]})
Appreciate all of the assistance.
amount_of_insurance=pd.DataFrame()
df['Amount of Insurance']=pd.melt(df['Amount of Insurance'],id_vars=['Amount Of Insurance'],var_name='Peril Type',value_name='Factor')
for peril in df['Amount of Insurance']['Peril Type'].unique():
#peril='Fire'
x=df['Amount of Insurance']['Amount Of Insurance'][df['Amount of Insurance']['Peril Type']==str(peril)]
y=df['Amount of Insurance']['Factor'][df['Amount of Insurance']['Peril Type']==str(peril)]
f=interpolate.interp1d(x,y)
xnew=data_for_rater[['Enter Amount of Insurance']]
ynew=f(xnew)
append_frame=data_for_rater[['Group','Enter Amount of Insurance']]
append_frame['Peril Type']=str(peril)
append_frame['Factor']=ynew
amount_of_insurance=amount_of_insurance.append(append_frame)
My solution with my actual data. Pretty much I melted the data to be able to loop through the unique Peril Types. If you guys have any alternatives let me know...
Related
I am currently working on a course in Data Science on how to win data science competitions. The final project is a Kaggle competition that we have to participate in.
My training dataset has close to 3 million rows, and one of the columns is a "date of purchase" column.
I want to calculate the distance of each date to the nearest public holiday.
E.g. if the date is 31/12/2014, the nearest PH would be 01/01/2015. The number of days apart would be "1".
I cannot think of an efficient way to do this operation. I have a list with a number of Timestamps, each one is a public holiday in Russia (the dataset is from Russia).
def dateDifference (target_date_raw):
abs_deltas_from_target_date = np.subtract(russian_public_holidays, target_date_raw)
abs_deltas_from_target_date = [i.days for i in abs_deltas_from_target_date if i.days >= 0]
index_of_min_delta_from_target_date = np.min(abs_deltas_from_target_date)
return index_of_min_delta_from_target_date
where 'russian_public_holidays' is the list of public holiday dates and 'target_date_raw' is the date for which I want to calculate distance to the nearest public holiday.
This is the code I use to create a new column in my DataFrame for the difference of dates.
training_data['closest_public_holiday'] = [dateDifference(i) for i in training_data['date']]
This code ran for nearly 25 minutes and showed no signs of completing, which is why I turn to you guys for help.
I understand that this is probably the least Pandorable way of doing things, but I couldn't really find a clean way of operating on a single column during my research. I saw a lot of people say that using the "apply" function on a single column is a bad way of doing things. I am very new to working with such large datasets, which is why clean and efficient practices seem to elude me for now. Please do let me know what would be the best way to tackle this!
Try this and see if helps with the timing. I worry that it will take up to much memory. I don't have the data to test. You can try.
df = pd.DataFrame(pd.date_range('01/01/2021','12/31/2021',freq='M'),columns=['Date'])
holidays = pd.to_datetime(np.array(['1/1/2021','12/25/2021','8/9/2021'])).to_numpy()
Assuming holidays: 1/1/2021, 8/9/2021, 12/25/2021
df['Days Away'] = (
np.min(np.absolute(df.Date.to_numpy()
.reshape(-1,1) - holidays),axis=1) /
np.timedelta64(1, 'D')
)
I'm currently working with CESM Large Ensemble data on the cloud (ala https://medium.com/pangeo/cesm-lens-on-aws-4e2a996397a1) using xarray and Dask and am trying to plot the trends in extreme precipitation in each season over the historical period (Dec-Jan-Feb and Jun-Jul-Aug specifically).
Eg. If one had a daily time-series data split into months like:
1920: J,F,M,A,M,J,J,A,S,O,N,D
1921: J,F,M,A,M,J,J,A,S,O,N,D
...
My aim is to group together the JJA days in each year and then take the maximum value within that group of days for each year. Ditto for DJF, however here you have to be careful because DJF is a year-skipping season; the most natural way to define it is 1921's DJF = 1920 D + 1921 JF.
Using iris this would be simple (though quite inefficient), as you could just add auxiliary time-coordinates for season and season_year and then aggregate/groupby those two coordinates and take a maximum, this would give you a (year, lat, lon) output where each year contains the maximum of the precipitation field in the chosen season (eg. maximum DJF precip in 1921 in each lat,lon pixel).
However in xarray this operation is not as natural because you can't natively groupby multiple coordinates, see https://github.com/pydata/xarray/issues/324 for further info on this. However, in this github issue someone suggests a simple, nested workaround to the problem using xarray's .apply() functionality:
def nested_groupby_apply(dataarray, groupby, apply_fn):
if len(groupby) == 1:
return dataarray.groupby(groupby[0]).apply(apply_fn)
else:
return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)
I'd be quite keen to try and use this workaround myself, but I have two main questions beforehand:
1) I can't seem to work out how to groupby coordinates such that I don't take the maximum of DJF in the same year?
Eg. If one simply applies the function like (for a suitable xr_max() function):
outp = nested_groupby_apply(daily_prect, ['time.season', 'time.year'], xr_max)
outp_djf = outp.sel(season='DJF')
Then you effectively define 1921's DJF as 1921 D + 1921 JF, which isn't actually what you want to look at! This is because the 'time.year' grouping doesn't account for the year-skipping behaviour of seasons like DJF. I'm not sure how to workaround this?
2) This nested groupby function is incredibly slow! As such, I was wondering if anyone in the community had found a more efficient solution to this problem, with similar functionality?
Thanks ahead of time for your help, all! Let me know if anything needs clarifying.
EDIT: Since posting this, I've discovered there already is a workaround for this in the specific case of taking DJF/JJA means each year (Take maximum rainfall value for each season over a time period (xarray)), however I'm keeping this question open because the general problem of an efficient workaround for multi-coord grouping is still unsolved.
I've got a data set where income is one of many variables. I want to add a column immediately to the right of the income variable that is the z-score. I know there's a question on here about how to do this to all but one column or many columns, but I need it for the one column, and without replacing the values. This is probably the long way of doing it but I've extracted just the income column and then applied the z-score to it. However, I can't figure out how to rename the column "Norm_Income" and then put it back into the main data frame, right next to the income. Any help is greatly appreciated. Here's what I have (I know it's not much):
## HW Part 3: Standardizing Income Attribute with Z-Score Normalization
Income=pd.DataFrame(bank_df,columns=['income'])
from scipy.stats import zscore
Norm_Income=Income.apply(zscore)
Norm_Income
Edit: This is so weird: this work last night, but now I get an error. Here's my code:
## HW Part 3: Standardizing Income Attribute with Z-Score Normalization Income=pd.DataFrame(bank_df,columns=['income'])
from scipy.stats import zscore
Income["Norm_Income"] = Income.apply(zscore) bank_df=bank_df[["id","age","income","Norm_Income","children","gender","region","married","car","savings_acct","current_acct","mortgage","pep"]]
bank_df
Here's the new error:
You already have a series, so it's pretty straightforward to put it in the dataframe, take a look at Adding new column to existing DataFrame in Python pandas
You just need:
Income["Norm_Income"] = Income.apply(zscore)
instead of your 3rd line
So please disregard my comment to the answer. I figured out code that worked in the context of my problem.
## HW Part 3: Standardizing Income Attribute with Z-Score Normalization
Income=pd.DataFrame(bank_df,columns=['income'])
from scipy.stats import zscore
bank_df["norm_income"] = Income.apply(zscore)
bank_df["norm_income"]
bank_df=bank_df[["id","age","income","norm_income","children","gender","region","married","car","savings_acct","current_acct","mortgage","pep"]]
bank_df
Editing a large dataframe in python. How do you drop entire rows in the dataframe if a specific column's row has the value 0.0?
When I drop the 0.0s in the overall satisfaction column the edits are not displayed in my scatterplot matrix of the large dataframe.
I have tried:
filtered_df = filtered_df.drop([('overall_satisfaction'==0)], axis=0)
also tried replacing 0.0 with nulls & dropping the nulls:
filtered_df = filtered_df.['overall_satisfaction'].replace(0.0, np.nan), axis=0)
filtered_df = filtered_df[filtered_NZ_df['overall_satisfaction'].notnull()]
What concept am I missing? Thanks :)
So it seems like your values are small enough to be represented as zeros, but are not actually zeros. This usually happens when calculations result in vanishing gradients (really small numbers that approach zero, but are not quite zero), so equality comparisons do not give you the result you're looking for.
In cases like this, numpy has a handy function called isclose that lets you test whether a number is close enough to another number within a certain tolerance.
In your case, doing
df = df[~np.isclose(df['overall_satisfaction'], 0)]
Seems to work.
I've been developing a tool that automatically preprocesses data in pandas.DataFrame format. During this preprocessing step, I want to treat continuous and categorical data differently. In particular, I want to be able to apply, e.g., a OneHotEncoder to only the categorical data.
Now, let's assume that we're provided a pandas.DataFrame and have no other information about the data in the DataFrame. What is a good heuristic to use to determine whether a column in the pandas.DataFrame is categorical?
My initial thoughts are:
1) If there are strings in the column (e.g., the column data type is object), then the column very likely contains categorical data
2) If some percentage of the values in the column is unique (e.g., >=20%), then the column very likely contains continuous data
I've found 1) to work fine, but 2) hasn't panned out very well. I need better heuristics. How would you solve this problem?
Edit: Someone requested that I explain why 2) didn't work well. There were some tests cases where we still had continuous values in a column but there weren't many unique values in the column. The heuristic in 2) obviously failed in that case. There were also issues where we had a categorical column that had many, many unique values, e.g., passenger names in the Titanic data set. Same column type misclassification problem there.
Here are a couple of approaches:
Find the ratio of number of unique values to the total number of unique values. Something like the following
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].nunique()/df[var].count() < 0.05 #or some other threshold
Check if the top n unique values account for more than a certain proportion of all values
top_n = 10
likely_cat = {}
for var in df.columns:
likely_cat[var] = 1.*df[var].value_counts(normalize=True).head(top_n).sum() > 0.8 #or some other threshold
Approach 1) has generally worked better for me than Approach 2). But approach 2) is better if there is a 'long-tailed distribution', where a small number of categorical variables have high frequency while a large number of categorical variables have low frequency.
There's are many places where you could "steal" the definitions of formats that can be cast as "number". ##,#e-# would be one of such format, just to illustrate. Maybe you'll be able to find a library to do so.
I try to cast everything to numbers first and what is left, well, there's no other way left but to keep them as categorical.
You could define which datatypes count as numerics and then exclude the corresponding variables
If initial dataframe is df:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
dataframe = df.select_dtypes(exclude=numerics)
I think the real question here is whether you'd like to bother the user once in a while or silently fail once in a while.
If you don't mind bothering the user, maybe detecting ambiguity and raising an error is the way to go.
If you don't mind failing silently, then your heuristics are ok. I don't think you'll find anything that's significantly better. I guess you could make this into a learning problem if you really want to. Download a bunch of datasets, assume they are collectively a decent representation of all data sets in the world, and train based on features over each data set / column to predict categorical vs. continuous.
But of course in the end nothing can be perfect. E.g. is the column [1, 8, 22, 8, 9, 8] referring to hours of the day or to dog breeds?
I've been thinking about a similar problem and the more that I consider it, it seems that this itself is a classification problem that could benefit from training a model.
I bet if you examined a bunch of datasets and extracted these features for each column / pandas.Series:
% floats: percentage of values that are float
% int: percentage of values that are whole numbers
% string: percentage of values that are strings
% unique string: number of unique string values / total number
% unique integers: number of unique integer values / total number
mean numerical value (non numerical values considered 0 for this)
std deviation of numerical values
and trained a model, it could get pretty good at inferring column types, where the possible output values are: categorical, ordinal, quantitative.
Side note: as far as a Series with a limited number of numerical values goes, it seems like the interesting problem would be determining categorical vs ordinal; it doesn't hurt to think a variable is ordinal if it turns out to be quantitative right? The preprocessing steps would encode the ordinal values numerically anyways without one-hot encoding.
A related problem that is interesting: given a group of columns, can you tell if they are already one-hot encoded? E.g in the forest-cover-type-prediction kaggle contest, you would automatically know that soil type is a single categorical variable.
IMO the opposite strategy, identifying categoricals is better because it depends on what the data is about. Technically address data can be thought of as unordered categorical data, but usually I wouldn't use it that way.
For survey data, an idea would be to look for Likert scales, e.g. 5-8 values, either strings (which might probably need hardcoded (and translated) levels to look for "good", "bad", ".agree.", "very .*",...) or int values in the 0-8 range + NA.
Countries and such things might also be identifiable...
Age groups (".-.") might also work.
I've been looking at this, thought it maybe useful to share what I have. This builds on #Rishabh Srivastava answer.
import pandas as pd
def remove_cat_features(X, method='fraction_unique', cat_cols=None, min_fraction_unique=0.05):
"""Removes categorical features using a given method.
X: pd.DataFrame, dataframe to remove categorical features from."""
if method=='fraction_unique':
unique_fraction = X.apply(lambda col: len(pd.unique(col))/len(col))
reduced_X = X.loc[:, unique_fraction>min_fraction_unique]
if method=='named_columns':
non_cat_cols = [col not in cat_cols for col in X.columns]
reduced_X = X.loc[:, non_cat_cols]
return reduced_X
You can then call this function, giving a pandas df as X and you can either remove named categorical columns or you can choose to remove columns with a low number of unique values (specified by min_fraction_unique).