In Python 3.7, I have a time series represented by a Pandas dataframe in which the index is a DateTimeIndex and the single value column is stock price:
The gaps correspond to NaN "price" values, and there are 126 non-NaN values and 20 NaN values. What I'm trying to do is to interpolate the non-NaN values to predict the values that are NaN. I tried several interpolation methods (linear, cubic spline) but they're not sufficiently accurate, and looking at the plot above, it appears there is a significant upward trend and also some traces of weekly periodicity, so I decided to use statsmodel ARIMA. Here is my code:
def fill_in_dataframe_ARIMA( df ):
price_is_not_NaN = df[ 'price' ].notnull()
price_is_NaN = np.logical_not( price_is_not_NaN )
# Convert the datetimes of the index into milliseconds:
datetime_ms = df.index.map( to_ms )
# Train the ARIMA model:
train_datetime_ms = datetime_ms[ price_is_not_NaN ]
train_price = df.price[ price_is_not_NaN ]
arima_model = ARIMA( train_price, ( 5, 1, 2 ), train_datetime_ms ).fit()
# Use model to predict the missing prices:
missing_datetime_ms = datetime_ms[ price_is_NaN ]
missing_price = arima_model.predict( exog = missing_datetime_ms )
return missing_price
What I'm expecting is that missing_price ends up being an array-like object of twenty entries, like missing_datetime_ms. Instead, missing_price has 125 entries, one fewer than the number of samples in train_datetime_ms:train_price.
Clearly I am not understanding what's meant by endogenous and exogenous (not to mention interpolate vs. extrapolate). Can someone please explain how I can get the intended result of 20 predicted entries?
Related
I have a pandas dataframe with a date index column and a second one with the Port outs I need to predict using a time series model.
For better prediction accuracy I need to normalize and preprocess the second column. So I created another one named Normalized Variable.
The problem is when I try to Denormalize to get the actual predicted number as the accuracy falls.
How can I change the denormalize def to not lose accuracy and is there a way to do it by not using the original training data values as I think it affects my result? I also tried using sklearn preprocessing libs but I find it difficult to accurately use them to a pd df column. These are the stats of my dataset.
PORT_OUTS NORMALIZED VARIABLE
count 19.000000 19.000000
mean 6026.631579 1.522419
std 1001.819689 0.183148
min 4281.000000 1.203291
25% 5350.500000 1.398812
50% 5922.000000 1.503291
75% 6889.000000 1.680073
max 7843.000000 1.854479
And this is the code I used:
print(f'Mean Absolute Percentage Error = {mean_absolute_percentage_error(test[train_variable],test_predictions)}')
def NormalizeDataForMul(data):
return ((data - np.min(data)) / (np.max(data) - np.min(data))) + 1
def DeNormalizeData(data_to_denormalize,orginal_data):
return (data_to_denormalize-1)*(np.max(orginal_data) - np.min(orginal_data)) + np.min(orginal_data)
actual_values = DeNormalizeData(test[train_variable],forecasting_dataset['PORT_OUTS'])
predicted_values = DeNormalizeData(test_predictions,forecasting_dataset['PORT_OUTS'])
print(f'Mean Absolute Percentage Error = {mean_absolute_percentage_error(actual_values,predicted_values)}')
And the output:
Mean Absolute Percentage Error = 0.08221370354752675
Mean Absolute Percentage Error = 0.11749164277173904
I also added 1 to the first def because I needed values >0 to use the model I want.
Here is what i got (time series) in pandas dataframe
screenshot
(also dates were converted from timestamps)
My goal is to plot not only observations, but all the range of dates. I need to see horizontal line or gap when there is no new observations.
Dealing with data that is not observed equidistant in time is a typical challenge with real-world time series data. Given your problem, this code should work.
from datetime import datetime
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
# sample Frame
df = pd.DataFrame({'time' : ['2022,7,3,0,1,21', '2022,7,3,0,2,47', '2022,7,3,0,2,47', '2022,7,3,0,5,5',
'2022,7,3,0,5,5'],
'balance' : [12.6, 12.54, 12.494426, 12.482481, 12.449206]})
df['time'] = pd.to_datetime(df['time'], format='%Y,%m,%d,%H,%M,%S')
# aggregate time duplicates by mean
df = df.groupby('time').mean()
df.reset_index(inplace=True)
# pick equidistant time grid
df_new = pd.DataFrame({'time' : pd.date_range(start=df.loc[0]['time'], end=df.loc[2]['time'], freq='S')})
df = pd.merge(left=df_new, right=df, on='time', how='left')
# fill nan
df['balance'].fillna(method='pad', inplace=True)
df.set_index("time", inplace=True)
# plot
_ = df.plot(title='Time Series of Balance')
There are several caveats to this solution.
First, your data has a high temporal resolution (seconds). However, there are hours-long gaps in between observations. You either coarsen the timestamp by rounding (e.g. to minutes or hours) or go along with the time series on a second-by-second resolution and accept the fact that most you balance values will be filled-in values rather than true observations.
Second, you have different balance values for the same timestamp which indicates faulty entries or a misspecified timestamp. I unified those entries via grouping by timestamp and averaged the balance over those non-unique timestamps.
Third, filled-up gaps and true observations both have the same visual representation in the plot (blue dots in the graph). As previously mentioned commenting out the fillna() line would only showcase true observations leaving everything in between white.
Finally, the missing values are merely filled in via padding. Look up different values of the argument method in the documentation in case you want to linearly interpolate etc.
Summary
The problems described above are typical for event-driven time series data. Since you deal with a (financial) balance that constitutes a state that is only changed by events (orders), I believe that the assumptions made above arew reasonable and can be adjusted easily for your or many other use cases.
this helped
data = data.set_index('time').resample('1M').mean()
I'm currently struggling with my dataframe in Pandas (new to this).
I have a 3 columns dataframe : Categorical_data1, Categorical_data2,Output. (2400 rows x 3 columns).
Both categorical data (inputs) are strings and output is depending of inputs.
Categorical_data1 = ['type1','type2', ... , 'type6']
Categorical_data2 = ['rain1','rain2', 'rain3','rain4]
So 24 possible pairs of categorical data.
I want to plot a heatmap (using seaborn for instance) of the number of 0 in outputs regarding couples of categorical data (Cat_data1,Cat_data2). I tried several things using boolean.
I tried to figure out how to compute exact amount of 0
count = ((df['Output'] == 0) & (df(['Categorical_Data1'] == 'type1') & (df(['Categorical_Data2'] == 'rain1')))).sum()
but it failed.
The output belongs to [0,1] with a large amount of 0 (around 1200 over 2400). My goal is to have something like this Source by jcdoming (I can't upload images...) with months = Categorical Data1, years = Categorical Data2 ; and numbers of 0 in ouputs).
Thank you for your help.
Use a seaborn countplot. It gives counts of categorical data occurrences in a certain feature. Use hue to add in the second feature to the visualization:
import seaborn as sns
sns.countplot(data=dataframe, x='Categorical_Data1', hue='Categorical_Data2')
iw ould like to get a dataframe of important features. With the code below i have got the shap_values and i am not sure, what do the values mean. In my df are 142 features and 67 experiments, but got an array with ca. 2500 values.
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
I have tried to store them in a df:
rf_resultX = pd.DataFrame(shap_values, columns = ['shap_values'])
but got: ValueError: Shape of passed values is (18, 142), indices imply (18, 1)
142 - the number of the features.
18 - i have no idea.
I believe it works as follows:
shap_values need to be averaged.
and paired with the feature names: pd.DataFrame(feature_names, columns = ['feature_names'])
Does anybody have an experience, how to interpret shap_values?
At first i thought, that the number of values are the number of features x number of rows.
Combining the other two answers like this worked for me.
feature_names = X_train.columns
rf_resultX = pd.DataFrame(shap_values, columns = feature_names)
vals = np.abs(rf_resultX.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
shap_importance.head()
shap_values have (num_rows, num_features) shape; if you want to convert it to dataframe, you should pass the list of feature names to the columns parameter: rf_resultX = pd.DataFrame(shap_values, columns = feature_names).
Each sample has its own shap value for each feature; the shap value tells you how much that feature has contributed to the prediction for that particular sample; this is called a local explanation. You could average shap values for each feature to get a feeling of global feature importance, but I'd suggest you take a look at the documentation since the shap package itself provides much more powerful visualizations/interpretations.
From https://github.com/slundberg/shap/issues/632
vals = np.abs(shap_values.values).mean(0)
feature_names = train_x.columns()
feature_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
feature_importance.head()
I wrote a short function for this which also works for multi-class classifications. It expects the data as a pandas DataFrame, a list of shap value arrays with one array for each class, and optionally a list of columns for which you want the average shap values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
def shap_feature_ranking(data, shap_values, columns=[]):
if not columns: columns = data.columns.tolist() # If columns are not given, take all columns
c_idxs = []
for column in columns: c_idxs.append(data.columns.get_loc(column)) # Get column locations for desired columns in given dataframe
if isinstance(shap_values, list): # If shap values is a list of arrays (i.e., several classes)
means = [np.abs(shap_values[class_][:, c_idxs]).mean(axis=0) for class_ in range(len(shap_values))] # Compute mean shap values per class
shap_means = np.sum(np.column_stack(means), 1) # Sum of shap values over all classes
else: # Else there is only one 2D array of shap values
assert len(shap_values.shape) == 2, 'Expected two-dimensional shap values array.'
shap_means = np.abs(shap_values).mean(axis=0)
# Put into dataframe along with columns and sort by shap_means, reset index to get ranking
df_ranking = pd.DataFrame({'feature': columns, 'mean_shap_value': shap_means}).sort_values(by='mean_shap_value', ascending=False).reset_index(drop=True)
df_ranking.index += 1
return df_ranking
For the latest version 0.40.0:
feature_names = shap_values.feature_names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
vals = np.abs(shap_df.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
I am using pandas qcut to split some data into 20 bins as part of data prep for training of a binary classification model like so:
data['VAR_BIN'] = pd.qcut(cc_data[var], 20, labels=False)
My question is, how can I apply the same binning logic derived from the qcut statement above to a new set of data, say for model validation purposes. Is there an easy way to do this?
Thanks
You can do it by passing retbins=True.
Consider the following DataFrame:
import pandas as pd
import numpy as np
prng = np.random.RandomState(0)
df = pd.DataFrame(prng.randn(100, 2), columns = ["A", "B"])
pd.qcut(df["A"], 20, retbins=True, labels=False) returns a tuple whose second element is the bins. So you can do:
ser, bins = pd.qcut(df["A"], 20, retbins=True, labels=False)
ser is the categorical series and bins are the break points. Now you can pass bins to pd.cut to apply the same grouping to the other column:
pd.cut(df["B"], bins=bins, labels=False, include_lowest=True)
Out[38]:
0 13
1 19
2 3
3 9
4 13
5 17
...
User #Karen said:
By using this logic, I am getting Na values in my validation set. Is there some way to solve it?
If this is happening to you, it most likely means that the validation set has values below (or above) the smallest (or greatest) value from the training data. Therefore, some values will fall out of range and will therefore not be assigned a bin.
You can solve this problem by extending the range of the training data:
# Make smallest value arbitrarily smaller
train.loc[train['value'].eq(train['value'].min()), 'value'] = train['value'].min() - 100
# Make greatest value arbitrarily greater
train.loc[train['value'].eq(train['value'].max()), 'value'] = train['value'].max() + 100
# Make bins from training data
s, b = pd.qcut(train['value'], 20, retbins=True)
# Cut validation data
test['bin'] = pd.cut(test['value'], b)