Array to columns in dataframe - python

I've built a functioning classification model following this tutorial.
I bring in a csv and then pass each row's text value into a function which calls on the classification model to make a prediction. The function returns an array which I need put into columns in the dataframe.
Function:
def get_top_k_predictions(model,X_test,k):
# get probabilities instead of predicted labels, since we want to collect top 3
np.set_printoptions(suppress=True)
probs = model.predict_proba(X_test)
# GET TOP K PREDICTIONS BY PROB - note these are just index
best_n = np.argsort(probs, axis=1)[:,-k:]
# GET CATEGORY OF PREDICTIONS
preds = [
[(model.classes_[predicted_cat], distribution[predicted_cat])
for predicted_cat in prediction]
for distribution, prediction in zip(probs, best_n)]
preds=[ item[::-1] for item in preds]
return preds
Function Call:
for index, row in df.iterrows():
category_test_features=category_loaded_transformer.transform(df['Text'].values.astype('U'))
df['PREDICTION'] = get_top_k_predictions(category_loaded_model,category_test_features,9)
This is the output from the function:
[[('Learning Activities', 0.001271131465669718),
('Communication', 0.002696299964802842),
('Learning Objectives', 0.002774964762863968),
('Learning Technology', 0.003557563051027678),
('Instructor/TAs', 0.004512712287403168),
('General', 0.006675929282872587),
('Learning Materials', 0.013051869950436862),
('Course Structure', 0.02781481160602757),
('Community', 0.9376447176288959)]]
I want the output to look like this in the end.

You function returns a list that contains a list of tuples? Why the double-nested list? One way I can think of:
tmp = {}
for index, row in df.iterrows():
predictions = get_top_k_predictions(...)
tmp[index] = {
key: value for key, value in predictions[0]
}
tmp = pd.DataFrame(tmp).T
df.join(tmp)

Related

Grouped Time Series forecasting with scikit-hts

I am trying to forecast sales for multiple time series I took from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. And for each store and each item, I have 5 years of daily records with weekly and annual seasonalities.
In total there are : 365.2days * 5years * 10stores *50items = 913000 records.
From my understanding based on what I've read so far on Hierarchical and Grouped time series, the whole dataframe could be structured as a Grouped Time Series and not simply as a strict Hierarchical Time Series as aggregation could be done at the store or item levels interchangeably.
I want to find a way to forecast all 500 time series (for store1_item1, store1_item2,..., store10_item50) for the next year (from 01-jan-2015 to 31-dec-2015) using the scikit-hts library and its AutoArimaModel function which is a wrapper function of pmdarima's AutoArima function.
To handle the two levels of seasonality, I added Fourier terms as exogenous features to deal with the annual seasonality while auto_arima deals with the weekly seasonality.
My problem is that I got an error at during prediction step.
Here's the error message :
ValueError: Provided exogenous values are not of the appropriate shape. Required (365, 4), got (365, 8).
I assume something is wrong with the exogenous dictionary but I do not know how to solve the issue as I'm using scikit-hts for the first time. To do this, I followed the official documentation of scikit-hts here.
EDIT :______________________________________________________________
I have not seen that a similar bug was reported on Github. Following the proposed fix that I implemented locally, I could have some results. However, even though there is no error when running the code, some of the forecasts are negative as raised in the comments below this post. And we even get disproportionate values for the positive ones.
Here are the plots for all the combinations of store and item. You can see that this seems to work for only one combination.
df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()
df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()
df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()
df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()
_____________________________________________________________________
Complete code:
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor
# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']
# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]
# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]
# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)
# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
col_names.append(col_name)
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names
# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1)
# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)
# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)
# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)
# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)
# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)
# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names
# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}
# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}
# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}
# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)
# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)
# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)
# Make predictions
# Set the exogenous_df param
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)
Other approaches I thought of and that I already implemented successfully for one series (for store 1 and item 1 for example) :
TBATS applied to each series independently inside a loop across all 500 time series
auto_arima (SARIMAX) with exogenous features (=Fourier terms to deal with the weekly and annual seasonalities) for each series independently + a loop across all 500 time series
What do you think of these approaches? Do you have other suggestions on how to scale ARIMA to multiple time series?
I also want to try LSTM but I'm new to data science and deep learning and do not know how to prepare the data. Should I keep the data in their original form (long format) and apply one hot encoding to train_data['store'] and train_data['item'] columns or should I start with the df I ended up with here?
I Hope this helped you in fixing the issue with exogenous regressors. To handle negative forecasts I would suggest you to try square root transformation.

Drop bad data from dataset Tensorflow

I have a training pipeline using tf.data. Inside the dataset there is some bad elements, in my case a values of 0. How do i drop these bad data elements based on their value? I want to be able to remove them within the pipeline while training since the dataset is large.
Assume from the following pseudo code:
def parse_function(element):
height = element['height']
if height <= 0: skip() #How to skip this value
labels = element['label']
features['height'] = height
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
A suggestion would be using ds.skip(1) based on the feature value, or provide some sort of neutral weight/loss?
You can use tf.data.Dataset.filter:
def filter_func(elem):
""" return True if the element is to be kept """
return tf.math.greater(elem['height'],0)
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.filter(filter_func)
Assuming that element is a data frame in your code, then it would be:
def parse_function(element):
element = element.query('height>0')
labels = element['label']
features['height'] = element['height']
return features, labels
ds = tf.data.Dataset.from_tensor_slices(ds_files)
clean_ds = ds.map(parse_function)
`

Extract Dictionary Values from Classifier Output

I'm trying zero-shot classification. I get an output like below
[{'labels': ['rep_appreciation',
'cx_service_appreciation',
'issue_resolved',
'recommend_product',
'callback_realted',
'billing_payment_related',
'disppointed_product'],
'scores': [0.9198898673057556,
0.8672246932983398,
0.79215407371521,
0.6239275336265564,
0.4782547056674957,
0.39024001359939575,
0.010263209231197834],
'sequence': 'Alan Edwards provided me with nothing less the excellent assistance'}
Above is output for one row in a data frame
I'm hoping to finally build a data frame columns and output values mapped like below. 1s for labels if the scores are above certain threshold
Any nudge/help to solve this is highly appreciated.
Define a function which returns a key : value dictionary for every row, with key being the label and value being 1/0 based on threshold
def get_label_score_dict(row, threshold):
result_dict = dict()
for _label, _score in zip(row['labels'], row['scores']):
if _score > threshold:
result_dict.update({_label: 1})
else:
result_dict.update({_label: 0})
return result_dict
Now if you have a list_of_rows with each row being in the form as shown above, then you can use the map function to get the above mentioned dictionary for every row. Once you get this, convert it into a DataFrame.
th = 0.5 #whatever threshold value you want
result = list(map(lambda x: get_label_score_dict(x, th), list_of_rows))
result_df = pd.DataFrame(result)

DataFrame has float values but calling to_csv() on it generates an empty CSV

Consider my following code:
columns = ['tf-idf','bag_of_words']
index = ['MultinomialNB', 'LinearSVC', 'LogisticRegression', \
'DecisionTreeClassifier','MLPClassifier']
df = pd.DataFrame(columns = columns, index = index)
estimators_dict = OrderedDict([('MultiNomialNB', MultinomialNB()), \
('LinearSVC', LinearSVC()), \
('LogisticRegression', LogisticRegression()), \
('DecisionTreeClassifier', DecisionTreeClassifier()), \
('MLPClassifier',MLPClassifier(max_iter=10))])
transformers_dict = OrderedDict([('tf-idf', TfidfVectorizer(max_features=500)), \
('bag_of_words', CountVectorizer())])
steps = []
for transformer_name, transformer in transformers_dict.items():
steps.append((transformer_name, transformer))
for estimator_name, estimator in estimators_dict.items():
steps.append((estimator_name, estimator))
model = Pipeline(steps)
predicted_labels = cross_val_predict(model, all_features,all_labels, cv=5)
# f1 is float
f1 = f1_score(all_labels, predicted_labels, average = 'weighted')
# writing to DataFrame
df[transformer_name][estimator_name] = round(f1,2)
# This correctly shows the value which just written
print(str(df[str(transformer_name)][str(estimator_name)])) # line a
del steps[1]
del steps[0]
# but writing to csv create a file with no values whatsoever
df.to_csv('classification_results_f1score') # line b
Quick little context: In my classification task I am using a set of feature transformers and another set of sklearn classifiers. I am running all possible combinations of these two sets to see which model performs the best.
I am calculating f1-score (a float value) of each model and storing it in a dataframe. The value is successfully written to the dataframe. I am able to verify the same by accessing it. (line a)
But after all the model runs are over (end of both the for loops) when I write the dataframe to a csv it is generating a csv as follows:
,tf-idf,bag_of_words
MultinomialNB,,
LinearSVC,,
LogisticRegression,,
DecisionTreeClassifier,,
MLPClassifier,,
What seems to be the issue here? Why are the values not showing up in the csv?

issue in analysis using decision tree algorithm in Spark and Python

I am doing a churn analysis for telecom industry and I have a sample dataset. I have written this code below where I am using decision tree algorithm in Spark through python. In the dataset I have multiple columns and I am selecting the columns that I need for my feature set.
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import os.path
import numpy as np
inputPath = os.path.join('file1.csv')
file_name = os.path.join(inputPath)
data = sc.textFile(file_name).zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
final_data = data.map(lambda line: line.split(",")).filter(lambda line: len(line)>1).map(lambda line:LabeledPoint(1 if line[5] == 'True' else 0,[line[6],line[7]]))
(trainingdata, testdata) = final_data.randomSplit([0.7, 0.3])
model = DecisionTree.trainRegressor(trainingdata, categoricalFeaturesInfo={},
impurity='variance', maxDepth=5, maxBins=32)
predictions = model.predict(testdata.map(lambda x: x.features))
prediction= predictions.collect()
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
Now this code works fine and does the prediction but what I am missing is the identifier for each customer in the prediction set or testdata. In my dataset there is a column for customerid (column number 4) which as of now I am not selecting as its not a feature to be considered in the model. I am having difficulty in associating this customerid column with the testdata for the customers whose detail is in the testdata. If I add this select this column from the dataset in the feature vector I am forming in the LabeledPoint then this would lead to error as its not a feature value.
How can I add this column in my analysis so that I can get say top 50 customers who have higher churn value?
You can do it exactly the same way as you add the label after prediction.
Small helper:
customerIndex = ... # Put index of the column
def extract(line):
"""Given a line create a tuple (customerId, labeledPoint)"""
label = 1 if line[5] == 'True' else 0
point = LabeledPoint(label, [line[6], line[7]])
customerId = line[customerIndex]
return (customerId, point)
Prepare date using the extract function:
final_data = (data
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(extract)) # Map to tuples
Train:
# As before
(trainingdata, testdata) = final_data.randomSplit([0.7, 0.3])
# Use only points, put the rest of the arguments in place of ...
model = DecisionTree.trainRegressor(trainingdata.map(lambda x: x[1]), ...)
Predict:
# Make predictions using points
predictions = model.predict(testdata.map(lambda x: x[1].features))
# Add customer id and label
labelsIdsAndPredictions = (testData
.map(lambda x: (x[0], x[1].label))
.zip(predictions))
Extract top 50:
top50 = labelsIdsAndPredictions.top(50, key=lambda x: x[1])

Categories