issue in analysis using decision tree algorithm in Spark and Python

issue in analysis using decision tree algorithm in Spark and Python - python

I am doing a churn analysis for telecom industry and I have a sample dataset. I have written this code below where I am using decision tree algorithm in Spark through python. In the dataset I have multiple columns and I am selecting the columns that I need for my feature set.
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import os.path
import numpy as np
inputPath = os.path.join('file1.csv')
file_name = os.path.join(inputPath)
data = sc.textFile(file_name).zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
final_data = data.map(lambda line: line.split(",")).filter(lambda line: len(line)>1).map(lambda line:LabeledPoint(1 if line[5] == 'True' else 0,[line[6],line[7]]))
(trainingdata, testdata) = final_data.randomSplit([0.7, 0.3])
model = DecisionTree.trainRegressor(trainingdata, categoricalFeaturesInfo={},
impurity='variance', maxDepth=5, maxBins=32)
predictions = model.predict(testdata.map(lambda x: x.features))
prediction= predictions.collect()
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
Now this code works fine and does the prediction but what I am missing is the identifier for each customer in the prediction set or testdata. In my dataset there is a column for customerid (column number 4) which as of now I am not selecting as its not a feature to be considered in the model. I am having difficulty in associating this customerid column with the testdata for the customers whose detail is in the testdata. If I add this select this column from the dataset in the feature vector I am forming in the LabeledPoint then this would lead to error as its not a feature value.
How can I add this column in my analysis so that I can get say top 50 customers who have higher churn value?

You can do it exactly the same way as you add the label after prediction.
Small helper:
customerIndex = ... # Put index of the column
def extract(line):
"""Given a line create a tuple (customerId, labeledPoint)"""
label = 1 if line[5] == 'True' else 0
point = LabeledPoint(label, [line[6], line[7]])
customerId = line[customerIndex]
return (customerId, point)
Prepare date using the extract function:
final_data = (data
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(extract)) # Map to tuples
Train:
# As before
(trainingdata, testdata) = final_data.randomSplit([0.7, 0.3])
# Use only points, put the rest of the arguments in place of ...
model = DecisionTree.trainRegressor(trainingdata.map(lambda x: x[1]), ...)
Predict:
# Make predictions using points
predictions = model.predict(testdata.map(lambda x: x[1].features))
# Add customer id and label
labelsIdsAndPredictions = (testData
.map(lambda x: (x[0], x[1].label))
.zip(predictions))
Extract top 50:
top50 = labelsIdsAndPredictions.top(50, key=lambda x: x[1])

Related

Grouped Time Series forecasting with scikit-hts

I am trying to forecast sales for multiple time series I took from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. And for each store and each item, I have 5 years of daily records with weekly and annual seasonalities.
In total there are : 365.2days * 5years * 10stores *50items = 913000 records.
From my understanding based on what I've read so far on Hierarchical and Grouped time series, the whole dataframe could be structured as a Grouped Time Series and not simply as a strict Hierarchical Time Series as aggregation could be done at the store or item levels interchangeably.
I want to find a way to forecast all 500 time series (for store1_item1, store1_item2,..., store10_item50) for the next year (from 01-jan-2015 to 31-dec-2015) using the scikit-hts library and its AutoArimaModel function which is a wrapper function of pmdarima's AutoArima function.
To handle the two levels of seasonality, I added Fourier terms as exogenous features to deal with the annual seasonality while auto_arima deals with the weekly seasonality.
My problem is that I got an error at during prediction step.
Here's the error message :
ValueError: Provided exogenous values are not of the appropriate shape. Required (365, 4), got (365, 8).
I assume something is wrong with the exogenous dictionary but I do not know how to solve the issue as I'm using scikit-hts for the first time. To do this, I followed the official documentation of scikit-hts here.
EDIT :______________________________________________________________
I have not seen that a similar bug was reported on Github. Following the proposed fix that I implemented locally, I could have some results. However, even though there is no error when running the code, some of the forecasts are negative as raised in the comments below this post. And we even get disproportionate values for the positive ones.
Here are the plots for all the combinations of store and item. You can see that this seems to work for only one combination.
df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()
df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()
df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()
df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()
_____________________________________________________________________
Complete code:
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor
# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']
# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]
# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]
# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)
# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
col_names.append(col_name)
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names
# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1)
# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)
# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)
# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)
# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)
# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)
# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names
# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}
# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}
# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}
# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)
# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)
# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)
# Make predictions
# Set the exogenous_df param
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)
Other approaches I thought of and that I already implemented successfully for one series (for store 1 and item 1 for example) :
TBATS applied to each series independently inside a loop across all 500 time series
auto_arima (SARIMAX) with exogenous features (=Fourier terms to deal with the weekly and annual seasonalities) for each series independently + a loop across all 500 time series
What do you think of these approaches? Do you have other suggestions on how to scale ARIMA to multiple time series?
I also want to try LSTM but I'm new to data science and deep learning and do not know how to prepare the data. Should I keep the data in their original form (long format) and apply one hot encoding to train_data['store'] and train_data['item'] columns or should I start with the df I ended up with here?

I Hope this helped you in fixing the issue with exogenous regressors. To handle negative forecasts I would suggest you to try square root transformation.

Array to columns in dataframe

I've built a functioning classification model following this tutorial.
I bring in a csv and then pass each row's text value into a function which calls on the classification model to make a prediction. The function returns an array which I need put into columns in the dataframe.
Function:
def get_top_k_predictions(model,X_test,k):
# get probabilities instead of predicted labels, since we want to collect top 3
np.set_printoptions(suppress=True)
probs = model.predict_proba(X_test)
# GET TOP K PREDICTIONS BY PROB - note these are just index
best_n = np.argsort(probs, axis=1)[:,-k:]
# GET CATEGORY OF PREDICTIONS
preds = [
[(model.classes_[predicted_cat], distribution[predicted_cat])
for predicted_cat in prediction]
for distribution, prediction in zip(probs, best_n)]
preds=[ item[::-1] for item in preds]
return preds
Function Call:
for index, row in df.iterrows():
category_test_features=category_loaded_transformer.transform(df['Text'].values.astype('U'))
df['PREDICTION'] = get_top_k_predictions(category_loaded_model,category_test_features,9)
This is the output from the function:
[[('Learning Activities', 0.001271131465669718),
('Communication', 0.002696299964802842),
('Learning Objectives', 0.002774964762863968),
('Learning Technology', 0.003557563051027678),
('Instructor/TAs', 0.004512712287403168),
('General', 0.006675929282872587),
('Learning Materials', 0.013051869950436862),
('Course Structure', 0.02781481160602757),
('Community', 0.9376447176288959)]]
I want the output to look like this in the end.

You function returns a list that contains a list of tuples? Why the double-nested list? One way I can think of:
tmp = {}
for index, row in df.iterrows():
predictions = get_top_k_predictions(...)
tmp[index] = {
key: value for key, value in predictions[0]
}
tmp = pd.DataFrame(tmp).T
df.join(tmp)

How to split dataframe according sub ID

I have a csv file with 3 columns containing image data set.1st column name 'ID' where ID represent patient id, 2nd and 3rd columns represent side and label of the data set respectively.I would like to split this dataframe in to test and train set according to patient ID in where patient Id wouldn't be repeat in both set.I mean the train ID would not present in the test set. Using this below code
# Defining a function for spliting dataframe into train and test
df_Datacopy = df_Data.copy() # copy the df
#df_Datacopy= df_Datacopy.sort_values(by=['ID'])
df_Datacopy = df_Datacopy.sample(frac=1)
train_df = df_Datacopy.sample(frac=0.80, random_state=0) # train spliting size 80%
# sorted according to ID
train_df= train_df.sort_values(by=['ID'])
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to ID
test_df= test_df.sort_values(by=['ID'])
u1 = np.unique(train_df['ID'])
u2 = np.unique(test_df['ID'])
print(set(u1).union(set(u2)))
I tried to split the test and train set,but the problem is the i seen that some ID present in both test and train set.It would be a great help for me if i get some help including code example.

Simple Python Lists Approach
So I would recommend using simple python lists for this as the preferred and simpler approach.Since you started with pandas I'll provide a way to use pandas methods to achieve something similar but with a possible worse outcome.
whole_dataset_list =df_copy.to_numpy().tolist()
patientid_list =df['ID'].to_numpy().tolist()
patientid_set =list(set(patientid_list))
import random as rand
rand.shuffle(patientid_set)
#Change the numbers as to represent a 80% slice of your dataset/10/10 respectively
train_set_by_patientID = patientid_set[0:800] # 80
val_set_by_patientID = patientid_set[800:900] # 10
test_set_by_patientID = patientid_set[1000:] # 10
After splitting these lists you can use them to obtain the final train/test/val splits as such.
for i in range(len(wholeDataset_list)):
curr_pt_id = wholeDataset_list[i]
if(curr_pt_id in train_set_by_patientID):
train_set.append(wholeDataset_list[i])
elif(curr_pt_id in val_set_by_patientID):
val_set.append(wholeDataset_list[i])
elif(curr_pt_id in test_set_by_patientID):
test_set.append(wholeDataset_list[i])
else:
raise RuntimeError("Whole dataset does not contain given i ")
Finally you can come back to a dataframe if you want as such:
train_df = pd.DataFrame(train_set, columns=df_copy.columns)
val_df = pd.DataFrame(val_set, columns=df_copy.columns)
test_df = pd.DataFrame(test_set, columns=df_copy.columns)
Second Option using Pandas Only:
Here sop_uid is a unique index. I am using a train/val/test split instead of a train/test split but that can be changed easily.
dff.sort_values(by="patient_id", axis=0, inplace=True)
count_study = dff.groupby_agg(by = 'patient_id', agg='count', agg_column_name='sop_uid', new_column_name="count_instances")
df_Datacopy = dict_dff
train_df = df_Datacopy.sample(frac=0.90, weights='count_study', random_state=0) # train spliting size 90%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
# test split and by removing train index
test_df = df_Datacopy.drop(train_df.index)
# sorted according to count_study
test_df= test_df.sort_values(by=['count_instances'], ascending = False)
#Sample
train_df = train_df.sample(frac=0.89, weights='count_study', random_state=0) # train spliting size 80%
train_df= train_df.sort_values(by=['count_instances'], ascending = False)
val_df = df_Datacopy.drop(train_df.index.append(test_df.index))

I recommend using a boolean mask to filter the dataset.
If you want to split 50/50 maybe checking if ID is even or uneven might work.
Since you didnt provide any sample data or furter detail on which citeria to split i suggest
train_df= df[df.ID % 2 == 0]
test_df = df[df.ID % 2 != 0]
Is that what you wanted to achieve?
If not maybe provide more information on what result you want.

index 0 is out of bounds for axis 0 with size 0 Python

PLEASE READ:
I have looked at all the other answers related to this question and none of them solve my specific problem so please carry on reading below.
I have the below code. what the code basically does is keeps the Title column and then concatenated the rest of the columns into one in order to be able to create a cosine matrix.
the main point is the recommendations function that is suppose to take in a Title for imput and return the top 10 matches based on that title but what i get at the end is the index 0 is out of bounds for axis 0 with size 0 error and i have no idea why.
import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
df =
pd.read_csv('https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7')
df = df[['Title','Genre','Director','Actors','Plot']]
df.head()
df['Key_words'] = ""
for index, row in df.iterrows():
plot = row['Plot']
# instantiating Rake, by default it uses english stopwords from NLTK
# and discards all puntuation characters as well
r = Rake()
# extracting the words by passing the text
r.extract_keywords_from_text(plot)
# getting the dictionary whith key words as keys and their scores as values
key_words_dict_scores = r.get_word_degrees()
# assigning the key words to the new column for the corresponding movie
row['Key_words'] = list(key_words_dict_scores.keys())
# dropping the Plot column
df.drop(columns = ['Plot'], inplace = True)
# instantiating and generating the count matrix
df['bag_of_words'] = df[df.columns[1:]].apply(lambda x: '
'.join(x.astype(str)),axis=1)
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
indices = pd.Series(df.index)
# defining the function that takes in movie title
# as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies

This line:
idx = indices[indices == title].index[0]
will fail if you do not return a match:
df.loc[df['Title']=='This is not a valid title'].index[0]
returns:
IndexError: index 0 is out of bounds for axis 0 with size 0
You need to confirm that the title you are passing in is actually in DF before trying to access any data associated with it:
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
if title not in indices:
raise KeyError("title is not in indices")
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This expression also seems to be doing nothing:
for index, row in df.iterrows():
plot = row['Plot']
If you just want a single plot record with which to do some development try:
plot = df['Plot'].sample(n=1)
Finally, it appears that recommendations is using the global variable indices - in general this is bad practice, as if indices changes outside of the scope of recommendations the function might break. I would consider refactoring this to be a little less brittle overall.

DataFrame has float values but calling to_csv() on it generates an empty CSV

Consider my following code:
columns = ['tf-idf','bag_of_words']
index = ['MultinomialNB', 'LinearSVC', 'LogisticRegression', \
'DecisionTreeClassifier','MLPClassifier']
df = pd.DataFrame(columns = columns, index = index)
estimators_dict = OrderedDict([('MultiNomialNB', MultinomialNB()), \
('LinearSVC', LinearSVC()), \
('LogisticRegression', LogisticRegression()), \
('DecisionTreeClassifier', DecisionTreeClassifier()), \
('MLPClassifier',MLPClassifier(max_iter=10))])
transformers_dict = OrderedDict([('tf-idf', TfidfVectorizer(max_features=500)), \
('bag_of_words', CountVectorizer())])
steps = []
for transformer_name, transformer in transformers_dict.items():
steps.append((transformer_name, transformer))
for estimator_name, estimator in estimators_dict.items():
steps.append((estimator_name, estimator))
model = Pipeline(steps)
predicted_labels = cross_val_predict(model, all_features,all_labels, cv=5)
# f1 is float
f1 = f1_score(all_labels, predicted_labels, average = 'weighted')
# writing to DataFrame
df[transformer_name][estimator_name] = round(f1,2)
# This correctly shows the value which just written
print(str(df[str(transformer_name)][str(estimator_name)])) # line a
del steps[1]
del steps[0]
# but writing to csv create a file with no values whatsoever
df.to_csv('classification_results_f1score') # line b
Quick little context: In my classification task I am using a set of feature transformers and another set of sklearn classifiers. I am running all possible combinations of these two sets to see which model performs the best.
I am calculating f1-score (a float value) of each model and storing it in a dataframe. The value is successfully written to the dataframe. I am able to verify the same by accessing it. (line a)
But after all the model runs are over (end of both the for loops) when I write the dataframe to a csv it is generating a csv as follows:
,tf-idf,bag_of_words
MultinomialNB,,
LinearSVC,,
LogisticRegression,,
DecisionTreeClassifier,,
MLPClassifier,,
What seems to be the issue here? Why are the values not showing up in the csv?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

issue in analysis using decision tree algorithm in Spark and Python - python

Related

Grouped Time Series forecasting with scikit-hts

Array to columns in dataframe

How to split dataframe according sub ID

index 0 is out of bounds for axis 0 with size 0 Python

DataFrame has float values but calling to_csv() on it generates an empty CSV

Categories

Resources