Inconsistent results using ALS in Apache Spark - python

I'm very new to Apache Spark and big data in general. I'm using the ALS method to create rating predictions based on a matrix of users, items, and ratings. The confusing part is that when I run the script to calculate the predictions, the results are different every time, without the input or the requested predictions changing. Is this expected behavior, or should the results be identical? Below is the Python code for reference.
from pyspark import SparkContext
from pyspark.mllib.recommendation import ALS
sc = SparkContext("local", "CF")
# get ratings from text
def parseRating(line):
fields = line.split(',')
return (int(fields[0]), int(fields[1]), float(fields[2]))
# define input and output files
ratingsFile = 's3n://weburito/data/weburito_ratings.dat'
unratedFile = 's3n://weburito/data/weburito_unrated.dat'
predictionsFile = '/root/weburito/data/weburito_predictions.dat'
# read training set
training = sc.textFile(ratingsFile).map(parseRating).cache()
# get unknown ratings set
predictions = sc.textFile(unratedFile).map(parseRating)
# define model
model = ALS.train(training, rank = 5, iterations = 20)
# generate predictions
predictions = model.predictAll(predictions.map(lambda x: (x[0], x[1]))).collect()

This is expected behaviour. The factor matrices in ALS are initialized randomly (well actually one of them is, and the other is solved based on that initialization in the first step).
So different runs will give slightly different results.

Related

Batching large input file into MLlib model

Is there any way to batch a large input file (111MB) made of 22 MLN cells (222 rows for 110k columns) in MLlib (something similar to this tutorial made in keras) Keras batching tutorial.
The file contains the actual features extracted from 222 images using the above tutorial, but instead of using a keras model I would like to replicate such code using pyspark and MLlib.
Unfortunately I've not enough resources for dealing in memory for such big file and the computation fails for Java Heap Space memory error.
The file structure is composed by for each row (representing an image) we have these columns: "_c0" the label 0/1, from "_c1" up to "_c100353" features extracted.
Here's my code, I don't care about precision and accuracy, I'm just interested on running the model for making resource usage metrics.
sql,sc = init_spark()
df = sql.read.option("maxColumns", 100400).load(file3,format="csv",inferSchema="true",sep=',',header="false")
labelIndexer = StringIndexer(inputCol="_c0", outputCol="indexedLabel").fit(df)
cols = df.columns
cols.remove("_c0")
assembler = VectorAssembler(inputCols=cols,outputCol="features")
data = assembler.transform(df)
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=100).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
#
# # Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
#
# # Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
#
# # Make predictions.
predictions = model.transform(testData)
#
# # Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(100)
predictions.printSchema()
#
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g " % accuracy)
Please don't suggest me to use sparkdl library for features extraction using DeepImageFeaturizer because it's completely broken.

Pyspark with sklearn and dataframe types conversion

I'm trying to use sklearn with pyspark but I'm having some performance issues. Lets say I have a dataset that have already went through a pipeline where features were vectorized and normalized. In order to use sklearn I'll either have to provide the algorithm an array or a pandas dataframe. Considering the size of my dataset (2.8M + and should grow bigger soon), converting the train set to pandas is painstakingly slow so I'm using a numpy approach:
train = np.array(train.select('features').collect()).squeeze()
This is relatively slow as I need to use collect to push data back to the driver. Is there any other approach that would be faster and better? Additionally, due to the nature of the problem I'm currently handling my scoring function in a non-standard way:
def score (fitmodel, test):
predY = fitmodel.score_samples(test)
return np.full((1, len(predY)), np.mean(predY)).transpose()
The idea is to compute the average of the predictions and then return an array where this result is replicated as many times as the number of records tested. E.g. if my testset has 450 records I'll return an array shaped (450,1) where all 450 records have the same value (average of the predictions). Although slow, so far so good and everything works as intended. My problem is that I need to proceed with the testing by doing this multiple times (changing the test set) and append the results to a single array in order to evaluate the model's performance later. My code:
for _ in tdqm(range(450, 800)):
#Get group #X
_test = df.where(col('index') == _) #Get a different "chunk" of the dataset each iteration
_test.coalesce(2)
#Apply pipeline with transforms
test = pipelineModel.transform(_test)
y_test = np.array(test.select('label').collect())
x_test = np.array(test.select('features').collect()).squeeze()
pred = newScore(x_test, model)
if(_ == (450)): #First run
trueY = y_test
predY = pred
else:
trueY = np.append(trueY, y_test)
predY = np.append(predY, pred)
Briefly, I grab a specific portion of the dataset, test it and then i want to append both the predictions and "true" label for later evaluation. My main problem here is that doing 2 collects and a np.append takes a long time and I need to find an alternative. Testing the entire test set (about 400k entries) takes me ~1min but with all these in place it increases the time to 2h20min.
On top of this I have to convert the array back to a pyspark dataframe to use the mllib evaluation functions which adds a bit more time to the process.
With all this being said, could anyone point me in a direction where I could accomplish this in a more efficient way? Maybe there is another way to use spark and sklearn ?

LightFM train_interactions shared among train and test sets: This will cause incorrect evaluation, check your data split

tl;dr: Working with Yelp Dataset to make a recommendation System but running into Test interactions matrix and train interactions matrix share 68 interactions. This will cause incorrect evaluation, check your data split. error when running the following LightFM code.
test_auc = auc_score(model,
test,
#train_interactions=train, #Unable to run with this line uncommented
item_features=sparse_features_matrix,
num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)
Full Story: Working with Yelp Dataset to build a recommendation system.
Going off the code provided in example documentation (https://making.lyst.com/lightfm/docs/examples/hybrid_crossvalidated.html) for Hybrid Collaborative Filtering.
I ran my code the following way:
from sklearn.model_selection import train_test_split
from lightfm import LightFM
from scipy import sparse
from lightfm.evaluation import auc_score
train, test = train_test_split(sparse_Rating_Matrix, test_size=0.25,random_state=4)
# Set the number of threads; you can increase this
# if you have more physical cores available.
NUM_THREADS = 2
NUM_COMPONENTS = 100
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6
# Define a new model instance
model = LightFM(loss='warp',
item_alpha=ITEM_ALPHA,
no_components=NUM_COMPONENTS)
# Fit the hybrid model. Note that this time, we pass
# in the item features matrix.
model = model.fit(train,
item_features=sparse_features_matrix,
epochs=NUM_EPOCHS,
num_threads=NUM_THREADS)
# Don't forget the pass in the item features again!
train_auc = auc_score(model,
train,
item_features=sparse_features_matrix,
num_threads=NUM_THREADS).mean()
print('Hybrid training set AUC: %s' % train_auc)
test_auc = auc_score(model,
test,
#train_interactions=train, # Unable to run with this line uncommented
item_features=sparse_features_matrix,
num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)
I had 2 problems:
1) Running the line in question uncommented (train_interactions=train) originally yielded Inconsistent Shape
which was resolved by the following:
"test" data set was modified by the following block of code to append a block of zeros below it until the dimensions match that of my train data set (per this recommendation: https://github.com/lyst/lightfm/issues/369):
#Add X users to Test so that the number of rows in Train match Test
N = train.shape[0] #Rows in Train set
n,m = test.shape #Rows & columns in Test set
z = np.zeros([(N-n),m]) #Create the necessary rows of zeros with m columns
test = test.todense() #Temporarily convert Test into a numpy array
test = np.vstack((test,z)) #Vertically stack Test on top of the blank users
test = sparse.csr_matrix(test) #Convert back to sparse
2) After the shape issue was resolved, I tried to implement "train_interactions=train"
But ran into Test interactions matrix and train interactions matrix share 68 interactions. This will cause incorrect evaluation, check your data split.
And I"m not sure how to resolve this 2nd issue. Any ideas?
Details:
-"sparse_features_matrix" is a sparse matrix of {items x categories} where if an item was "Italian" and "Pizza" then the category of "Italian" and "Pizza" would have a value "1" for that item's row ... "0" elsewhere.
-"sparse_Rating_Matrix" is a sparse matrix of {users x items} containing values of the user's ratings to the restaurant (item).
04/08/2020 Update:
LightFM has a whole Database() class object that you should use to prep your data set prior to model evaluation. I found a great github post (https://github.com/lyst/lightfm/issues/494) where user Med-ELOMARI provides an amazing walk through on a small test data set.
When I prepped my data through this method, I was able to add in user_features that I wanted to model (E.g: User_1592 likes "Thai","Mexican","Sushi" cuisines).
Per Turbo's comment, I used LightFM's random_train_test_split method (had originally split my data via sklearn's train_test_split method) and ran the auc_score with the new train/test sets AND the correctly (as far as im aware) prepared model I still run into the same error code:
Input:
%%time
(train,test) = random_train_test_split(lightfm_interactions,test_percentage=0.25) #LightFM's method to split
# Don't forget the pass in the item features again!
train_auc = auc_score(model_users,
train,
user_features=lightfm_user_features_list,
num_threads=NUM_THREADS).mean()
print('User_feature training set AUC: %s' % train_auc)
test_auc = auc_score(model_users,
test,
#train_interactions=train, #Still can't get this to function
user_features=lightfm_user_features_list,
num_threads=NUM_THREADS).mean()
print('User_feature test set AUC: %s' % test_auc)
Output if "train_interactions=train" is used:
ValueError: Test interactions matrix and train interactions matrix share 435 interactions. This will cause incorrect evaluation, check your data split.
Good news however is --- by switching from sklearn's train_test_split to LightFM's random_train_test_split my model's AUC score went from 0.49 to 0.96 on training. So I guess it's important to stick with LightFM's methods if available!
LightFM provide a way of splitting your dataset, did you look on it?
With it, it might work.
https://making.lyst.com/lightfm/docs/cross_validation.html

how to make features using featuretools, for the new data(on which we want to make prediction)

I have a single dataframe and want to use featuretools for auto feature engineering part. I am able to do it with normalize entities function. code snippet is below:
es = ft.EntitySet(id = 'obs_data')
es = es.entity_from_dataframe(entity_id = 'obs', dataframe = X_train,
variable_types = variable_types, make_index = True, index = "Id")
for feat in interaction: # interaction columns are found using xgbfir
es = es.normalize_entity(base_entity_id='obs', new_entity_id=feat, index=feat)
features, feature_names = ft.dfs(entityset = es,
target_entity = 'obs',
max_depth = 2)
Its creating features, Now I want to do same thing for X_test. I read blogs on this and they are suggesting to combine X_train and X_test and then do the same process. suppose there are 5 obs in X_test and if i combine it with X_train, then each observation (from X_test) will have effect of other 4 observation (X_test) also, which is not a good idea.
Anyone can suggest how to do feature engineering using featuretools for the new data?
You can try using cutoff times which specifies the last point in time that an observation can be used for a feature calculation. The labels can be passed along with the cutoff times to ensure that they stay aligned with the feature matrix. Then, you can split the feature matrix to X_train and X_test.
With new data, the normalization should be repeatable so that the entity set can have the same structure. Then, you can calculate features with cutoff times as usual. You may also want to look into Compose which automatically generates the cutoff times based on how you define the prediction problem. If cutoff times don't work in your use case, I will need more details to better understand how each observation will have an effect on the others. Let me know if this helps.
It is possible with calculate_feature_matrix() in featuretools. You can get detailed guide from its webpage: https://docs.featuretools.com/en/stable/guides/deployment.html#calculating-feature-matrix-for-new-data
Suppose new data is X_test. If it is a dataframe, you should create an entityset for it.
es_test = es.entity_from_dataframe(entity_id = 'entity', dataframe = X_test)
Otherwise, if it is an entity already, you can skip previous step. Suppose your test entity is es_test and your generated feature names is feature_names. By using train data's feature names you can create a new feature matrix for test data.
test_feat_generated= ft.calculate_feature_matrix(feature_names, es_test)
For later use of feature_names, you can look load_features(), save_features() functions.
Note: Train and test entities should have the same entity_id otherwise you would get an error.

10 fold cross validation python

There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results.
I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.
There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.
There is a line of code in the main file:
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
if args.mode=='train':
model.train(data)
sess = model.restore_last_session()
model.predict(data, sess)
if args.mode=='test':
sess = model.restore_last_session()
model.predict(data, sess)
in which the 'data' is a class of Data(code) that includes test/train/dev datasets:
which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
class Data(object):
def __init__(self,data_path,vocab_path,pretrained,batch_size):
self.batch_size = batch_size
data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
self.train=data['train']
self.valid=data['valid']
self.test=data['test']
self.train2=data['train2']
self.valid2=data['valid2']
self.test2=data['test2']
self.word_size = len(vocab['word2id'])+1
self.max_sent_len = vocab['max_sent_len']
self.max_topic_len = vocab['max_topic_len']
self.word2id = vocab['word2id']
word2id = vocab['word2id']
#self.id2word = dict((v, k) for k, v in word2id.iteritems())
self.id2word = {}
for k, v in six.iteritems(word2id):
self.id2word[v]=k
self.pretrained=pretrained
by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)
so with a very minimal changes to existing code and libraries you can do smth like
first load all the data and build the model
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
then create the cross validation data set, smth like
def get_new_data_object():
return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
cross_validation = []
for i in range(10):
tmp_data = get_new_data_object()
tmp_data.train= #get 90% of tmp_data['train']
tmp_data.valid= #get 90% of tmp_data['valid']
tmp_data.test= #get 90% of tmp_data['test']
tmp_data.train2= #get 90% of tmp_data['train2']
tmp_data.valid2= #get 90% of tmp_data['valid2']
tmp_data.test2= #get 90% of tmp_data['test2']
cross_validation.append(tmp_data)
than run the model n times (10 for 10-fold cross validation)
sess = null
for data in cross_validation:
model.train(data, sess)
sess = model.restore_last_session()
keep in mind to pay attention to some key ideas
I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
you can train the model n times with cross validation or create n models and pick the best to avoid overfitting
this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)
one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object
more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach
hope it helps, and point you in the right direction

Categories