I am new to Pyspark, trying to create a ML model in Pyspark
My goal is to create a TFidf vectorizer and pass those features to my SVM model.
I tried this
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("Stream")
sc = SparkContext(conf=conf)
parallelized = sc.parallelize(Dataset.CleanText)
#dataset is a pandas dataframe with CleanText as one of the column
from pyspark.mllib.feature import HashingTF, IDF
hashingTF = HashingTF()
tf = hashingTF.transform(parallelized)
# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:
# First to compute the IDF vector and second to scale the term frequencies by IDF.
#tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
print ("vecs: ",tfidf.glom().collect())
#This is printing all the TFidf vectors
import numpy as np
labels = np.array(Dataset['LabelNo'])
Now how should I pass these Tfidf and label values to my model?
I followed this
http://spark.apache.org/docs/2.0.0/api/python/pyspark.mllib.html
and tried to create labeled point as
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("SparkSessionZipsExample").getOrCreate()
dd = [(labels[i], Vectors.dense(tfidf[i])) for i in range(len(labels))]
df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
print ("df: ",df.glom().collect())
But this is giving me error as:
---〉 15 dd = [(labels[i], Vectors.dense(tfidf[i])) for i in range(len(labels))]
16 df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
17
TypeError: 'RDD' object does not support indexing
The error clearly explains itself RDD does not support indexing. You are trying to get the ith row of tfidf by using i as its index(tfidf[i] in line 15). RDDs don't work like lists. RDDs are distributed datasets. Rows are randomly distributed to workers.
You have to collect the tfidf to a single node if you want your code to work but that would defeat the purpose of a distributed framework like spark.
I would advise you to work with dataframes instead of rdds as they are much faster than rdds and ml lib supports most of the operations(HashingTF, IDF) provided by mllib.
Related
I have a large corpus of text originating from various communities in the form of posts. Specifically, the dataset is Reddit where each row is a comment (subreddit and text).
I'm hoping to run CountVectorizer on the dataset but I require a distributed approach like PySpark. I found this site which contains some scripts. This is the script I ended up on for defining my dataset:
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, Tokenizer
tokenizer = Tokenizer(inputCol="body", outputCol="words")
vectorizer = CountVectorizer(binary = True, minTF = 1, vocabSize = 80000, inputCol = "words", outputCol = "rawFeatures")
pipeline = Pipeline(stages=[tokenizer, vectorizer])
model = pipeline.fit(data)
The problematic code block arises when I try to aggregate the model by subreddits. I tried using something like this but I keep on getting an error:
total_counts = model.transform(data)\
.select('subreddit','rawFeatures').rdd\
.map(lambda row: (row["subreddit"], row['rawFeatures'].toArray()))\
.reduceByKey(lambda x,y: x+y)
Let me know if you have any ideas how I can aggregate the CountVectorizer vectors by the community it occurs.
I have loaded sklearn model successfully but unable to make predictions on pyspark dataframe. While running the below given code, getting an error mentioned below. Please help me to get the code to make predictions with sklearn model on pyspark. I also have searched relevant questions but could not find the solution.
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
braodcast_model.value
#update prediction method
def predictor(cols):
#call predict method for model
return model.value.predict(*cols)
udf_predictor = udf(predictor, FloatType())
#apply the udf to dataframe
df_prediction = df.withColumn("prediction", udf_predictor(df.select(list_of_columns)))
I get the following error message
TypeError: Invalid argument, not a string or column. For column literals, use 'lit', 'array',
'struct' or 'create_map' function.
I think you were on the right track for reaching your expected output.
I managed to find two possible solutions for such problem: one uses Spark UDF, the other uses Pandas UDF.
Spark UDF
from pyspark.sql.functions import udf
#udf('integer')
def predict_udf(*cols):
return int(braodcast_model.value.predict((cols,)))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_udf(*list_of_columns))
Pandas UDF
import pandas as pd
from pyspark.sql.functions import pandas_udf
#pandas_udf('integer')
def predict_pandas_udf(*cols):
X = pd.concat(cols, axis=1)
return pd.Series(braodcast_model.value.predict(X))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_pandas_udf(*list_of_columns))
Reproducible example
Here I used a Databricks Community cluster with Spark 3.1.2, pandas==1.2.4 and pyarrow==4.0.0.
broadcasted_model is a simple logistic regression from scikit-learn, trained on the breast cancer dataset.
import pandas as pd
import joblib
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from pyspark.sql.functions import udf, pandas_udf
# load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# split in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)
# create a small pipeline with standardization and model
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# save and reload the model
path = '/databricks/driver/test_model.joblib'
joblib.dump(model, path)
loaded_model = joblib.load(path)
# sample of unseen data
df = spark.createDataFrame(X_test.sample(50, random_state=42))
# create broadcasted model
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
Then I used the two methods illustrated above and you will see that the outputs df_prediction will be the same in both cases.
I have data that I am trying to build a tree model on. The dataset is 6 million row
I'm using pandas udf successfully to train multiple Lightgbm models with on multiple nodes
Using Databricks with the following configuration
worker type: r5.24xlarge
Driver type: r5.24xlarge
Max workers: 12
Since pandas udf, supposedly, does the computing in parallel for each group, I thought that I could finish the training faster by using a cluster with more workers.
checking the Stage Detail page on the Web UI ,I see that I have only one stage responsible for the udf pandas training. It took 4 hours. Acessing the details of this stage, I see it had only one task, with {'Locality Level': 'PROCESS_LOCAL'}, that took the whole 4 hours. Screenshots below.
My assumption is the training is not being parallelized. I have attached the code below but I am not able to point where the mistake is
#reading data into pandas dataframe
train_gbm_x = s3_read_csv()
train_gbm_original = train_gbm_x.copy()
#create replicas of the dataset so that I can run 4 versions of these in parallel
for i in range(1,5,1):
x_t_c = train_gbm_x.copy()
x_t_c['id']=i
print(train_gbm_x_original.shape)
train_gbm_x_original =train_gbm_x_original.append([x_t_c], ignore_index=True)
train_gbm_sp = spark.createDataFrame(train_gbm_x_original) #convert to spark dataframe
# main UDF
schema = StructType([
StructField('r_squared', DoubleType(), True)])
#f.pandas_udf(schema, f.PandasUDFType.GROUPED_MAP)
def train_RF(data):
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from scipy.stats.stats import pearsonr
import numpy as np
import lightgbm as lgb
# params_xgb['objective'] = 'binary:logistic'
import xgboost as xgb
xgb_model = lgb.LGBMClassifier(bagging_fraction=0.9500000000000001, boosting_type='goss',
feature_fraction=0.53, learning_rate=0.045075773005757123,
max_depth=2, metric='binary_logloss', min_child_samples=70,
min_child_weight=0, min_data_in_bin=54, min_data_in_leaf=219,
min_gain_to_split=0.75, n_estimators=9900, num_leaves=1000,
objective='binary', random_state=4, reg_alpha=7.906971438083483,
reg_lambda=9.905206242985752, verbosity=-1)
# create data and label groups
y_train = data['label']
X_train = data.drop(['employee_id', 'date','label'], axis=1)
xgb_model.fit(X_train, y_train)
# make predictions
y_pred = xgb_model.predict(X_train)
r = pearsonr(y_pred, y_train)
print(y_pred)
# return the number of trees, and the R value
return pd.DataFrame({'r_squared': (r[0]**2)}, index=[0])
The call to the UDF is as follows:
train_gbm_sp= train_gbm_sp.sort(F.col("emp_id"),F.col("id"))
model_coeffs_r =train_gbm_sp.groupby('id').apply(train_RF)
model_coeffs_r.show(5)
spark UI screenshot:
This view shows that it is being run on a single executor ID
I am not sure where the exact issue is, if someone can help point to what I might be doing wrong in the set up here.
Thank you
I am have generated a tf-idf model on ~20,000,000 documents using the following code, which works well. The problem is when I try to calculate similarity scores when using linear_kernel the memory usage blows up:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
train_file = "docs.txt"
train_docs = DocReader(train_file) #DocReader is a generator for individual documents
vectorizer = TfidfVectorizer(stop_words='english',max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs)
#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])
#This is where the memory blows up
similarities = linear_kernel(invec, X).flatten()
Seems like this shouldn't take up much memory, doing a comparison of a 1-row-CSR to a 20mil-row-CSR should output a 1x20mil ndarray.
Justy FYI: X is a CSR matrix ~12 GB in memory (my computer only has 16). I have tried looking into gensim to replace this but I can't find a great example.
Any thoughts on what I am missing?
You can do the processing in batches. Here is an example based on your code snippet but replacing the dataset to something in sklearn. For this smaller dataset, I compute it the original way as well to show that the results are equivalent. You can probably use a larger batchsize.
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.datasets import fetch_20newsgroups
train_docs = fetch_20newsgroups(subset='train')
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.2,min_df=5)
X = vectorizer.fit_transform(train_docs.data)
#predicting a new vector, this works well when I check the predictions
indoc = "This is an example of a new doc to be predicted"
invec = vectorizer.transform([indoc])
#This is where the memory blows up
batchsize = 1024
similarities = []
for i in range(0, X.shape[0], batchsize):
similarities.extend(linear_kernel(invec, X[i:min(i+batchsize, X.shape[0])]).flatten())
similarities = np.array(similarities)
similarities_orig = linear_kernel(invec, X)
print((similarities == similarities_orig).all())
Output:
True
I am trying to do the classic job of clustering text documents by pre-processing, generating tf-idf matrix, and then applying K-means. However, testing this workflow on the classic 20NewsGroup dataset results in most documents being clustered into one cluster. (I have initially tried to cluster all documents from 6 of the 20 groups - so expecting to cluster into 6 clusters).
I am implementing this in Apache Spark as my purpose is to utilise this technique on millions of documents. Here is the code written in Pyspark on Databricks:
#declare path to folder containing 6 of 20 news group categories
path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" %
MOUNT_NAME
#read all the text files from the 6 folders. Each entity is an entire
document.
text_files = sc.wholeTextFiles(path).cache()
#convert rdd to dataframe
df = text_files.toDF(["filePath", "document"]).cache()
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer
#tokenize the document text
tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
tokenized = tokenizer.transform(df).cache()
from pyspark.ml.feature import StopWordsRemover
remover = StopWordsRemover(inputCol="tokens",
outputCol="stopWordsRemovedTokens")
stopWordsRemoved_df = remover.transform(tokenized).cache()
hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", outputCol="rawFeatures", numFeatures=200000)
tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()
idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
idfModel = idf.fit(tfVectors)
tfIdfVectors = idfModel.transform(tfVectors).cache()
#note that I have also tried to use normalized data, but get the same result
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
l2NormData = normalizer.transform(tfIdfVectors)
from pyspark.ml.clustering import KMeans
# Trains a KMeans model.
kmeans = KMeans().setK(6).setMaxIter(20)
km_model = kmeans.fit(l2NormData)
clustersTable = km_model.transform(l2NormData)
ID number_of_documents_in_cluster
0 3024
3 5
1 3
5 2
2 2
4 1
As you can see most of my data points get clustered into cluster 0, and I cannot figure out what I am doing wrong as all the tutorials and code I have come across online point to using this method.
In addition I have also tried normalizing the tf-idf matrix before K-means but that also produces the same result. I know cosine distance is a better measure to use, but I expected using standard K-means in Apache Spark would provide meaningful results.
Can anyone help with regards to whether I have a bug in my code, or if something is missing in my data clustering pipeline?
Thank you in advance!
Here is the implementation in python which does not group all documents together even with a high number of max features:
#imports
import pandas as pd
import os
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, MiniBatchKMeans
vectorizer = TfidfVectorizer(max_features=200000, lowercase=True,
min_df=5, stop_words='english',
use_idf=True)
X = vectorizer.fit_transform(df['document'])
#Apply K-means to create cluster
from time import time
km = KMeans(n_clusters=20, init='k-means++', max_iter=20, n_init=1,
verbose=False)
km.fit(X)
#result
3 2634
6 1720
18 1307
15 780
0 745
1 689
16 504
8 438
7 421
5 369
11 347
14 330
4 243
13 165
10 136
17 118
9 113
19 106
12 87
2 62
I would have thought that we could replicate something similar in pyspark using KMeans with Euclidean distance before trying cosine or Jaccard distances in KMeans. Any solutions or comments?
Just a few quick comments:
K-Means isn't the best algorithm for text analysis in general since it can perform badly at high dimensions. I'd recommend LDA instead.
With K-Means, if you decrease the number of features to maybe 2000, then you're more likely to get multiple distinct clusters. (I tried this quickly on the 20news dataset provided in Databricks CE at /databricks-datasets/news20.binary/data-001/training and was able to get distinct clusters.)
Unrelated: MLlib code can be less verbose if you put all of the transformers and K-Means into a Pipeline and then only call fit() and transform() once. : )
Here's the code I modified from yours can ran. Caveat: I have not tuned it at all, so the clusters are currently pretty useless (but it does find distinct clusters).
df = spark.read.parquet("/databricks-datasets/news20.binary/data-001/training")
df.cache().count()
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
remover = StopWordsRemover(inputCol="tokens", outputCol="stopWordsRemovedTokens")
hashingTF = HashingTF(inputCol="stopWordsRemovedTokens", outputCol="rawFeatures", numFeatures=2000)
idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=20)
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[tokenizer, remover, hashingTF, idf, kmeans])
model = pipeline.fit(df)
results = model.transform(df)
results.cache()
display(results.groupBy("prediction").count()) # Note "display" is for Databricks; use show() for OSS Apache Spark
#Nassir,
Spark k-means(scala mllib api) is consistently producing highly skewed cluster size distributions in my experiments as well(see figure 1). Majority of the data-points are assigned to one cluster. This experiment was conducted using the 20 Newsgroup data for which ground truth is available: the ~10K data-points were manually categorized into fairly balanced 20 groups. http://qwone.com/~jason/20Newsgroups/
Initially I suspected that the vector creation step (using Spark's HashingTF and IDF libraries) was the cause of the incorrect clustering. However, even after implementing my own version of TF-IDF based vector representation I still got similar clustering results with highly skewed size distribution.
Eventually I implemented my own version of k-means on top of spark which uses standard TF-IDF vector representation and (-ve) cosine similarity as the distance metric. The results from this k-means look right. See the Figure 2 below.
Additionally, I experimented by plugging in Euclidean distance as a similarity metric (to my own version of kmean) and results continue look right, not as skewed as spark k-means.
figure 1 and 2