Pyspark: Grouping a CountVectorizer by a specific key set - python

I have a large corpus of text originating from various communities in the form of posts. Specifically, the dataset is Reddit where each row is a comment (subreddit and text).
I'm hoping to run CountVectorizer on the dataset but I require a distributed approach like PySpark. I found this site which contains some scripts. This is the script I ended up on for defining my dataset:
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, Tokenizer
tokenizer = Tokenizer(inputCol="body", outputCol="words")
vectorizer = CountVectorizer(binary = True, minTF = 1, vocabSize = 80000, inputCol = "words", outputCol = "rawFeatures")
pipeline = Pipeline(stages=[tokenizer, vectorizer])
model = pipeline.fit(data)
The problematic code block arises when I try to aggregate the model by subreddits. I tried using something like this but I keep on getting an error:
total_counts = model.transform(data)\
.select('subreddit','rawFeatures').rdd\
.map(lambda row: (row["subreddit"], row['rawFeatures'].toArray()))\
.reduceByKey(lambda x,y: x+y)
Let me know if you have any ideas how I can aggregate the CountVectorizer vectors by the community it occurs.

Related

Scikit learn: TypeError: float() argument must be a string or a number, not 'Bunch'

I want to apply the svm using the following approach but apparently the "Bunch" type is not appropriate.
Usually, with Bunch (Dictionary-like object), the interesting attributes are: ‘data’, the data to learn and ‘target’, the classification labels. You can access the .data and the .target information accordingly. How can I make it work as I have the code below?
import pandas as pd
from sklearn import preprocessing
#Call the data below using scikit learn which stores them in Bunch
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories = cats)
vectorizer = TfidfVectorizer( stop_words = 'english') #new
vectors = vectorizer.fit_transform(newsgroups_train.data) #new
vectors_test = vectorizer.transform(newsgroups_test.data) #new
max_abs_scaler = preprocessing.MaxAbsScaler()
scaled_train_data = max_abs_scaler.fit_transform(vectors)#corrected
scaled_test_data = max_abs_scaler.transform(vectors_test)
clf=CalibratedClassifierCV(OneVsRestClassifier(SVC(C=1)))
clf.fit(scaled_train_data, train_labels)
predictions=clf.predict(scaled_test_data)
proba=clf.predict_proba(scaled_test_data)
in the clf.fit line in the position of "trained_labels" I put "vectorizer.vocabulary_.keys()" but it gives: ValueError: bad input shape (). What should I do to get the trained labels and make it work?
You are trying to apply a Numerical Scaling Operation over text data. That is logically incorrect. If you see the official documentation of MaxAbsScalar it's function is to :
Scale each feature by its maximum absolute value
If you want to find the vectors of the text data, then you need to use something like CountVectorizer. See this example from official documentation here.
Alternatively, you can try TfIDfTransformer as well. Here is an example of using it with newsgroup data.

Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'
Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.
As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:
numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns = train.select_dtypes(include=np.object).columns.tolist()
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]
combined_pipe = ColumnTransformer(transformers)
train_clean = combined_pipe.fit_transform(train)
test_clean = combined_pipe.transform(test)
In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.
I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.
If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?
There's an extensive discussion about it here, but I don't think anything has been finalized yet.
Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.
Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.
From Documentation of ColumnTransformer:
Notes
The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.
Try this!
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer
train = pd.DataFrame({'age': [23,12, 12, np.nan],
'Gender': ['M','F', np.nan, 'F'],
'income': ['high','low','low','medium'],
'sales': [10000, 100020, 110000, 100],
'foo' : [1,0,0,1],
'text': ['I will test this',
'need to write more sentence',
'want to keep it simple',
'hope you got that these sentences are junk'],
'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns = ['Gender','income']
numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))
transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns),
('text', text_pipeline, 'text'),
('simple_transformer', MinMaxScaler(), ['sales']),
]
combined_pipe = ColumnTransformer(
transformers, remainder='passthrough')
transformed_data = combined_pipe.fit_transform(
train.drop('y',1), train['y'])
def get_feature_out(estimator, feature_in):
if hasattr(estimator,'get_feature_names'):
if isinstance(estimator, _VectorizerMixin):
# handling all vectorizers
return [f'vec_{f}' \
for f in estimator.get_feature_names()]
else:
return estimator.get_feature_names(feature_in)
elif isinstance(estimator, SelectorMixin):
return np.array(feature_in)[estimator.get_support()]
else:
return feature_in
def get_ct_feature_names(ct):
# handles all estimators, pipelines inside ColumnTransfomer
# doesn't work when remainder =='passthrough'
# which requires the input column names.
output_features = []
for name, estimator, features in ct.transformers_:
if name!='remainder':
if isinstance(estimator, Pipeline):
current_features = features
for step in estimator:
current_features = get_feature_out(step, current_features)
features_out = current_features
else:
features_out = get_feature_out(estimator, features)
output_features.extend(features_out)
elif estimator=='passthrough':
output_features.extend(ct._feature_names_in[features])
return output_features
pd.DataFrame(transformed_data,
columns=get_ct_feature_names(combined_pipe))

Training ML algorithm in pyspark

I am new to Pyspark, trying to create a ML model in Pyspark
My goal is to create a TFidf vectorizer and pass those features to my SVM model.
I tried this
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[2]").setAppName("Stream")
sc = SparkContext(conf=conf)
parallelized = sc.parallelize(Dataset.CleanText)
#dataset is a pandas dataframe with CleanText as one of the column
from pyspark.mllib.feature import HashingTF, IDF
hashingTF = HashingTF()
tf = hashingTF.transform(parallelized)
# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes:
# First to compute the IDF vector and second to scale the term frequencies by IDF.
#tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
print ("vecs: ",tfidf.glom().collect())
#This is printing all the TFidf vectors
import numpy as np
labels = np.array(Dataset['LabelNo'])
Now how should I pass these Tfidf and label values to my model?
I followed this
http://spark.apache.org/docs/2.0.0/api/python/pyspark.mllib.html
and tried to create labeled point as
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("SparkSessionZipsExample").getOrCreate()
dd = [(labels[i], Vectors.dense(tfidf[i])) for i in range(len(labels))]
df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
print ("df: ",df.glom().collect())
But this is giving me error as:
---〉 15 dd = [(labels[i], Vectors.dense(tfidf[i])) for i in range(len(labels))]
16 df = spark.createDataFrame(sc.parallelize(dd),schema=["label", "features"])
17
TypeError: 'RDD' object does not support indexing
The error clearly explains itself RDD does not support indexing. You are trying to get the ith row of tfidf by using i as its index(tfidf[i] in line 15). RDDs don't work like lists. RDDs are distributed datasets. Rows are randomly distributed to workers.
You have to collect the tfidf to a single node if you want your code to work but that would defeat the purpose of a distributed framework like spark.
I would advise you to work with dataframes instead of rdds as they are much faster than rdds and ml lib supports most of the operations(HashingTF, IDF) provided by mllib.

Scikit SGD classifier with Hashing Vectorizer accuracy stuck at 58%

I am trying my hand in Machine Learning and have been using python based Scikit library for it.
I wish to solve a 'Classification' problem in which a chunk of text (say of 1k-2k words) is classified into one or more category. For this I have been studying scikit for a while now.
As my data being in range 2-3 Million, so I was using SGDClassfier with HashingVectorizer for the purpose using partial_fit learning technique, coded as below:
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer
import numpy as np
from sklearn.externals import joblib
import copy
data = pd.read_csv(
open('train_shuffled.csv'), error_bad_lines=False)
data_all = copy.deepcopy(data)
target = data['category']
del data['category']
cls = np.unique(target)
model = SGDClassifier(loss='log', verbose=1)
vect = HashingVectorizer(stop_words='english', strip_accents='unicode', analyzer='word')
loop = len(target) / 100
for passes in range(0, 5):
count, r = 0, 0
print("Pass " + str(passes + 1))
for q in range(0, loop):
d = nltk.word_tokenize(data['content'][r:r + 100])
d = vect.fit_transform(d)
t = np.array(target[r:r + 100])
model.partial_fit(d, t, cls)
r = r + 100
data = copy.deepcopy(data_all)
data = data.iloc[np.random.permutation(len(data))]
data = data.reset_index(drop=True)
target = data['category']
del data['category']
print(model)
joblib.dump(model, 'Model.pkl')
joblib.dump(vect, 'Vectorizer.pkl')
While going the learning process, I read in an answer here on stack that manually randomizing the training data on each iteration results into better model.
Using the Classifers and Vectorizers with default parameters, I got an accuracy score of ~58.4%. Since then, I have trying playing with different parameter setting for both Vectorizer and Classifier but no increase in accuracy.
Is anyone able to tell me, if something is wrong I have been doing or what should be done for improving the model score.
Any help will be highly appreciated.
Thanks!
1) consider using GridSearchCv to tune parameters. http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html
2)consider feature engineering, to combine existing features into new features. E.G. use the polynomial features, feature selection and feature union tools provided in sklearn.
3) try different models. Not all models work on all problems. Try using an ensemble of simpler models and some kind of decision function to take the outputs of those models and make a prediction. Some are in the enesemble module, but you can use the voting classifiers to make your own.
but by far the best and most important thing to do, look at the data. Find examples of where the classifier performed badly. Why did it perform badly? Can you classify it from reading it (i.e. is it reasonable to expect an algo to classifier that text?). If it can be classified, what does the model miss.
All these will help guide what to do next.

Predicting Classifications with some Words not in the training set (Naive Bayes)

I am created a Naive Bayes model to predict if the outcome is 'negative' or 'positive'. The problem I am having is running the model on a new set of data with some of the words not in the model. The error I receive for predicting a new data set is :
ValueError: Expected input with 6 features, got 4 instead
I read that I would have to put a Laplace Smoother in my model and Bernoulli() already has a default alpha of 1. What else can I do to fix my error? Thank you
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import accuracy_score
import textblob as TextBlob
#scikit
comments = list(['happy','sad','this is negative','this is positive', 'i like this', 'why do i hate this'])
classes = list(['positive','negative','negative','positive','positive','negative'])
# preprocess creates the term frequency matrix for the review data set
stop = stopwords.words('english')
count_vectorizer = CountVectorizer(analyzer =u'word',stop_words = stop, ngram_range=(1, 3))
comments = count_vectorizer.fit_transform(comments)
tfidf_comments = TfidfTransformer(use_idf=True).fit_transform(comments)
# preparing data for split validation. 60% training, 40% test
data_train,data_test,target_train,target_test = cross_validation.train_test_split(tfidf_comments,classes,test_size=0.2,random_state=43)
classifier = BernoulliNB().fit(data_train,target_train)
#new data
comments_new = list(['positive','zebra','george','nothing'])
comments_new = count_vectorizer.fit_transform(comments_new)
tfidf_comments_new = TfidfTransformer(use_idf=True).fit_transform(comments_new)
classifier.predict(tfidf_comments_new)
You should not fit a new estimator on the new data using fit_transform, but use the previously build count_vectorizer, just using transform. That will ignore all words that were not in the dictionary.
I disagree with Maxim: While this doesn't make a difference for CountVectorizer, using TfidfTransformer on the joined dataset will leak information from the test set to the training set, which you need to avoid.
You are creating a count matrix from 'comments' words. While creating count matrix you must use all of possible words you will encounter in your problem.Imagine the simpler case when you create a membership matrix. Each column states for specific word, each row - for specific example from dataset (for example, email text). Matrix holds 0 if specific word is not in the example, and 1 if it is in example. Obviously, if you have built such matrix for emails which hold, for example, 100 different words, the matrix will have 100 of columns. But if after that you will try to use trained classifier against new data in which you will have some new word, which wasn't in the training set - it will just fail. Since there was no column in original matrix to hold values for these new word. So once again, during vectorization of text you must provide all terms you will ever face in train and test datasets.
So instead of calling CountVectorizer and tfidfTransformer against 'comments' you must join both comments and comments_new into one list and call CountVectorizer and tfidfTransformer against the joined list.

Categories