I want to build a content-based recommender system in Python that uses multiple attributes to decide whether two items are similar. In my case, the "items" are packages hosted by the C# package manager (example) that have various attributes such as name, description, tags that could help to identify similar packages.
I have a prototype recommender system here that currently uses only a single attribute, the description, to decide whether packages are similar. It computes TF-IDF rankings for the descriptions and prints out the top 10 recommendations based on that:
# Code mostly stolen from http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
def train(dataframe):
tfidf = TfidfVectorizer(analyzer='word',
ngram_range=(1, 3),
min_df=0,
stop_words='english')
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
for idx, row in dataframe.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-10:-1]
similar_items = [(dataframe['id'][i], cosine_similarities[idx][i])
for i in similar_indices]
id = row['id']
similar_items = [it for it in similar_items if it[0] != id]
# This 'sum' is turns a list of tuples into a single tuple:
# [(1,2), (3,4)] -> (1,2,3,4)
flattened = sum(similar_items, ())
try_print("Top 10 recommendations for %s: %s" % (id, flattened))
How can I combine cosine_similarities with other similarity measures (based on same author, similar names, shared tags, etc.) to give more context to my recommendations?
For some context, my work with content-based recommenders has revolved primarily around raw text and categorical data/features. Here's a high-level approach I've taken that has worked out nicely and is pretty simple to implement.
Suppose I have three feature columns that I can potentially use to make recommendations: description, name, and tags. To me, the path of least resistance entails combining these three feature sets in a useful way.
You're off to a good start, using TF-IDF to encode description. So why not treat name and tags in a similar way by creating a feature "corpus" consisting of description, name, and tags? Literally, this would mean concatenating the contents of each of the three columns into one long text column.
Be wise about the concatenation, though, as it's probably to your advantage to preserve from which column a given word comes from, in the case of features like name and tag, which are assumed to have much lower cardinality than description. To put it more explicitly: instead of just creating your corpus column like this:
df['corpus'] = (pd.Series(df[['description', 'name', 'tags']]
.fillna('')
.values.tolist()
).str.join(' ')
You might try preserving information about where particular data points in name and tags come from. Something like this:
df['name_feature'] = ['name_{}'.format(x) for x in df['name']]
df['tags_feature'] = ['tags_{}'.format(x) for x in df['tags']]
And after you do that, I would take things a step further by considering how the default tokenizer (which you're using above) works in TfidfVectorizer. Suppose you have the name of a given package's author: "Johnny 'Lightning' Thundersmith". If you just concatenate that literal string, the tokenizer will split it up and roll each of "Johnny", "Lightning", and "Thundersmith" into separate features, which could potentially diminish the information added by that row's value for name. I think it's best to try to preserve that information. So I would do something like this to each of your lower-cardinality text columns (e.g. name or tags):
def raw_text_to_feature(s, sep=' ', join_sep='x', to_include=string.ascii_lowercase):
def filter_word(word):
return ''.join([c for c in word if c in to_include])
return join_sep.join([filter_word(word) for word in text.split(sep)])
def['name_feature'] = df['name'].apply(raw_text_to_feature)
The same sort of critical thinking should be applied to tags. If you've got a comma-separated "list" of tags, you'll probably have to parse those individually and figure out the right way to use them.
Ultimately, once you've got all of your <x>_feature columns created, then you can create your final "corpus" and plug that into your recommender system as inputs.
This whole system takes some engineering, to be sure, but I've found it's the easiest way to introduce new information from other columns that have different cardinalities.
As I understand your question, there are two ways this can be done:
Combine the other features with tfidf_matrix and then calculate the cosine similarity
Calculate the similarity of other features using other methods and then somehow combine them with the cosine similarity of tfidf_matrix to get a meaningful metric.
I was talking about the first one.
For example lets say, for your data, the tfidf_matrix (for only the 'description' column) is of shape (3000, 4000)
where 3000 are the rows in the data and 4000 are the unique words (vocabulary) found by the TfidfVectorizer.
Now lets say you do some feature processing on the other columns ('authors', 'id' etc) and that produces 5 columns. So the shape of that data is (3000, 5).
I was saying to combine the two matrices (combine the columns) so that the new shape of your data is (3000, 4005) and then calculate the cosine_similarity.
See below example:
from scipy import sparse
# This is your original matrix
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
# This is the other features
other_matrix = some_processing_on_other_columns()
combined_matrix = sparse.hstack((tfidf_matrix, other_matrix))
cosine_similarities = linear_kernel(combined_matrix, combined_matrix)
You have a vector for a user $\gamma_u$ and an item $\gamma_i$. The scoring function for your recommendation is:
Right now you said your feature vector has only 1 item, but once you get more, this model will scale for that.
In this case you already engineered your vectors, but typically in recommenders, the feature are learned through matrix factorization. This is called a latent factor model, whereas you have a hand-crafted model.
Related
I am trying to achieve something similar in calculating product similarity used in this example. how-to-build-recommendation-system-word2vec-python/
I have a dictionary where the key is the item_id and the value is the product associated with it. For eg: dict_items([('100018', ['GRAVY MIX PEPPER']), ('100025', ['SNACK CHEEZIT WHOLEGRAIN']), ('100040', ['CAULIFLOWER CELLO 6 CT.']), ('100042', ['STRIP FRUIT FLY ELIMINATOR'])....)
The data structure is the same as in the example (as far as I know). However, I am getting KeyError: "word '100018' not in vocabulary" when calling the similarity function on the model using the key present in the dictionary.
# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
def similar_products(v, n = 6): #similarity function
# extract most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]
# extract name and similarity score of the similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
I am calling the function using:
similar_products(model['100018'])
Note: I was able to run the example code with the very similar data structure input which was also a dictionary. Can someone tell me what I am missing here?
If you get a KeyError telling you a word isn't in your model, then the word genuinely isn't in the model.
If you've trained the model yourself, and expected the word to be in the resulting model, but it isn't, something went wrong with training.
You should look at the corpus (purchases_train in your code) to make sure each item is of the form the model expects: a list of words. You should enable logging during training, and watch the output to confirm the expected amount of word-discovery and training is happening. You can also look at the exact list-of-words known-to-the-model (in model.wv.key_to_index) to make sure it has all the words you expect.
One common gotcha is that by default, for the best operation of the word2vec algorithm, the Word2Vec class uses a default min_count=5. (Word2vec only works well with multiple varied examples of a word's usage; a word appearing just once, or just a few times, usually won't get a good vector, and further, might make other surrounding word's vectors worse. So the usual best practice is to discard very-rare words.
Is the (pseudo-)word '100018' in your corpus less than 5 times? If so, the model will ignore it as a word too-rare to get a good vector, or have any positive influence on other word-vectors.
Separately, the site you're using example code from may not be a quality source of example code. It's changed a bunch of default values for no good reason - such as changing the alpha and min_alpha values to peculiar non-standard values, with no comment why. This is usually a signal that someone who doesn't know what they're doing is copying someone else who didn't know what they were doing's odd choices.
I have a dataframe like this with columns - ["A","B","C",D"]
A --> Categorical feature with 2 values, say Yes or No
B --> Categorical feature with 10 unique values, like "AAXX-10","BBYY-20" etc
C --> A date-time field
D --> Text-based column, describing if a person was interested in the movie or not based on short text(basically their comments after coming out of theatre)
Sample df
A | B | C | D
------------------------------------------------------------------------------
Yes|AAXX-10|8/10/2018|"Yes I liked the movie, it was great"
------------------------------------------------------------------------------
Yes|BBYY-20|8/10/2017|"I liked the performance of the cast in the movie but as a whole, It was just average"
------------------------------------------------------------------------------
No |AANN-88|8/10/2013|"Never seen a ridiculous movie like this"
I have two questions here -
I want to make a fifth column, say "Interest", based on the column "D" which would have 4 categories ["Liked", "Didn't like", "Average", "Cannot comment"]. How could I do that?
--On the basis of "D", the "Interest" column should have ["Liked", "Average", "Didn't like"]--.
Since most of the columns are categorical and date-time, and one column as Text. How should I go ahead and do the feature engineering in this particular scenario to be able to feed to Kmeans?
How to get features out of column "D" which is a text feature?.
Should I convert column A to binary 0s a 1s?
Should I do one hot encoding/label encoding to the second column?
How to make use of the date-time feature in the clustering?
Things I have tried -
I did preprocess and feature engineering of column A(convert to binary), B(label encoding), C(Converted to year and month feature from dates) and D(ignored this feature as did not know how could I use it).
Based on this, I got clusters using kmeans.labels_, but those clusters are numeric 1,2,3,4.
How can I actually map those to ["Liked", "Didn't like", "Average", "Cannot comment"]?
How can I use the text column efficiently to make the clusters?
Just short answers to my query would do. I don't need any implementation.
To answer the second question first:
A: can be turned to binary
B: what information can you get from a list of unique strings by encoding? After encoding you are left with either the identity matrix(One-Hot) or a list of monotonically increasing ints (label encoding)
C: you might better transform to Timestamp unix epoch if the date range allows it, this allows you to caluclate distance properly.
D: This is the bread and butter of the project. Processing step is very complex but a short summary:
A basic recipe includes but is not limited to:
Text normalization:
convert to lower or upper case
converting numbers into words or removing numbers,
removing punctuations, accent marks and other diacritics,
removing leading or trailing white spaces
Corupus tokenization (Split each row into a list of single words)
remove stop words, (a, the ..) they contain very litle information and are common
Stemming or Lemmatization. Tese reduce the words to a base form. Stemming is quite crude and could produce inavlid words, but is fast. Lemmatization produces valid words based on a dictionary, but is slower
.... many more stuff
n. Feature Extraction with TF-IDF, this is a sort of encoding that gives each word an importance score. This method works by increasing the weight of a word when it appears many times in a document, and lowering it’s weight when it’s common in many documents.
Example for td-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
After these n steps, you get the answer to your first question; The output could look something like this :
You can find code on how to do all this stuff here (with NLTK). You might not be allowed to use NLTK however, in which case, you will have a hard time doing all these steps.
I am currently working with the TFIDF Vectorizer within SciKit-Learn. The Vectorizer is supposed to apply a formula to detect the most frequent word pairs (bigrams) within a Pandas DataFrame.
The below code section however only returns the frequency analysis for five bigrams while the dataset includes thousands of bigrams for which the frequencies should be calculated.
Does anyone have a smart idea to get rid of my error that limits the number of calculations to 5 responses? I have been researching regarding a solution but have not found the right tweak yet.
The relevant code section is shown below:
def get_top_n_bigram_Group2(corpus, n=None):
# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(ngram_range=(2, 2), stop_words='english', use_idf=True).fit(corpus)
# just send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)
# get the first vector out (for the first document)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]
# place tf-idf values in a pandas data frame
df1 = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df2 = df1.sort_values(by=["tfidf"],ascending=False)
return df2
And the output code looks like this:
for i in ['txt_pro','txt_con','txt_adviceMgmt','txt_main']:
# Loop over the common words inside the JSON object
common_words = get_top_n_bigram_Group2(df[i], 500)
common_words.to_csv('output.csv')
The proposed changes to achieve what you asked for and also taking into account your comments is as follows:
def get_top_n_bigram_Group2(corpus, n=None, my_vocabulary=None):
# settings that you use for count vectorizer will go here
count_vectorizer=CountVectorizer(ngram_range=(2, 2),
stop_words='english',
vocabulary=my_vocabulary,
max_features=n)
# just send in all your docs here
count_vectorizer_vectors=count_vectorizer.fit_transform(corpus)
# Create a list of (bigram, frequency) tuples sorted by their frequency
sum_bigrams = count_vectorizer_vectors.sum(axis=0)
bigram_freq = [(bigram, sum_bigrams[0, idx]) for bigram, idx in count_vectorizer.vocabulary_.items()]
# place bigrams and their frequencies in a pandas data frame
df1 = pd.DataFrame(bigram_freq, columns=["bigram", "frequency"]).set_index("bigram")
df1 = df1.sort_values(by=["frequency"],ascending=False)
return df1
# a list of predefined bigrams
my_vocabulary = ['bigram 1', 'bigram 2', 'bigram 3']
for i in ['text']:
# Loop over the common words inside the JSON object
common_words = get_top_n_bigram_Group2(df[i], 500, my_vocabulary)
common_words.to_csv('output.csv')
If you do not provide the my_vocabulary argument in the get_top_n_bigram_Group2() then the CountVectorizer will count all bigrams without any restriction and will return only the top 500 (or whatever number you request in the second argument).
Please let me know if this is what you were looking for.
Note that the TFIDF is not returning frequencies but rather scores (or if you prefer 'weights').
I would understand the necessity to use TFIDF if you did not have a predefined list of bigrams and you were looking for a way to score among all possible bigrams and wanted to reject those that appear in all documents and have little information power (for example the bigram "it is" appears very frequently in texts but means very little).
I'm having issues with presenting my data in a form which sklearn will accept
My raw data is a few hundred strings, and these are classified into one of 5 classes, I've a list of the strings i'd like to classify, and a parallel list of their respective classes. I'm using GaussianNB()
Example Data:
For such a large, successful business, I really feel like they need to be
either choosier in their employee selection or teach their employees to
better serve their customers.|||Class:4
Which represents a given "feature" and a classification
Naturally, the strings themselves have to be converted to vectors prior to their use in the classifier, I've attempted to use DictVector to perform this task
dictionaryTraining = convertListToSentence(data)
vec = DictVectorizer()
print(dictionaryTraining)
vec.fit_transform(dictionaryTraining)
However in order todo it, i have to attach the actual classification of the data into the dictionary, otherwise i get the error 'str' object has no attribute 'items' I understand this is because .fit_transform requires features and indices, but i don't fully understand the purpose of the indice
fit_transform(X[, y]) Learn a list of feature name -> indices mappings and transform X.
My question is, how can i take a list of strings, and a list of numbers representing their classifications, and provide these to a gaussianNB() classifier such that i can present it with a similar string in the future and it will estimate the strings class?
Since your input data are in the format of raw text and not in the format of a dictionary where like {"word":number_of_occurrences, } I believe you should go with a CountVectorizer which will split your input text on white space and transform it on the input vectors you need.
A simple example of such a transformation would be:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.', 'This is the second second document.',
'And the third one.', 'Is this the first document?',]
x = CountVectorizer().fit_transform(corpus)
print x.todense() #x holds your features. Here I am only vizualizing it
Company 1 has this vector:
['books','video','photography','food','toothpaste','burgers'] ... ...
Company 2 has this vector:
['video','processor','photography','LCD','power supply', 'books'] ... ...
Suppose this is a frequency distribution (I could make it a tuple but too much to type).
As you can see...these vectors have things that overlap. "video" and "photography" seem to be "similar" between two vectors due to the fact that they are in similar positions. And..."books" is obviously a strong point for company 1.
Ordering and positioning does matter, as this is a frequency distribution.
What algorithms could you use to play around with this? What algorithms could you use that could provide valuable data for these companies, using these vectors?
I am new to text-mining and information-retrieval. Could someone guide me about those topics in relation to this question?
Is position is very important, as you emphasize, then the crucial metric will be based on the difference of positions between the same items in the different vectors (you can, for example, sum the absolute values of the differences, or their squares). The big issue that needs to be solved is -- how much to weigh an item that's present (say it's the N-th one) in one vector, and completely absent in the other. Is that a relatively minor issue -- as if the missing item was actually present right after the actual ones, for example -- or a really, really big deal? That's impossible to say without more understanding of the actual application area. You can try various ways to deal with this issue and see what results they give on example cases you care about!
For example, suppose "a missing item is roughly the same as if it were present, right after the actual ones". Then, you can preprocess each input vector into a dict mapping item to position (crucial optimization if you have to compare many pairs of input vectors!):
def makedict(avector):
return dict((item, i) for i, item in enumerate(avector))
and then, to compare two such dicts:
def comparedicts(d1, d2):
allitems = set(d1) | set(d2)
distances = [d1.get(x, len(d1)) - d2.get(x, len(d2)) for x in allitems]
return sum(d * d for d in distances)
(or, abs(d) instead of the squaring in the last statement). To make missing items weigh more (make dicts, i.e. vectors, be considered further away), you could use twice the lengths instead of just the lengths, or some large constant such as 100, in an otherwise similarly structured program.
I would suggest you a book called Programming Collective Intelligence. It's a very nice book on how you can retrieve information from simple data like this one. There are code examples included (in Python :)
Edit:
Just replying to gbjbaanb: This is Python!
a = ['books','video','photography','food','toothpaste','burgers']
b = ['video','processor','photography','LCD','power supply', 'books']
a = set(a)
b = set(b)
a.intersection(b)
set(['photography', 'books', 'video'])
b.intersection(a)
set(['photography', 'books', 'video'])
b.difference(a)
set(['LCD', 'power supply', 'processor'])
a.difference(b)
set(['food', 'toothpaste', 'burgers'])
Take a look at Hamming Distance
As mbg mentioned, the hamming distance is a good start. It's basically assigning a bitmask for every possible item whether it is contained in the companies value.
Eg. toothpaste is 1 for company A, but 0 for company B. You then count the bits which differ between the companies. The Jaccard coefficient is related to this.
Hamming distance will actually not be able to capture similarity between things like "video" and "photography". Obviously, a company that sells one does sell the other also with higher probability than a company that sells toothpaste.
For this, you can use stuff like LSI (it's also used for dimensionality reduction) or factorial codes (e.g. neural network stuff as Restricted Boltzman Machines, Autoencoders or Predictablity Minimization) to get more compact representations which you can then compare using the euclidean distance.
pick the rank of each entry (higher rank is better) and make sum of geometric means between matches
for two vectors
sum(sqrt(vector_multiply(x,y))) //multiply matches
Sum of ranks for each value over vector should be same for each vector (preferrebly 1)
That way you can make compares between more than 2 vectors.
If you apply ikkebr's metod you can find how a is simmilar to b
in that case just use
sum( b( b.intersection(a) ))
You could use the set_intersection algorithm. The 2 vectors must be sorted first (use sort call), then pass in 4 iterators and you'll get a collection back with the common elements inserted into it. There are a few others that operate similarly.