Using a recommender system with new user - python

I am using "implicit" package (https://github.com/benfred/implicit) to create a recommender system in python. More preciseling, I am using the implicit least square algorithm.
The library is pretty easy to use, I was able to make predictions for already existing users, or to find similar items, no prob. But how can I make predictions for a new user which was not in input data? My goal is to get prediction from a new vector of items (~a new user). All items exist in input data.
This library and other equivalent ones usually provide a predict method for user already existing in dataset.
My first attempt was to get a prediction vector for each item and sum them all. But it does not feel right, does it?
This seems like a common usage, so I think I am missing something. What would be the method to use? Thank you for your help.

depends on what you're recommending but for example if it is something like movies then to a new user we would just generally recommend the most popular movies. Then as we get to know more about the user we can use the usual matrix factorization.

Related

Can Ngboost algorithm processing missing values automatic?

I get a new GBDT algorithm named Ngboost invented by stanfordmlgroup. I want to use it and call encode
pip install ngboost==0.2.0
to install it.
and then I train a dataset that donot impute or delete missing value.
however I get a error:
Input contains NaN, infinity or a value too large for dtype('float32').
is this mean Ngboost cannot processing missing value automatic like xgboost?
You have two possibilities with this error.
1- You have some really large value. Check the max of your columns.
2- The algorithm don't support NAN and inf type so you have to handle them like in some other regression models.
Here's a response from one of the ngboost creators about that
Hey #omsuchak, thanks for the suggestion. There is no one "natural" or good way to generically handle missing data. If ngboost were to do this for you, we would be making a number of choices behind the scenes that would be obscured from the user.
If we limited ourselves to use cases where the base learner is a regression tree (like we do with the feature importances) there are some reasonable default choices for what to do with missing data. Implementing those strategies here is probably not crazy hard to do but it's also not a trivial task. Either way, I'd want the user to have a transparent choice about what is going on. I'd be open to review pull requests on that front as they satisfy that requirement, but it's not something I plan on working on myself in the foreseeable future. I'll close for now but if anyone wants to try to add this please feel free to comment.
And then you can see other answer about how to solve that, for example with sklearn.impute.MissingIndicator module (to indicate to the model the presence of missings) or some Imputer module.
If you need a practical example you can try with the survival example (located in the repo!).

How do you solve long tail problem in a recommendation system?

I want to make a movie recommendation system using the binary ratings that is whether a person has seen the movie or not! I am using various cosine similarity techniques and all but the issue is the Long Tail
in Recommendation System. I am not able to find any concrete solution which uses just viewed or not (i.e. either 0 or 1) and not the ratings as such for the recommendation? What other popular algorithms can be used for the same. I need to remove the long tail issue,
I have used Adaptive Clustering but it needs many Derived Variables and those are not present here.
Used other ways like Total Clustering but no use.
Used Popularity Sensitive Clustering but same issue.
Been stuck here in this long tail issue but not getting even a good implementation for my work or a research paper that helps but nothing.
Everyone is using either ratings or the user data but my work doesn't have any user info and neither is it having any ratings just the binary values.
The Long Tail issue in recommendation systems basically is about how to give users recommendation of items that do not have a lot of interactions(ratings/likes) etc. As similarity algorithms like cosine similarity and clustering algorithms fails in recommending them. You need to look into diversity increasing algorithms.
What I mean is rather than calculating similarity try calculating dissimilarity.
Here R is recommendation list, d(i, j) is dissimilarity.
You can use surprise to generate R here using matrix factorization algorithms.
Also, when you generate a user vs. item matrix where matrix[user_i][item_j] denote rating you can convert it to 1 to show rating and 0 otherwise and it will still work. Also, these binary ratings generally are call interaction the user had with the item.

Python: Create Nomograms from Data (using PyNomo)

I am working on Python 2.7. I want to create nomograms based on the data of various variables in order to predict one variable. I am looking into and have installed PyNomo package.
However, the from the documentation here and here and the examples, it seems that nomograms can only be made when you have equation(s) relating these variables, and not from the data. For example, examples here show how to use equations to create nomograms. What I want, is to create a nomogram from the data and use that to predict things. How do I do that? In other words, how do I make the nomograph take data as input and not the function as input? Is it even possible?
Any input would be helpful. If PyNomo cannot do it, please suggest some other package (in any language). For example, I am trying function nomogram from package rms in R, but not having luck with figuring out how to properly use it. I have asked a separate question for that here.
The term "nomogram" has become somewhat confused of late as it now refers to two entirely different things.
A classic nomogram performs a full calculation - you mark two scales, draw a straight line across the marks and read your answer from a third scale. This is the type of nomogram that pynomo produces, and as you correctly say, you need a formula. As mentioned above, producing nomograms like this is definitely a two-step process.
The other use of the term (very popular, recently) is to refer to regression nomograms. These are graphical depictions of regression models (usually logistic regression models). For these, a group of parallel predictor variables are depicted with a common scale on the bottom; for each predictor you read the 'score' from the scale and add these up. These types of nomograms have become very popular in the last few years, and thats what the RMS package will draft. I haven't used this but my understanding is that it works directly from the data.
Hope this is of some use! :-)

Python Deep Learning find Duplicates

I want to try and learn Deep Learning with Python.
The first thing that came to my mind for a useful scenario would be a Duplicate-Check.
Let's say you have a customer-table with name,address,tel,email and want to insert new customers.
E.g.:
In Table:
Max Test,Teststreet 5, 00642 / 58458,info#max.de
To Insert:
Max Test, NULL, (+49)0064258458, test#max.de
This should be recognised as a duplicate entry.
Are there already tutorials out there for this usecase? Or is it even possible with deep learning?
Duplicate matching is a special case of similarity matching. You can define input features as either individual characters or fields and then train your network. It's a binary classification problem (true/false) unless you want to have a similarity score (95% match). The network should be able to learn that punctuation and whitespace is irrelevant and an 'or function' for at least one of the fields matching to produce true positive.
Sounds like a fairly simple case for deep learning.
I don't know of any specific tutorial for this, but I tried to give you some keywords to look for.
You can use duplicates=dataset.duplicated()
It will return all rows which are duplicate
Then:
print(sum(duplicates))
to print count of duplicated rows.
In your case, finding duplicates for numbers and category data should be simpler. The problem arises when it is free text. I think you should try out fuzzy matching techniques to start with. There is good distance metric available in Python called Levenshtein distance. The library for calculating the distance is python-Levenshtein. It is pretty fast. See if you get good results using this distance metric if you want to improve further you go for deep learning algorithms like RNN, LSTM, etc. which are good for text data.
The problem to find duplicate instances in relational database is a traditional research topic in database and data mining, which is called "entity matching" or "entity resolution". Deep learning is also adapted in this domain.
Many related works can be found in google scholar by searching "entity matching"+"deep learning"
I think that it's easier to build some functions, who can check different input schemes than training a network to do so. The hard part would be building a large enough data set to train your network correctly.

group detection in large data sets python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!
Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)
That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

Categories