Suppose I have been given data sets with headers :
id, query, product_title, product_description, brand, color, relevance.
Only id and relevance is in numeric format while all others consists of words and numbers. Relevance is the relevancy or ranking of a product with respect to a given query. For eg - query = "abc" and product_title = "product_x" --> relevance = "2.3"
In training sets, all these fields are filled but in test set, relevance is not given and I have to find out by using some machine learning algorithms. I am having problem in determining which features should I use in such a problem ? for example, I should use TF-IDF here. What other features can I obtain from such data sets ?
Moreover, if you can refer to me any book/ resources specifically for 'feature extraction' topic that will be great. I always feel troubled in this phase. Thanks in advance.
I think there is no book that will give the answers you need, as feature extraction is the phase that relates directly to the problem being solved and the existing data,the only tip you will find is to create features that describe the data you have. In the past i worked in a problem similar to yours and some features i used were:
Number of query words in product title.
Number of query words in product description.
n-igram counts
tf-idf
Cosine similarity
All this after some preprocessing like taking all text to upper(or lower) case, stemming, standard dictionary normalization.
Again, this depends on the problmen and the data and you will not find the direct answer, its like posting a question: "i need to develop a product selling system, how do i do it? Is there any book?" . You will find books on programming and software engineering, but you will not find a book on developing your specific system,you'll have to use general knowledge and creativity to craft your solution.
Related
I have a list of merchant category:
[
'General Contractors–Residential and Commercial',
'Air Conditioning, Heating and Plumbing Contractors',
'Electrical Contractors',
....,
'Insulation, Masonry, Plastering, Stonework and Tile Setting Contractors'
]
I want to exclude merchants from my dataframe if df['merchant_category'].str.contains() any of such merchant categories.
However, I cannot guarantee that the value in my dataframe has the long name as in the list of merchant category. It could be that my dataframe value is just air conditioning.
As such, df = df[~df['merchant_category'].isin(list_of_merchant_category)] will not work.
If you can collect a long list of positive examples (categories you definitely want to keep), & negative examples (categories you definitely want to exclude), you could try to train a text classifier on that data.
It would then be able to look at new texts and make a reasonable guess as to whether you want them included or excluded, based on their similarity to your examples.
So, as you're working in Python, I suggest you look for online tutorials and examples of "binary text classification" using Scikit-Learn.
While there's a bewildering variety of possible approaches to both representing/vectorizing your text, and then learning to make classifications from those vectors, you may have success with some very simple ones commonly used in intro examples. For example, you could represent your textual categories with bag-of-words and/or character-n-gram (word-fragments) representations. Then try NaiveBayes or SVC classifiers (and others if you need to experiment for possibly-bettr results).
Some of these will even report a sort of 'confidence' in their predictions - so you could potentially accept the strong predictions, but highlight the weak predictions for human review. When a human then looks at, an definitively rules on, a new 'category' string – because it was highlighted an iffy prediction, or noticed as an error, you can then improve the overall system by:
adding that to the known set that are automatically included/excluded based on an exact literal comparison
re-training the system, so that it has a better chance at getting other new similar strings correct
(I know this is a very high-level answer, but once you've worked though some attempts based on other intro tutorials, and hit issues with your data, you'll be able to ask more specific questions here on SO to get over any specific issues.)
I am trying to automate the task of searching for opportunities (tenders) in 40+ websites, for a company. The opportunities are usually displayed in table format. They have a title, date published, and a clickable link that takes you to a detailed description of what the opportunity is.
One website example is:
http://www.eib.org/en/about/procurement/index.htm
The goal would be to retrieve the new opportunities that are posted everyday and that fit specific criteria. So I need to look at specific keywords within the opportunities' title. These keywords are the fields and regions in which the company had previous experience.
My question is: After I extract these tables, with the tenders' titles, in a dataframe format, how do I search for the right opportunities and sort them by relevance (given a list of keywords)? Do I use NLP in this case and turn the words in the titles into binary code (0s and 1s)? Or are there other simpler methods I should be looking at?
Thanks in advance!
To sort the tenders by relevance, you need define the relevance.
In this case you could count the number of occurrence of your keywords in the tender and this would be your relevance score. You can then only keep the ones that have at least one appearing keyword.
This is a first try, you can improve this by adding keywords, or assign a higher score if the keyword is in the title rather than in the detailed description...
The task you might be trying to solve here is information retrieval: rank documents (the tenders) given their relevance to a query (your keyword).
So then you can use weighing schemes like Tf-Idf or BM25, etc... But it depends on your needs, maybe counting the keyword is more than enough !
I have a file called train.dat which has three fields - userID, movieID and rating.
I need to predict the rating in the test.dat file based on this.
I want to know how I can use scikit-learn's KMeans to group similar users given that I have only feature - rating.
Does this even make sense to do? After the clustering step, I could do a regression step to get the ratings for each user-movie pair in test.dat
Edit: I have some extra files which contain the actors in each movie, the directors and also the genres that the movie falls into. I'm unsure how to use these to start with and I'm asking this question because I was wondering whether it's possible to get a simple model working with just rating and then enhance it with the other data. I read that this is called content based recommendation. I'm sorry, I should've written about the other data files as well.
scikit-learn is not a library for recommender systems, neither is kmeans typical tool for clustering such data. Things that you are trying to do deal with graphs, and usually are either analyzed on graph level, or using various matrix factorization techniques.
In particular kmeans only works in euclidean spaces, and you do not have such thing here. What you can do is to use DBScan (or any other clustering technique accepting arbitrary simialrity, but this one is actually in scikit-learn) and define similarity between two users by some kind of their agreement in terms of their taste, for example:
sim(user1, user2) = # movies both users like / # movies at least one of them likes
which is known as Jaccard coefficient for similarity between binary vectors. You have rating, not just "liking" but I am giving here a simplest possible example, while you can come up with dozens other things to try out. The point is - for the simplest approach all you have to do is define a notion of per-user similarity and apply clustering that accepts such a setting (like mentioned DBScan).
Clustering users makes sense. But if your only feature is the rating, I don't think it could produce a useful model for prediction. Below are my assumptions to make this justification:
The quality of movie should be distributed with a gaussion distribution.
If we look at the rating distribution of a common user, it should be something like gaussian.
I don't exclude the possibility that a few users only give ratings when they see a bad movie (thus all low ratings); and vice versa. But on a large scale of users, this should be unusual behavior.
Thus I can imagine that after clustering, you get small groups of users in the two extreme cases; and most users are in the middle (because they share the gaussian-like rating behavior). Using this model, you probably get good results for users in the two small (extreme) groups; however for the majority of users, you cannot expect good predictions.
I want to try and learn Deep Learning with Python.
The first thing that came to my mind for a useful scenario would be a Duplicate-Check.
Let's say you have a customer-table with name,address,tel,email and want to insert new customers.
E.g.:
In Table:
Max Test,Teststreet 5, 00642 / 58458,info#max.de
To Insert:
Max Test, NULL, (+49)0064258458, test#max.de
This should be recognised as a duplicate entry.
Are there already tutorials out there for this usecase? Or is it even possible with deep learning?
Duplicate matching is a special case of similarity matching. You can define input features as either individual characters or fields and then train your network. It's a binary classification problem (true/false) unless you want to have a similarity score (95% match). The network should be able to learn that punctuation and whitespace is irrelevant and an 'or function' for at least one of the fields matching to produce true positive.
Sounds like a fairly simple case for deep learning.
I don't know of any specific tutorial for this, but I tried to give you some keywords to look for.
You can use duplicates=dataset.duplicated()
It will return all rows which are duplicate
Then:
print(sum(duplicates))
to print count of duplicated rows.
In your case, finding duplicates for numbers and category data should be simpler. The problem arises when it is free text. I think you should try out fuzzy matching techniques to start with. There is good distance metric available in Python called Levenshtein distance. The library for calculating the distance is python-Levenshtein. It is pretty fast. See if you get good results using this distance metric if you want to improve further you go for deep learning algorithms like RNN, LSTM, etc. which are good for text data.
The problem to find duplicate instances in relational database is a traditional research topic in database and data mining, which is called "entity matching" or "entity resolution". Deep learning is also adapted in this domain.
Many related works can be found in google scholar by searching "entity matching"+"deep learning"
I think that it's easier to build some functions, who can check different input schemes than training a network to do so. The hard part would be building a large enough data set to train your network correctly.
I have a list of questions in a text file extracted from online website. I am new to nltk (in Python) and going through initial chapters from ( http://http://shop.oreilly.com/product/9780596516499.do ) . Please anybody help me out for categorizing my topics under different headings.
I don't know the heading of the questions. So, how to create headings and categorize then thenafter ???
Your task consists of document clustering, where each question is a document, and cluster labeling, where label designates topic.
Note that if your questions are short and/or hardly separable, e.g. belong to similar categories, then quality would be not so high.
Take a look at simple recipe for document clustering and related questions first and second.
As a baseline for labels, try max tf-idf words from cluster words or from centroids.