I'm trying to write a free text search algorithm for finding specific posts on a wall (similar kind of wall as Facebook uses). A user is suppose to be able to write some words in a search field and get hits on posts that contain the words; with the best match on top and then other posts in decreasing order according to match score.
I'm using the edit distance (Levenshtein) "e(x, y) = e" to calculate the score for each post when compared to the query word "x" and post word "y" according to: score(x, y) = 2^(2 - e)(1 - min(e, |x|) / |x|), where "|x|" is the number of letters in the query word.
Each word in a post contributes to the total score for that specific post. This approach seems to work well when the posts are of roughly the same size, but sometime certain large posts manages to rack up score solely on having a lot of words in them while in practice not being relevant to the query.
Am I approaching this problem in the wrong way or is there some way to normalize the score that I haven't thought of?
Yes. There are many normalization methods you could use. This is a well-researched field!
Take a look at the vector space model . TDF/IDF could be relevant to what you're doing. It's not strictly related to the method you're using but could give you some normalization leads.
Also note that comparing each post will be O(N) and could get very slow. Instead of string-distance, you may have better results with stemmming. You can then put that into a VSM inverted index.
Many databases (including MySQL and Postgres) have full-text search. That's probably more practical than doing it yourself.
Related
I am relatively new to the field of NLP/text processing. I would like to know how to identify domain-related important keywords from a given text.
For example, if I have to build a Q&A chatbot that will be used in the Banking domain, the Q would be like: What is the maturity date for TRADE:12345 ?
From the Q, I would like to extract the keywords: maturity date & TRADE:12345.
From the extracted information, I would frame a SQL-like query, search the DB, retrieve the SQL output and provide the response back to the user.
Any help would be appreciated.
Thanks in advance.
So, this is where the work comes in.
Normally people start with a stop word list. There are several, choose wisely. But more than likely you'll experiment and/or use a base list and then add more words to that list.
Depending on the list it will take out
"what, is, the, for, ?"
Since this a pretty easy example, they'll all do that. But you'll notice that what is being done is just the opposite of what you wanted. You asked for domain-specific words but what is happening is the removal of all that other cruft (to the library).
From here it will depend on what you use. NLTK or Spacy are common choices. Regardless of what you pick, get a real understanding of concepts or it can bite you (like pretty much anything in Data Science).
Expect to start thinking in terms of linguistic patterns so, in your example:
What is the maturity date for TRADE:12345 ?
'What' is an interrogative, 'the' is a definite article, 'for' starts a prepositional phrase.
There may be other clues such as the ':' or that TRADE is in all caps. But, it might not be.
That should get you started but you might look at some of the other StackExchange sites for deeper expertise.
Finally, you want to break a question like this into more than one question (assuming that you've done the research and determined the question hasn't already been asked -- repeatedly). So, NLTK and NLP are decently new, but SQL queries are usually a Google search.
Having used Spacy to find similarity across few texts, now I'm trying to find similar texts in millions of entries (instantaneously).
I have an app with millions of texts and I'd like to present the user with similar texts if they ask to.
How sites like StackOverflow find similar questions so fast?
I can imagine 2 approaches:
Each time a text is inserted, the entire DB is compared and a link is done between both questions (in a intermediate table with both foreign keys)
Each time a text is inserted, the vector is inserted in a field associated with this text. Whenever a user asks for similar texts, its "searches" the DB for similar texts.
My doubt is with the second choice. Storing the word vector is enough for searching quickly for similar texts?
Comparing all the texts every time a new request comes in is infeasible.
To be really fast on large datasets I can recommend Locality-sensitive Hasing (LSH). It gives you entries that are similar with high probability. It significantly reduces the Complexity of your algorithm.
However, you have to train your algorithm once - that may take time - but after that it's very fast.
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
Here is a tutorial that seems close to your application:
https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/
You want a function that can map quickly from a text, into a multi-dimensional space. Your collection of documents should be indexed with respect to that space such that you can quickly find the shortest-distance match between your text, and those in the space.
Algorithms exist that will speed up that indexing process - but could be as simple as sub-indexing the space into shards or blocks on a less granular basis and narrowing down the search like that.
One simple way of defining such a space might be on term-frequency (TF), term-frequency-inverse document frequency (TFIDF) - but without defining a limit on your vocabulary size, these can suffer from space/accuracy issues - still, with a vocabulary of the most specific 100 words in a corpus, you should be able to get a reasonable indication of similarity that would scale to millions of results. It depends on your corpus.
There are plenty of alternative features you might consider - but all of them will resolve to having a reliable method of transforming your document into a geometric vector, which you can then interrogate for similarity.
I need my score to take into account only how close the terms (in a multi-term search) are. It looks like in implementing your own weighting function (the docs), you only get access to one term of the search at once, so cannot look at distance between two terms.
The best solution I've found is to index each sentence alone. This isn't ideal, since it'll allow no highly scored exceptions to come through.
Sorry for the lousy title, but let me explain the problem I'm having. I'm currently working on a project, and a part of this includes a search engine for addresses, which I have in elasticsearch. What I'm trying to do is use fuzzy_like_this_field queries when a new character is entered in my search bar to generate autocomplete results and try to "guess" which of the (~1 million) addresses the user is typing.
My issue is that I currently have a size limit on my query, as returning all of the results was both unnecessary and expensive, time-wise. My issue, is that I often am not getting the "correct" result unless I return 1000 or more results from the query. For example, if I enter "100 broad" in trying to search for "100 broadway" and I only return 200 results (about the max that I can do without it taking too long), 100 broadway is nowhere to be found, even though all of the returned results have a higher levenshtein distance than the result that I want. I get "100 broadway" as the first result if I return 2000 results from my query, but it takes too long. I can't even filter the results that got returned to bring the correct one to the top, because it's not being returned.
Shouldn't putting a size limit of N on the query return the best N results, not a seemingly random subset of them?
Sorry if this is poorly worded or too vague.
I think you might have some misapprehensions about the fuzzy_like_this query.
Fuzzifies ALL terms provided as strings and then picks the best n differentiating terms...For each source term the fuzzy variants are held in a BooleanQuery with no coord factor...
If you just want a fuzzy search based on Levenshtein distance, use a fuzzy query
You can write a custom analyzer using the edge ngram tokenizer which would help you achieve what you are looking for. Find here a technique demonstrated by elasticsearch
https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html
Then doing a simple query such as
{
"query": {
"match":{
"address": "100 Broadway"
}
}
}
would do you the job. You might also consider using a different analyzer for the search which is also shown in the tutorial (In the end). This will enable you to do stuff such as tokenizing your search query and preprocessing it in a manner different from the index analyisis.
I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!
Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)
That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.