group detection in large data sets python

group detection in large data sets python - python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!

Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)

That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

Related

How do you solve long tail problem in a recommendation system?

I want to make a movie recommendation system using the binary ratings that is whether a person has seen the movie or not! I am using various cosine similarity techniques and all but the issue is the Long Tail
in Recommendation System. I am not able to find any concrete solution which uses just viewed or not (i.e. either 0 or 1) and not the ratings as such for the recommendation? What other popular algorithms can be used for the same. I need to remove the long tail issue,
I have used Adaptive Clustering but it needs many Derived Variables and those are not present here.
Used other ways like Total Clustering but no use.
Used Popularity Sensitive Clustering but same issue.
Been stuck here in this long tail issue but not getting even a good implementation for my work or a research paper that helps but nothing.
Everyone is using either ratings or the user data but my work doesn't have any user info and neither is it having any ratings just the binary values.

The Long Tail issue in recommendation systems basically is about how to give users recommendation of items that do not have a lot of interactions(ratings/likes) etc. As similarity algorithms like cosine similarity and clustering algorithms fails in recommending them. You need to look into diversity increasing algorithms.
What I mean is rather than calculating similarity try calculating dissimilarity.
Here R is recommendation list, d(i, j) is dissimilarity.
You can use surprise to generate R here using matrix factorization algorithms.
Also, when you generate a user vs. item matrix where matrix[user_i][item_j] denote rating you can convert it to 1 to show rating and 0 otherwise and it will still work. Also, these binary ratings generally are call interaction the user had with the item.

How to find text similarity within millions of entries?

Having used Spacy to find similarity across few texts, now I'm trying to find similar texts in millions of entries (instantaneously).
I have an app with millions of texts and I'd like to present the user with similar texts if they ask to.
How sites like StackOverflow find similar questions so fast?
I can imagine 2 approaches:
Each time a text is inserted, the entire DB is compared and a link is done between both questions (in a intermediate table with both foreign keys)
Each time a text is inserted, the vector is inserted in a field associated with this text. Whenever a user asks for similar texts, its "searches" the DB for similar texts.
My doubt is with the second choice. Storing the word vector is enough for searching quickly for similar texts?

Comparing all the texts every time a new request comes in is infeasible.
To be really fast on large datasets I can recommend Locality-sensitive Hasing (LSH). It gives you entries that are similar with high probability. It significantly reduces the Complexity of your algorithm.
However, you have to train your algorithm once - that may take time - but after that it's very fast.
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
Here is a tutorial that seems close to your application:
https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/

You want a function that can map quickly from a text, into a multi-dimensional space. Your collection of documents should be indexed with respect to that space such that you can quickly find the shortest-distance match between your text, and those in the space.
Algorithms exist that will speed up that indexing process - but could be as simple as sub-indexing the space into shards or blocks on a less granular basis and narrowing down the search like that.
One simple way of defining such a space might be on term-frequency (TF), term-frequency-inverse document frequency (TFIDF) - but without defining a limit on your vocabulary size, these can suffer from space/accuracy issues - still, with a vocabulary of the most specific 100 words in a corpus, you should be able to get a reasonable indication of similarity that would scale to millions of results. It depends on your corpus.
There are plenty of alternative features you might consider - but all of them will resolve to having a reliable method of transforming your document into a geometric vector, which you can then interrogate for similarity.

How to measure similarity between two python code blocks?

Many would want to measure code similarity to catch plagiarisms, however my intention is to cluster a set of python code blocks (say answers to the same programming question) into different categories and distinguish different approaches taken by students.
If you have any idea how this could be achieved, I would appreciate it if you share it here.

You can choose any scheme you like that essentially hashes the contents of the code blocks, and place code blocks with identical hashes into the same category.
Of course, what will turn out to be similar will then depend highly on how you defined the hashing function. For instance, a truly stupid hashing function H(code)==0 will put everything in the same bin.
A hard problem is finding a hashing function that classifies code blocks in a way that seems similar in a natural sense. With lots of research, nobody has yet found anything better to judge this than I'll know if they are similar when I see them.
You surely do not want it to be dependent on layout/indentation/whitespace/comments, or slight changes to these will classify blocks differently even if their semantic content is identical.
There are three major schemes people have commonly used to find duplicated (or similar) code:
Metrics-based schemes, which compute the hash by counting various type of operators and operands by computing a metric. (Note: this uses lexical tokens). These often operate only at the function level. I know of no practical tools based on this.
Lexically based schemes, which break the input stream into lexemes, convert identifiers and literals into fixed special constants (e.g, treat them as undifferentiated), and then essentially hash N-grams (a sequence of N tokens) over these sequences. There are many clone detectors based on essentially this idea; they work tolerably well, but also find stupid matches because nothing forces alignment with program structure boundaries.
The sequence
return ID; } void ID ( int ID ) {
is an 11 gram which occurs frequently in C like languages but clearly isn't a useful clone). The result is that false positives tend to occur, e.g, you get claimed matches where there isn't one.
Abstract syntax tree based matching, (hashing over subtrees) which automatically aligns clones to language boundaries by virtue of using the ASTs, which represent the language structures directly. (I'm the author of the original paper on this, and build a commercial product CloneDR based on the idea, see my bio). These tools have the advantage that they can match code that contains sequences of tokens of different lengths in the middle of a match, e.g., one statement (of arbitrary size) is replaced by another.
This paper provides a survey of the various techniques: http://www.cs.usask.ca/~croy/papers/2009/RCK_SCP_Clones.pdf. It shows that AST-based clone detection tools appear to be the most effective at producing clones that people agree are similar blocks of code, which seems key to OP's particular interest; see Table 14.
[There are graph-based schemes that match control and data flow graphs. They should arguably produce even better matches but apparantly do not do much better in practice.]

One approach would be to count then number of functions, objects, keywords possibly grouped into categories such as branching, creating, manipulating, etc., and number variables of each type. Without relying on the methods and variables being called the same name(s).
For a given problem the similar approaches will tend to come out with similar scores for these, e.g.: A students who used decision tree would have a high number of branch statements while one who used a decision table would have much lower.
This approach would be much quicker to implement than parsing the code structure and comparing the results.

similarity of genes given gene name, in BioPython

How can I find the similarity of two genes, given the gene name?
By similarity, I think I mean the similarity of the sequences. I am new to this area and given this work by my professor. I do not know many types of similarity
Hopefully, can this be done with Biopython?
Thank you so much.
Update as response:
Thanks. But I tried.
My main problem is when I retrieve gene sequence from database, some results come as a sequence of gene, others come out as a sequence of proteins. I think if we want to compare them, I need make sure they are all gene sequences or they are all protein sequences right?
Here is the code I use:
handle = Entrez.efetch(db="nucleotide", id=t ,rettype="gb")
record = handle.read()
Then, for some ids, I got a sequence of agtc, others I got a sequence like mwvllvffll tltylfwpkt. They are proteins right?
I got stuck here and I do not know what to do next.

You should start off by reading through the Biopython Tutorial, which covers all of the basics. Your problem is pretty straightforward (assuming you already know how to program in Python): Read in the gene name or accession ID, retrieve the sequences, align the sequences, then generate summary information (percent identity, percent homology, gap score, etc.). All of these functions are covered in the tutorial and the cookbook. The Biopython API documentation is also very helpful when working with the individual classes and methods.
Good luck!

If youre really into this you should learn the meanings of e-values' scores etc.Like high scores and low e-values corresponds to better similarities.
You must compare the same types but if you like to compare nucleotides to proteins anyway first translate dna to protein.
Take a look at NCBI,ENSEMBL,EBI websites.They provide you almost all the tools you need.
If you have lots of sequences to be compared it will be wise to use biopython but first understand the cookbook as MattDMo said.Look around over the internet see how other programmers did it try to understand their codes.
Good luck

Writing a post search algorithm

I'm trying to write a free text search algorithm for finding specific posts on a wall (similar kind of wall as Facebook uses). A user is suppose to be able to write some words in a search field and get hits on posts that contain the words; with the best match on top and then other posts in decreasing order according to match score.
I'm using the edit distance (Levenshtein) "e(x, y) = e" to calculate the score for each post when compared to the query word "x" and post word "y" according to: score(x, y) = 2^(2 - e)(1 - min(e, |x|) / |x|), where "|x|" is the number of letters in the query word.
Each word in a post contributes to the total score for that specific post. This approach seems to work well when the posts are of roughly the same size, but sometime certain large posts manages to rack up score solely on having a lot of words in them while in practice not being relevant to the query.
Am I approaching this problem in the wrong way or is there some way to normalize the score that I haven't thought of?

Yes. There are many normalization methods you could use. This is a well-researched field!
Take a look at the vector space model . TDF/IDF could be relevant to what you're doing. It's not strictly related to the method you're using but could give you some normalization leads.
Also note that comparing each post will be O(N) and could get very slow. Instead of string-distance, you may have better results with stemmming. You can then put that into a VSM inverted index.
Many databases (including MySQL and Postgres) have full-text search. That's probably more practical than doing it yourself.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

group detection in large data sets python - python

Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this: from collections import defaultdict groups = defaultdict(list): for post in facebook_posts: for url in geturls(post): groups[url].append(post)

Related

How do you solve long tail problem in a recommendation system?

How to find text similarity within millions of entries?

How to measure similarity between two python code blocks?

similarity of genes given gene name, in BioPython

Writing a post search algorithm

Categories

Resources