Python usage of breadth-first search on social graph - python

I've been reading a lot of stackoverflow questions about how to use the breadth-first search, dfs, A*, etc, the question is what is the optimal usage and how to implement it in reality verse simulated graphs. E.g.
Consider you have a social graph of Twitter/Facebook/Some social networking site, to me it seems a search algorithm would work as follows:
If user A had 10 friends, then one of those had 2 friends and another 3. The search would first figure out who user A's friends were, then it would have to look up who the friends where to each of the ten users. To me this seems like bfs?
However, I'm not sure if that's the way to go about implementing the algorithm.
Thanks,

For my two cents, if you're just trying to traverse the whole graph it doesn't matter a whole lot what algorithm you use so long as it only hits each node once. This seems to be what you're saying when you note:
I'm just trying to traverse the whole graph
This means your terminology is technically flawed- you're talking about walking a graph, not searching a graph. Unless you're actually trying to search for something in particular, which you don't seem to mention in the question at all.
With that said, Facebook and Twitter are very different graph structures that do have an impact on how you walk them:
Facebook is fundamentally an undirected graph. If X is friends with Y, Y MUST be friends with X. (Or in a relationship with, or related to, etc).
Twitter is fundamentally a directed graph. If you X follows Y, Y does not have to follow X.
These issues will significantly impact the graph walking algorithm. To be quite honest, if you just want to visit all the nodes, do you even need a graph? Why not just iterate over all of them? If you have all the nodes in some data structure MY_DATA that is iterable, you could just have a generator expression like this:
def nodeGenerator(MY_DATA)
for node in MY_DATA:
yield node
Clearly, you'd need to adjust the nodeGenerator internals to handle how you're actually accessing the nodes. With that said, most graph structures implement a node iterator. Then you can just create an iterator anytime you want to do things via:
for node in nodeGenerator(MY_DATA):
(Do something here)
Maybe I'm just missing the point of the question here, but at present you've posed a question about search algorithms without a search problem. Due to the No Free Lunch nature of optimization and search, the worth of any search algorithm will be entirely dependent on the search problem you're trying to examine.
This is true even among the same data set. After all, if you were searching for everybody whose name starts with the letter D, a great approach would be to just sort everyone alphabetically and do a binary search. If instead you're trying to find everyone's degree of separation from Kevin Bacon, you're going to want and algorithm that starts with Mr. Bacon and recursively iterates over everyone who knows him and everyone who they know. These are both things you COULD do on Facebook or Twitter, but without any specifics there's really no way to recommend an algorithm. Hence, if you know nothing, just iterate over everyone as a list. It's just as good as anything else. If you then want to optimize, cache any calculations.

I have around 300 friends in facebook and some of my friends also have 300 friends on an average. If you gonna build a graph out of it , it's gonna be huge . Correct me , if I am wrong ? . A BFS will be quit lot demanding in this scenario ?
Thanks
J

Related

What is Grover's algorithm in a big simplification?

I'm new in quantum computing. I mean extremely new. I saw that there is some king of an algorithm called Grover's search algorithm. I have read that it searches through the database containing N-elements in order to find the specific element. I also read that standard computers would be doing it for many, many years while quantum computers would do it in just a few seconds. And that is what confuses me the most. How I understand this:
Let's say we want to search the database containing 50.000 different names and we are looking for a name "Jack". The standard computer wouldn't do it for years right? I think there's matter of seconds or minutes as searching through the database containing names which is probably text won't take long...
Example in python:
names = ["Mark", "Bob", "Katty", "Susan", "Jack"]
for i in range(len(names)):
if names[i] == "Jack":
print("It's Jack!")
else:
print("It's not Jack :(")
That's how I understand it. So let's imagine this list contains 50.000 names and we want to search for "Jack". I guess it wouldn't take long.
So how does this Grover's algorithm works? I really can't figure it out.
Grover's search is not indeed a good replacement for classical database lookup methods. (Note that classical databases will have classical indices in them that will speed up the lookup way beyond your implementation.) You can see this paper for a discussion of practical applications of Grover search.
It is more correct to think about the oracle as a tool to recognize the answer, not to find it. For example, if you're looking to solve a SAT problem, the oracle circuit will encode the Boolean formula for a specific instance of a problem you're trying to solve.
If you were to use Grover's algorithm for database search, the oracle would have to encode the condition you're searching for, but also the criteria of whether the element is in a database. For example, if you're looking for a name starting with A, the oracle needs to recognize all strings starting with A, but it also needs to recognize which of the strings are present in the database - otherwise the algorithm will yield a random string starting with A, which is probably not what you were looking for.
Grover's algorithm has practical application when generalized to amplitude amplification, which shows up as a component of many other quantum algorithms. Amplitude amplification is a way of improving the success likelihood of a probabilistic quantum algorithm.

My algorithm isn't correct. Why so?

I am trying to solve the following problem:
The group of people consists of N members. Every member has one or more friends in the group. You are to write program that divides this group into two teams. Every member of each team must have friends in another team.
Input:
The first line of input contains the only number N (N ≤ 100). Members are numbered from 1 to N. The second, the third,…and the (N+1)th line contain list of friends of the first, the second, …and the Nth member respectively. This list is finished by zero. Remember that friendship is always mutual in this group.
Output:
The first line of output should contain the number of people in the first team or zero if it is impossible to divide people into two teams. If the solution exists you should write the list of the first group into the second line of output. Numbers should be divided by single space. If there are more than one solution you may find any of them.
My algorithm looks like this:
create a dictionary where each player maps to a list of friends
team1 = ['1']
team2 = []
left = []
for player in dictionary:
if its friend in team1:
add to team2
elif its freind in team2:
add to team1
else:
add it to left
But still it isn't correct. There may be cycles in the dictionary where the friend of 6 would be 7 and the only friend of 7 would be 6. What should I do in such a case? I do not know how long such a cycle may be. What should I do. Since, I have a while loop around my code, I currently am running into an infinite loop. I am also trying to add players from left to teams but its not working since they have cycles among them. I don't know how to solve the following problem.
Thanks.
Since this is a competition problem and it's clear you want to learn from it, I'll be a little sparse on details and explain more about how I thought about the problem.
First, consider a connected friendship component, then pick any vertex. Since the friendship relationship is commutative, it's easy to see that adding an edge means that both the vertices are "solved". This seems to suggest something like finding a perfect matching.
However, finding a perfect matching is not sufficient, as for the complete graph with three vertices, a perfect matching doesn't exist, yet it can be solved. So thinking about it little more, it seems that a Hamiltonian path is sufficient, because you can just alternate teams.
If you consider a sufficiently large tree though, it should be clear that there's no Hamiltonian path, but the obvious splitting of teams by even or odd height produces the right result. So the answer seems to be that if you can find a spanning tree, that tree can be used to split the teams into two.
This can be repeated for each component, and just playing around with graphs, it should be convincing enough for a competition, as every component has a spanning tree, so there's nowhere else to expand to. I'm not sure what would be a graph with no possible assignment. Maybe if you have an unconnected node, that's considered invalid?
Update: I found even simpler solution. The original answer is at the bottom. This one is cleaner and comes with a proof ;)
We will be building the solution incrementally. The initial state is that all the people are unallocated, and both teams are empty. We will extend the solution using one of two actions below. After each step, the division will be legal, meaning every allocated person will have a friend allocated to the other team.
Action 1: pick any two unallocated guys that are friends. Put one of them in Team A, the other in Team B. The invariant holds, because newly allocated people know each other and are on separate teams.
Action 2: pick any guy who has an allocated friend, and place him on the other team. The invariant holds, because the one allocated person was allocated in such a way to satisfy it.
So at very step you pick any doable action and execute it. Repeat until there are no more possible actions. When does this happen? It would mean that no-one of the unallocated people has any friends. Since we assumed that everyone has at least one friend, you will be able to execute the actions until there is nobody left.
Original answer:
The problem seems complicated at first, but in fact does not require rocket science. The constraint on the division is rather loose - everyone needs just one friend on the other team.
Consider a simpler case first. Let's say you are given two teams of people and one extra player that got late to the party and needs to be allocated to one of the two existing teams. If he has no friends at all, that's impossible. But if he does have any friends, you pick one of his friends and allocate the newcommer to the other team.
The outcome? If you could start with some small teams and then arrange the rest of the people in such a way that they always know someone who came before, you're golden. This means we reduced initial big problem to two smaller ones.
Tackling the first one is easy. In order to bootstrap the teams, just pick any two guys that know each other, put one in Team A, the other in Team B, and it works.
Now, the second: adding the rest of the people. Take a look at all the people that are already allocated to teams and see if they have any unallocated friends. Case 1: one of already allocated guys has an unallocated friend. You can easily add him somewhere. Case 2: all the friends of the allocated guys are already allocated, too. This means the initial friendship graph was not connected and doesn't hurt at all - just take any random unallocated guy and place him anywhere.

group detection in large data sets python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!
Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)
That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

Classify words to "good" and "bad"

I have a list of domain names and want to determine is name of domain looks like it is porno site or not. What the better way to do this? List of porn domains looks like http://dumpz.org/56957/ . This domains can be used to teach the system how porno domains should look like. Also I have other list - http://dumpz.org/56960/ - many domains of this list also is porno and I want to determine them by name.
Use a bayesian filter eg: SpamBayes or Divmods Reverend. You train it with the list you have and could score how likely it is for a given domain, if it is porn.
For a short overview look at this article.
You can't rely on the domain name for that, there are far too many porn domains with decent names and few others with porn-like names but with safe content.
It might depend on what your goals are. I'm guessing that you are mostly interested in minimizing false negatives (accidentally calling a domain a good domain if it isn't). This might be true if, for example, you want all porn links in a forum to be reviewed for spam before being posted. If some non-porn links get flagged for review, it's OK.
In this case, you could probably do something fairly simple. If you could come up with a list of porn'ish words, you could just mark all of the domains that contain any of those words as a substring. This would catch some safe domains though: expertsexchange.com could match "sex" or "sexchange", but "yahoo" wouldn't ever flag positive. Easy to implement, easy to understand, easy to tweak.
Lists of obscene words can be found using your favorite search engine. You could use your list of domains to extract common long substrings across the domains as words as well.
If you want to really get the answers correct though, you'll need to see what is on those domains. Site-About-Kitty-Porn.com could be a lolcats domain or illegal porn. Impossible to know unless you do some crawling. If you fetch the actual content and matched against your list, you'd be doing a little better.
You could also try each domain against some third party service, such as a child-safe internet filter, or even trying to test if the domain will appear for safe-search results in your favorite search engine. Of course, make sure you are following each service's TOS and all of that.
As someone already pointed out, you need some kind of classification to achieve what you are trying to. But then overall accuracy (precision and recall) depends on the training dataset you have. You could use classifiers like SVM, decision tree, etc. for this purpose.
I would suggest to go for a semi-supervised approach where you cluster your different URLs and check a few representative URLs from each cluster to see if that is porn or not. The benefit is you don need any training and you can find porn URLs which probably do not cover your training dataset. The common clustering techniques are k-means, hierarchical, dbscan, etc.
This will still not cover porn sites which do not have porn like URL. For that you have to grab the page and need to do similar training/clustering on the content of the webpage(s).
Do you mean something like this?
scala> val pornList = List("porn1.com","porn2.com","porn3.com")
pornList: List[java.lang.String] = List(porn1.com, porn2.com, porn3.com)
scala> val sites = List("porn1.com","site1.com","porn3.com","site2.com","site3.com")
sites: List[java.lang.String] = List(porn1.com, site1.com, porn3.com, site2.com, site3.com)
scala> val result = sites filterNot { pornList contains _ }
result: List[java.lang.String] = List(site1.com, site2.com, site3.com)
Check out this blog post on classifying webpages by topic. Start with the list of bad sites as your positive examples and use any heuristic for finding good sites (basic web crawler seeded with some innocent Google searches) as negative examples. The post walks you through the process of extracting content through the pages and touches on Weka and how you might apply some of their basic learners.
Note that you may want to add additional data to your training set that is specific to the domain of your problem instead of just using page contents. For example the number of pictures or size of pictures on the page may be a factor that you may want to consider.

Adjustment to a shortest path algorithm

for a datastructures & algorithms class in college we have to implement an algorithm presented in a paper. The paper can be found here.
So i fullly implemented the algorithm, with still some errors left (but that's not really why I'm asking this question, if you want to see how I implemented it thus far, you can find it here)
The real reason why I'm asking a question on Stackoverflow is the second part of the assignment: we have to try to make the algorithm better. I had a few ways in mind, but all of them sound good in theory but they won't really do good in practice:
Draw a line between the source and end node, search the node closest to the middle of that line and divide the "path" in 2 recursively. The base case would be a smaller graph were a single Dijkstra would do the computation. This isn't really an adjustment to the current algorithm but with some thinking it is clear this wouldn't give an optimal solution.
Try to give the algorithm some sense of direction by giving a higher priority to edges that point to the end node. This also won't be optimal..
So now I'm all out of ideas and hoping that someone here could give me a little hint for a possible adjustment. It doesn't really have to improve the algorithm, I think the first reason why they asked us to do this is so we don't just implement the algorithm from the paper without knowing what's behind it.
(If Stackoverflow isn't the right place to ask this question, my apologies :) )
A short description of the algorithm:
The algorithm tries to select which nodes look promising. By promising I mean that they have a good chance on lying on a shortest path. How promising a node is is represented by it's 'reach'. The reach of a vertex on a path is the minimum of it's distances to the start and to the end. The reach of a vertex in a graph is the maximum of the reaches of the vertex on all shortest paths.
To eventually determine whether a node is added to the priority queue in Dijkstra's algorithm, a test() function is added. Test returns true (if the reach of a vertex in the graph is larger or equal then the weight of the path from the origin to v at the time v is to be inserted in the priority queue) or (the reach of the vertex in the graph is larger or equal then the euclidean distance from v to the end vertex).
Harm De Weirdt
Your best bet in cases like this is to think like a researcher: Research in general and Computer Science research specifically is about incremental improvement, one person shows that they can compute something faster using Dijkstra's Algorithm and then later they, or someone else, show that they can compute the same thing a little faster using A*. It's a series of small steps.
That said, the best place to look for ways to improve an algorithm presented in a paper is in the future directions section. This paper gives you a little bit to work on in that direction, but your gold mine in this case lies in sections 5 and 6. There are multiple places where the authors admit to different possible approaches. Try researching some of these approaches, this should lead you either to a possible improvement in the algorithm or at least an arguable one.
Best of luck!

Categories