I have been Googling for sometime but I guess I am using the wrong set of keywords. Does anyone know this URI that lets me request permission from Facebook to let me crawl their network? Last time I was using Python to do this, someone suggested that I look at it but I couldn't find that post either.
Amazingly enough, that's given in their robots.txt.
The link you're looking for is this one:
http://www.facebook.com/apps/site_scraping_tos.php
If you're not a huge organization already, don't expect to be explicitly whitelisted there. If you're not explicitly whitelisted, you're not allowed to crawl at all, according to the robots.txt and the TOS. You must use the API instead.
Don't even think about pretending to be one of the whitelisted crawlers. Facebook filters by whitelisted IP for each crawler and anything else that looks at all like crawling gets an instant perma-ban. For a while users who simply clicked too fast could occasionally run into this.
Since this is a community with login & password, I am not sure how much of it is legally crawl-able. If you see even Google indexes just the user profile pages. But not their wall posts or photos etc.
I would suggest you to post this question in Facebook Forum. But you can check it up here -
Facebook Developers
Facebook Developers Documentation
Facebook Developers Forum
Related
I'm new to web programming, so I guess my question would seem very stupid :)
I have simple website on Python/Django. There is some url, which users may open without any authentication.
I need to remember this user somehow and recognize him when he re-opens this url once again (not for a long time - say, for several hours).
By "same user" I mean "user uses same browser on same device".
How can I achieve this? Thanks in advance :)
Coookies are your answer. They will work for any URL for the same browser. assuming your user has agreed to use them.
An alternative would be to tag the links with parameters, but that is specific to the link and could be shared with others.
I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes
First of all, I am a noob when it comes to coding and things like that, so forgive me if any of this sounds stupid. Anyway, using Youtube and Google I managed to get my own Google App Engine Proxy Server up and running. The problem is, it won't let me sign into websites (e.g. Facebook, Twitter, etc.) while using the proxy. I enter the url into the url bar on my website/server whatever you want to call it, and it takes me there without any problems. For example, if I enter Facebook.com into the bar, it takes me to Facebook.com just fine, but when I try signing in to Facebook with my account information, the following message appears:
405 Method Not Allowed
The method POST is not allowed for this resource.
No matter what website it is, whenever I try signing in to the website with my account, I get this error. Like I said before, my knowledge with coding is limited, but I somewhat know what I'm doing. I am using Python for the coding.
The second problem I have is that whenever I try to play a flash game it doesn't load correctly, no matter what website it is. Any help?
I have just started learning scrapy and i want to try some prasing with python and Scrapy.
I am thinking of getting the list of questions from specific tags, which i have posted, and then parse them on SO.
But i am not sure how can i log with open id and Scrapy.
Can some please guide me with this on which url i have to submit the data because when i will type the openid then site gets transfered to openid url so how can i enter password there
Use your RSS feed (basically xml).
One other way I can think of is to:
Login normally in browser
export cookies to file
use that file in script
I can think of a situation where fanatic badge would become less valued. Because, then people would simply schedule cron on their server to visit SO every day! So, I advice not doing anything of this sort.
Here is a good example how to simulate a user login request using scrapy.
I've looked at a lot of questions and libs and didn't found exactly what I wanted. Here's the thing, I'm developing an application in python for a user to get all sorts of things from social networks accounts. I'm having trouble with facebook. I would like, if possible, a step-by-step tutorial on the code and libs to use to get a user's information, from posts to photos information (with the user's login information, and how to do it, because I've had a lot of problem with authentication).
Thank you
I strongly encourage you to use Facebook's own APIs.
First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp).
Once you grasp the concepts, start using this API. For Python, there are at several alternatives:
facebook/python-sdk https://github.com/facebook/python-sdk
pyFaceGraph https://github.com/iplatform/pyFaceGraph/
It is also semitrivial to write a simple HTTP client that uses the graph API
I would suggest you to check out the Python libraries, try out the examples in their documentation and see if they are working and do the stuff you need.
Only as a last resort, would I write a scraper and try to extract data with screenscraping (it is much more painful and breaks more easily).
I have not used this with Facebook, but in the past when I had to scrape a site that required login I used Mechanize to handle the login and scraping and Beautiful Soup to parse the resulting HTML.