I've looked at a lot of questions and libs and didn't found exactly what I wanted. Here's the thing, I'm developing an application in python for a user to get all sorts of things from social networks accounts. I'm having trouble with facebook. I would like, if possible, a step-by-step tutorial on the code and libs to use to get a user's information, from posts to photos information (with the user's login information, and how to do it, because I've had a lot of problem with authentication).
Thank you
I strongly encourage you to use Facebook's own APIs.
First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp).
Once you grasp the concepts, start using this API. For Python, there are at several alternatives:
facebook/python-sdk https://github.com/facebook/python-sdk
pyFaceGraph https://github.com/iplatform/pyFaceGraph/
It is also semitrivial to write a simple HTTP client that uses the graph API
I would suggest you to check out the Python libraries, try out the examples in their documentation and see if they are working and do the stuff you need.
Only as a last resort, would I write a scraper and try to extract data with screenscraping (it is much more painful and breaks more easily).
I have not used this with Facebook, but in the past when I had to scrape a site that required login I used Mechanize to handle the login and scraping and Beautiful Soup to parse the resulting HTML.
Related
I'm pretty inexperienced with any form of http request more complicated than a basic GET query. I've tried to do research online, but I'm having trouble figuring out where to start because I don't know all the required terminology.
For several years I've worked a side job for a data entry company. Basically what I do is Google several things, find the results from a few specific webpages, and copy those URLs into the company's system. About two years ago I wrote a very basic Python program to do the Googling part for me, and now I want to rewrite it and expand it to do the rest of it as well.
The website uses a combination of POST and PATCH requests to update the information on the database, and because the information is attached to my account I assume there is some form of authentication involved. I don't have access to the system's backend so the best I can do is head to the Network tab under Inspect Element. I can't find anything in the requests' headers that seem to attach to my account.
What do I need to do to authenticate, and if it's not that simple, where's the best place to start learning?
Let me know if you need more information and I'll try to give you what you need--I don't know exactly what's required.
I would like to create a tool with Python and the Twitter API to be able to create lists of tweets that match certain criteria like "contains the word Python" or "has at least 2 likes". Or simple stats like top posters, most liked, etc.
All of my search pointed me to the Tweepy project. But for that I need 0Auth tokens. So I applied for a developer account and was denied with the comment "we are unable to serve your use case".
Do I have any alternatives?
Well, as a general answer for these situations, you can always use a web-based automation tool, which is basically a library that interacts with the remote feature of the browsers and replicates what would be you "opening the website, logging in, etc" and can subsequently parse all data from the rendered elements.
Try looking at selenium, i've used that library in the past to raw scrap facebook and it worked flawless.
Edit: Note that this isn't a twitter specific library, you will have to find the html tags in the login website and use them to log in, same for parsing data, etc.
I am working on my machine learning project where I am in a great need to access the textual posts of the people from Facebook, of course with permission, and then process it to find the best personality traits.
I am struggling to find a way to merge my application with something that can give me the post of the people from their profiles. Hence, I found something in the documentation of the Django-Facebook API. Here is the link for the documentation.
Here what I am not able to understand is how can I use this for extracting the post of the people? Can anyone give me a small example that can be helpful in understanding what I can do to extract the text from the people profile and use it in my application.
I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.