I would like to create a tool with Python and the Twitter API to be able to create lists of tweets that match certain criteria like "contains the word Python" or "has at least 2 likes". Or simple stats like top posters, most liked, etc.
All of my search pointed me to the Tweepy project. But for that I need 0Auth tokens. So I applied for a developer account and was denied with the comment "we are unable to serve your use case".
Do I have any alternatives?
Well, as a general answer for these situations, you can always use a web-based automation tool, which is basically a library that interacts with the remote feature of the browsers and replicates what would be you "opening the website, logging in, etc" and can subsequently parse all data from the rendered elements.
Try looking at selenium, i've used that library in the past to raw scrap facebook and it worked flawless.
Edit: Note that this isn't a twitter specific library, you will have to find the html tags in the login website and use them to log in, same for parsing data, etc.
Related
For a project I've been using an API to get information from Instagram. However, I would like to get info from posts using keywords (words included inside the post description). This is a feature available in the app
see here, however I have been only able to make search by hashtag, which is not what I want.
I would like to know if any of you know an API/tool able to accomplish this task.
Here is a way to search instagram posts by words or phrases (not hashtags):
First, you would need to use tools for web scrapping Google search results. Check out this answers for guidance. The url that you will make the request to would be something like:
https://www.google.com/search?q=site%3Ahttps%3A%2%2Fwww.instagram.com%2Fp%2F+**put+the+phrase+here**
Once you get the urls from the posts that contain those words, you may want to use an API (e.g. from Rapidapi), build your own code for web scrapping, or use Python packages such as instagramy to get metadata from the instagram posts that you've got.
Usually, the information comes in JSON when using an API, so it is not very difficult to extract the data and put it in as a pandas dataframe if you want to.
I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.
It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.
I am trying to crawl a website for the first time. I am using urllib2 Python
I am currently trying to log into Foursquare social networking site using Python urlib2 and Beautifulsoup. To view a particular page, I need to provide username and password.
So,I followed the Basic Authentication described on the ducumentation page.
I guess, everything worked well, but the site throws up a security check asking me to type a text (capcha), before sending me the required page. It obviously looks like, the site is detecting that, a page is being requested not by a human, but a crawler.
So, what is the way, to avoid being detected. How to make urllib2 get the desired page, without having to stop at the security check? Pls help..
You probably want to use foursquare API instead.
You have to use the foursquare API. I guess, there is no other way. API are designed for such purposes.
Crawlers depending solely on the HTML format of the page will fail in the furture when the HTML page changes
I am making a web application that will monitor the amount of members and discussions in each one of the groups listed here (http://www.codecademy.com/groups#web) and display that information in nice graphs.
However, as you have already seen, it looks like I need to create an account and login with it.
Having in mind that my project is using Python for the server side, how do I do it? Which API is easier? (Google, FB or twitter?)
I would really love if you could also provide some examples because I am really new at this (and at Python too).
The official wrapper around the Twitter API for Python is this one. I used it and it's very easy. You should first read this page and also register an application to get OAuth keys.
Example:
import twitter
# Remember to put these values
api = twitter.Api(consumer_key="",
consumer_secret="",
access_token_key="",
access_token_secret="")
# Get your timeline
print api.GetHomeTimeline()
Hope it helps.
I've looked at a lot of questions and libs and didn't found exactly what I wanted. Here's the thing, I'm developing an application in python for a user to get all sorts of things from social networks accounts. I'm having trouble with facebook. I would like, if possible, a step-by-step tutorial on the code and libs to use to get a user's information, from posts to photos information (with the user's login information, and how to do it, because I've had a lot of problem with authentication).
Thank you
I strongly encourage you to use Facebook's own APIs.
First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp).
Once you grasp the concepts, start using this API. For Python, there are at several alternatives:
facebook/python-sdk https://github.com/facebook/python-sdk
pyFaceGraph https://github.com/iplatform/pyFaceGraph/
It is also semitrivial to write a simple HTTP client that uses the graph API
I would suggest you to check out the Python libraries, try out the examples in their documentation and see if they are working and do the stuff you need.
Only as a last resort, would I write a scraper and try to extract data with screenscraping (it is much more painful and breaks more easily).
I have not used this with Facebook, but in the past when I had to scrape a site that required login I used Mechanize to handle the login and scraping and Beautiful Soup to parse the resulting HTML.