For a research project, I am collecting tweets using Python-Twitter. However, when running our program nonstop on a single computer for a week we manage to collect about only 20 MB of data per week. I am only running this program on one machine so that we do not collect the same tweets twice.
Our program runs a loop that calls getPublicTimeline() every 60 seconds. I tried to improve this by calling getUserTimeline() on some of the users that appeared in the public timeline. However, this consistently got me banned from collecting tweets at all for about half an hour each time. Even without the ban, it seemed that there was very little speed-up by adding this code.
I know about Twitter's "whitelisting" that allows a user to submit more requests per hour. I applied for this about three weeks ago, and have not hear back since, so I am looking for alternatives that will allow our program to collect tweets more efficiently without going over the standard rate limit. Does anyone know of a faster way to collect public tweets from Twitter? We'd like to get about 100 MB per week.
Thanks.
How about using the streaming API? This is exactly the use-case it was created to address. With the streaming API you will not have any problems gathering megabytes of tweets. You still won't be able to access all tweets or even a statistically significant sample without being granted access by Twitter though.
I did a similar project analyzing data from tweets. If you're just going at this from a pure data collection/analysis angle, you can just scrape any of the better sites that collect these tweets for various reasons. Many sites allow you to search by hashtag, so throw in a popular enough hashtag and you've got thousands of results. I just scraped a few of these sites for popular hashtags, collected these into a large list, queried that list against the site, and scraped all of the usable information from the results. Some sites also allow you to export the data directly, making this task even easier. You'll get a lot of garbage results that you'll probably need to filter (spam, foreign language, etc), but this was the quickest way that worked for our project. Twitter will probably not grant you whitelisted status, so I definitely wouldn't count on that.
There is pretty good tutorial from ars technica on using streaming API n Python that might be helpful here.
Otherwise you could try doing it via cURL.
.
Related
I just recently finished a nearly 6 week long conversation with multiple people on Twitter. Since several things that were said were quite interesting (particularly in hindsight), I'd like to be able to archive the entire conversation for reference later. From what I can tell, there are no existing solutions similar to threadreaderapp.com to recursively unroll an entire conversation. As such, I looked into doing it in Python with the Twitter API. In researching it, I found several people saying the free version of the API only lets you search replies from the last 7 days. However, then I found some places (e.g., here) that seemed to indicate the Twitter API v2 added access to a "conversation ID" that enabled this limitation to be avoided. However, when I tried to run that code to get the replies to my tweet, the response kept coming back empty. Specifically, as best I can tell, the request from line 19 of this code (link ... which is the code from step 7 of the previously mentioned article: direct link) is not returning data.
Am I missing something? Is it possible to recursively get all replies to a tweet from the past 6 weeks without needing to be considered an "Academic Researcher" to be able to access the full Twitter archive (reference)?
Ultimately, I can get all the tweets from the website in the browser, so I suppose if I knew what I was doing I could just use some sort of a HTML scraper or something, but I don't.
The Twitter API v2 allows you to use the conversation_id as a search parameter on both the recent search, and full archive search, endpoints. The difference is that the recent search API covers the past seven days (available in the Essential access tier / most users), and the full archive search API is limited to Academic access at this time.
So, to directly answer your question: no, the API does not allow you to recursively get all replies to a Tweet from the past 6 weeks, unless you are indeed a qualified academic researcher with access to the full archive search functionality.
Other retrieval methods are beyond the scope of the API and are not supported by Twitter.
I realize that versions of this question have been asked and I spent several hours the other day trying a number of strategies.
What I would like to is use python to scrape all of the URLs from a google search that I can use in a separate script to do text analysis of a large corpus (news sites mainly). This seems relatively straightforward, but none of the attempts I've tried have worked properly.
This is as close as I got:
from google import search
for url in search('site:cbc.ca "kinder morgan" and "trans mountain" and protest*', stop=100):
print(url)
This returned about 300 URLs before I got kicked. An actual search using these parameters provides about 1000 results and I'd like all of them.
First: is this possible? Second: does anyone have any suggestions to do this? I basically just want a txt file of all the URLs that I can use in another script.
It seems that this package uses screen scraping to retrieve search results from google, so it doesn't play well with Google's Terms of Service which could be the reason why you've been blocked.
The relevant clause in Google's Terms of Service:
Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide. You may use our Services only as permitted by law, including applicable export and re-export control laws and regulations. We may suspend or stop providing our Services to you if you do not comply with our terms or policies or if we are investigating suspected misconduct.
I haven't been able to find a definite number, but it seems like their limit for the number of search queries a day is rather strict too - at 100 search queries / day on their JSON Custom Search API documentation here.
Nonetheless, there's no harm trying out other alternatives to see if they work better:
BeautifulSoup
Scrapy
ParseHub - this one is not in code, but is a useful piece of software with good documentation. Link to their tutorial on how to scrape a list of URLs.
I am writing a script in Python, that uses tweepy to search for tweets with a given keyword. Here is the snippet:
for tweet in tweepy.Cursor(api.search, q=keyword, lang="en").items(10):
print tweet.id
I have everything authenticated properly and the code works most of the time. However, when I try to search for some keywords (examples below) it doesn't return anything.
The keywords that cause trouble are "digitalkidz" (a tech conference) and "newtrendbg" (a Bulgarian company). If you do a quick search on Twitter for either of those you will see that there are results. However, tweepy doesn't find anything. Again, it does work for pretty much any other keyword I use.
Do you have any ideas what might be the problem and how to fix it?
Thank you
I believe you're forgetting an important aspect of the twitter api, it's not exhaustive.
Taken from the api docs
Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.
Regardless of whether you're using the streaming or rest api, you're going to have issues with this if you're looking for specific tweets.
Rest API
When looking for historical tweets, you unfortunately won't be able to obtain anything that is older than a week using api.search(). This is also shown in the docs.
Keep in mind that the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
There are other ways of getting older tweets, this post details those options.
Streaming API
While it doesn't sound like you're using twitter's streaming API, it should be noted that this only gives a small sample of twitter's current tweet traffic (~1-2%).
Hopefully this is helpful. Let me know if you have any questions.
So I have run into the issue of getting data from Google Finance. They have an html access system that you can use to access webpages that give stock data in simple text format (ideal for minimizing parsing). However, if you access this service too frequently, Google locks you out and you need to enter a captcha. I currently have a list of about 50 stocks and I want to update my price data every 15 seconds, but I soon get locked out (after about 3-4 minutes).
Does anyone have any solutions to this/understand the nature of how often is the max I could ping Google for this information?
Not sure why a feature like this would be on a service designed to give data like this... but similar alternative services with realtime data would also be accepted.
Probably because your usage is not what they intended. Every 40x every 15 seconds seems a bit excessive. They had an API that got discontinued some years ago, and there's another SO question with some available alternatives which is probably also a bit out of date.
From google, there is also it's Finance service with getStockInfo which allows to query its database but read their warnings.
Yahoo YQL works fairly well, but throws numerous HTTP 500 errors that need to be handled, they are all benign. TradeKing is an option, however, the oauth2 package is required and that is very difficult to install properly
Is there a way that I can get unlimited or 500 tweets from Twitter?
I'm using python.
I can get 100 tweets by using twitter_search = Twitter(domain="search.twitter.com") but it has a limitation on 100 tweets.
Edit:
Im using the pypi.python.org/pypi/twitter/1.9.0 library.
It should be the public tweets and not the tweets from my account and my followers
I've been having the same issue. As far as I can tell, there is really no getting around the twitter API limits, and there don't seem to be any other APIs that give access to archives of tweets.
One option, albeit a challenging one, is downloading all tweets in bulk from archive.org:
http://archive.org/details/twitterstream
Each month of data is > 30GB large though, compressed, so it won't be easy to handle. But if you are determined, this will give you full control over the raw data with no limits.
The twitter API limits the results to a maximum of 200 per request. How to set the count in your example to this maximum depends on the library you are using (which you didn't state, so I can't give you any information on that).
So, unlimited won't be possible in one request, no matter which library you are using.
See the "count" parameter here: https://dev.twitter.com/docs/api/1.1/get/statuses/home_timeline
If you can shift to Orange from here you can get 9999 tweets per request. Hope someone will find it helpful.
https://medium.com/analytics-vidhya/twitter-sentiment-analysis-with-orange-vader-powerbi-part-1-184b693b9d70