I have a python script that streams tweets into a csv file. I have provided the runtime parameter of 46800 seconds which is equal to 13 hrs, this dictates for how long the tweets have to be streamed into that csv. It was running fine for the specified duration until yesterday, but yesterday the script ran for 7.5 hrs only and it stopped streaming afterwards. I believe, there was no tweet about the topic i was streaming for a certain duration and hence the disconnect happened. So even if, when people started tweeting about the topic in question at later point, the connection didn't get re-establish and hence the script didn’t stream those tweets to csv file. So, i had to restart the script in another instance and let the script stream the tweets into another csv file. Today also, i ran into similar issue, the stream got disconnected after running for 6 hours and so i had to re-start again.
But i am not sure if that was the case. Below is the script that i used, please advise what could have happened. And if so, then how can i avoid this?
runtime = 46800
class listener(StreamListener):
def on_data(self,data):
data1 = json.loads(data)
time = data1["created_at"]
tweet1 = BeautifulSoup(tweet, "lxml").get_text()
url = "https://twitter.com/{}/status/{}".format(data1["user"]["screen_name"], data1["id_str"])
file = open('MARCH_DATA.csv', 'a')
csv_writer = csv.writer(file)
csv_writer.writerow([time, tweet1, url])
file.close()
auth = OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=["MTA"], async = True)
time.sleep(runtime)
twitterStream.disconnect()
Thanks
This worked for my streaming exercise.
# the regular imports, as well as this:
from urllib3.exceptions import ProtocolError
auth = OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
twitterStream = Stream(auth, listener())
while True:
try:
twitterStream.filter(track=["MTA"], async = True, stall_warnings=True)
except (ProtocolError, AttributeError):
continue
Related
I recently started to learn python, i had 0 knowledge on pyhton, and in the last few weeks i've been studying python and the twitter api.
I decided to work on a simple twitter bot, that automatically posts whatever people send on my direct messages, and i maneged to do so.
Here's my code:
import tweepy
import json
import sys
import re
import time
print("acessing api...")
consumer_key = 'consumer_key'
consumer_secret = 'consumer_secret'
key = 'key'
secret = 'key_secret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth, wait_on_rate_limit = True)
print('api accessed')
print('colecting dms')
direct_message = api.list_direct_messages()
text = direct_message[0].message_create["message_data"]["text"]
def send_tweet():
print('sending tweet...')
api.update_status('/bot: ' + text)
while True:
send_tweet()
time.sleep(30)
The code works, but not 100%, i'm able to extract the text from the Dm and use it on the api.update_status("/bot: " + text) but when it loops i get the following error:
tweepy.error.TweepError: [{'code': 187, 'message': 'Status is a duplicate.'}]
i've looked it up, and tried many things, for example:
while True:
try:
send_tweet()
time.sleep(30)
except tweepy.TweepError as error:
if error.api_code == 187:
print('duplicate message')
else:
print('waiting for new message')
time.sleep(30)
All i want is a way to keep the code going and ignore messages and tweets that were already sent, also wait for new dm's
Another thing that kept happening was that when i tried testing the extraction of the text, i kept getting the same text from the same dm, instead of getting a nother newer one.
Direct Messages stay in the Inbox even after your read or reply. It is best to delete a DM after you process it (in your case tweet it) so you don't need to worry about it next time you fetch the messages.
I'm looking for the fastest way to check if a specific user (TwitterID) has tweeted in real-time. To achieve this I have used Tweepy and the stream function, this results in a notification of the tweeted tweet in about -+5 seconds. Is there a faster way to check if someone has tweeted by using another library / requests or code optimization?
Thanks in advance.
import tweepy
TwitterID = "148137271"
class MyStreamListener(tweepy.StreamListener):
def __init__(self, api):
self.api = api
self.me = api.me()
def on_status(self, tweet):
#Filter if ID has tweeted
if tweet.user.id_str == TwitterID:
print("Tweeted:", tweet.text)
def on_error(self, status):
print("Error detected")
print(status)
# Authenticate to Twitter
auth = tweepy.OAuthHandler("x")
auth.set_access_token("Y",
"Z")
# Create API object
api = tweepy.API(auth, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
tweets_listener = MyStreamListener(api)
stream = tweepy.Stream(api.auth, tweets_listener)
stream.filter([TwitterID])
I'd say around 5 seconds is a reasonable latency, given that your program is not running on the same server as Twitter's core systems. You're subject to network and API latency and those things are outside of your control. There's no real way to rewrite this logic to change the time between a Tweet being posted and it reaching the API. If you think about the internal stuff going on inside Twitter itself from a Tweet being posted and it being fanned out to potentially millions of followers, the fact that the API - AT THE END OF AN UNKNOWN NETWORK CONNECTION - gets the Tweet data inside of < 5 seconds is pretty crazy in itself.
So what I want to do is live stream Tweets from Twitters API: for just the hashtag 'Brexit', only in the English language, and for a specific amount of Tweets (1k - 2k).
So far my code will live stream the Tweets, but whichever way I modify it I either end up with it ignoring the count and just streaming indefinitely, or I get errors. If I change it to only stream a specific users Tweets the count function works, but it ignores the hashtag. If I stream everything for the given hashtag it completely ignores the count. I've had a decent go at trying to fix it but am quite inexperienced and have really hit a brick wall with it.
If I could get some help with how to tick all these boxes at the same time would be much appreciated!
The code below so far will just stream 'Brexit' Tweets indefinitely so ignores the count=10
The bottom of the code is a bit of a mess due to me playing with it, apologies:
import numpy as np
import pandas as pd
import tweepy
from tweepy import API
from tweepy import Cursor
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import Twitter_Credentials
import matplotlib.pyplot as plt
# Twitter client - hash out to stream all
class TwitterClient:
def __init__(self, twitter_user=None):
self.auth = TwitterAuthenticator().authenticate_twitter_app()
self.twitter_client = API(self.auth)
self.twitter_user = twitter_user
def get_twitter_client_api(self):
return self.twitter_client
# Twitter authenticator
class TwitterAuthenticator:
def authenticate_twitter_app(self):
auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
return auth
class TwitterStreamer():
# Class for streaming and processing live Tweets
def __init__(self):
self.twitter_authenticator = TwitterAuthenticator()
def stream_tweets(self, fetched_tweets_filename, hash_tag_list):
# this handles Twitter authentication and connection to Twitter API
listener = TwitterListener(fetched_tweets_filename)
auth = self.twitter_authenticator.authenticate_twitter_app()
stream = Stream(auth, listener)
# This line filters Twitter stream to capture data by keywords
stream.filter(track=hash_tag_list)
# Twitter stream listener
class TwitterListener(StreamListener):
# This is a listener class that prints incoming Tweets to stdout
def __init__(self, fetched_tweets_filename):
self.fetched_tweets_filename = fetched_tweets_filename
def on_data(self, data):
try:
print(data)
with open(self.fetched_tweets_filename, 'a') as tf:
tf.write(data)
return True
except BaseException as e:
print("Error on_data: %s" % str(e))
return True
def on_error(self, status):
if status == 420:
# Return false on data in case rate limit occurs
return False
print(status)
class TweetAnalyzer():
# Functionality for analysing and categorising content from tweets
def tweets_to_data_frame(self, tweets):
df = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['tweets'])
df['id'] = np.array([tweet.id for tweet in tweets])
df['len'] = np.array([len(tweet.text) for tweet in tweets])
df['date'] = np.array([tweet.created_at for tweet in tweets])
df['source'] = np.array([tweet.source for tweet in tweets])
df['likes'] = np.array([tweet.favorite_count for tweet in tweets])
df['retweets'] = np.array([tweet.retweet_count for tweet in tweets])
return df
if __name__ == "__main__":
auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
api = tweepy.API(auth)
for tweet in Cursor(api.search, q="#brexit", count=10,
lang="en",
since="2019-04-03").items():
fetched_tweets_filename = "tweets.json"
twitter_streamer = TwitterStreamer()
hash_tag_list = ["Brexit"]
twitter_streamer.stream_tweets(fetched_tweets_filename, hash_tag_list)
You're trying to use two different methods of accessing the Twitter API - Streaming is realtime, and searching is a one-off API call.
Since streaming is continuous and realtime, there's no way to apply a count of results to it - the code simply opens a connection, says "hey, send me all the Tweets from now onwards that contain the hash_tag_list", and sits listening. At that point you then drop into the StreamListener, where for each Tweet received, you write them into a file.
You could apply a counter here, but you'd need to wrap it inside your StreamListener on_data handler, and increment the counter for each Tweet received. When you get to 1000 Tweets, stop listening.
For the search option, you have a couple of issues... the first one is that you're asking for Tweets since 2019, but the standard search API can only go back 7 days in time. You've obviously asked for only 10 Tweets there. The way you've written the method though, what's actually happening is that for each Tweet in the collection of 10 that the API returns, you then create a realtime streaming connection and start listening and writing to a file. So that's not going to work.
You'll need to choose one - either search for 1000 Tweets and write them to a file (never set up TwitterStreamer()), or, listen for 1000 Tweets and write them to a file (drop the for Tweet in Cursor(api.search... and jump straight to the streamer).
Simply add the hashtag symbol to the search phrase in the list, and it'll match tweets that use a specific hashtag. It's case-sensitive so you may want to add as many options to the search array. Merely using "Brexit" matches tweets that may or may not use the hashtag but contain the keyword "Brexit".
hash_tag_list = ["#Brexit"]
I am very new to python. I'm using tweepy library to scrape tweets via twitter streaming API. but it seems like connection gets broken after running for an hour. I want to know if there is any way to stop the program from running before the connections get broken. In short limiting the tweets.
I have tried the .items method but it did'nt work as it gives the name Error.
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
ckey="xxxxxxxxxxxxxxxxxxxxxxxxxxx"
csecret="xxxxxxxxxxxxxxxxxxxxxx"
atoken="xxxxxxxxxxxxxxxxxxxxx"
asecret="xxxxxxxxxxxxxxxxxxxxxxxxxxx"
class listener(StreamListener):
def on_data(self, data):
print(data)
return(True)
def on_error(self, status):
print status
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=["Obama"])
thanks
To solve your connection issue take help from this:
Tweepy Connection broken: IncompleteRead - best way to handle exception? or, can threading help avoid?
To achieve tweets limitation you can return False from the class def on_data method, when the desired number of tweets are fetched. Set max number of tweets in the init method and use try and except for error handling. This might help
def __init__(self):
super().__init__()
self.max_tweets = 10
self.tweet_count = 0
def on_data(self, data):
try:
data
except TypeError:
print(completed)
else:
self.tweet_count+=1
if(self.tweet_count==self.max_tweets):
print("completed")
return(False)
else:
decoded = json.loads(data)
While running this program to retrieve Twitter data using Python 2.7.8 :
#imports
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
#setting up the keys
consumer_key = '…………...'
consumer_secret = '………...'
access_token = '…………...'
access_secret = '……………..'
class TweetListener(StreamListener):
# A listener handles tweets are the received from the stream.
#This is a basic listener that just prints received tweets to standard output
def on_data(self, data):
print (data)
return True
def on_error(self, status):
print (status)
#printing all the tweets to the standard output
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
stream = Stream(auth, TweetListener())
t = u"سوريا"
stream.filter(track=[t])
after running this program for 5 hours i got this Error message:
Traceback (most recent call last):
File "/Users/Mona/Desktop/twitter.py", line 32, in <module>
stream.filter(track=[t])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 316, in filter
self._start(async)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 237, in _start
self._run()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 173, in _run
self._read_loop(resp)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 225, in _read_loop
next_status_obj = resp.read( int(delimited_string) )
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 543, in read
return self._read_chunked(amt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 612, in _read_chunked
value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 660, in _safe_read
raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(0 bytes read, 976 more expected)
>>>
Actually i don't know what to do with this problem !!!
You should check to see if you're failing to process tweets quickly enough using the stall_warnings parameter.
stream.filter(track=[t], stall_warnings=True)
These messages are handled by Tweepy (check out implementation here) and will inform you if you're falling behind. Falling behind means that you're unable to process tweets as quickly as the Twitter API is sending them to you. From the Twitter docs:
Setting this parameter to the string true will cause periodic messages to be delivered if the client is in danger of being disconnected. These messages are only sent when the client is falling behind, and will occur at a maximum rate of about once every 5 minutes.
In theory, you should receive a disconnect message from the API in this situation. However, that is not always the case:
The streaming API will attempt to deliver a message indicating why a stream was closed. Note that if the disconnect was due to network issues or a client reading too slowly, it is possible that this message will not be received.
The IncompleteRead could also be due to a temporary network issue and may never happen again. If it happens reproducibly after about 5 hours though, falling behind is a pretty good bet.
I've just had this problem. The other answer is factually correct, in that it's almost certainly:
Your program isn't keeping up with the stream
you get a stall warning if that's the case.
In my case, I was reading the tweets into postgres for later analysis, across a fairly dense geographic area, as well as keywords (London, in fact, and about 100 keywords). It's quite possible that, even though you're just printing it, your local machine is doing a bunch of other things, and system processes get priority, so the tweets will back up until Twitter disconnects you. (This is typically manifests as an apparent memory leak - the program increases in size until it gets killed, or twitter disconnects - whichever is first.)
The thing that made sense here was to push off the processing to a queue. So, I used a redis and django-rq solution - it took about 3 hours to implement on dev and then my production server, including researching, installing, rejigging existing code, being stupid about my installation, testing, and misspelling things as I went.
Install redis on your machine
Start the redis server
Install Django-RQ (or just Install RQ if you're working solely in python)
Now, in your django directory (where appropriate - ymmv for straight python applications) run:
python manage.py rqworker &
You now have a queue! You can add jobs to that like by changing your handler like this:
(At top of file)
import django_rq
Then in your handler section:
def on_data(self, data):
django_rq.enqueue(print, data)
return True
As an aside - if you're interested in stuff emanating from Syria, rather than just mentioning Syria, then you could add to the filter like this:
stream.filter(track=[t], locations=[35.6626, 32.7930, 42.4302, 37.2182]
That's a very rough geobox centred on Syria, but which will pick up bits of Iraq/Turkey around the edges. Since this is an optional extra, it's worth pointing this out:
Bounding boxes do not act as filters for other filter parameters. For
example track=twitter&locations=-122.75,36.8,-121.75,37.8 would match
any tweets containing the term Twitter (even non-geo tweets) OR coming
from the San Francisco area.
From this answer, which helped me, and the twitter docs.
Edit: I see from your subsequent posts that you're still going down the road of using Twitter API, so hopefully you got this sorted anyway, but hopefully this will be useful for someone else! :)
This worked for me.
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
while True:
try:
stream.filter(track=['python', 'java'], stall_warnings=True)
except (ProtocolError, AttributeError):
continue
A solution is restarting the stream immediately after catching exception.
# imports
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
# setting up the keys
consumer_key = "XXXXX"
consumer_secret = "XXXXX"
access_token = "XXXXXX"
access_secret = "XXXXX"
# printing all the tweets to the standard output
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
class TweetListener(StreamListener):
# A listener handles tweets are the received from the stream.
# This is a basic listener that just prints received tweets to standard output
def on_data(self, data):
print(data)
return True
def on_exception(self, exception):
print('exception', exception)
start_stream()
def on_error(self, status):
print(status)
def start_stream():
stream = Stream(auth, TweetListener())
t = u"سوريا"
stream.filter(track=[t])
start_stream()
For me the back end application to which the URL is pointing is directly returning the string
I changed it to
return Response(response=original_message, status=200, content_type='application/text')
in the start I just returned text like
return original_message
I think this answer works only for my case