Python issue with saving the output - python

As a new user to Python I have hit an issue with the following code. Instead of only printing the results of Twitter search on the screen I need to save the file (ideally pipe-delimited which I don't yet know how to produce...). However the following code runs ok but doesn't create the Output.txt file. It did once and then never again. I am running it on Mac OS and ending the code with Ctrl+C (as I still don't know how to modify it only to return specific number of tweets). I thought that the issue might be related to Flush'ing but after trying to include the options from this post:Flushing issues none of them seemed to work (unless I did something wrong which is more than probable...)
import tweepy
import json
import sys
# Authentication details. To obtain these visit dev.twitter.com
consumer_key = 'xxxxxx'
consumer_secret = 'xxxxx'
access_token = 'xxxxx-xxxx'
access_token_secret = 'xxxxxxxx'
# This is the listener, resposible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
# Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
print '#%s: %s' % (decoded['user']['screen_name'], decoded['text'].encode('ascii', 'ignore'))
print ''
return True
def on_error(self, status):
print status
if __name__ == '__main__':
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
print "Showing all new tweets for #Microsoft"
stream = tweepy.Stream(auth, l)
stream.filter(track=['Microsoft'])
sys.stdout = open('Output.txt', 'w')

I think you would be much better off chaning StdOutListener and having it write to the file directly. Assigning sys.stdout to a file is... weird. This way, you can print things for debug output. Also note that file mode "w" will truncate the file when it's opened.
class TweepyFileListener(tweepy.StreamListener):
def on_data(self, data):
print "on_data called"
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
msg = '#%s: %s\n' % (
decoded['user']['screen_name'],
decoded['text'].encode('ascii', 'ignore'))
#you should really open the file in __init__
#You should also use a RotatingFileHandler or this guy will get massive
with open("Output.txt", "a") as tweet_log:
print "Received: %s\n" % msg
tweet_log.write(msg)

Related

Avoid twitter stream from getting disconnected

I have a python script that streams tweets into a csv file. I have provided the runtime parameter of 46800 seconds which is equal to 13 hrs, this dictates for how long the tweets have to be streamed into that csv. It was running fine for the specified duration until yesterday, but yesterday the script ran for 7.5 hrs only and it stopped streaming afterwards. I believe, there was no tweet about the topic i was streaming for a certain duration and hence the disconnect happened. So even if, when people started tweeting about the topic in question at later point, the connection didn't get re-establish and hence the script didn’t stream those tweets to csv file. So, i had to restart the script in another instance and let the script stream the tweets into another csv file. Today also, i ran into similar issue, the stream got disconnected after running for 6 hours and so i had to re-start again.
But i am not sure if that was the case. Below is the script that i used, please advise what could have happened. And if so, then how can i avoid this?
runtime = 46800
class listener(StreamListener):
def on_data(self,data):
data1 = json.loads(data)
time = data1["created_at"]
tweet1 = BeautifulSoup(tweet, "lxml").get_text()
url = "https://twitter.com/{}/status/{}".format(data1["user"]["screen_name"], data1["id_str"])
file = open('MARCH_DATA.csv', 'a')
csv_writer = csv.writer(file)
csv_writer.writerow([time, tweet1, url])
file.close()
auth = OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=["MTA"], async = True)
time.sleep(runtime)
twitterStream.disconnect()
Thanks
This worked for my streaming exercise.
# the regular imports, as well as this:
from urllib3.exceptions import ProtocolError
auth = OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
twitterStream = Stream(auth, listener())
while True:
try:
twitterStream.filter(track=["MTA"], async = True, stall_warnings=True)
except (ProtocolError, AttributeError):
continue

How to get tweets data that contain multiple keywords

I'm trying to accumulate tweets data by using these typical codes. As you can see I attempt to track tweets containing 'UniversalStudios', 'Disneyland' OR 'Los Angeles'. But in fact what I really want to get are tweets that contain these keywords "UniversalStudios", "Disneyland" AND "LosAngeles" altogether. Can anyone tell me how to achieve that?
Thanks a lot in advance :)
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
all_data = json.loads(data)
tweet = TextBlob(all_data["text"])
#Add the 'sentiment data to all_data
#all_data['sentiment'] = tweet.sentiment
#print(tweet)
#print(tweet.sentiment)
# Open json text file to save the tweets
with open('tweets.json', 'a') as tf:
# Write a new line
tf.write('\n')
# Write the json data directly to the file
json.dump(all_data, tf)
# Alternatively: tf.write(json.dumps(all_data))
return True
def on_error(self, status):
print (status)
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(languages = ['en'], track=['UniversalStudios','Disneyland', "LosAngeles"])
Twitter's API (see "track") mentions you need to have spaces between the phrases to mean ANDs (commas are ORs). I'm not sure how the library you're using handles it, but my bet would be:
track=['UniversalStudios Disneyland LosAngeles']
The quote from the docs:
By this model, you can think of commas as logical ORs, while spaces are equivalent to logical ANDs (e.g. ‘the twitter’ is the AND twitter, and ‘the,twitter’ is the OR twitter).

"IncompleteRead" Error when retrieving Twitter Data using Python

While running this program to retrieve Twitter data using Python 2.7.8 :
#imports
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
#setting up the keys
consumer_key = '…………...'
consumer_secret = '………...'
access_token = '…………...'
access_secret = '……………..'
class TweetListener(StreamListener):
# A listener handles tweets are the received from the stream.
#This is a basic listener that just prints received tweets to standard output
def on_data(self, data):
print (data)
return True
def on_error(self, status):
print (status)
#printing all the tweets to the standard output
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
stream = Stream(auth, TweetListener())
t = u"سوريا"
stream.filter(track=[t])
after running this program for 5 hours i got this Error message:
Traceback (most recent call last):
File "/Users/Mona/Desktop/twitter.py", line 32, in <module>
stream.filter(track=[t])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 316, in filter
self._start(async)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 237, in _start
self._run()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 173, in _run
self._read_loop(resp)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tweepy/streaming.py", line 225, in _read_loop
next_status_obj = resp.read( int(delimited_string) )
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 543, in read
return self._read_chunked(amt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 612, in _read_chunked
value.append(self._safe_read(chunk_left))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 660, in _safe_read
raise IncompleteRead(''.join(s), amt)
IncompleteRead: IncompleteRead(0 bytes read, 976 more expected)
>>>
Actually i don't know what to do with this problem !!!
You should check to see if you're failing to process tweets quickly enough using the stall_warnings parameter.
stream.filter(track=[t], stall_warnings=True)
These messages are handled by Tweepy (check out implementation here) and will inform you if you're falling behind. Falling behind means that you're unable to process tweets as quickly as the Twitter API is sending them to you. From the Twitter docs:
Setting this parameter to the string true will cause periodic messages to be delivered if the client is in danger of being disconnected. These messages are only sent when the client is falling behind, and will occur at a maximum rate of about once every 5 minutes.
In theory, you should receive a disconnect message from the API in this situation. However, that is not always the case:
The streaming API will attempt to deliver a message indicating why a stream was closed. Note that if the disconnect was due to network issues or a client reading too slowly, it is possible that this message will not be received.
The IncompleteRead could also be due to a temporary network issue and may never happen again. If it happens reproducibly after about 5 hours though, falling behind is a pretty good bet.
I've just had this problem. The other answer is factually correct, in that it's almost certainly:
Your program isn't keeping up with the stream
you get a stall warning if that's the case.
In my case, I was reading the tweets into postgres for later analysis, across a fairly dense geographic area, as well as keywords (London, in fact, and about 100 keywords). It's quite possible that, even though you're just printing it, your local machine is doing a bunch of other things, and system processes get priority, so the tweets will back up until Twitter disconnects you. (This is typically manifests as an apparent memory leak - the program increases in size until it gets killed, or twitter disconnects - whichever is first.)
The thing that made sense here was to push off the processing to a queue. So, I used a redis and django-rq solution - it took about 3 hours to implement on dev and then my production server, including researching, installing, rejigging existing code, being stupid about my installation, testing, and misspelling things as I went.
Install redis on your machine
Start the redis server
Install Django-RQ (or just Install RQ if you're working solely in python)
Now, in your django directory (where appropriate - ymmv for straight python applications) run:
python manage.py rqworker &
You now have a queue! You can add jobs to that like by changing your handler like this:
(At top of file)
import django_rq
Then in your handler section:
def on_data(self, data):
django_rq.enqueue(print, data)
return True
As an aside - if you're interested in stuff emanating from Syria, rather than just mentioning Syria, then you could add to the filter like this:
stream.filter(track=[t], locations=[35.6626, 32.7930, 42.4302, 37.2182]
That's a very rough geobox centred on Syria, but which will pick up bits of Iraq/Turkey around the edges. Since this is an optional extra, it's worth pointing this out:
Bounding boxes do not act as filters for other filter parameters. For
example track=twitter&locations=-122.75,36.8,-121.75,37.8 would match
any tweets containing the term Twitter (even non-geo tweets) OR coming
from the San Francisco area.
From this answer, which helped me, and the twitter docs.
Edit: I see from your subsequent posts that you're still going down the road of using Twitter API, so hopefully you got this sorted anyway, but hopefully this will be useful for someone else! :)
This worked for me.
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
while True:
try:
stream.filter(track=['python', 'java'], stall_warnings=True)
except (ProtocolError, AttributeError):
continue
A solution is restarting the stream immediately after catching exception.
# imports
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
# setting up the keys
consumer_key = "XXXXX"
consumer_secret = "XXXXX"
access_token = "XXXXXX"
access_secret = "XXXXX"
# printing all the tweets to the standard output
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
class TweetListener(StreamListener):
# A listener handles tweets are the received from the stream.
# This is a basic listener that just prints received tweets to standard output
def on_data(self, data):
print(data)
return True
def on_exception(self, exception):
print('exception', exception)
start_stream()
def on_error(self, status):
print(status)
def start_stream():
stream = Stream(auth, TweetListener())
t = u"سوريا"
stream.filter(track=[t])
start_stream()
For me the back end application to which the URL is pointing is directly returning the string
I changed it to
return Response(response=original_message, status=200, content_type='application/text')
in the start I just returned text like
return original_message
I think this answer works only for my case

How to add a location filter to tweepy module

I have found the following piece of code that works pretty well for letting me view in Python Shell the standard 1% of the twitter firehose:
import sys
import tweepy
consumer_key=""
consumer_secret=""
access_key = ""
access_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print status.text
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(track=['manchester united'])
How do I add a filter to only parse tweets from a certain location? Ive seen people adding GPS to other twitter related Python code but I cant find anything specific to sapi within the Tweepy module.
Any ideas?
Thanks
The streaming API doesn't allow to filter by location AND keyword simultaneously.
Bounding boxes do not act as filters for other filter parameters. For example
track=twitter&locations=-122.75,36.8,-121.75,37.8 would match any tweets containing
the term Twitter (even non-geo tweets) OR coming from the San Francisco area.
Source: https://dev.twitter.com/docs/streaming-apis/parameters#locations
What you can do is ask the streaming API for keyword or located tweets and then filter the resulting stream in your app by looking into each tweet.
If you modify the code as follows you will capture tweets in United Kingdom, then those tweets get filtered to only show those that contain "manchester united"
import sys
import tweepy
consumer_key=""
consumer_secret=""
access_key=""
access_secret=""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
if 'manchester united' in status.text.lower():
print status.text
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
sapi = tweepy.streaming.Stream(auth, CustomStreamListener())
sapi.filter(locations=[-6.38,49.87,1.77,55.81])
Juan gave the correct answer. I'm filtering for Germany only using this:
# Bounding boxes for geolocations
# Online-Tool to create boxes (c+p as raw CSV): http://boundingbox.klokantech.com/
GEOBOX_WORLD = [-180,-90,180,90]
GEOBOX_GERMANY = [5.0770049095, 47.2982950435, 15.0403900146, 54.9039819757]
stream.filter(locations=GEOBOX_GERMANY)
This is a pretty crude box that includes parts of some other countries. If you want a finer grain you can combine multiple boxes to fill out the location you need.
It should be noted though that you limit the number of tweets quite a bit if you filter by geotags. This is from roughly 5 million Tweets from my test database (the query should return the %age of tweets that actually contain a geolocation):
> db.tweets.find({coordinates:{$ne:null}}).count() / db.tweets.count()
0.016668392651547598
So only 1.67% of my sample of the 1% stream include a geotag. However there's other ways of figuring out a user's location:
http://arxiv.org/ftp/arxiv/papers/1403/1403.2345.pdf
You can't filter it while streaming but you could filter it at the output stage, if you were writing the tweets to a file.
sapi.filter(track=['manchester united'],locations=['GPS Coordinates'])

Downloading Full JSON data from Tweets Using Rest API and Tweepy, Querying by Tweet ID

Brand new to using tweepy and Twitter's API(s) in general, and I've realized (too late) that I've made a number of mistakes in collecting some Twitter data. I've been collecting tweets about the winter olympics and had been using the Streaming API to filter by search terms. However, instead of retrieving all the data available, I've only retrieved text, datetime, and Tweet ID. An example of the implemented stream listener is below:
import os
import sys
import tweepy
os.chdir('/my/preferred/location/Twitter Olympics/Data')
consumer_key = 'cons_key'
consumer_secret = 'cons_sec'
access_token = 'access_token'
access_secret = 'access_sec'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
# count is used to give an approximation of how many tweets I'm pulling at a given time.
count = []
f = open('feb24.txt', 'a')
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
print 'Running...'
info = status.text, status.created_at, status.id
f.write(str(info))
for i in info:
count.append(1)
def on_error(self, status_code):
print >> sys.stderr, "Encountered error with status code: ", status_code
def on_timeout(self):
print >> sys.stderr, "Timeout..."
return True
sapi = tweepy.streaming.Stream(auth, StreamListener())
sapi.filter(track=["olympics", "olympics 2014", "sochi", "Sochi2014", "sochi 2014", "2014Sochi", "winter olympics"])
An example of the output that is stored in the .txt file is here:
('RT #Visa: There can only be one winner. Soak it in #TeamUSA, this is your #everywhere #Sochi2014 http://t.co/dVKYUln1r7', datetime.datetime(2014, 2, 15, 18, 9, 51), 111111111111111111).
So, here's my question. If I'm able to get the Tweet ID's in a list, is there a way to iterate over these to query the Twitter Rest API and retrieve the full JSON files? My hunch is yes, but I'm unsure about implementation, and mainly about how to save the resulting data as a JSON file (since I've been using .txt files here). Thanks in advance for reading.
Figured it out. For anyone who has made this terrible mistake (just get all the data to begin with!) here's some code with regular expressions that will extract the ID numbers to store them as a list:
import re
# Read in your ugly text file.
tweet_string = open('nameoffile.txt', 'rU')
tweet_string = tweet_string.read()
# Find all the id numbers with a regex.
id_finder = re.compile('[0-9]{18,18}')
# Go through the twee_string object and find all
# the IDs that meet the regex criteria.
idList = re.findall(id_finder, tweet_string)
Now you can iterate over the list idList and feed each ID as a query to the api (assuming you've done authenticating and have an instance of the api class). You can then append these to a list. Something like:
tweet_list = []
for id in idList:
tweet = api.get_status(id)
tweet_list.append(tweet)
An important note: what will be appended in the tweet_list variable will be a tweepy status object. I need to get a workaround for that, but the above problem is solved.

Categories