I am using Tweepy package in Python to collect tweets. I track several users and collect their latest tweets. For some users I get an error like "Failed to parse JSON payload: ", e.g. "Failed to parse JSON payload: Expecting ',' delimiter or '}': line 1 column 694303 (char 694302)". I took a note of the userid and tried to reproduce the error and debug the code. The second time I ran the code for that particular user, I got results (i.e. tweets) with no problem. I adjusted my code so that when I get this error I try once more to extract the tweets. So, I might get this error once, or twice for a user, but in a second or third attempt the code returns the tweets as usual without the error. I get similar behaviour for other userids too.
My question is, why does this error appear randomly? Nothing else has changed. I searched on the internet but couldn't find a similar report. A snippet of my code follows
#initialize a list to hold all the tweepy Tweets
alltweets = []
ntries = 0
#make initial request for most recent tweets (200 is the maximum allowed count)
while True:
try: #if process fails due to connection problems, retry.
if beforeid:
new_tweets = api.user_timeline(user_id = user,count=200, since_id=sinceid, max_id=beforeid)
else:
new_tweets = api.user_timeline(user_id = user,count=200, since_id=sinceid)
break
except tweepy.error.RateLimitError:
print "Rate limit error:", sys.exc_info()[0]
print("Timeout, retry in 5 minutes...\n")
time.sleep(60 * 5)
continue
except tweepy.error.TweepError as er:
print('TweepError: ' + er.message)
if er.message == 'Not authorized.':
new_tweets = []
break
else:
print(str(ntries))
ntries +=1
pass
except:
print "Unexpected error:", sys.exc_info()[0]
new_tweets = []
break
Related
I'm new to python, but I don't see much information on Stackoverflow in regards to paginating with the links method. The loop works perfectly in that it pulls all the data I want, but it only breaks until there's a timeout error when my Mac falls asleep. Sometimes it runs for 2 hours until my Mac sleeps. I'm wondering if there's a faster way to retrieve this data? Here is my python script:
import requests
import pandas as pd
res = []
url = "https://horizon.stellar.org/accounts/GCQJVAXWHB23WBNIG7TWEWHWUGGB6HWBC2ASPF5HMSADO5R5UKI4T7SD/trades"
querystring = {"limit":"200"}
try:
while True:
response = requests.request("GET", url, params=querystring)
data = response.json()
res += data['_embedded']['records']
if "href" not in data['_links']['next']:
break
url = data['_links']['next']['href']
except Exception as ex:
print("Exception:", ex)
df = pd.json_normalize(res)
df.to_csv('stellar_t7sd_trades.csv')
It returns with the following:
Exception: ('Connection aborted.', TimeoutError(60, 'Operation timed
out'))
But it returns the desired data into the csv file.
Is there a problem with my loop in that it doesn't properly break when It's done returning the data? Just trying to figure out a way so it doesn't run for 2 hours, but other than that, I get the desired data.
I solved this by adding a break after n number of loop iterations. This only works because I know exactly how many iterations of the loop will pull the data I need.
res = []
url = "https://horizon.stellar.org/accounts/GCQJVAXWHB23WBNIG7TWEWHWUGGB6HWBC2ASPF5HMSADO5R5UKI4T7SD/trades"
querystring = {"limit":"200"}
n = 32
try:
while n > 0: #True:
response = requests.request("GET", url, params=querystring)
n-=1
data = response.json()
res += data['_embedded']['records']
if "href" not in data['_links']['next']:
break
elif n==32:
break
url = data['_links']['next']['href']
except Exception as ex:
print("Exception:", ex)
df = pd.json_normalize(res)
df.to_csv('stellar_t7sd_tradestest.csv')
for a data visualization project I need to gather all tweets (would it be possible at all?) with a certain hashtag. for this purpose I am using the code below. it uses Tweepy and REST API. However, it only downloads up to around 2500 tweets or less. I was wondering how I can fix this limitation. is there a pro subscription or anything else i should purchase or how should I modify the code.
#!/usr/bin/python
# -*- coding: utf-8 -*-
# this file is configured for rtl language and farsi characters
import sys
from key import *
import tweepy
#imported from the key.py file
API_KEY =KAPI_KEY
API_SECRET =KAPI_SECRET
OAUTH_TOKEN =KOAUTH_TOKEN
OAUTH_TOKEN_SECRET =KOAUTH_TOKEN_SECRET
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True)
if not api:
print("Can't Authenticate")
sys.exit(-1)
def write_unicode(text, charset='utf-8'):
return text.encode(charset)
searchQuery = "#کرونا" # this is what we're searching for
maxTweets = 100000 # Some arbitrary large number
tweetsPerQry = 100 # this is the max the API permits
fName = 'Corona-rest8.txt' # We'll store the tweets in a text file.
sinceId = None
max_id = -1
tweetCount: int = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, "wb") as f:
while tweetCount < maxTweets:
try:
if max_id <= 0:
if not sinceId:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
since_id=sinceId)
else:
if not sinceId:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_id - 1))
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
#print(tweet._json["created_at"])
if str(tweet._json["user"]["location"])!="":
print(tweet._json["user"]["location"])
myDict = json.dumps(tweet._json["text"], ensure_ascii=False).encode('utf8')+ "\n".encode('ascii')
f.write(myDict)
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
break
print("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
The tweepy API Reference for api.search() provides a bit of color on this:
Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface.
To answer your question directly, it is not possible to acquire an exhaustive list of tweets from the API (because of several limitations). However, a few scraping-based Python libraries are available to work around these API limitations, like #taspinar's twitterscraper.
Maybe my question title is different from the question content.
statuscode = []
statuscode.append(200)
for x in find_from_sublister(hostname):
x2 = x.strip()
url = "http://"+ x2
try:
req = requests.get(url)
req1 = str(req.status_code) + " " + str(url) + '\n'
req2 = str(req.status_code)
req3 = str(url)
dict = {req2 : req3}
print " \n " + str(req1)
except requests.exceptions.RequestException as e:
print "Can't make the request to this Subdomain " + str(url) + '\n'
for keys, values in dict.iteritems():
print "finding the url's whose status code is 200"
if keys == statuscode[0]:
print values
else:
print "po"
I am using the code to do some following stuffs.
It will find the subdomains from the Sub-lister (Locally)
Then it will go to find the status code of each subdomain which was find by the sublister, with the help of for loop. for x in find_from_sublister(hostname):
Note: find_from_sublister(hostname) is a function for finding the subdomain from the sublister.
Then print the status code with the URL. Here print " \n " + str(req1)
[All goes well, but the problem starts here ]
Now what I want is to seperate the URLs which have 200 status code.
so I heard then it can happen by using the dictionary in python. So, I try to use the dictionary. As you can see dict = {req2 : req3}
Now I also make a list of index which have value 200
Then I compare the keys to list of index. here keys == statuscode[0]
and if they match then it should print all the URL's which have 200 status code.
But the result I am getting is below,
finding the url's whose status code is 200
po
You see the value po its else statment value,
else:
print "po"
Now the problem is Why I am getting this else value why not the URL's which have status code 200?
Hope I explained you clearly. And waiting for someone who talk to me on this.
Thanks
Note: I am Using Python 2.7 ver.
In your case, you don't even need the dictionary.
I've tried to clean up the code as much as possible, including more descriptive variable names (also, it's a bad idea to override builtin names like dict).
urls_returning_200 = []
for url in find_from_sublister(hostname):
url = "http://" + url.strip()
try:
response = requests.get(url)
if response.status_code == 200:
urls_returning_200.append(url)
print " \n {} {}\n".format(response.status_code, url)
except requests.exceptions.RequestException as e:
print "Can't make the request to this Subdomain {}\n".format(url)
print "finding the url's whose status code is 200"
print urls_returning_200
I wrote a hiscore checker for a game that I play, basically you enter a list of usernames into the .txt file & it outputs the results in found.txt.
However if the page responds a 404 it throws an error instead of returning output as " 0 " & continuing with the list.
Example of script,
#!/usr/bin/python
import urllib2
def get_total(username):
try:
req = urllib2.Request('http://services.runescape.com/m=hiscore/index_lite.ws?player=' + username)
res = urllib2.urlopen(req).read()
parts = res.split(',')
return parts[1]
except urllib2.HTTPError, e:
if e.code == 404:
return "0"
except:
return "err"
filename = "check.txt"
accs = []
handler = open(filename)
for entry in handler.read().split('\n'):
if "No Displayname" not in entry:
accs.append(entry)
handler.close()
for account in accs:
display_name = account.split(':')[len(account.split(':')) - 1]
total = get_total(display_name)
if "err" not in total:
rStr = account + ' - ' + total
handler = open('tried.txt', 'a')
handler.write(rStr + '\n')
handler.close()
if total != "0" and total != "49":
handler = open('found.txt', 'a')
handler.write(rStr + '\n')
handler.close()
print rStr
else:
print "Error searching"
accs.append(account)
print "Done"
HTTPERROR exception that doesn't seem to be working,
except urllib2.HTTPError, e:
if e.code == 404:
return "0"
except:
return "err"
Error response shown below.
Now I understand the error shown doesn't seem to be related to a response of 404, however this only occurs with users that return a 404 response from the request, any other request works fine. So I can assume the issue is within the 404 response exception.
I believe the issue may lay in the fact that the 404 is a custom page which you get redirected too?
so the original page is " example.com/index.php " but the 404 is " example.com/error.php "?
Not sure how to fix.
For testing purposes, format to use is,
ID:USER:DISPLAY
which is placed into check.txt
It seems that total can end up being None. In that case you can't check that it has 'err' in it. To fix the crash, try changing that line to:
if total is not None and "err" not in total:
To be more specific, get_total is returning None, which means that either
parts[1] is None or
except urllib2.HTTPError, e: is executed but e.code is not 404.
In the latter case None is returned as the exception is caught but you're only dealing with the very specific 404 case and ignoring other cases.
I am trying to get all replies of this particular user. So this particular user have reply_to_user_id_str of 151791801. I tried to print out all the replies but I'm not sure how. However, I only manage to print out only 1 of the replies. Can anyone help me how to print out all the replies?
My codes are:
for page in tweepy.Cursor(api.user_timeline, id="253346744").pages(1):
for item in page:
if item.in_reply_to_user_id_str == "151791801":
print item.text
a = api.get_status(item.in_reply_to_status_id_str)
print a.text
First, find the retweet thread of your conversation with your service provider:
# Find the last tweet
for page in tweepy.Cursor(api.user_timeline, id="253346744").pages(1):
for item in page:
if item.in_reply_to_user_id_str == "151791801":
last_tweet = item
The variable last tweet will contain their last retweet to you. From there, you can loop back to your original tweet:
# Loop until the original tweet
while True:
print(last_tweet.text)
prev_tweet = api.get_status(last_tweet.in_reply_to_status_id_str)
last_tweet = prev_tweet
if not last_tweet.in_reply_to_status_id_str:
break
It's not pretty, but it gets the job done.
Good luck!
user_name = "#nameofuser"
replies = tweepy.Cursor(api.search, q='to:{} filter:replies'.format(user_name)) tweet_mode='extended').items()
while True:
try:
reply = replies.next()
if not hasattr(reply, 'in_reply_to_user_id_str'):
continue
if str(reply.in_reply_to_user_id_str) == "151791801":
logging.info("reply of :{}".format(reply.full_text))
except tweepy.RateLimitError as e:
logging.error("Twitter api rate limit reached".format(e))
time.sleep(60)
continue
except tweepy.TweepError as e:
logging.error("Tweepy error occured:{}".format(e))
break
except StopIteration:
break
except Exception as e:
logger.error("Failed while fetching replies {}".format(e))
break