I have a simple for :
for tweet in tweets.xpath("/tweets/tweet/content"):
count = count+1
print("Tweet n°%s" % count)
print("=> " + tweet.text)
print("===================================")
And I wanna know how can I do to create automatically a variable to get every tweets here in a different variable, if there are 30 tweets so 30 differents variables are automatically create.. I don't know if I'm clear but thanks for any help !
You shall use an array:
from lxml import etree
xmldoc = etree.parse("tweets.xml")
tweets = xmldoc.xpath("/tweets/tweet/content/text()")
To access any of the tweet texts, access them by an index:
print "first one", tweets[0]
print "last one", tweets[-1]
print "number of tweets", len(tweets)
I know, you asked for dynamically creating new variable names for the values,
but you would have to provide information about those dynamically created
variables to later processing.
On the other hand, if you consider tweets[1] a "variable name", you have the
solution you asked for.
EDIT: Code modified not to reuse variable tweets for multiple purposes.
In your case, you can do:
for count, tweet in enumerate(tweets.xpath("/tweets/tweet/content")):
print("Tweet n°%s" % count)
print("=> " + tweet.text)
print("===================================")
If you want to store the result, you can do:
tweets = dict()
for count, tweet in enumerate(tweets.xpath("/tweets/tweet/content")):
tweets[count] = tweet.text
Related
I'm a begginer at python and I'm trying to gather data from twitter using the API. I want to gather username, date, and the clean tweets without #username, hashtags and links and then put it into dataframe.
I find a way to achieve this by using : ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split()) but when I implement it on my codes, it returns NameError: name 'tweet' is not defined
Here is my codes
tweets = tw.Cursor(api.search, q=keyword, lang="id", since=date).items()
raw_tweet = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split())
data_tweet = [[tweet.user.screen_name, tweet.created_at, raw_tweet] for tweet in tweets]
dataFrame = pd.DataFrame(data=data_tweet, columns=['user', "date", "tweet"])
I know the problem is in the data_tweet, but I don't know how to fix it. Please help me
Thank you.
The problem is actually in the second line:
raw_tweet = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split())
Here, you are using tweet.text. However, you have not defined what tweet is yet, only tweets. Also, from reading your third line where you actually define tweet:
for tweet in tweets
I'm assuming you want tweet to be the value you get while iterating through tweets.
So what you have to do is to run both lines through an iterator together, assuming my earlier hypothesis is correct.
So:
for tweet in tweets:
raw_tweet = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",tweet.text).split())
data_tweet = [[tweet.user.screen_name, tweet.created_at, raw_tweet]]
You can also use reg-ex to remove any words the start with '#' (usernames) or 'http' (links) in a pre-defined function and apply the function to the pandas data frame column
import re
def remove_usernames_links(tweet):
tweet = re.sub('#[^\s]+','',tweet)
tweet = re.sub('http[^\s]+','',tweet)
return tweet
df['tweet'] = df['tweet'].apply(remove_usernames_links)
If you encounter, "expected string or byte-like object error", then just use
import re
def remove_usernames_links(tweet):
tweet = re.sub('#[^\s]+','',str(tweet))
tweet = re.sub('http[^\s]+','',str(tweet))
return tweet
df['tweet'] = df['tweet'].apply(remove_usernames_links)
Credit: https://www.datasnips.com/59/remove-usernames-http-links-from-tweet-data/
I am working with Python attempting to store tweets (more precisely only their date, user, bio and text) related to a specific keyword in a csv file.
As I am working on the free-to-use API of Twitter, I am limited to 450 tweets every 15 minutes.
So I have coded something which is supposed to store exactly 450 tweets in 15 minutes.
BUT the problem is something goes wrong when extracting the tweets so that at a specific point the same tweet is stored again and again.
Any help would be much appreciated !!
Thanks in advance
import time
from twython import Twython, TwythonError, TwythonStreamer
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET)
sfile = "tweets_" + keyword + todays_date + ".csv"
id_list = [last_id]
count = 0
while count < 3*60*60*2: #we set the loop to run for 3hours
# tweet extract method with the last list item as the max_id
print("new crawl, max_id:", id_list[-1])
tweets = twitter.search(q=keyword, count=2, max_id=id_list[-1])["statuses"]
time.sleep(2) ## 2 seconds rest between api calls (450 allowed within 15min window)
for status in tweets:
id_list.append(status["id"]) ## append tweet id's
if status==tweets[0]:
continue
if status==tweets[1]:
date = status["created_at"].encode('utf-8')
user = status["user"]["screen_name"].encode('utf-8')
bio = status["user"]["description"].encode('utf-8')
text = status["text"].encode('utf-8')
with open(sfile,'a') as sf:
sf.write(str(status["id"])+ "|||" + str(date) + "|||" + str(user) + "|||" + str(bio) + "|||" + str(text) + "\n")
count += 1
print(count)
print(date, text)
You should use Python's CSV library to write your CSV files. It takes a list containing all of the items for a row and automatically adds the delimiters for you. If a value contains a comma, it automatically adds quotes for you (which is how CSV files are meant to work). It can even handle newlines inside a value. If you open the resulting file into a spreadsheet application you will see it is correctly read in.
Rather than trying to use time.sleep(), a better approach is to work with absolute times. So the idea is to take your starting time and add three hours to it. You can then keep looping until this finish_time is reached.
The same approach can be made to your API call allocations. Keep a counter holding how many calls you have left and downcount it. If it reaches 0 then stop making calls until the next fifteen minute slot is reached.
timedelta() can be used to add minutes or hours to an existing datetime object. By doing it this way, your times will never slip out of sync.
The following shows a simulation of how you can make things work. You just need to add back your code to get your Tweets:
from datetime import datetime, timedelta
import time
import csv
import random # just for simulating a random ID
fifteen = timedelta(minutes=15)
finish_time = datetime.now() + timedelta(hours=3)
calls_allowed = 450
calls_remaining = calls_allowed
now = datetime.now()
next_allocation = now + fifteen
todays_date = now.strftime("%d_%m_%Y")
ids_seen = set()
with open(f'tweets_{todays_date}.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while now < finish_time:
time.sleep(2)
now = datetime.now()
if now >= next_allocation:
next_allocation += fifteen
calls_remaining = calls_allowed
print("New call allocation")
if calls_remaining:
calls_remaining -= 1
print(f"Get tweets - {calls_remaining} calls remaining")
# Simulate a tweet response
id = random.choice(["1111", "2222", "3333", "4444"]) # pick a random ID
date = "01.01.2019"
user = "Fred"
bio = "I am Fred"
text = "Hello, this is a tweet\nusing a comma and a newline."
if id not in ids_seen:
csv_output.writerow([id, date, user, bio, text])
ids_seen.add(id)
As for the problem of keep writing the same Tweets. You could use a set() to hold all of the IDs that you have written. You could then test if a new tweet has already been seen before writing it again.
I'm using tweepy to automatically tweet a list of URLs. However if my list is too long (it can vary from tweet to tweet) I am not allowed. Is there anyway that tweepy can create a thread of tweets when the content is too long? My tweepy code looks like this:
import tweepy
def get_api(cfg):
auth = tweepy.OAuthHandler(cfg['consumer_key'],
cfg['consumer_secret'])
auth.set_access_token(cfg['access_token'],
cfg['access_token_secret'])
return tweepy.API(auth)
def main():
# Fill in the values noted in previous step here
cfg = {
"consumer_key" : "VALUE",
"consumer_secret" : "VALUE",
"access_token" : "VALUE",
"access_token_secret" : "VALUE"
}
api = get_api(cfg)
tweet = "Hello, world!"
status = api.update_status(status=tweet)
# Yes, tweet is called 'status' rather confusing
if __name__ == "__main__":
main()
Your code isn't relevant to the problem you're trying to solve. Not only does main() not seem to take any arguments (tweet text?) but you don't show how you are currently trying approaching the matter. Consider the following code:
import random
TWEET_MAX_LENGTH = 280
# Sample Tweet Seed
tweet = """I'm using tweepy to automatically tweet a list of URLs. However if my list is too long (it can vary from tweet to tweet) I am not allowed."""
# Creates list of tweets of random length
tweets = []
for _ in range(10):
tweets.append(tweet * (random.randint(1, 10)))
# Print total initial tweet count and list of lengths for each tweet.
print("Initial Tweet Count:", len(tweets), [len(x) for x in tweets])
# Create a list for formatted tweet texts
to_tweet = []
for tweet in tweets:
while len(tweet) > TWEET_MAX_LENGTH:
# Take only first 280 chars
cut = tweet[:TWEET_MAX_LENGTH]
# Save as separate tweet to do later
to_tweet.append(cut)
# replace the existing 'tweet' variable with remaining chars
tweet = tweet[TWEET_MAX_LENGTH:]
# Gets last tweet or those < 280
to_tweet.append(tweet)
# Print total final tweet count and list of lengths for each tweet
print("Formatted Tweet Count:", len(to_tweet), [len(x) for x in to_tweet])
It's separated out as much as possible for ease-of-interpretation. The gist is that one could start with a list of text to be used as tweets. The variable TWEET_MAX_LENGTH defines where each tweet would be split to allow for multi-tweets.
The to_tweet list would contain each tweet, in the order of your initial list, expanded into multiple tweets of <= TWEET_MAX_LENGTH length strings.
You could use that list to feed into your actual tweepy function that posts. This approach is pretty willy-nilly and doesn't do any checks for maintaining sequence of split tweets. Depending on how you're implenting your final tweet functions, that might be an issue but also a matter for a separate question.
I have a program set up so it searches tweets based on the hashtag I give it and I can edit how many tweets to search and display but I can't figure out how to place the searched tweets into a string. this is the code I have so far
while True:
for status in tweepy.Cursor(api.search, q=hashtag).items(2):
tweet = [status.text]
print tweet
when this is run it only outputs 1 tweet when it is set to search 2
Your code looks like there's nothing to break out of the while loop. One method that comes to mind is to set a variable to an empty list and then with each tweet, append that to the list.
foo = []
for status in tweepy.Cursor(api.search, q=hashtag).items(2):
tweet = status.text
foo.append(tweet)
print foo
Of course, this will print a list. If you want a string instead, use the string join() method. Adjust the last line of code to look like this:
bar = ' '.join(foo)
print bar
I'm trying to get a sorted list or table of users from a loaded dict. I was able to print them as below but I couldn't figure out how to sort them in descending order according to the number of tweets the user name made in the sample. If I'm able to do that I might figure out how to track the to user as well. Thanks!
tweets = urllib2.urlopen("http://search.twitter.com/search.json?q=ECHO&rpp=100")
tweets_json = tweets.read()
data = json.loads(tweets_json)
for tweet in data['results']:
... print tweet['from_user_name']
... print tweet['to_user_name']
... print
tweets = data['results']
tweets.sort(key=lambda tw: tw['from_user_name'], reverse=True)
Assuming tw['from_user_name'] contains number of tweets from given username.
If tw['from_user_name'] contains username instead then:
from collections import Counter
tweets = data['results']
count = Counter(tw['from_user_name'] for tw in tweets)
tweets.sort(key=lambda tw: count[tw['from_user_name']], reverse=True)
To print top 10 usernames by number of tweets they send, you don't need to sort tweets:
print("\n".join(count.most_common(10)))