Speed up parsing Twitter from json to csv (python)

Speed up parsing Twitter from json to csv (python) - python

This is my first post so please bear with me.
I have a large (~1GB) json file of Tweets I collected via Twitter's Streaming API. I am able to successfully parse this out into a CSV with the fields I need, however, it is painfully slow - even with the few entities I am extracting (userid, lat/long, and parsing Twitter date string to date/time). What methods could I potentially use to try and speed this up? It currently takes several hours, and I'm anticipating collecting more data....
import ujson
from datetime import datetime
from dateutil import tz
from csv import writer
import time
def hms_string(sec_elapsed):
h = int(sec_elapsed / (60 * 60))
m = int((sec_elapsed % (60 * 60)) / 60)
s = sec_elapsed % 60.
return "{}:{:>02}:{:>05.2f}".format(h, m, s)
start_time = time.time()
with open('G:\Programming Projects\GGS 681\dmv_raw_tweets1.json', 'r') as in_file, \
open('G:\Programming Projects\GGS 681\dmv_tweets1.csv', 'w') as out_file:
print >> out_file, 'user_id,timestamp,latitude,longitude'
csv = writer(out_file)
tweets_count = 0
for line in in_file:
tweets_count += 1
tweets = ujson.loads(line)
timestamp = []
lats = ''
longs = ''
for tweet in tweets:
tweet = tweets
from_zone = tz.gettz('UTC')
to_zone = tz.gettz('America/New_York')
times = tweet['created_at']
for tweet in tweets:
times = tweets['created_at']
utc = datetime.strptime(times, '%a %b %d %H:%M:%S +0000 %Y')
utc = utc.replace(tzinfo=from_zone) #comment out to parse to utc
est = utc.astimezone(to_zone) #comment out to parse to utc
timestamp = est.strftime('%m/%d/%Y %I:%M:%S %p') # use %p to differentiate AM/PM
for tweet in tweets:
if tweets['geo'] and tweets['geo']['coordinates'][0]:
lats, longs = tweets['geo']['coordinates'][:2]
else:
pass
row = (
tweets['user']['id'],
timestamp,
lats,
longs
)
values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
end_time = time.time()
print "{} to execute this".format(hms_string(end_time - start_time))

It appears I may have solved this. Looking at the code I was actually running, it looks like my if/else statement below was incorrect.
for tweet in tweets:
if tweets['geo'] and tweets['geo']['coordinates'][0]:
lats, longs = tweets['geo']['coordinates'][:2]
else:
None
I was using else: None, when I should've been using pass or continue. Also, I removed the inner iteration in tweets in my original code. It was able to parse a 60mb file about 4 minutes. Still, if anyone has any tips for making this any faster, I'm open to your suggestions.
Edit: I also used ujson which has significantly increased the speed of loading/dumping the json data from twitter.

Related

How to retrieve only tweets about a hashtag within an hour?

I have been trying to write a python code to use snscrape to retrieve tweets about a hashtag within an hour. But my code has been returning an empty dataframe each time I tried.
This is what I have tried so far:
now = datetime.utcnow()
since = now - timedelta(hours=1)
since_str = since.strftime('%Y-%m-%d %H:%M:%S.%f%z')
until_str = now.strftime('%Y-%m-%d %H:%M:%S.%f%z')
# Query tweets with hashtag #SOSREX in the last one hour
query = '#SOSREX Since:' + since_str + ' until:' + until_str
SOSREX_data = []
SOSREX_data=[]
for tweet in sntwitter.TwitterSearchScraper(query).get_items():
if len(SOSREX_data)>100:
break
else:
SOSREX_data.append([tweet.date,tweet.user.username,tweet.user.displayname,
tweet.content,tweet.likeCount,tweet.retweetCount,
tweet.sourceLabel,tweet.user.followersCount,tweet.user.location
])
# Creating a dataframe from the tweets list above
Tweets_data = pd.DataFrame(SOSREX_data,
columns=["Date_tweeted","username","display_name",
"Tweets","Number_of_Likes","Number_retweets",
"Source_of_Tweet",
"number_of_followers","location"
])
print("Tweets_data")

ValueError: Excel does not support datetimes with timezones

When I try to run my sreamlit app having function:
def get_tweets(Topic,Count):
i=0
#my_bar = st.progress(100) # To track progress of Extracted tweets
for tweet in tweepy.Cursor(api.search_tweets, q=Topic,count=100, lang="en",exclude='retweets').items():
time.sleep(0.1)
#my_bar.progress(i)
df.loc[i,"Date"] = tweet.created_at
df.loc[i,"User"] = tweet.user.name
df.loc[i,"IsVerified"] = tweet.user.verified
df.loc[i,"Tweet"] = tweet.text
df.loc[i,"Likes"] = tweet.favorite_count
df.loc[i,"RT"] = tweet.retweet_count
df.loc[i,"User_location"] = tweet.user.location
df.to_csv("TweetDataset.csv",index=False)
df.to_excel('{}.xlsx'.format("TweetDataset"),index=False) ## Save as Excel
i=i+1
if i>Count:
break
else:
pass
I get this error:
ValueError: Excel does not support datetimes with timezones. Please ensure that datetimes are timezone unaware before writing to Excel.

ValueError: time data 'None' does not match format '%Y-%m-%dT%H:%M:%S.%f'

For the node 'TransactionDate' i have a logic before updating it for policy"POL000002NGJ".
The logic i am trying to implement is If existing 'TransactionDate' < than today, then add 5 days with current value and parse it to xml.
Transaction Date Format in XML : 2020-03-23T10:56:15.00
Please Note that, If i parsing the DateTime value like below, It works good But i dont want to hardcode the value... I want to Parse it as a string object to handle for any datetime in format ""%Y-%m-%dT%H:%M:%S.%f""...
# <TransactionDate>
today = datetime.now()
TransactionDate = doc.find('TransactionDate')
Date = '2020-03-24T10:56:15.00'
previous_update = datetime.strptime(Date, "%Y-%m-%dT%H:%M:%S.%f")
if previous_update < today:
today = previous_update - timedelta(days=-5)
TransactionDate = today.strftime("%Y-%m-%dT%H:%M:%S.%f")
Below code while parsing it as a DateTime Object, I have an issue.. I got struck here and referenced other answers in stackoverflow and python forums, But still i got struct up here and unable to resolve the issue...
if any help to fix will be a great helpful. Thanks. Below code using lxml and getting help to support below code will helpful. Because i already completed for other nodes. My understanding is Date variable is calling as None.. But struck here to fix.. Please help..
# <TransactionDate>
today = datetime.now()
TransactionDate = doc.find('TransactionDate')
Date = str(TransactionDate)
previous_update = datetime.strptime(Date, "%Y-%m-%dT%H:%M:%S.%f")
if previous_update < today:
today = previous_update - timedelta(days=-5)
TransactionDate = today.strftime("%Y-%m-%dT%H:%M:%S.%f")
Full Code is Below
from lxml import etree
from datetime import datetime, timedelta
import random, string
doc = etree.parse(r'C:\Users\python.xml')
# <PolicyId> - Random generated policy number
Policy_Random_Choice = 'POL' + ''.join(random.choices(string.digits, k=6)) + 'NGJ'
# <TransactionDate>
today = datetime.now()
TransactionDate = doc.find('TransactionDate')
Date = str(TransactionDate)
previous_update = datetime.strptime(Date, "%Y-%m-%dT%H:%M:%S.%f")
if previous_update < today:
today = previous_update - timedelta(days=-5)
TransactionDate = today.strftime("%Y-%m-%dT%H:%M:%S.%f")
#Parsing the Variables
replacements = [Policy_Random_Choice , TransactionDate ]
targets = doc.xpath('//ROW[PolicyId="POL000002NGJ"]')
for target in targets:
target.xpath('./PolicyId')[0].text = replacements[0]
target.xpath('.//TransactionDate')[0].text = replacements[1]
print(etree.tostring(doc).decode())
Sample XML
<TABLE>
<ROW>
<PolicyId>POL000002NGJ</PolicyId>
<BusinessCoverageCode>COV00002D3X1</BusinessCoverageCode>
<TransactionDate>2020-03-23T10:56:15.00</TransactionDate>
</ROW>
<ROW>
<PolicyId>POL111111NGJ</PolicyId>
<BusinessCoverageCode>COV00002D3X4</BusinessCoverageCode>
<TransactionDate>2020-03-23T10:56:15.00</TransactionDate>
</ROW>
</TABLE>

Maybe the find method is wrong. Try this one
# <TransactionDate>
today = datetime.now()
TransactionDate = doc.xpath('//ROW/TransactionDate') # Change find to xpath
Date = str(TransactionDate[0].text) # Use the first one
previous_update = datetime.strptime(Date, "%Y-%m-%dT%H:%M:%S.%f")

Adding time to x axis via python

I am writing an API that collects my fitbit data and an extract is shown below. I wanted to ask if anyone knows how to display the x axis as 24 hour time. The program creates a csv and in the file i do have Date and Time fields however could not get it to display on the graph.
(i have deleted the beginning bit of this code but it just contained the import functions and CLIENT_ID and CLIENT_SECRET.)
server = Oauth2.OAuth2Server(CLIENT_ID, CLIENT_SECRET)
server.browser_authorize()
ACCESS_TOKEN = str(server.fitbit.client.session.token['access_token'])
REFRESH_TOKEN =
str(server.fitbit.client.session.token['refresh_token'])
auth2_client = fitbit.Fitbit(CLIENT_ID, CLIENT_SECRET, oauth2=True,
access_token=ACCESS_TOKEN, refresh_token=REFRESH_TOKEN)
yesterday = str((datetime.datetime.now() -
datetime.timedelta(days=1)).strftime("%Y%m%d"))
yesterday2 = str((datetime.datetime.now() -
datetime.timedelta(days=1)).strftime("%Y-%m-%d"))
yesterday3 = str((datetime.datetime.now() -
datetime.timedelta(days=1)).strftime("%d/%m/%Y"))
today = str(datetime.datetime.now().strftime("%Y%m%d"))
fit_statsHR = auth2_client.intraday_time_series('activities/heart',
base_date=yesterday2, detail_level='15min')
time_list = []
val_list = []
for i in fit_statsHR['activities-heart-intraday']['dataset']:
val_list.append(i['value'])
time_list.append(i['time'])
heartdf = pd.DataFrame({'Heart
Rate':val_list,'Time':time_list,'Date':yesterday3})
heartdf.to_csv('/Users/zabiullahmohebzadeh/Desktop/python-fitbit-
master/python-fitbit-master/Data/HeartRate - '+ \
yesterday+'.csv', \
columns=['Date','Time','Heart Rate'], header=True, \
index = False)
plt.plot(val_list, 'r-')
plt.ylabel('Heart Rate')
plt.show()

You can pass your x-values to plt.plot:
plt.plot(time_list, val_list, 'r-')
Without knowing how your time_list is formatted I can't advise on the best way to get it into 24hr time I'm afraid.

Python & Tweepy - How to compare and change times.

I am trying to create a number of constraints for some other code based on twitter handle sets.
I am having issues with the following code because:
TypeError: can't compare datetime.datetime to str
It seems that even though I have changed Last_Post to a datetime object initially, when i compare it to datetime.datetime.today() it is converting to string. Yes, I have checked the to ensure that Last_post is converting properly. Im not really sure what is going on. Help?
for handle in handles:
try:
user = api.get_user(handle)
#print json.dumps(user, indent = 4)
verified = user["verified"]
name = user['name']
language = user['lang']
follower_count = user['followers_count']
try:
last_post = user['status']['created_at']
last_post = datetime.strptime(last_post, '%a %b %d %H:%M:%S +0000 %Y')
except:
last_post = "User has not posted ever"
location = user['location']
location_ch = location_check(location)
if location_ch is not "United States":
location_output.append(False)
else:
location_output.append(True)
new_sum.append(follower_count)
if language is not "en":
lang_output.append(False)
else:
lang_output.append(True)
if datetime.datetime.today() - datetime.timedelta(days=30) > last_post:
recency.append(False)
else:
recency.append(True)

I think you need to convert the twitter date to a timestamp:
import time
ts = time.strftime('%Y-%m-%d %H:%M:%S', time.strptime(tweet['created_at'],'%a %b %d %H:%M:%S +0000 %Y'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speed up parsing Twitter from json to csv (python) - python

Related

How to retrieve only tweets about a hashtag within an hour?

ValueError: Excel does not support datetimes with timezones

ValueError: time data 'None' does not match format '%Y-%m-%dT%H:%M:%S.%f'

Adding time to x axis via python

Python & Tweepy - How to compare and change times.

Categories

Resources