I have been trying to write a python code to use snscrape to retrieve tweets about a hashtag within an hour. But my code has been returning an empty dataframe each time I tried.
This is what I have tried so far:
now = datetime.utcnow()
since = now - timedelta(hours=1)
since_str = since.strftime('%Y-%m-%d %H:%M:%S.%f%z')
until_str = now.strftime('%Y-%m-%d %H:%M:%S.%f%z')
# Query tweets with hashtag #SOSREX in the last one hour
query = '#SOSREX Since:' + since_str + ' until:' + until_str
SOSREX_data = []
SOSREX_data=[]
for tweet in sntwitter.TwitterSearchScraper(query).get_items():
if len(SOSREX_data)>100:
break
else:
SOSREX_data.append([tweet.date,tweet.user.username,tweet.user.displayname,
tweet.content,tweet.likeCount,tweet.retweetCount,
tweet.sourceLabel,tweet.user.followersCount,tweet.user.location
])
# Creating a dataframe from the tweets list above
Tweets_data = pd.DataFrame(SOSREX_data,
columns=["Date_tweeted","username","display_name",
"Tweets","Number_of_Likes","Number_retweets",
"Source_of_Tweet",
"number_of_followers","location"
])
print("Tweets_data")
Is it possible to exclusively gather Tweets which mention countries by name? I am only gathering Tweets from the US.
I know that Twitter allows us to access context_annotations from the payload, and that context_annotations identifies if a tweet mentions a country. Here, https://developer.twitter.com/en/docs/twitter-api/annotations/overview ,they mention that countries is domain number 160 in context annotations.
I'm wondering if it is possible to exclusively gather Tweets that mention country names. I am not familiar with Tweepy, so I've finally managed to obtain Tweets from the US, but am still unable to specify the code to obtain only tweets which mention countries.
This is my current code:
client = tweepy.Client(bearer_token=bearer_token)
# Specify Query
query = ' "favorite country" place_country:US'
start_time = '2022-03-05T00:00:00Z'
end_time = '2022-03-11T00:00:00Z'
tweets = client.search_all_tweets(query=query, tweet_fields=['context_annotations', 'created_at', 'geo'],
place_fields = ['place_type','geo'], expansions='geo.place_id',
start_time=start_time,
end_time=end_time, max_results=10000)
# Prepare to write to csv file
f = open('tweetSheet.csv','w')
writer = csv.writer(f)
# Write to csv file
for tweet in tweets.data:
print(tweet.text)
print(tweet.created_at)
writer.writerow(['0', tweet.id, tweet.created_at, tweet.text])
# Close csv file
f.close()
has:geo:
One way of doing this would be by filtering in tweets that have country attributes.
You can use the has:geo: operator in your query instead of the place_country: operator seen in the Twitter Docs. This way you get all the tweets that are geo tagged, every geo tagged tweet has a country attribute.
includes
Another way would be checking if the tweet has an includes attribute, empty if it has no geo attributes: response.includes != {}. To get the country code if needed then response.includes['places'][0].country works just fine. It is not very well documented in the Tweepy Docs so here are all the geo attributes found in the Twitter Docs for a tweet:
twt_geo = 1602695447298162689
twt_no_geo = 1602719044645408768
response = client.get_tweet(
twt_geo, place_fields=['country', 'country_code', 'place_type', 'name'], expansions=['geo.place_id'])
if(response.includes != {}):
print(response.includes)
print(response.includes['places'][0].country)
print(response.includes['places'][0].country_code)
print(response.includes['places'][0].place_type)
print(response.includes['places'][0].name)
print(response.includes['places'][0].full_name)
print(response.includes['places'][0])
print(response.data.geo)
print(response.data.geo['place_id'])
else:
print(response.data.id)
Hashtags
If you are implying filtering in tweets that have country names as hashtags as country mentions, you can extract the tweet text with response.data.text and compare the country names you would like to filter in.
I have recently created a python program that would import my finances from a .csv file and transfer it onto google sheets. However, I am struggling to figure out how to fix the names that my bank gives me.
Example:
ME DC SI XXXXXXXXXXXXXXXX NETFLIX should just be NETFLIX,
POS XXXXXXXXXXXXXXXX STEAM PURCHASE should just be STEAM and so on
Forgive me if this is a stupid question as I am a newbie when it comes to coding and I am just looking to use it to automate certain situations in my life.
import csv
from unicodedata import category
import gspread
import time
MONTH = 'June'
# Set month name
file = f'HDFC_{MONTH}_2022.csv'
#the file we need to extract data from
transactions = []
# Create empty list to add data to
def hdfcFin(file):
'''Create a function that allows us to export data to google sheets'''
with open(file, mode = 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for row in csv_reader:
date = row[0]
name = row[1]
expense = float(row[2])
income = float(row[3])
category = 'other'
transaction = ((date, name, expense, income, category))
transactions.append(transaction)
return transactions
sa = gspread.service_account()
# connect json to api
sh = sa.open('Personal Finances')
wks = sh.worksheet(f'{MONTH}')
rows = hdfcFin(file)
for row in rows:
wks.insert_row([row[0], row[1], row[4], row[2], row[3]], 8)
time.sleep(2)
# time delay because of api restrictions
If you dont have specific format to identify the name then you can use below logic. Which will have key value pair. If key appears in name then you can replace it with value.
d={'ME DC SI XXXXXXXXXXXXXXXX NETFLIX':'NETFLIX','POS XXXXXXXXXXXXXXXX STEAM PURCHASE':'STEAM'}
test='POS XXXXXXXXXXXXXXXX STEAM PURCHASE'
if test in d.keys():
test=d[test]
print(test)
Output:
STEAM
If requirement is to fetch only last word out of your name then you can use below logic.
test='ME DC SI XXXXXXXXXXXXXXXX NETFLIX'
test=test.split(" ")[-1]
print(test)
Output:
NETFLIX
I have a class assignment to write a python program to download end-of-day data last 25 years the major global stock market indices from Yahoo Finance:
Dow Jones Index (USA)
S&P 500 (USA)
NASDAQ (USA)
DAX (Germany)
FTSE (UK)
HANGSENG (Hong Kong)
KOSPI (Korea)
CNX NIFTY (India)
Unfortunately, when I run the program an error occurs.
File "C:\ProgramData\Anaconda2\lib\site-packages\yahoofinancials__init__.py", line 91, in format_date
form_date = datetime.datetime.fromtimestamp(int(in_date)).strftime('%Y-%m-%d')
ValueError: timestamp out of range for platform localtime()/gmtime() function
If you see below, you can see the code that I have written. I'm trying to debug my mistakes. Can you help me out please? Thanks
from yahoofinancials import YahooFinancials
import pandas as pd
# Select Tickers and stock history dates
index1 = '^DJI'
index2 = '^GSPC'
index3 = '^IXIC'
index4 = '^GDAXI'
index5 = '^FTSE'
index6 = '^HSI'
index7 = '^KS11'
index8 = '^NSEI'
freq = 'daily'
start_date = '1993-06-30'
end_date = '2018-06-30'
# Function to clean data extracts
def clean_stock_data(stock_data_list):
new_list = []
for rec in stock_data_list:
if 'type' not in rec.keys():
new_list.append(rec)
return new_list
# Construct yahoo financials objects for data extraction
dji_financials = YahooFinancials(index1)
gspc_financials = YahooFinancials(index2)
ixic_financials = YahooFinancials(index3)
gdaxi_financials = YahooFinancials(index4)
ftse_financials = YahooFinancials(index5)
hsi_financials = YahooFinancials(index6)
ks11_financials = YahooFinancials(index7)
nsei_financials = YahooFinancials(index8)
# Clean returned stock history data and remove dividend events from price history
daily_dji_data = clean_stock_data(dji_financials
.get_historical_stock_data(start_date, end_date, freq)[index1]['prices'])
daily_gspc_data = clean_stock_data(gspc_financials
.get_historical_stock_data(start_date, end_date, freq)[index2]['prices'])
daily_ixic_data = clean_stock_data(ixic_financials
.get_historical_stock_data(start_date, end_date, freq)[index3]['prices'])
daily_gdaxi_data = clean_stock_data(gdaxi_financials
.get_historical_stock_data(start_date, end_date, freq)[index4]['prices'])
daily_ftse_data = clean_stock_data(ftse_financials
.get_historical_stock_data(start_date, end_date, freq)[index5]['prices'])
daily_hsi_data = clean_stock_data(hsi_financials
.get_historical_stock_data(start_date, end_date, freq)[index6]['prices'])
daily_ks11_data = clean_stock_data(ks11_financials
.get_historical_stock_data(start_date, end_date, freq)[index7]['prices'])
daily_nsei_data = clean_stock_data(nsei_financials
.get_historical_stock_data(start_date, end_date, freq)[index8]['prices'])
stock_hist_data_list = [{'^DJI': daily_dji_data}, {'^GSPC': daily_gspc_data}, {'^IXIC': daily_ixic_data},
{'^GDAXI': daily_gdaxi_data}, {'^FTSE': daily_ftse_data}, {'^HSI': daily_hsi_data},
{'^KS11': daily_ks11_data}, {'^NSEI': daily_nsei_data}]
# Function to construct data frame based on a stock and it's market index
def build_data_frame(data_list1, data_list2, data_list3, data_list4, data_list5, data_list6, data_list7, data_list8):
data_dict = {}
i = 0
for list_item in data_list2:
if 'type' not in list_item.keys():
data_dict.update({list_item['formatted_date']: {'^DJI': data_list1[i]['close'], '^GSPC': list_item['close'],
'^IXIC': data_list3[i]['close'], '^GDAXI': data_list4[i]['close'],
'^FTSE': data_list5[i]['close'], '^HSI': data_list6[i]['close'],
'^KS11': data_list7[i]['close'], '^NSEI': data_list8[i]['close']}})
i += 1
tseries = pd.to_datetime(list(data_dict.keys()))
df = pd.DataFrame(data=list(data_dict.values()), index=tseries,
columns=['^DJI', '^GSPC', '^IXIC', '^GDAXI', '^FTSE', '^HSI', '^KS11', '^NSEI']).sort_index()
return df
Your problem is your datetime stamps are in the wrong format. If you look at the error code it clugely tells you:
datetime.datetime.fromtimestamp(int(in_date)).strftime('%Y-%m-%d')
Notice the int(in_date) part?
It wants the unix timestamp. There are several ways to get this, out of the time module or the calendar module, or using Arrow.
import datetime
import calendar
date = datetime.datetime.strptime("1993-06-30", "%Y-%m-%d")
start_date = calendar.timegm(date.utctimetuple())
* UPDATED *
OK so I fixed up to the dataframes portion. Here is my current code:
# Select Tickers and stock history dates
index = {'DJI' : YahooFinancials('^DJI'),
'GSPC' : YahooFinancials('^GSPC'),
'IXIC':YahooFinancials('^IXIC'),
'GDAXI':YahooFinancials('^GDAXI'),
'FTSE':YahooFinancials('^FTSE'),
'HSI':YahooFinancials('^HSI'),
'KS11':YahooFinancials('^KS11'),
'NSEI':YahooFinancials('^NSEI')}
freq = 'daily'
start_date = '1993-06-30'
end_date = '2018-06-30'
# Clean returned stock history data and remove dividend events from price history
daily = {}
for k in index:
tmp = index[k].get_historical_stock_data(start_date, end_date, freq)
if tmp:
daily[k] = tmp['^{}'.format(k)]['prices'] if 'prices' in tmp['^{}'.format(k)] else []
Unfortunately I had to fix a couple things in the yahoo module. For the class YahooFinanceETL:
#staticmethod
def format_date(in_date, convert_type):
try:
x = int(in_date)
convert_type = 'standard'
except:
convert_type = 'unixstamp'
if convert_type == 'standard':
if in_date < 0:
form_date = datetime.datetime(1970, 1, 1) + datetime.timedelta(seconds=in_date)
else:
form_date = datetime.datetime.fromtimestamp(int(in_date)).strftime('%Y-%m-%d')
else:
split_date = in_date.split('-')
d = date(int(split_date[0]), int(split_date[1]), int(split_date[2]))
form_date = int(time.mktime(d.timetuple()))
return form_date
AND:
# private static method to scrap data from yahoo finance
#staticmethod
def _scrape_data(url, tech_type, statement_type):
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
script = soup.find("script", text=re.compile("root.App.main")).text
data = loads(re.search("root.App.main\s+=\s+(\{.*\})", script).group(1))
if tech_type == '' and statement_type != 'history':
stores = data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"]
elif tech_type != '' and statement_type != 'history':
stores = data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"][tech_type]
else:
if "HistoricalPriceStore" in data["context"]["dispatcher"]["stores"] :
stores = data["context"]["dispatcher"]["stores"]["HistoricalPriceStore"]
else:
stores = data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"]
return stores
You will want to look at the daily dict, and rewrite your build_data_frame function, which it should be a lot simpler now since you are working with a dictionary already.
I am actually the maintainer and author of YahooFinancials. I just saw this post and wanted to personally apologize for the inconvenience and let you all know I will be working on fixing the module this evening.
Could you please open an issue on the module's Github page detailing this?
It would also be very helpful to know which version of python you were running when you encountered these issues.
https://github.com/JECSand/yahoofinancials/issues
I am at work right now, however as soon as I get home in ~7 hours or so I will attempt to code a fix and release it. I'll also work on the exception handling. I try my best to maintain this module, but my day (and often night time) job is rather demanding. I will report back with the final results of these fixes and publish to pypi when it is done and stable.
Also if anyone else has any feedback or personal fixes made you can offer, it would be a huge huge help in fixing this. Proper credit will be given of course. I am also in desperate need of contributers, so if anyone is interested in that as well let me know. I am really wanting to take YahooFinancials to the next level and have this project become a stable and reliable alternative for free financial data for python projects.
Thank you for your patience and for using YahooFinancials.
I've been working on the quandl API recently and I've been stuck on an issue for a while.
My question is how to create a method on the difference between One
date and the date before for a stock index, Data seems to come out as
an array as an example: [[u'2015-04-30', 17840.52]] for the Dow Jones
Industrial Average. I'd like to also create a way to get the change
from one day away from the latest one. Say getting Friday's stock and
the change between that and the day before.
My code:
def fetchData(apikey, url):
'''Returns JSON data of the Dow Jones Average.'''
parameters = {'rows' : 1, 'auth_token' : apikey}
req = requests.get(url, params=parameters)
data = json.loads(req.content)
parsedData = []
stockData = {}
for datum in data:
if data['code'] == 'COMP':
stockData['name'] = data['name']
stockData['description'] = '''The NASDAQ Composite Index measures all
NASDAQ domestic and international based common type stocks listed on The NASDAQ Stock Market.'''
stockData['data'] = data['data']
stockData['code'] = data['code']
else:
stockData['name'] = data['name']
stockData['description'] = data['description']
stockData['data'] = data['data']
stockData['code'] = data['code']
parsedData.append(stockData)
return parsedData
I've attempted to just tack on [1] on data to get just the current day but both the issue of getting the day before has kinda stumped me.