Download intraday historical stock data - python

I want to download historical intraday stock data. I've found AlphaVantage offers two years of data. It's the longest history of data that I've found for free.
I'm making a script to download the full two years of data for all ticker symbols that they offer and in all timeframes. They provide the data divided in 30 days intervals from the current day (or the last trading day, I'm not sure). The rows go from newest to oldest timedate. I want to reverse the order in which the data appears and concatenate all the months with the column headers appearing only once. So I would have a single csv file with two years of data for each stock and timeframe. The rows of the data would go from oldest to newest timedate.
The problem I have is that I also want to use the script to update the data and I don't know how to append only the data that doesn't already appear in my files. The data that I've downloaded goes from 2020-09-28 07:15:00 to 2020-10-26 20:00:00 in 15 minutes intervals (when they exists, there are some missing). When I use the script again I'd like to update the data. I'd like to delete somehow the rows that already appear and append only the rest. So if the last datetime that appears is for example 2020-10-26 20:00:00 it would continue appending from 2020-10-26 2020-10-26 20:15:00 if it exists. How can I update the data correctly?
Also, when updating, if the file already exists, it copies the column headers which is something I don't want to do. Edit: I've solved this with header=(not os.path.exists(file)) but it seems very inefficient to check if the file exists in every iteration.
I also have to make the script comply with the API's rule of 5 calls per minute and 500 calls per day. Is there a way for the script to stop when it reaches the daily limit and continue at that point next time it runs? Or should I just add a 173 seconds sleep between API calls?
import os
import glob
import pandas as pd
from typing import List
from requests import get
from pathlib import Path
import os.path
import sys
BASE_URL= 'https://www.alphavantage.co/'
def download_previous_data(
file: str,
ticker: str,
timeframe: str,
slices: List,
):
for _slice in slices:
url = f'{BASE_URL}query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol={ticker}&interval={timeframe}&slice={_slice}&apikey=demo&datatype=csv'
pd.read_csv(url).iloc[::-1].to_csv(file, mode='a', index=False, encoding='utf-8-sig')
def main():
# Get a list of all ticker symbols
print('Downloading ticker symbols:')
#df = pd.read_csv('https://www.alphavantage.co/query?function=LISTING_STATUS&apikey=demo')
#tickers = df['symbol'].tolist()
tickers = ['IBM']
timeframes = ['1min', '5min', '15min', '30min', '60min']
# To download the data in a subdirectory where the script is located
modpath = os.path.dirname(os.path.abspath(sys.argv[0]))
# Make sure the download folders exists
for timeframe in timeframes:
download_path = f'{modpath}/{timeframe}'
#download_path = f'/media/user/Portable Drive/Trading/data/{timeframe}'
Path(download_path).mkdir(parents=True, exist_ok=True)
# For each ticker symbol download all data available for each timeframe
# except for the last month which would be incomplete.
# Each download iteration has to be in a 'try except' in case the ticker symbol isn't available on alphavantage
for ticker in tickers:
print(f'Downloading data for {ticker}...')
for timeframe in timeframes:
download_path = f'{modpath}/{timeframe}'
filepath = f'{download_path}/{ticker}.csv'
# NOTE:
# To ensure optimal API response speed, the trailing 2 years of intraday data is evenly divided into 24 "slices" - year1month1, year1month2,
# year1month3, ..., year1month11, year1month12, year2month1, year2month2, year2month3, ..., year2month11, year2month12.
# Each slice is a 30-day window, with year1month1 being the most recent and year2month12 being the farthest from today.
# By default, slice=year1month1
if Path(filepath).is_file(): # if the file already exists
# download the previous to last month
slices = ['year1month2']
download_previous_data(filepath, ticker, timeframe, slices)
else: # if the file doesn't exist
# download the two previous years
#slices = ['year2month12', 'year2month11', 'year2month10', 'year2month9', 'year2month8', 'year2month7', 'year2month6', 'year2month5', 'year2month4', 'year2month3', 'year2month2', 'year2month1', 'year1month12', 'year1month11', 'year1month10', 'year1month9', 'year1month8', 'year1month7', 'year1month6', 'year1month5', 'year1month4', 'year1month3', 'year1month2']
slices = ['year1month2']
download_previous_data(filepath, ticker, timeframe, slices)
if __name__ == '__main__':
main()

You have an awful lot of questions within your question!
These are suggestions for you to try, but I have no way to test the validity of them:
Read all your files names into a list check files names exist against the list rather than pinging the os each time
Read the data from existing file and append everything in pandas and write new file. Can't tell if you are appending the csv files but if you're having difficulty there just read the data and append new data - until you figure out how to append a excel correctly. Or save new iterations to their own file and consolidate files later.
Look into drop_duplicates() if you have concerned with having duplicates
Look into time module for time.sleep() in your for loops for reduce calls
If you have 1min data you can look into resample() to 5min, 15min rather than importing at all those timeframes

Related

Converting Glob.glob file into a pandas dataframe and append to an excel sheet

Here is the code I am working on to automate a daily task I do but stuck on picking up files with that change daily.
df = pd.read_excel('Sales Data 2020.xlsx')
dbt = pd.read_excel('Sales Data 2020.xlsx', sheet_name='January')
test = glob.glob('January Sales 29122020.xlsx')
Every day I get sales data in this format January Sales 29122020. I am trying to have all the data copied and pasted over to 'Sales Data 2020.xlsx', sheet_name='January'.
The difficulty I am having is the date of the files changes daily i.e January Sales 30122020 and there are 20 other files just like this which I need to copy and paste data to their relevant tabs. I looked at wildcards to just pick the strings in the file name as they do not change.
As for the code I am stuck because I need to convert January Sales 29122020.xlsx into a dataframe which I don't know how to then append/concat with the dbt variable.
This is a wild card to find all files with January Sales then concat dataframes:
import glob
import pandas as pd
pd.concat([pd.read_excel(name, sheet_name='January') for name in glob.glob("January Sales*.xlsx")])

Python: Issue with rapidly reading and writing excel files after web scraping? Works for a bit then weird issues come up

So I developed a script that would pull data from a live-updated site tracking coronavirus data. I set it up to pull data every 30 minutes but recently tested it on updates every 30 seconds.
The idea is that it creates the request to the site, pulls the html, creates a list of all of the data I need, then restructures into a dataframe (basically it's the country, the cases, deaths, etc.).
Then it will take each row and append to the rows of each of the 123 excel files that are for the various countries. This will work well for, I believe, somewhere in the range of 30-50 iterations before it either causes file corruptions or weird data entries.
I have my code below. I know it's poorly written (my initial reasoning was I felt confident I could set it up quickly and I wanted to collect data quickly.. unfortunately I overestimated my abilities but now I want to learn what went wrong). Below my code I'll include sample output.
PLEASE note that this 30 second interval code pull is only for quick testing. I don't usually look to send that many requests for months. I just wanted to see what the issue was. Originally it was set to pull every 30 minutes when I detected this issue.
See below for the code:
import schedule
import time
def RecurringProcess2():
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import numpy as np
from os import listdir
import os
try:
extractTime = datetime.datetime.now()
extractTime = str(extractTime)
print("Access Initiated at " + extractTime)
link = 'https://www.worldometers.info/coronavirus/'
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser').findAll('td')#[1107].get_text()
table = pd.DataFrame(columns=['Date and Time','Country','Total Cases','New Cases','Total Deaths','New Deaths','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop'])
soupList = []
for i in range(1107):
value = soup[i].get_text()
soupList.insert(i,value)
table = np.reshape(soupList,(123,-1))
table = pd.DataFrame(table)
table.columns=['Country','Total Cases','New Cases (+)','Total Deaths','New Deaths (+)','Total Recovered','Active Cases','Serious Critical','Total Cases/1M pop']
table['Date & Time'] = extractTime
#Below code is run once to generate the initial files. That's it.
# for i in range(122):
# fileName = table.iloc[i,0] + '.xlsx'
# table.iloc[i:i+1,:].to_excel(fileName)
FilesDirectory = 'D:\\Professional\\Coronavirus'
fileType = '.csv'
filenames = listdir(FilesDirectory)
DataFiles = [ filename for filename in filenames if filename.endswith(fileType) ]
for file in DataFiles:
countryData = pd.read_csv(file,index_col=0)
MatchedCountry = table.loc[table['Country'] == str(file)[:-4]]
if file == ' USA .csv':
print("Country Data Rows: ",len(countryData))
if os.stat(file).st_size < 1500:
print("File Size under 1500")
countryData = countryData.append(MatchedCountry)
countryData.to_csv(FilesDirectory+'\\'+file, index=False)
except :
pass
print("Process Complete!")
return
schedule.every(30).seconds.do(RecurringProcess2)
while True:
schedule.run_pending()
time.sleep(1)
When I check the code after some number of iterations (usually successful for like 30-50) it has either displayed only 2 rows and lost all other rows, or it'll keep appending while deleting a single entry in the row above while two rows above loses 2 entries, etc. (essentially forming a triangle of sorts).
Above that image would be a few hundred empty rows. Does anyone have an idea of what is going wrong here? I'd consider this a failed attempt but would still like to learn from this attempt. I appreciate any help in advance.
Hi as per my understanding the webpage only has one table element. My suggestion would be to use pandas read_html method as it provides clean and structured table.
Try the below code you can modify to schedule the same:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
Disclaimer: I'm still evaluating this solution. So far it works almost perfectly for 77 rows.
Originally I had set the script up to run for .xlsx files. I converted everything to .csv but retained the index column code:
countryData = pd.read_csv(file,index_col=0)
I started realizing that things were being ordered differently every time the script ran. I have since removed that from the code and so far it works. Almost.
Unnamed: 0 Unnamed: 0.1
0 7
7
For some reason I have the above output in every file. I don't know why. But it's in the first 2 columns yet it still seems to be reading and writing correctly. Not sure what's going on here.

Reading files from folder and appending it to xlsx file

I have a folder that have say a few hundreds files and is growing every hour. I am trying to consolidate all the data into a single file for analysis use. But the script I wrote is not too effective for processing these data as it will read all the content in the folder and append it to an xlsx file. The processing time is simply too long.
What I seeking is to enhance and improve my script:
1) To be able to only read and extract data new files that have not been previously read
2) To extract and append these data to update the xlxs file.
I just need some to enlightenment to help me improve on the script.
Part of my code is as follows
import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
import glob
import schedule
import time
import re
import datetime as dt
def job():
# Select the path to download the files
path=r'V:\DB\ABCD\BEFORE\8_INCHES'
files=glob.glob(path+"/*.csv")
df=None
# Extracting of information from files
for i, file in enumerate (files) :
if i==0:
df= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
df['Path'] =file
df['Machine No']=re.findall("MC-11",str(df["Path"]))
df['Process']= re.findall("ABCD",str(df["Path"]))
df['Before/After']=re.findall("BEFORE",str(df["Path"]))
df['Wafer Size']=re.findall("8_INCHES",str(df["Path"]))
df['Employee ID']=df["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
df['Date']=df["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
df['Lot Number']=df["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
df['Part Number']=df["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
df["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
else:
tmp= np.transpose(pd.read_csv(file,delimiter="|",index_col=False))
tmp['Path'] =file
tmp['Machine No']=tmp["Path"].str.extract(r'(\D{3}\d{2})',expand=False)
tmp['Process']= tmp["Path"].str.extract(r'(\w{8})',expand=False)
tmp['Before/After']= tmp["Path"].str.extract(r'([B][E][F][O][R][E])',expand= False)
tmp['Wafer Size']= tmp["Path"].str.extract(r'(\d\_\D{6})',expand= False)
tmp['Employee ID']=tmp["Path"].str.extract(r'(?<!\d)(\d{6})(?!\d)',expand=False)
tmp['Date']=tmp["Path"].str.extract(r'(\d{4}_\d{2}_\d{2})',expand=False)
tmp['Lot Number']=tmp["Path"].str.extract(r'(\d{7}\D\d)',expand=False)
tmp['Part Number']=tmp["Path"].str.extract(r'([A-Z]{2,3}\d{3,4}[A-Z][A-Z]\d{2,4}[A-Z])',expand=False)
tmp["Part Number"].fillna("ENGINNERING SAMPLE",inplace=True)
df= df.append(tmp)
export_excel= rf.to_excel(r'C:\Users\hoosk\Documents\Python Scripts\hoosk\test26_feb_2020.xlsx')
#schedule to run every hour
schedule.every(1).hour.do(job)
while True:
schedule.run_pending()
time.sleep(1)
In general terms you'll want to do the following:
Read in the xlsx file at the start of your script.
Extract a set with all the filename already parsed (Path attribute)
For each file you iterate over check if it is contained within the set of already processed files.
This assumes that existing files don't have their content updated. If that could happen, you may want to track metrics like last change date (a checksum would be most reliable, but probably too expensive to compute).

Python get files from N days ago and before

I am trying to write a function that can find files from a certain date and before and delete them. I was playing around with fabric and I want to delete my old log files from my server. the folder has files in the following format:
['user-2015-10-16.log.gz', 'user-2015-10-19.log.gz', 'user-2015-10-22.log.gz', 'user-2015-10-25.log.gz', 'admin-2015-10-17.log.gz', 'admin-2015-10-20.log.gz', 'admin-2015-10-23.log.gz', 'requests.log', 'user-2015-10-17.log.gz', 'user-2015-10-20.log.gz', 'user-2015-10-23.log.gz', 'extra.log', 'admin-2015-10-18.log.gz', 'admin-2015-10-21.log.gz', 'admin-2015-10-24.log.gz', 'user-2015-10-18.log.gz', 'user-2015-10-21.log.gz', 'user-2015-10-24.log.gz', 'admin-2015-10-16.log.gz', 'admin-2015-10-19.log.gz', 'admin-2015-10-22.log.gz', 'admin-2015-10-25.log.gz']
What I want to do is keep files from today till 4 days back i.e. keep the ones from 25th, 24th, 23rd, 22nd and delete the rest (keep extra.log and requests.log).
I tried this:
import datetime
days = 4
user = []
admin = []
for i in range(days):
that_date = datetime.datetime.now() - datetime.timedelta(days=i)
use = 'user-{}.log.gz'.format(that_date)
adm = 'admin-{}.log.gz'.format(that_date)
# user.append(user)
# admin.append(admin)
print user, adm
But realized embarrassingly late that this gives me the files I want to keep and not the ones I want to delete.
Any help will be greatly appreciated
Edit: if not already clear, the files are generated daily with the user-(todays date) format so cant hardcode anything.
You might consider using glob with user-* and admin-* and then get the file creation times with os.stat
NOT TESTED, but something like:
import glob
import os
import time
target=4*24*60*60 # 4 days in seconds
for fn in glob.glob('user-*')+glob.glob('admin-*'):
if time.time()-os.path.getctime(fn)>target:
# delete that file...
You need to change the working directory (or change the glob) to the target directory.

Trying to understand how to use Exception Handling with my code

I'm reading in stock data from Yahoo associated with the "tickers" (stock codes) provided to me in a CSV file. However, some of the stock codes are actually not available on Yahoo, so I was wondering if there is a way to account for this in my code below via Exception Handling.
import pandas
import pandas.io.data as web
import datetime
import csv
f1=open('C:\Users\Username\Documents\Programming\Financialdata.csv') #Enter the location of the file
c1= csv.reader(f1)
tickers =[]
for row in c1: #reading tickers from the csv file
tickers.append(row)
start=datetime.datetime(2012,1,1)
end=datetime.datetime(2013,1,1)
l=[]; m=[]; tickernew=[]
i=0;j=0; k=0; z=[]
for tick in tickers[0]:
f=web.DataReader(tick,'yahoo', start,end)
if len(f)==250: #checking if the stock was traded for 250 days
tickernew.append(tick) #new ticker list to keep track of the new index number of tickers
k = k + 1 #k keeps track of the number of new tickers
for i in range(0,len(f)-1):
m.append(f['Adj Close'][i+1]/f['Adj Close'][i]) #calculating returns
Absolutely. Your first step should be to look at the traceback you get when your program crashes because of the invalid input you mention.
Then, simply wrap the line of the code you're crashing on with a try/except. Good Python style encourages you to be specific about what type of exception you're handling. So, for example, if the crash is raising a "ValueException" you'll want to do this:
try:
bad_line_of_code
except ValueException:
handle_the_issue

Categories