How to resample OHLC data with multiple stocks in index? - python

I haven't been able to find anything too similar to this I have OHLC data pulled from y-finance for multiple stocks. This results in a multi-index of columns of OHLC data and stock names
Python Script
'''
import requests
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta
N_DAYS_AGO = 15
now = datetime.now()
today = datetime(now.year,now.month,now.day, now.hour)
n_days_ago = today - timedelta(days=N_DAYS_AGO)
df = yf.download(['SPY','TLT'], start=n_days_ago, end=now, interval = "60m") #no error with 1 stock
ohlc_dict = {
'Adj Close':'last',
'Open':'first',
'High':'max',
'Low':'min',
'Close':'last',
'Volume':'sum'
}
df_sample = df.resample('W-FRI', closed='left').agg(ohlc_dict)
df_sample #error with 2 stocks
'''
The code above works without a single stock but fails when there are multiple stocks/ multi index columns.
I've tried stacking and unstacking but haven't found a good way to resample this data. What's the simplest path forward here?

Related

How do I export this pandas array of stock data into an excel spreadsheet?

import pandas as pd
import yfinance as yf
import pendulum
from datetime import date
today = str(date.today())
print(today)
pd.options.display.max_rows=390
start = pendulum.parse('2022-12-5 08:30')
end = pendulum.parse('2022-12-5 15:00')
stock = input("Enter a stock ticker symbol: ")
print(stock + " 1 Minute Data")
print(start)
print(yf.download(tickers= stock, interval="1m", start=start, end=end))
Running the code and typing in "TSLA" will load up every tick for the specified date. How would I export this array in a clean fashion to excel?
Side note: I was also trying to put today's date instead of pendulum's manual date '2022-12-5'
Is there a way to also use the current date for pendulum.parse instead of manually typing it out every time? I tried making the date a variable but got an error etc.
Well, I suspect yf.download is returning a pandas dataframe, so you could just save it to Excel using panda's to_excel function, unless there is more structure or processing you need to do.
df = yf.download(...)
df.to_excel('ticker_data.xlsx')
Question 1:
It returns a dataframe but you need to reset the index to bring Datetime column from index to column. You can chain all the commands together.
Question 2:
pendulum.parse() takes a str so you just need to use an fstring.
import pendulum
import yfinance as yf
start = pendulum.parse(f"{pendulum.now().date()} 08:30")
end = pendulum.parse(f"{pendulum.now().date()} 15:00")
stock = input("Enter a stock ticker symbol: ")
(yf
.download(tickers=stock, interval="1m", start=start, end=end)
.reset_index()
.to_excel(f"/path/to/file/{stock.lower()}.xlsx", index=False, sheet_name=stock)
)

Python: clear a specific range of data from a column in a dataframe

I have the problem that the dataframe from my import (stock prices from Yahoo) are not correct for a specific time period. I want to clear the data from 2010-01-01 until 2017-10-17 for "VAR1.DE" and replace it empty or with NaN. I have found the panda function "drop" but this will delete the hole column.
How can I solve the problem?
Here is my code:
from pandas_datareader import data as web
import pandas as pd
import numpy as np
from datetime import datetime
assets = ['1211.HK','BABA','BYND','CAP.DE','JKS','PLUG','QCOM','VAR1.DE']
weights = np.array([0.125,0.125,0.125,0.125,0.125,0.125,0.125,0.125])
stockStartDate='2010-01-01'
today = datetime.today().strftime('%Y-%m-%d')
df = pd.DataFrame()
for stock in assets:
df[stock]=web.DataReader(stock, data_source='yahoo', start=stockStartDate,end=today)['Adj Close']
instead of having a for loop, you can simply do:
df = web.DataReader(name=assets, data_source='yahoo', start=stockStartDate, end=today)['Adj Close']
since the return dataframe would be indexed by datetime. (i.e. pd.DatetimeIndex)
so you can simply do:
df.loc[:'2017-10-17', 'VAR1.DE'] = np.nan
reassigning values as NaN for column='VAR1.DE' that are before '2017-10-17'.

pandas - get a dataframe for every day

I have a DataFrame with dates in the index. I make a Subset of the DataFrame for every Day. Is there any way to write a function or a loop to generate these steps automatically?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
import datetime as dt
#Get the channel feeds from Thinkspeak
response = requests.get("https://api.thingspeak.com/channels/518038/feeds.json?api_key=XXXXXX&results=500")
#Convert Json object to Python object
response_data = response.json()
channel_head = response_data["channel"]
channel_bottom = response_data["feeds"]
#Create DataFrame with Pandas
df = pd.DataFrame(channel_bottom)
#rename Parameters
df = df.rename(columns={"field1":"PM 2.5","field2":"PM 10"})
#Drop all entrys with at least on nan
df = df.dropna(how="any")
#Convert time to datetime object
df["created_at"] = df["created_at"].apply(lambda x:dt.datetime.strptime(x,"%Y-%m-%dT%H:%M:%SZ"))
#Set dates as Index
df = df.set_index(keys="created_at")
#Make a DataFrame for every day
df_2018_12_07 = df.loc['2018-12-07']
df_2018_12_06 = df.loc['2018-12-06']
df_2018_12_05 = df.loc['2018-12-05']
df_2018_12_04 = df.loc['2018-12-04']
df_2018_12_03 = df.loc['2018-12-03']
df_2018_12_02 = df.loc['2018-12-02']
Supposing that you do that on the first day of next week (so, exporting monday to sunday next monday, you can do that as follows:
from datetime import date, timedelta
day = date.today() - timedelta(days=7) # so, if today is monday, we start monday before
df = df.loc[today]
while day < today:
df1 = df.loc[str(day)]
df1.to_csv('mypath'+str(day)+'.csv') #so that export files have different names
day = day+ timedelta(days=1)
you can use:
from datetime import date
today = str(date.today())
df = df.loc[today]
and schedule the script using any scheduler such as crontab.
You can create dictionary of DataFrames - then select by keys for DataFrame:
dfs = dict(tuple(df.groupby(df.index.strftime('%Y-%m-%d'))))
print (dfs['2018-12-07'])

Python: Time Series with Pandas

I want to use time series with Pandas. I read multiple time series one by one, from a csv file which has the date in the column named "Date" as (YYYY-MM-DD):
Date,Business,Education,Holiday
2005-01-01,6665,8511,86397
2005-02-01,8910,12043,92453
2005-03-01,8834,12720,78846
2005-04-01,8127,11667,52644
2005-05-01,7762,11092,33789
2005-06-01,7652,10898,34245
2005-07-01,7403,12787,42020
2005-08-01,7968,13235,36190
2005-09-01,8345,12141,36038
2005-10-01,8553,12067,41089
2005-11-01,8880,11603,59415
2005-12-01,8331,9175,70736
df = pd.read_csv(csv_file, index_col = 'Date',header=0)
Series_list = df.keys()
The time series can have different frequencies: day, week, month, quarter, year and I want to index the time series according to a frequency I decide before I generate the Arima model. Could someone please explain how can I define the frequency of the series?
stepwise_fit = auto_arima(df[Series_name]....
pandas has a built in function pandas.infer_freq()
import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])
pd.infer_freq(df.Date)
#'MS'
pd.infer_freq(df.Date1)
#'D'
pd.infer_freq(df.Date2)
#'AS-JAN'
Alternatively you could also make use of the datetime functionality of the columns.
df.Date.dt.freq
#'MS'
Of course if your data doesn't actually have a real frequency, then you won't get anything.
pd.infer_freq(df.Date3)
#
The frequency descriptions are docmented under offset-aliases.

Pivot pandas timeseries by year

Is there a shorter or more elegant way to pivot a timeseries by year in pandas? The code below does what I want but I wonder if there is a better way to accomplish this:
import pandas
import numpy
daterange = pandas.date_range(start='2000-01-01', end='2017-12-31', freq='10T')
# generate a fake timeseries of measured wind speeds from 2000 to 2017 in 10min intervals
wind_speed = pandas.Series(data=numpy.random.rand(daterange.size), index=daterange)
# group by year
wind_speed_groups = wind_speed.groupby(wind_speed.index.year).groups
# assemble data frame with columns of wind speed data for every year
wind_speed_pivot = pandas.DataFrame()
for key, group in wind_speed_groups.items():
series = wind_speed[group]
series.name = key
series.index = series.index - pandas.Timestamp(str(key)+'-01-01')
wind_speed_pivot = wind_speed_pivot.join(series, how='outer')
print(wind_speed_pivot)
I'm not sure if this is the fastest method, as I'm adding two columns to your initial dataframe (it's possible to add just one if you want to overwrite it).
import pandas as pd
import numpy as np
import datetime as dt
daterange = pd.date_range(start='2000-01-01', end='2017-12-31', freq='10T')
# generate a fake timeseries of measured wind speeds from 2000 to 2017 in 10min intervals
wind_speed = pd.Series(data=np.random.rand(daterange.size), index=daterange)
df = wind_speed.to_frame("windspeed")
df["year"] = df.index.year
df["pv_index"] = df.index - df["year"].apply(lambda x: dt.datetime(x,1,1))
wind_speed_pivot = df.pivot_table(index=["pv_index"], columns=["year"], values=["windspeed"])

Categories