How can I deal with Expect data.index as DatetimeIndex? - python

I am planning to get a candlestick plot from the bitcoin data.
Here is my code for select the dataframe I want after loading the csv file.
df['Date'] = pd.to_datetime(df['Date'])
start_date = '2016-02-27'
end_date = '2021-02-27'
mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
df
and then, I entered the code for making the candlestick plot like below:
import matplotlib.pyplot as plt
! pip install --upgrade mplfinance
import mplfinance as mpf
import matplotlib.dates as mpl_dates
mpf.plot(df, type = 'candle', style = 'charles',
title = 'Bitcoin Price',
ylabel = 'Price (USD$)',
volume = True,
ylabel_lower = 'Shares \nTraded',
mav = (3,6,9),
savefig = 'chart-mplfinance.png')
It said "TypeError: Expect data.index as DatetimeIndex".
So I looked up the solution for this on google and I tried out this:
df = dict()
df['Date'] = []
df['High'] = []
df['Low'] = []
df['Open'] = []
df['Close'] = []
df['Volume'] = []
for dict in df:
df['Date'].append(datetime.datetime.fromtimestamp(t).strftime('%Y-%m-%d %H:%M:%S')
df['High'].append(dict['High'])
df['Low'].append(dict['Low'])
df['Open'].append(dict['Open'])
df['Close'].append(dict['Close'])
df['Volume'].append(dict['Vol'])
print("df:", df)
pdata = pd.DataFrame.from_dict(df)
pdata.set_index('Date', inplace=True)
mpf.plot(pdata)
This time, it said "invalid syntax"
I'm not sure where I get this wrong, is there anything that I have missed?

There are two easy ways to make sure your dataframe has a pandas.DatetimeIndex as the dataframe index:
When calling read_csv() indicate which column you want to use for the index (which should be the column that contains the dates/datetimes), and also set kwarg parse_dates=True. This will will automatically convert the datetimes column (which is normally strings within a csv file) into a DatetimeIndex object, and set it as the index You can see this being done in the examples in the mplfinance repository, for example, click here, and look under basic usage where you can see index_col=0, parse_dates=True in the call to read_csv().  
Use the pandas.DatetimeIndex() constructor. For example, instead of what you wrote,
df['Date'] = pd.to_datetime(df['Date']), you would write:
df.index = pd.DatetimeIndex(df['Date'])
As a side note, once the dataframe has a DatetimeIndex, you don't need the mask in the next section of code, but can simply slice as follows:
start_date = '2016-02-27'
end_date = '2021-02-27'
df = df.loc[start_date:end_date]
hth.

Related

Python Panda drop() function not working after dictionary conversion; using Yfinance API

Using the yfinance API I pulled data from there option chain object and converted it to a dictionary. I tried to delete all rows that contained "True" in the column labeled "inTheMoney" however when I run the program it does not do so.
import yfinance as yf
import pandas as pd
price = 100
ticker = yf.Ticker("SPY")
opt = ticker.option_chain('2022-11-18')
df = pd.DataFrame(opt.puts)
#df = df.drop(df[(df['inTheMoney'] != 'True')].index)
df = df.drop(['contractSymbol', 'lastTradeDate', 'change', 'percentChange', 'volume', 'openInterest', 'impliedVolatility', 'contractSize', 'currency'], axis = 1)
print(df)
I also tried to use a for loop and loc but that did not work either.
for index in range(len(df)):
#print(df.loc[index, 'strike'])
if df.loc[index, 'strike'] < 100:
print(df.loc[index])
Any help is greatly appreciated
just:
df = df.drop(df[(df['inTheMoney'] != True)].index) #do not use quotes

Converting separate hour/min/sec columns into a single time column with pandas?

I'm trying to create a single time-column that I can create a time-series plot by resampling the date/time index. However I'm trouble combining the columns to a singular column and/or indexing it. Below is my code and what I've tried to do. Any suggestions would be appreciated!
colnames=['time_ms','power','chisq','stations','alt','hour','min','sec','time_frac','lat','lon']
df = pd.read_csv('/data/selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None, names=colnames)
#df = pd.read_csv('/data/selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None,names=colnames,parse_dates=[[5, 7]], index_col=0)
#df = pd.read_csv('/data/selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None,names=colnames,infer_datetime_format=True,parse_dates=[[5, 6]], index_col=0)
I did try this method to include/add the date as well, which isn't necessary I believe but would be nice for consistency. However I wasn't able to get this to work.
s = df['hour'].mul(10000) + df['min'].mul(100) + df['sec']
df['date'] = pd.to_datetime('2021-09-14 ' + s.astype(int), format='%Y-%m-%d %H%M%S.%f')
This method did work to create a new column, but had trouble indexing it.
df['time'] = (pd.to_datetime(df['hour'].astype(str) + ':' + df['min'].astype(str), format='%H:%M')
.dt.time)
df['Datetime'] = pd.to_datetime(df['time'])
df.set_index('Datetime')
Creating this column to get counts for a time-series
df['tot'] = 1
Using this to resample the data necessary for the timeseries in a new df
df2 = df[['tot']].resample('5min').sum()
However I keep getting datetime/index errors despite what I've tried above.
Link to data: https://drive.google.com/file/d/16GmXfQNMK81aAbB6C-W_Bjm2mcOVrILP/view?usp=sharing
you should try and keep all the data in different columns as string, concatenate them and then convert it to datatime. Below updated code would do this...
colnames=['time_ms','power','chisq','stations','alt','hour','min','sec','time_frac','lat','lon']
df = pd.read_csv('selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None, names=colnames)
df['date'] = '2021-09-14 ' + df['hour'].astype('string') + ":" + df['min'].astype('string') + ":" + df['sec'].astype('string')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')
df.set_index('date', inplace=True)
Post this you can do the plots as you need. I tried these and they appear to work fine...
df.alt.plot(kind='line')
df.plot('lat', 'lon', kind='scatter')

How to find Date of 52 Week High and date of 52 Week low using pandas dataframe (Python)?

Please refer below table to for reference
I was able to find 52 Week High and low using:
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
Can someone please guide me how to find Date of 52 Week High and date of 52 Week low? Thanks in Advance.
My guess is that the date is another column in the dataframe, assuming its name is 'Date'.
you can try something like
df = pd.read_csv(csv_file_name, engine='python')
df['52W H'] = df['HIGH'].rolling(window=252, center=False).max()
df['52W L'] = df['LOW'].rolling(window=252, center=False).min()
df_low = df[df['LOW']== df['52W L'] ]
low_date = df_low['Date']
Similarly you can look for high values
Also it would have helped if you shared your sample dataframe structure.
Used 'pandas_datareader' data. The index is reset first. Then, using the idxmax() and idxmin() functions, the indices of highs and lows are found and lists are created from these values. The index of the 'Date' column is again set. And lists with indexes are fed into df.index. Note how setting indexes in df.index nan values are not involved.
High, Low replace with yours in df.
import pandas as pd
import pandas_datareader.data as web
import numpy as np
df = web.DataReader('GE', 'yahoo', start='2012-01-10', end='2019-10-09')
df = df.reset_index()
imax = df['High'].rolling(window=252, center=False).apply(lambda x: x.idxmax()).values
imin = df['Low'].rolling(window=252, center=False).apply(lambda x: x.idxmin()).values
count0_imax = np.count_nonzero(np.isnan(imax))
count0_imin = np.count_nonzero(np.isnan(imin))
imax = imax[count0_imax:].astype(int)
imin = imin[count0_imin:].astype(int)
df = df.set_index('Date')
df.loc[df.index[count0_imax]:, '52W H'] = df.index[imax]
df.loc[df.index[count0_imin]:, '52W L'] = df.index[imin]

Python resample function not resampling

I'm trying to resample the data, however, it does not seem to be working properly. I want to have start-of-month data to start-of-month.
The code is the following
df = pd.read_csv('OSEBX_daily.csv')
df = data[['time', 'OSEBX GR']]
df['time'] = pd.to_datetime(df['time']).dt.normalize()
df.set_index('time', inplace=True)
df.index = pd.to_datetime(df.index)
df.resample('1M').mean()
df['returns'] = df['OSEBX GR'].pct_change()
plt.plot(df['returns'])
You forget assign back:
df = df.resample('1M').mean()

How to extrapolate a periodic time serie in Pandas?

In Python 3.5, Pandas 20, say I have a one year periodic time serie :
import pandas as pd
import numpy as np
start_date = pd.to_datetime("2015-01-01T01:00:00.000Z", infer_datetime_format=True)
end_date = pd.to_datetime("2015-12-31T23:00:00.000Z", infer_datetime_format=True)
index = pd.DatetimeIndex(start=start_date,
freq="60min",
end=end_date)
time = np.array((index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
df = pd.DataFrame(index=index)
df["foo"] = np.sin( 2 * np.pi * time / len(time))
df.plot()
I want to do some periodic extrapolation of the time serie for a new index. I.e with :
new_start_date = pd.to_datetime("2017-01-01T01:00:00.000Z", infer_datetime_format=True)
new_end_date = pd.to_datetime("2019-12-31T23:00:00.000Z", infer_datetime_format=True)
new_index = pd.DatetimeIndex(start=new_start_date,
freq="60min",
end=new_end_date)
I would like to use some kind of extrapolate_periodic method to get:
# DO NOT RUN
new_df = df.extrapolate_periodic(index=new_index)
# END DO NOT RUN
new_df.plot()
What is the best way do such a thing in pandas?
How can I define a periodicity and get data from a new index easily?
I think I have what you are looking for, though it is not a simple pandas method.
Carrying on directly from where you left off,
def extrapolate_periodic(df, new_index):
df_right = df.groupby([df.index.dayofyear, df.index.hour]).mean()
df_left = pd.DataFrame({'new_index': new_index}).set_index('new_index')
df_left = df_left.assign(dayofyear=lambda x: x.index.dayofyear,
hour=lambda x: x.index.hour)
df = (pd.merge(df_left, df_right, left_on=['dayofyear', 'hour'],
right_index=True, suffixes=('', '_y'))
.drop(['dayofyear', 'hour'], axis=1))
return df.sort_index()
new_df = extrapolate_periodic(df, new_index)
# or as a method style
# new_df = df.pipe(extrapolate_periodic, new_index)
new_df.plot()
If you have more that a years worth of data it will take the mean of each duplicated day-hour. Here mean could be changed for last if you wanted just the most recent reading.
This will not work if you do not have a full years worth of data but you could fix this by adding in a reindex to complete the year and then using interpolate with a polynomial feature to fill in the missing foo column.
Here is some code I've used to solve my problem. The asumption is that the initial serie corresponds to a period of data.
def extrapolate_periodic(df, new_index):
index = df.index
start_date = np.min(index)
end_date = np.max(index)
period = np.array((end_date - start_date) / np.timedelta64(1, 'h'), dtype=int)
time = np.array((new_index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
new_df = pd.DataFrame(index=new_index)
for col in list(df.columns):
new_df[col] = np.array(df[col].iloc[time % period])
return new_df

Categories