datetime groupby on a multiindex - python

If I have a multiindex set up like:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from io import StringIO
csv = u"""string,date,number
a string1,2/5/11 9:16am,1.0
a string2,3/5/11 10:44pm,2.0
a string3,4/22/11 12:07pm,3.0
a string4,4/22/11 12:10pm,4.0
a string5,4/29/11 11:59am,1.0
a string6,5/2/11 1:41pm,2.0
a string7,5/2/11 2:02pm,3.0
a string8,5/2/11 2:56pm,4.0
a string9,5/2/11 3:00pm,5.0
a string10,5/2/14 3:02pm,6.0
a string11,5/2/14 3:18pm,7.0"""
df = pd.read_csv(StringIO(csv))
df['date']=pd.to_datetime(df['date'],format='%m/%d/%y %I:%M%p')
df.index = df['date']
df.index = pd.MultiIndex.from_tuples(zip(df['date'], df['string']), names=['alpha', 'bravo'])
How can I do a groupby on the alpha index by month and then sum? What I've tried is:
df.groupby(level='alpha').sum().groupby(df.index.month).sum()
which clearly doesn't work.

Like this?
df.groupby(df.index.get_level_values('alpha').month).number.sum()

Related

MACD stock indicator function using ewm() from pandas library

Here is the test code for my macd function, however, the values I am getting are incorrect. I don't know if it is because my span is in days and my data is in 2 minute increments, or if it is a seperate issue. Any help would be much appreciated :)
import yfinance as yf
import pandas as pd
import pandas_ta as ta
import numpy as np
import datetime as dt
import time
dataTSLA = yf.download(tickers='TSLA', period='1mo', interval='2m', auto_adjust=True)
def indicatorMACD(data):
exp1 = data['Close'].ewm(span=12, adjust=False).mean()
exp2 = data['Close'].ewm(span=26, adjust=False).mean()
macd = exp1 - exp2
signalLine = macd.ewm(span=9, adjust=False).mean()
return [macd, signalLine]
print(indicatorMACD(dataTSLA))
Getting an output of around 0.66 for macd and 0.23 for signal when it should be -0.23 and -0.64 respectively.
Use min_periods instead adjust
code:
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
df = pdr.DataReader('BTC-USD' , data_source='yahoo' , start='2020-01-01')
df
Function definition:
def MACD(DF,a,b,c):
df=DF.copy()
df['MA FAST'] = df['Close'].ewm(span=a , min_periods = a).mean()
df['MA SLOW'] = df['Close'].ewm(span=b , min_periods = b).mean()
df['MACD'] = df['MA FAST'] - df['MA SLOW']
df['Signal'] = df['MACD'].ewm(span= c , min_periods = c).mean()
df.dropna(inplace=True)
return df
Function call:
data = MACD(df , 12,26,9)
data

Formatting HTML pandas tables in python

I am using Pandas to create 3 HTML tables out of 3 dataframes. The output I want is an HTML file. The code I'm currently using prints tables one under the other. I want to print one table on top, and then the other two tables side by side. What could I change in the code to achieve that?
import numpy as np
from numpy.random import randn
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(randn(5,4),columns='W X Y Z'.split())
df1 = pd.DataFrame(randn(5,4),columns='A B C D'.split())
df2 = pd.DataFrame(randn(5,4),columns='E F G K'.split())
with open("a.html", 'w') as _file:
_file.write(df.head().to_html() + "\n\n" + df1.head().to_html()+ "\n\n" + df2.head().to_html())
Here's my proposal based on your original code:
import numpy as np
from numpy.random import randn
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(randn(5,4),columns='W X Y Z'.split())
df1 = pd.DataFrame(randn(5,4),columns='A B C D'.split())
df2 = pd.DataFrame(randn(5,4),columns='E F G K'.split())
html = """
{table1}
<table>
<tr>
<td>{table2}</td>
<td>{table3}</td>
</tr>
</table>
""".format(
table1=df.head().to_html(),
table2=df1.head().to_html(),
table3=df2.head().to_html()
)
with open("a.html", 'w') as _file:
_file.write(html)

Xarray resample inter annually

I am trying to resample my data annually, but struggle to set the start day of resampling.
import xarray as xr
import numpy as np
import pandas as pd
da = xr.DataArray(
np.linspace(0, 11, num=36),
coords=[
pd.date_range(
"15/12/1999", periods=36,
)
],
dims="time",
)
da.resample(time="1Y").mean()
What I am trying to achieve is to get the means of the following periods: 15/12/1999-15/12/2000, 15/12/2000-15/12/2001, 15/12/2001-15/12/2002, ...
I have solved it by shifting the time to the first month and use the corresponding pandas anchored offset. Afterwards, reset the time back.
import xarray as xr
import numpy as np
import pandas as pd
da = xr.DataArray(
np.concatenate([np.zeros(365), np.ones(365)]),
coords=[
pd.date_range(
"06/15/2017", "06/14/2019", freq='D'
)
],
dims="time",
)
days_to_first_of_month = pd.Timedelta(days=int(da.time.dt.day[0])-1)
da['time'] = da.time - days_to_first_of_month
month = da.time.dt.strftime("%b")[0].values
resampled = da.resample(time=f'AS-{month}').sum()
resampled['time'] = resampled.time + days_to_first_of_month
print(resampled)
Is there a more efficient or clean way?

How to apply euclidean distance to dataframe. Calculate each row

Please help me, I have the problem. It's been about 2 weeks but I don't get it yet.
So, I want to use "apply" in dataframe, which I got from Alphavantage API.
I want to apply euclidean distance to each row of dataframe.
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.neighbors import KNeighborsRegressor
from alpha_vantage.timeseries import TimeSeries
from services.KEY import getApiKey
ts = TimeSeries(key=getApiKey(), output_format='pandas')
And in my picture I got this
My chart (sorry can't post image because of my reputation)
In my code
stock, meta_data = ts.get_daily_adjusted(symbol, outputsize='full')
stock = stock.sort_values('date')
open = stock['1. open'].values
low = stock['3. low'].values
high = stock['2. high'].values
close = stock['4. close'].values
sorted_date = stock.index.get_level_values(level='date')
stock_numpy_format = np.stack((sorted_date, open, low
,high, close), axis=1)
df = pd.DataFrame(stock_numpy_format, columns=['date', 'open', 'low', 'high', 'close'])
df = df[df['open']>0]
df = df[(df['date'] >= "2016-01-01") & (df['date'] <= "2018-12-31")]
df = df.reset_index(drop=True)
df['close_next'] = df['close'].shift(-1)
df['daily_return'] = df['close'].pct_change(1)
df['daily_return'].fillna(0, inplace=True)
stock_numeric_close_dailyreturn = df['close', 'daily_return']
stock_normalized = (stock_numeric_close_dailyreturn - stock_numeric_close_dailyreturn.mean()) / stock_numeric_close_dailyreturn.std()
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized) , axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx":euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = df.loc[int(second_smallest)]["date"]
And I want that my chart like this
The chart that I want
And the code from this picture
distance_columns = ['Close', 'DailyReturn']
stock_numeric = stock[distance_columns]
stock_normalized = (stock_numeric - stock_numeric.mean()) / stock_numeric.std()
stock_normalized.fillna(0, inplace = True)
date_normalized = stock_normalized[stock["Date"] == "2016-06-29"]
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized), axis = 1)
distance_frame = pandas.DataFrame(data = {"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = stock.loc[int(second_smallest)]["Date"]
I tried to figure it out, the "apply" in the df.apply from pandas format and from pandas.csv_reader is different.
Is there any alternative to have same output in different format (pandas and csv)
Thank you!
nb: sorry if my english bad.

Exception: real is not double (PYTHON)

I'm trying to take the moving average of a stocks volume using TA-Lib but I'm getting the error above. Any suggestions on how to fix this? Thanks!
See code below:
import pandas_datareader as pdr
import datetime
import pandas as pd
import numpy as np
import talib as ta
#Download Data
aapl = pdr.get_data_yahoo('AAPL', start=datetime.datetime(2006, 10, 1), end=datetime.datetime(2012, 1, 1))
#Saves Data as CSV on desktop
aapl.to_csv('C:\\Users\\JDOG\\Desktop\\aapl_ohlc.csv', encoding='utf-8')
#Save to dataframe
df = pd.read_csv('C:\\Users\JDOG\\Desktop\\aapl_ohlc.csv', header=0, index_col='Date', parse_dates=True)
twenty_ma = 20
signals = pd.DataFrame(index=aapl.index)
signals['signal'] = 0.0
signals['20 MA'] = ta.SMA(aapl.Volume.values,twenty_ma)
It looks like SMA expects an array of floats rather than ints:
In [11]: ta.SMA(aapl.Volume.values.astype('float64'), twenty_ma)
Out[11]:
array([ nan, nan, nan, ..., 78960385., 76585880.,
73991890.])

Categories