Pandas Time Series and groupby

Pandas Time Series and groupby - python

[Edited to more clearly state root problem, which behaves differently if you use numpy 1.8 as dmvianna points out]
I have a DataFrame that has time stamps add other data. In the end I would like to not use a formatted time as the index because it messes with matplotlibs 3d plotting. I also want to preform a groupby to populate some flag fields. This is causing me to run into a number of weird errors. The first two work as I would expect. Once I bring pd.to_datetime into the picture it starts throwing errors.
runs as expected:
import pandas as pd
import numpy as np
df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
'type':np.random.randint(10, size=1000),
'value':np.random.rand(1000)})
df['high'] = 0
def high_low(group):
if group.value.mean() > .5:
group.high = 1
return group
grouped = df.groupby('type')
df = grouped.apply(high_low)
works fine:
df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
'type':np.random.randint(10, size=1000),
'value':np.random.rand(1000)})
df.time = pd.to_datetime(df.time, unit='s')
df['high'] = 0
def high_low(group):
if group.value.mean() > .5:
group.high = 1
return group
grouped = df.groupby('type')
df = grouped.apply(high_low)
throws error:
ValueError: Shape of passed values is (3, 1016), indices imply (3, 1000)
df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
'type':np.random.randint(10, size=1000),
'value':np.random.rand(1000)})
df.time = pd.to_datetime(df.time, unit='s')
df = df.set_index('time')
df['high'] = 0
def high_low(group):
if group.value.mean() > .5:
group.high = 1
return group
grouped = df.groupby('type')
df = grouped.apply(high_low)
throws error:
ValueError: Shape of passed values is (3, 1016), indices imply (3, 1000)
df = pd.DataFrame({'time':np.random.randint(100000, size=1000),
'type':np.random.randint(10, size=1000),
'value':np.random.rand(1000)})
df['epoch'] = df.time
df.time = pd.to_datetime(df.time, unit='s')
df = df.set_index('time')
df = df.set_index('epoch')
df['high'] = 0
def high_low(group):
if group.value.mean() > .5:
group.high = 1
return group
grouped = df.groupby('type')
df = grouped.apply(high_low)
Anyone know what I'm missing / doing wrong?

Instead of using pd.to_datetime, I would use np.datetime64. It will work in columns and offers the same functionality as you expect from a datetime.index (np.datetime64 is a building block for datetime.index).
import numpy as np
data['time2'] = np.datetime64(data.time, 's')
Check the Docs
This would also lead to the same result:
import pandas as pd
data['time2'] = pd.to_datetime(data.time, unit='s')
Notice though that I'm using pandas 0.12.0 and Numpy 1.8.0. Numpy 1.7 has issues referred to in the comments below.

Related

Calculate rolling mean, max, min, std of time series pandas dataframe

I'm trying to calculate a rolling mean, max, min, and std for specific columns inside a time series pandas dataframe. But I keep getting NaN for the lagged values and I'm not sure how to fix it. My MWE is:
import numpy as np
import pandas as pd
# original data
df = pd.DataFrame()
np.random.seed(0)
days = pd.date_range(start='2015-01-01', end='2015-05-01', freq='1D')
df = pd.DataFrame({'Date': days, 'col1': np.random.randn(len(days)), 'col2': 20+np.random.randn(len(days)), 'col3': 50+np.random.randn(len(days))})
df = df.set_index('Date')
print(df.head(10))
def add_lag(dfObj, window):
cols = ['col2', 'col3']
for col in cols:
rolled = dfObj[col].rolling(window)
lag_mean = rolled.mean().reset_index()#.astype(np.float16)
lag_max = rolled.max().reset_index()#.astype(np.float16)
lag_min = rolled.min().reset_index()#.astype(np.float16)
lag_std = rolled.std().reset_index()#.astype(np.float16)
dfObj[f'{col}_mean_lag{window}'] = lag_mean[col]
dfObj[f'{col}_max_lag{window}'] = lag_max[col]
dfObj[f'{col}_min_lag{window}'] = lag_min[col]
dfObj[f'{col}_std_lag{window}'] = lag_std[col]
# add lag feature for 1 day, 3 days
add_lag(df, window=1)
add_lag(df, window=3)
print(df.head(10))
print(df.tail(10))

Just don't do reset_index(). Then it works.
import numpy as np
import pandas as pd
# original data
df = pd.DataFrame()
np.random.seed(0)
days = pd.date_range(start='2015-01-01', end='2015-05-01', freq='1D')
df = pd.DataFrame({'Date': days, 'col1': np.random.randn(len(days)), 'col2': 20+np.random.randn(len(days)), 'col3': 50+np.random.randn(len(days))})
df = df.set_index('Date')
print(df.head(10))
def add_lag(dfObj, window):
cols = ['col2', 'col3']
for col in cols:
rolled = dfObj[col].rolling(window)
lag_mean = rolled.mean()#.reset_index()#.astype(np.float16)
lag_max = rolled.max()#.reset_index()#.astype(np.float16)
lag_min = rolled.min()#.reset_index()#.astype(np.float16)
lag_std = rolled.std()#.reset_index()#.astype(np.float16)
dfObj[f'{col}_mean_lag{window}'] = lag_mean#[col]
dfObj[f'{col}_max_lag{window}'] = lag_max#[col]
dfObj[f'{col}_min_lag{window}'] = lag_min#[col]
dfObj[f'{col}_std_lag{window}'] = lag_std#[col]
# add lag feature for 1 day, 3 days
add_lag(df, window=1)
add_lag(df, window=3)
print(df.head(10))
print(df.tail(10))

Whenever you use the rolling function, it creates NaN for the values that it cannot calculate.
For example, consider a single column, col1 = [2, 4, 10, 6], and a rolling window of 2.
The output of the rolling window will be NaN, 3, 7, 8.
This is because the rolling average of the first value cannot be calculated since there the window looks at that given index and the previous value, for which there is none.
Then, when you calculate the mean, std, etc you are calculating a series functions without accounting for the NaN. In R, you can usually just do na.rm=T; however, in Python it is recommended that you drop the NaN values, then calculate the series function.

How to apply euclidean distance to dataframe. Calculate each row

Please help me, I have the problem. It's been about 2 weeks but I don't get it yet.
So, I want to use "apply" in dataframe, which I got from Alphavantage API.
I want to apply euclidean distance to each row of dataframe.
import math
import numpy as np
import pandas as pd
from scipy.spatial import distance
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.neighbors import KNeighborsRegressor
from alpha_vantage.timeseries import TimeSeries
from services.KEY import getApiKey
ts = TimeSeries(key=getApiKey(), output_format='pandas')
And in my picture I got this
My chart (sorry can't post image because of my reputation)
In my code
stock, meta_data = ts.get_daily_adjusted(symbol, outputsize='full')
stock = stock.sort_values('date')
open = stock['1. open'].values
low = stock['3. low'].values
high = stock['2. high'].values
close = stock['4. close'].values
sorted_date = stock.index.get_level_values(level='date')
stock_numpy_format = np.stack((sorted_date, open, low
,high, close), axis=1)
df = pd.DataFrame(stock_numpy_format, columns=['date', 'open', 'low', 'high', 'close'])
df = df[df['open']>0]
df = df[(df['date'] >= "2016-01-01") & (df['date'] <= "2018-12-31")]
df = df.reset_index(drop=True)
df['close_next'] = df['close'].shift(-1)
df['daily_return'] = df['close'].pct_change(1)
df['daily_return'].fillna(0, inplace=True)
stock_numeric_close_dailyreturn = df['close', 'daily_return']
stock_normalized = (stock_numeric_close_dailyreturn - stock_numeric_close_dailyreturn.mean()) / stock_numeric_close_dailyreturn.std()
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized) , axis=1)
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx":euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = df.loc[int(second_smallest)]["date"]
And I want that my chart like this
The chart that I want
And the code from this picture
distance_columns = ['Close', 'DailyReturn']
stock_numeric = stock[distance_columns]
stock_normalized = (stock_numeric - stock_numeric.mean()) / stock_numeric.std()
stock_normalized.fillna(0, inplace = True)
date_normalized = stock_normalized[stock["Date"] == "2016-06-29"]
euclidean_distances = stock_normalized.apply(lambda row: distance.euclidean(row, date_normalized), axis = 1)
distance_frame = pandas.DataFrame(data = {"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_date = stock.loc[int(second_smallest)]["Date"]
I tried to figure it out, the "apply" in the df.apply from pandas format and from pandas.csv_reader is different.
Is there any alternative to have same output in different format (pandas and csv)
Thank you!
nb: sorry if my english bad.

How do I use the output of pandas.ewm.cov?

How is one intended to use the output of the pandas.ewm.cov function. I would presume that there are functions that allow you to directly use it in the form returned for multiplication, but nothing I try seems to work.
For example, suppose I take a minimal use case, stock X and Y returns timeseries in DF1, so we estimate an ewma covariance matrix, then to get the variance estimate for a portfolio of position A and B (given in DF2) I need to compute $x^T C x$, but I can't find the command to do this without writing a for loop?
# Python 3.6, pandas 0.20
import pandas as pd
import numpy as np
np.random.seed(100)
DF1 = pd.DataFrame(dict(X = np.random.normal(size = 100), Y = np.random.normal(size = 100)))
DF2 = pd.DataFrame(dict(A = np.random.normal(size = 100), B = np.random.normal(size = 100)))
COV = DF1.ewm(10).cov()
print(DF1)
print(COV)
# All of the following are invalid
print(COV.dot(DF2))
print(DF2.dot(COV))
print(COV.multiply(DF2))

The best I can figure out is this ugly piece of code
COV.reset_index().rename(columns = dict(level_0 = "index", level_1 = "variable"), inplace = True)
DF2m = pd.melt(DF2.reset_index(), id_vars = "index").sort_values("index")
MDF = pd.merge(COV, DF2m, on=["index", "variable"])
VAR = MDF.groupby("index").apply(lambda x: np.dot(np.dot(x["value"], np.matrix([x["X"], x["Y"]])), x["value"])[0,0])
I hold out hope that there is a nice way to do this...

Q : Python CSV - Key Error when plotting

What I'm trying to plot a a dataframe but I'm encountering some errors that I don't know how to solve.
Python Code:
import numpy as np
from datetime import date,time,datetime
import pandas as pd
import csv
df = pd.read_csv('MainD2.csv', parse_dates=['Time_Stamp'], infer_datetime_format=True)
df["Time_Stamp"] = pd.to_datetime(df["Time_Stamp"]) # convert to Datetime
df_filter = df[df["Curr"].le(3.0)] # new df with less or equal to 0.5
#print(df_filter)
where = (df_filter[df_filter["Time_Stamp"].diff().dt.total_seconds() > 1] ["Time_Stamp"] - pd.Timedelta("1s")).astype(str).tolist() # Find where diff > 1 second
df_filter2 = df[df["Time_Stamp"].isin(where)] # Create new df with those
#print(df_filter2)
df_filter2["AC_Input_Current"] = 0.0 # Set c1 to 0.0
#df_filter2
df = df.set_index("Time_Stamp")
df_filter2 = df_filter2.set_index("Time_Stamp")
df.loc[df_filter2.index] = df_filter2
def getMask(start,end):
mask = (df['Time_Stamp'] > start) & (df['Time_Stamp'] <= end)
return mask;
start = '2017-06-26 01:05:00'
end = '2017-06-26 01:20:00'
timerange = df.loc[getMask(start, end)]
timerange.plot(x='Time_Stamp', y='AC_Input_Current', style='-', color='black')*
*------------------ Plotting Part -------------------
timerange.plot(x='Time_Stamp', y='AC_Input_Current', style='-', color='black')
I have encountered this error when trying to plot :
KeyError: 'Time_Stamp'

Traversing multiple dataframes simultaneously

I have three dataframes of three users with same column names like time, compass data,accelerometer data, gyroscope data and camera panning information. I want to traverse all the dataframes simultaneously to check for a particular time which user has performed camera panning and return the user(like in which data frame panning has been detected for a particular time). I have tried using dash for achieving parallelism but in vain. below is my code
import pandas as pd
import glob
import numpy as np
import math
from scipy.signal import butter, lfilter
order=3
fs=30
cutoff=4.0
data=[]
gx=[]
gy=[]
g_x2=[]
g_y2=[]
dataList = glob.glob(r'C:\Users\chaitanya\Desktop\Thesis\*.csv')
for csv in dataList:
data.append(pd.read_csv(csv))
for i in range(0, len(data)):
data[i] = data[i].groupby("Time").agg(lambda x: x.value_counts().index[0])
data[i].reset_index(level=0, inplace=True)
def butter_lowpass(cutoff,fs,order=5):
nyq=0.5 * fs
nor=cutoff / nyq
b,a=butter(order,nor,btype='low', analog=False)
return b,a
def lowpass_filter(data,cutoff,fs,order=5):
b,a=butter_lowpass(cutoff,fs,order=order)
y=lfilter(b,a,data)
return y
for i in range(0,len(data)):
gx.append(lowpass_filter(data[i]["Gyro_X"],cutoff,fs,order))
gy.append(lowpass_filter(data[i]["Gyro_Y"],cutoff,fs,order))
g_x2.append(gx[i]*gx[i])
g_y2.append(gy[i]*gy[i])
g_rad=[[] for _ in range(len(data))]
g_ang=[[] for _ in range(len(data))]
for i in range(0,len(data)):
for j in range(0,len(data[i])):
g_ang[i].append(math.degrees(math.atan(gy[i][j]/gx[i][j])))
data[i]["Ang"]=g_ang[i]
panning=[[] for _ in range(len(data))]
for i in range(0,len(data)):
for j in data[i]["Ang"]:
if 0-30<=j<=0+30:
panning[i].append("Panning")
elif 180-30<=j<=180+30:
panning[i].append("left")
else:
panning[i].append("None")
data[i]["Panning"]=panning[i]
result=[[] for _ in range(len(data))]
for i in range (0,len(data)):
result[i].append(data[i].loc[data[i]['Panning']=='Panning','Ang'])

I'm going to make the assumption that you want to traverse simultaneously in time. In any case, you want your three dataframes to have an index in the dimension you want to traverse.
I'll generate 3 dataframes with rows representing random seconds in a 9 second period.
Then, I'll align these with a pd.concat and ffill to be able to reference the last known data for any gaps.
seconds = pd.date_range('2016-08-31', periods=10, freq='S')
n = 6
ssec = seconds.to_series()
sidx = ssec.sample(n).index
df1 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df2 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df3 = pd.DataFrame(np.random.randint(1, 10, (n, 3)),
ssec.sample(n).index.sort_values(),
['compass', 'accel', 'gyro'])
df4 = pd.concat([df1, df2, df3], axis=1, keys=['df1', 'df2', 'df3']).ffill()
df4
you can then proceed to walk through via iterrows()
for tstamp, row in df4.iterrows():
print tstamp

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Time Series and groupby - python

Related

Calculate rolling mean, max, min, std of time series pandas dataframe

How to apply euclidean distance to dataframe. Calculate each row

How do I use the output of pandas.ewm.cov?

Q : Python CSV - Key Error when plotting

Traversing multiple dataframes simultaneously

Categories

Resources