I've imported deque from collections to limit the size of my data frame. When new data is entered, the older ones should be progressively deleted over time.
Big Picture:
Im creating a Data Frame of historical values of the previous 26 days from time "whatever day it is..."
Confusion:
I think my data each minute comes in a series format, which then I attempted to restrict the maxlen using deque. Then I tried implementing the data into an data frame. However I just get NaN values.
Code:
import numpy as np
import pandas as pd
from collections import deque
def initialize(context):
context.stocks = (symbol('AAPL'))
def before_trading_start(context, data):
data = data.history(context.stocks, 'close', 20, '1m').dropna()
length = 5
d = deque(maxlen = length)
data = d.append(data)
index = pd.DatetimeIndex(start='2016-04-03 00:00:00', freq='S', periods=length)
columns = ['price']
df = pd.DataFrame(index=index, columns=columns, data=data)
print df
How can I get this to work?
Mike
If I understand correctly the question, you want to keep all the values of the last twenty six last days. Does the following function is enough for you?
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
.loc[lambda x: x.index > twenty_six_day_before, :]
.iloc[-length:, :]
)
If the dates are not in the index:
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
# the following line is changed for values in a specific column
.loc[lambda x: x['column_with_date'] > twenty_six_day_before, :]
.iloc[-length:, :]
)
Don't forget to change the hard coded timezone if you are not in France. :-)
Related
I'm preparing a big multivariate time series data set for a supervised learning task and I would like to create time shifted versions of my input features so my model also infers from past values. In pandas there's the shift(n) command that lets you shift a column by n rows. Is there something similar in vaex?
I could not find anything comparable in the vaex documentation.
No, we do not support that yet (https://github.com/vaexio/vaex/issues/660). Because vaex is extensible (see http://docs.vaex.io/en/latest/tutorial.html#Adding-DataFrame-accessors) I thought I would give you the solution in the form of that:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
def shift(self, column, n, inplace=False):
# make a copy without column
df = self.df.copy().drop(column)
# make a copy with just the colum
df_column = self.df[[column]]
# slice off the head and tail
df_head = df_column[-n:]
df_tail = df_column[:-n]
# stitch them together
df_shifted = df_head.concat(df_tail)
# and join (based on row number)
return df.join(df_shifted, inplace=inplace)
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df['shifted_y'] = df.y
df2 = df.mytool.shift('shifted_y', 2)
df2
It generates a single column datagram, slices that up, concatenates and joins it back. All without a single memory copy.
I am assuming here a cyclic shift/rotate.
The function needs to be modified slightly in order to work in the latest release (vaex 4.0.0ax), see this thread.
Code by Maarten should be updated as follows:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
# mytool.shift is the analog of pandas.shift() but add the shifted column with specified name to the end of initial df
def shift(self, column, new_column, n, cyclic=True):
df = self.df.copy().drop(column)
df_column = self.df[[column]]
if cyclic:
df_head = df_column[-n:]
else:
df_head = vaex.from_dict({column: np.ma.filled(np.ma.masked_all(n, dtype=float), 0)})
df_tail = df_column[:-n]
df_shifted = df_head.concat(df_tail)
df_shifted.rename(column, new_column)
return df_shifted
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df2 = df.join(df.mytool.shift('y', 'shifted_y', 2))
df2
I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price
I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!
I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)
Here's what my data looks like:
There are daily records, except for a gap from 2017-06-12 to 2017-06-16.
df2['timestamp'] = pd.to_datetime(df['timestamp'])
df2['timestamp'] = df2['timestamp'].map(lambda x:
datetime.datetime.strftime(x,'%Y-%m-%d'))
df2 = df2.convert_objects(convert_numeric = True)
df2 = df2.groupby('timestamp', as_index = False).sum()
I need to fill this missing gap and others with values for all fields (e.g. timestamp, temperature, humidity, light, pressure, speed, battery_voltage, etc...).
How can I accomplish this with Pandas?
This is what I have done before
weektime = pd.date_range(start = '06/04/2017', end = '12/05/2017', freq = 'W-SUN')
df['week'] = 'nan'
df['weektemp'] = 'nan'
df['weekhumidity'] = 'nan'
df['weeklight'] = 'nan'
df['weekpressure'] = 'nan'
df['weekspeed'] = 'nan'
df['weekbattery_voltage'] = 'nan'
for i in range(0,len(weektime)):
df['week'][i+1] = weektime[i]
df['weektemp'][i+1] = df['temperature'].iloc[7*i+1:7*i+7].sum()
df['weekhumidity'][i+1] = df['humidity'].iloc[7*i+1:7*i+7].sum()
df['weeklight'][i+1] = df['light'].iloc[7*i+1:7*i+7].sum()
df['weekpressure'][i+1] = df['pressure'].iloc[7*i+1:7*i+7].sum()
df['weekspeed'][i+1] = df['speed'].iloc[7*i+1:7*i+7].sum()
df['weekbattery_voltage'][i+1] =
df['battery_voltage'].iloc[7*i+1:7*i+7].sum()
i = i + 1
The value of sum is not correct. Cause the value of 2017-06-17 is a sum of 2017-06-12 to 2017-06-16. I do not want to add them again. This gap is not only one gap in the period. I want to fill all of them.
Here is a function I wrote that might be helpful to you. It looks for inconsistent jumps in time and fills them in. After using this function, try using a linear interpolation function (pandas has a good one) to fill in your null data values. Note: Numpy arrays are much faster to iterate over and manipulate than Pandas dataframes, which is why I switch between the two.
import numpy as np
import pandas as pd
data_arr = np.array(your_df)
periodicity = 'daily'
def fill_gaps(data_arr, periodicity):
rows = data_arr.shape[0]
data_no_gaps = np.copy(data_arr) #avoid altering the thing you're iterating over
data_no_gaps_idx = 0
for row_idx in np.arange(1, rows): #iterate once for each row (except the first record; nothing to compare)
oldtimestamp_str = str(data_arr[row_idx-1, 0])
oldtimestamp = np.datetime64(oldtimestamp_str)
currenttimestamp_str = str(data_arr[row_idx, 0])
currenttimestamp = np.datetime64(currenttimestamp_str)
period = currenttimestamp - oldtimestamp
if period != np.timedelta64(900,'s') and period != np.timedelta64(3600,'s') and period != np.timedelta64(86400,'s'):
if periodicity == 'quarterly':
desired_period = 900
elif periodicity == 'hourly':
desired_period = 3600
elif periodicity == 'daily':
desired_period = 86400
periods_missing = int(period / np.timedelta64(desired_period,'s'))
for missing in np.arange(1, periods_missing):
new_time_orig = str(oldtimestamp + missing*(np.timedelta64(desired_period,'s')))
new_time = new_time_orig.replace('T', ' ')
data_no_gaps = np.insert(data_no_gaps, (data_no_gaps_idx + missing),
np.array((new_time, np.nan, np.nan, np.nan, np.nan, np.nan)), 0) # INSERT VALUES YOU WANT IN THE NEW ROW
data_no_gaps_idx += (periods_missing-1) #incriment the index (zero-based => -1) in accordance with added rows
data_no_gaps_idx += 1 #allow index to change as we iterate over original data array (main for loop)
#create a dataframe:
data_arr_no_gaps = pd.DataFrame(data=data_no_gaps, index=None,columns=['Time', 'temp', 'humidity', 'light', 'pressure', 'speed'])
return data_arr_no_gaps
Fill time gaps and nulls
Use the function below to ensure expected date sequence exists, and then use forward fill to fill in nulls.
import pandas as pd
import os
def fill_gaps_and_nulls(df, freq='1D'):
'''
General steps:
A) check for extra dates (out of expected frequency/sequence)
B) check for missing dates (based on expected frequency/sequence)
C) use forwardfill to fill nulls
D) use backwardfill to fill remaining nulls
E) append to file
'''
#rename the timestamp to 'date'
df.rename(columns={"timestamp": "date"})
#sort to make indexing faster
df = df.sort_values(by=['date'], inplace=False)
#create an artificial index of dates at frequency = freq, with the same beginning and ending as the original data
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq=freq)
#record column names
df_cols = df.columns
#delete ffill_df.csv so we can begin anew
try:
os.remove('ffill_df.csv')
except FileNotFoundError:
pass
#check for extra dates and/or dates out of order. print warning statement for log
extra_dates = set(df.date).difference(all_dates)
#if there are extra dates (outside of expected sequence/frequency), deal with them
if len(extra_dates) > 0:
#############################
#INSERT DESIRED BEHAVIOR HERE
print('WARNING: Extra date(s):\n\t{}\n\t Shifting highlighted date(s) back by 1 day'.format(extra_dates))
for date in extra_dates:
#shift extra dates back one day
df.date[df.date == date] = date - pd.Timedelta(days=1)
#############################
#check the artificial date index against df to identify missing gaps in time and fill them with nulls
gaps = all_dates.difference(set(df.date))
print('\n-------\nWARNING: Missing dates: {}\n-------\n'.format(gaps))
#if there are time gaps, deal with them
if len(gaps) > 0:
#initialize df of correct size, filled with nulls
gaps_df = pd.DataFrame(index=gaps, columns=df_cols.drop('date')) #len(index) sets number of rows
#give index a name
gaps_df.index.name = 'date'
#add the region and type
gaps_df.region = r
gaps_df.type = t
#remove that index so gaps_df and df are compatible
gaps_df.reset_index(inplace=True)
#append gaps_df to df
new_df = pd.concat([df, gaps_df])
#sort on date
new_df.sort_values(by='date', inplace=True)
#fill nulls
new_df.fillna(method='ffill', inplace=True)
new_df.fillna(method='bfill', inplace=True)
#append to file
new_df.to_csv('ffill_df.csv', mode='a', header=False, index=False)
return df_cols, regions, types, all_dates
I have a dataframe of 600 000 x/y points with date-time information, along another field 'status', with extra descriptive information
My objective is, for each record:
sum column 'status' by records that are within a certain spatial temporal buffer
the specific buffer is within t - 8 hours and < 100 meters
Currently I have the data in a pandas data frame.
I could, loop through the rows, and for each record, subset the dates of interest, then calculate a distances and restrict the selection further. However that would still be quite slow with so many records.
THIS TAKES 4.4 hours to run.
I can see that I could create a 3 dimensional kdtree with x, y, date as epoch time. However, I am not certain how to restrict the distances properly when incorporating dates and geographic distances.
Here is some reproducible code for you guys to test on:
Import
import numpy.random as npr
import numpy
import pandas as pd
from pandas import DataFrame, date_range
from datetime import datetime, timedelta
Create data
np.random.seed(111)
Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a date range with hour frequency
date = date_range(start='10/1/2012', end='10/31/2012', freq='H')
# Create long lat data
laty = npr.normal(4815862, 5000,size=len(date))
longx = npr.normal(687993, 5000,size=len(date))
# status of interest
status = [0,1]
# Make a random list of statuses
random_status = [status[npr.randint(low=0,high=len(status))] for i in range(len(date))]
# user pool
user = ['sally','derik','james','bob','ryan','chris']
# Make a random list of users
random_user = [user[npr.randint(low=0,high=len(user))] for i in range(len(date))]
Output.extend(zip(random_user, random_status, date, longx, laty))
return pd.DataFrame(Output, columns = ['user', 'status', 'date', 'long', 'lat'])
#Create data
data = CreateDataSet(3)
len(data)
#some time deltas
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
Function to speed up
def work(df):
output = []
#loop through data index's
for i in range(0, len(df)):
l = []
#first we will filter out the data by date to have a smaller list to compute distances for
#create a mask to query all dates between range for date i
date_mask = (df['date'] >= df['date'].iloc[i]-before) & (df['date'] <= df['date'].iloc[i]+after)
#create a mask to query all users who are not user i (themselves)
user_mask = df['user']!=df['user'].iloc[i]
#apply masks
dists_to_check = df[date_mask & user_mask]
#for point i, create coordinate to calculate distances from
a = np.array((df['long'].iloc[i], df['lat'].iloc[i]))
#create array of distances to check on the masked data
b = np.array((dists_to_check['long'].values, dists_to_check['lat'].values))
#for j in the date queried data
for j in range(1, len(dists_to_check)):
#compute the ueclidean distance between point a and each point of b (the date masked data)
x = np.linalg.norm(a-np.array((b[0][j], b[1][j])))
#if the distance is within our range of interest append the index to a list
if x <=100:
l.append(j)
else:
pass
try:
#use the list of desired index's 'l' to query a final subset of the data
data = dists_to_check.iloc[l]
#summarize the column of interest then append to output list
output.append(data['status'].sum())
except IndexError, e:
output.append(0)
#print "There were no data to add"
return pd.DataFrame(output)
Run code and time it
start = datetime.now()
out = work(data)
print datetime.now() - start
Is there a way to do this query in a vectorized way? Or should I be chasing another technique.
<3
Here is what at least somewhat solves my problem. Since the loop can operate on different parts of the data independently, parallelization makes sense here.
using Ipython...
from IPython.parallel import Client
cli = Client()
cli.ids
cli = Client()
dview=cli[:]
with dview.sync_imports():
import numpy as np
import os
from datetime import timedelta
import pandas as pd
#We also need to add the time deltas and output list into the function as
#local variables as well as add the Ipython.parallel decorator
#dview.parallel(block=True)
def work(df):
before = timedelta(hours = 8)
after = timedelta(minutes = 1)
output = []
final time 1:17:54.910206, about 1/4 original time
I would still be very interested for anyone to suggest small speed improvements within the body of the function.