vaex: shift column by n steps - python

I'm preparing a big multivariate time series data set for a supervised learning task and I would like to create time shifted versions of my input features so my model also infers from past values. In pandas there's the shift(n) command that lets you shift a column by n rows. Is there something similar in vaex?
I could not find anything comparable in the vaex documentation.

No, we do not support that yet (https://github.com/vaexio/vaex/issues/660). Because vaex is extensible (see http://docs.vaex.io/en/latest/tutorial.html#Adding-DataFrame-accessors) I thought I would give you the solution in the form of that:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
def shift(self, column, n, inplace=False):
# make a copy without column
df = self.df.copy().drop(column)
# make a copy with just the colum
df_column = self.df[[column]]
# slice off the head and tail
df_head = df_column[-n:]
df_tail = df_column[:-n]
# stitch them together
df_shifted = df_head.concat(df_tail)
# and join (based on row number)
return df.join(df_shifted, inplace=inplace)
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df['shifted_y'] = df.y
df2 = df.mytool.shift('shifted_y', 2)
df2
It generates a single column datagram, slices that up, concatenates and joins it back. All without a single memory copy.
I am assuming here a cyclic shift/rotate.

The function needs to be modified slightly in order to work in the latest release (vaex 4.0.0ax), see this thread.
Code by Maarten should be updated as follows:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
# mytool.shift is the analog of pandas.shift() but add the shifted column with specified name to the end of initial df
def shift(self, column, new_column, n, cyclic=True):
df = self.df.copy().drop(column)
df_column = self.df[[column]]
if cyclic:
df_head = df_column[-n:]
else:
df_head = vaex.from_dict({column: np.ma.filled(np.ma.masked_all(n, dtype=float), 0)})
df_tail = df_column[:-n]
df_shifted = df_head.concat(df_tail)
df_shifted.rename(column, new_column)
return df_shifted
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df2 = df.join(df.mytool.shift('y', 'shifted_y', 2))
df2

Related

Can someone explain how to write a function so that the Function Transformer from sklearn understands it

I have several functions that need to be sent to pipeline on assignment, for example
def Android_iOs_device_os_cange(df: pd.DataFrame) -> pd.DataFrame:
import numpy as np
df = df.copy()
def foung_android_list(df):
list_for_android = list(df[df['device_os'] == 'Android'].device_brand.unique())
list_for_android.remove('(not set)')
return list_for_android
def foung_iOS_list(df):
list_for_iOS = list(df[df['device_os'] == 'iOS'].device_brand.unique())
list_for_iOS.remove('(not set)')
return list_for_iOS
df.loc[:,'device_os'] = np.where(df.loc[df['device_brand'].isin(foung_iOS_list(df))] & (df.loc[df['device_os'].isnull()]), 'iOS',
df['device_os'])
df.loc[:,'device_os'] = np.where(df['device_brand'].isin(foung_android_list(df)) & (df['device_os'].isnull()), 'Android',
df['device_os'])
df.loc[:,'device_os'] = np.where(df['device_os'].isnull(), '(not set)', df['device_os'])
print(df)
return df
This function changes all empty values in the device_os column on Android or iOS, depending on which brand of phone the client has specified, or leaves (not set) if the device_brand line is empty. As a function in jupiter lab, the code runs fine, but when I inserted it into the function in the pipeline, the code gives me an error, 'device_brand', i.e. the code does not find such a column in the DataFrame.
Because this is a data science task, I have a data preprocessing function and I perform it outside of pipline, because a target variable is needed for x and y, and I get it from another dataframe, in theory I can generally shove everything into the preprocessing function and execute the code like this, but the task is a task if I shove Android_iOs_device_os_cange to the preprocessing function, then the from_float_to_int function follows
def from_float_to_int(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
for i in df.columns:
if df[i].dtype == 'float64':
df[i] = df[i].astype(int)
print(df)
return df
as it is easy to guess, it changes the data type from float to int, and i have an idea make columnselectior replace types but I am not sure about the correctness of this strategy, at the end there are functions that cannot be replaced so easily.
as it is easy to guess, it changes the data type from float to int and then I encounter the problem that pipeline does not perceive the columns function, and without it I will not be able to perform two probably the most important functions: prepare_for_ohe and make_standard_scatter and Labelencoder_select. The first function removes from all columns, except one, all columns where unique values are less than 80, the second converts numeric values for certain columns to StandartScallaer, and the latter converts encrypted data to LabelEncoder and in all these functions there are pd.columns, without this function I do not know how to replace it, because if with this problem I if I meet in from_float_to_int, then it's stupid to think that it won't be in the following functions
def make_standard_scatter(df: pd.DataFrame) -> None:
from sklearn.preprocessing import StandardScaler
df = df.copy()
data_1 = df[['count_of_action', 'hit_time']]
std_scaler = StandardScaler()
std_scaler.fit(data_1)
std_scaled = std_scaler.transform(data_1)
list1 = ['count_of_action', 'hit_time']
list2 = []
for name in list1:
std_name = name + '_std'
list2.append(std_name)
df[list2] = std_scaled
print(df)
return df
def prepare_for_ohe(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
df[[i for i in df.columns if i != 'hit_time']] = df[[i for i in df.columns if i != 'hit_time']].apply(
lambda x: x.where(x.map(x.value_counts()) > 80))
df = df.dropna()
return df
def Labelencoder_select(df: pd.DataFrame) -> None:
from sklearn.preprocessing import LabelEncoder
df = df.copy()
list1 = [i for i in df.columns if (i.split('_')[0] in 'utm') or (i.split('_')[0] in 'device') or (i.split('_')[0] in
'geo')]
df[list1] = df[list1].apply(LabelEncoder().fit_transform)
print(df)
return df
And all this I mean, how to write functions so that pd.columns functions and line-by-line data changes are perceived correctly by Pipeline.

How to optimize dataframe iteration in pandas?

I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.
df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
row_ids = []
for index, row in df.iterrows():
if (index % 1000) == 0:
print("Row node index: {}".format(str(index)))
caculated_id = get_id(row['name', row['sex']])
row_ids.append(caculated_id)
df['id'] = row_ids
Is there a way to make it much faster without going row by row?
Add more info based on suggested solutions:
Use apply instead:
def func(x):
if (x.name % 1000) == 0:
print("Row node index: {}".format(str(x.name)))
caculated_id = get_id(row['name', row['sex']])
return caculated_id
df['id'] = df.apply(func, axis=1)
If you are working on a large dataset then np.vectorize() should help bypass the apply() overhead, which should be a bit faster.
import numpy as np
v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)
Edit:
To get even more of a speed up you could also just pass the function get_id instead of using a lambda function and pass df.*.values instead of df.*.
v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)
Instead of printing updates about the progression through the process try using tqdm to show the progression using a progress bar.
import numpy as np
from tqdm import tqdm
#np.vectorize
def get_id(name, sex):
global pbar
...
pbar.update(1)
...
return
global pbar
with tqdm(total=len(df)) as pbar:
df['id'] = get_id(df['name'].values, df['sex'].values)

Python sort out columns in DataFrame for OLS regression

I have a csv file with the following columns:
Date|Mkt-RF|SMB|HML|RF|C|aig-RF|ford-RF|ibm-RF|xom-RF|
I am trying to run a multiple OLS regression in python, regressing 'Mkt-RF', 'SMB' and 'HML' on 'aig-RF' for instance.
It seems like i need to first sort out the DataFrame from the arrays but i cannot seem to understand how:
# Regression
x = df[['Mkt-RF','SMB','HML']]
y = df['aig-RF']
df = pd.DataFrame({'x':x, 'y':y})
df['constant'] = 1
df.head()
sm.OLS(y,df[['constant','x']]).fit().summary()
The full code is:
import numpy as np
import pandas as pd
from pandas import DataFrame
from sklearn import linear_model
import statsmodels.api as sm
def ReadFF(sIn):
"""
Purpose:
Read the FF data
Inputs:
sIn string, name of input file
Return value:
df dataframe, data
"""
df= pd.read_csv(sIn, header=3, names=["Date","Mkt-RF","SMB","HML","RF"])
df= df.dropna(how='any')
# Reformat the dates, as date-time, and place them as index
vDate= pd.to_datetime(df["Date"].values,format='%Y%m%d')
df.index= vDate
# Add in a constant
iN= len(vDate)
df["C"]= np.ones(iN)
print(df)
return df
def JoinStock(df, sStock, sPer):
"""
Purpose:
Join the stock into the dataframe, as excess returns
Inputs:
df dataframe, data including RF
sStock string, name of stock to read
sPer string, extension indicating period
Return value:
df dataframe, enlarged
"""
df1= pd.read_csv(sStock+"_"+sPer+".csv", index_col="Date", usecols=["Date", "Adj Close"])
df1.columns= [sStock]
# Add prices to original dataframe, to get correct dates
df= df.join(df1, how="left")
# Extract returns
vR= 100*np.diff(np.log(df[sStock].values))
# Add a missing, as one observation was lost differencing
vR= np.hstack([np.nan, vR])
# Add excess return to dataframe
df[sStock + "-RF"]= vR - df["RF"]
print(df)
return df
def SaveFF(df, asStock, sOut):
"""
Purpose:
Save data for FF regressions
Inputs:
df dataframe, all data
asStock list of strings, stocks
sOut string, output file name
Output:
file written to disk
"""
df= df.dropna(how='any')
asOut= ['Mkt-RF', 'SMB', 'HML', 'RF', 'C']
for sStock in asStock:
asOut.append(sStock+"-RF")
print ("Writing columns ", asOut, "to file ", sOut)
df.to_csv(sOut, columns=asOut, index_label="Date", float_format="%.8g")
print(df)
return df
def main():
sPer= "0018"
sIn= "Research_Data_Factors_weekly.csv"
sOut= "ffstocks"
asStock= ["aig", "ford", "ibm", "xom"]
# Initialisation
df= ReadFF(sIn)
for sStock in asStock:
df= JoinStock(df, sStock, sPer)
# Output
SaveFF(df, asStock, sOut+"_"+sPer+".csv")
print ("Done")
# Regression
x = df[['Mkt-RF','SMB','HML']]
y = df['aig-RF']
df = pd.DataFrame({'x':x, 'y':y})
df['constant'] = 1
df.head()
sm.OLS(y,df[['constant','x']]).fit().summary()
What exactly do i need to modify in pd.DataFrame in order to get the multiple OLS regression table?
I propose to change the first chunk of your code to below (mostly just swapping line orders):
# add constant column to the original dataframe
df['constant'] = 1
# define x as a subset of original dataframe
x = df[['Mkt-RF', 'SMB', 'HML', 'constant']]
# define y as a series
y = df['aig-RF']
# pass x as a dataframe, while pass y as a series
sm.OLS(y, x).fit().summary()
Hope this helps.

Python looping and Pandas rank/index quirk

This question pertains to one posted here:
Sort dataframe rows independently by values in another dataframe
In the linked question, I utilize a Pandas Dataframe to sort each row independently using values in another Pandas Dataframe. The function presented there works perfectly every single time it is directly called. For example:
import pandas as pd
import numpy as np
import os
##Generate example dataset
d1 = {}
d2 = {}
d3 = {}
d4 = {}
## generate data:
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
d3[col+'3'] = np.random.random_integers(0,100, 12)
d4[col+'4'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
#place data into dataframes
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
dat3 = pd.DataFrame(d3, index = t_index)
dat4 = pd.DataFrame(d4, index = t_index)
## Functions
def sortByAnthr(X,Y,Xindex, Reverse=False):
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex,Y), key=lambda pair: pair[1],reverse=Reverse)]
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
return(ordrd_row)
def r_selectr(dat2,dat1, n, Reverse=False):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index,Reverse),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
## Call functions
ordr_cols2,ordr_r = r_selectr(dat2,dat1,5)
##print output:
print("Ordering set:\n",dat2.iloc[-2:,:])
print("Original set:\n", dat1.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
As can be checked, the columns of dat1 are correctly ordered according to the values in dat2.
However, when called from a loop over dataframes, it does not rank/index correctly and produces completely dubious results. Although I am not quite able to recreate the problem using the reduced version presented here, the idea should be the same.
## Loop test:
out_dict = {}
data_dicts = {'dat2':dat2, 'dat3': dat3, 'dat4':dat4}
for i in range(3):
#this outer for loop supplies different parameter values to a wrapper
#function that calls r_selectr.
for key in data_dicts.keys():
ordr_cols,_ = r_selectr(data_dicts[key], dat1,5)
out_list.append(ordr_cols)
#do stuff here
#print output:
print("Ordering set:\n",dat3.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
In my code (almost completely analogous to the example given here), the ordr_cols are no longer ordered correctly for any of the sorting data frames.
I currently solve the issue by separating the ordering and indexing operations with r_selectr into two separate functions. That, for some reason, resolves the issue though I have no idea why.

Trying to use Deque to limit DataFrame of incoming data... suggestions?

I've imported deque from collections to limit the size of my data frame. When new data is entered, the older ones should be progressively deleted over time.
Big Picture:
Im creating a Data Frame of historical values of the previous 26 days from time "whatever day it is..."
Confusion:
I think my data each minute comes in a series format, which then I attempted to restrict the maxlen using deque. Then I tried implementing the data into an data frame. However I just get NaN values.
Code:
import numpy as np
import pandas as pd
from collections import deque
def initialize(context):
context.stocks = (symbol('AAPL'))
def before_trading_start(context, data):
data = data.history(context.stocks, 'close', 20, '1m').dropna()
length = 5
d = deque(maxlen = length)
data = d.append(data)
index = pd.DatetimeIndex(start='2016-04-03 00:00:00', freq='S', periods=length)
columns = ['price']
df = pd.DataFrame(index=index, columns=columns, data=data)
print df
How can I get this to work?
Mike
If I understand correctly the question, you want to keep all the values of the last twenty six last days. Does the following function is enough for you?
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
.loc[lambda x: x.index > twenty_six_day_before, :]
.iloc[-length:, :]
)
If the dates are not in the index:
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
# the following line is changed for values in a specific column
.loc[lambda x: x['column_with_date'] > twenty_six_day_before, :]
.iloc[-length:, :]
)
Don't forget to change the hard coded timezone if you are not in France. :-)

Categories