Using Classes to operate on Data Frames - python

In my project using Classes makes a lot of sense:
The class will contain time-series data which will be repetitively used.
It will have methods/functions operating on those time series and called many times
Problem is that I want to use pandas Data Frames as time series container (in some instance, it will be a one-time Series - Panda's Series, in others multiple time series - Panda's DataFrame) and I have issues with returning and operating on Padas DataFrames withing Class.
I've done reading around this topic, but I couldn't find an elegant and simple solution. Can you please help?
I've prepared a simplified example of DataFrame operations on time-series withing Class.
class MyClass:
def __init__(self, my_dataframe, time_series_type):
self.my_dataframe = my_dataframe
self.time_series_type = time_series_type
def returns(self):
r = self.my_dataframe.pct_change()
r = r.fillna(0, inplace = True)
return r #returns Panda's DataFrame
def rebase_to_1(self):
rebase_to_1 = self.returns + 1
rebase_to_1 = rebase_to_1.cumprod()
return rebase_to_1 #returns Panda's DataFrame
Example of application:
a = MyClass(my_time_series1)
b = MyClass(my_time_series2)
#Show rebased time-series/PadasDataFrame in Jupiter notebook
a.rebase_to_1()
I have quite a lot of functions which I'm trying to put inside the class to streamline operating on time-series. I hope that example above illustrates this well enough.

Related

efficient way to find unique values within time windows in python?

I have a large pandas dataframe that countains data similar to the image attached.
I want to get a count of how many unique TN exist within each 2 second window of the data. I've done this with a simple loop, but it is incredibly slow. Is there a better technique I can use to get this?
My original code is:
uniqueTN = []
tmstart = 5400; tmstop = 86400
for tm in range(int(tmstart), int(tmstop), 2):
df = rundf[(rundf['time']>=(tm-2))&rundf['time']<tm)]
uniqueTN.append(df['TN'].unique())
This solution would be fine it the set of data was not so large.
Here is how you can implement groupby() method and nunique().
rundf['time'] = (rundf['time'] // 2) * 2
grouped = rundf.groupby('time')['TN'].nunique()
Another alternative is to use the resample() method of pandas and then the nunique() method.
grouped = rundf.resample('2S', on='time')['TN'].nunique()

List column names used by a custom function using a panda DataFrame as input

I am currently building a small app to post-process more easily data stored in HDF5 files (generated by a 1D simulation proprietary software, the output being basically a very large table with ~10k variables, y-axis corresponding to time). I want the user to be able to write custom functions generating series from the data (for plotting). My problem is that the HDF5 file can be quite large and I would like to open it a single time and extract only the necessary data for performance reasons. Column names to load depend on the user-defined functions, so I need to inspect them and collect the calls to 'df.loc' and 'df.getitem()' to list the columns to extract. I am looking for the best way to do that.
Below is a simplified example of my current implementation, using a 'spy' object wrapping a dummy dataframe and listing the keys used when 'get_item' is called. Something similar could be done for pandas 'loc' dataframe method.
import pandas as pd
from scipy import integrate
# Data for the example
data = {
'TIME': [0,1,2,3,4,5],
'frac[air]': [0,0,0,0,0,0],
'frac[water]': [0.5,0.4,0.2,0,0,0],
'frac[oil]': [0.5,0.6,0.8,1,1,1],
'flowrate': [9999,0,0.1,0.2,0,0],
}
def read_data(var_names, loc_filter=None):
"""Simple table reading here, but in real application I am
reading a large hdf5 file. The filter is typicaly used to
select time slices (or slices corresponding to some states
of the system described by the dataframe) through pandas
dataframe loc method"""
if loc_filter is None:
df = pd.DataFrame(data=data)
else:
df = pd.DataFrame(data=data).loc[loc_filter]
return df[var_names]
# User-defined functions using only a few columns of the large dataframe
# Constraint: new functions can be defined by the user ! 2 Examples below
def grp_liq (df, var_name):
"""Summing fractions corresponding to liquids"""
return df[var_name+'[water]'] + df[var_name+'[oil]']
def integ (df, var_name):
"""Cumulative integration of a variable over time"""
return pd.Series(integrate.cumtrapz(df[var_name], df['TIME']))
# Creation of a 'spy' object wrapping a dummy dataframe to be able to
# inspect functions
class DFProxyGetItemSpy:
def __init__(self):
self._var_names = []
self._df_dum = pd.DataFrame({'dum':[1]})
# __getitem__ defined to list used var_names
def __getitem__(self, key):
if key not in self._var_names:
self._var_names.append(key)
# Returning a dummy series to limit the risk of error raising
# if a DataFrame-specific attribute/method is used in the function
return self._df_dum['dum']
# __getattr__ defined to pass any attribute to the dummy DataFrame
def __getattr__(self, name):
return self._df_dum.__getattribute__(name)
# Function to reset spy object
def reset(self):
self._var_names = []
#property
def var_names(self):
return self._var_names
##################
# DATA REQUEST 1 #
##################
# Spying a user function to list the variables used
spy = DFProxyGetItemSpy()
grp_liq(spy,'frac')
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df = read_data(spy.var_names)
print(grp_liq(df,'frac'))
##################
# DATA REQUEST 2 #
##################
spy.reset()
# Spying a second user function, but this time we also want to
# set a filter --> this function needs to be spyied too ...
spy = DFProxyGetItemSpy()
integ(spy,'flowrate')
my_filter = lambda df: df['TIME']>0
my_filter(spy)
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df2 = read_data(spy.var_names, loc_filter=my_filter)
print(integ(df2,'flowrate'))
I find this approach inelegant and I am looking for a better way to do that. I have been looking a bit in dask documentation to see if information about the keys used in a function could be extracted from the graph, but I was not successful. To formalize my question: what would be a good way to allow a user to define custom functions to manipulate a dataframe (using any function from pandas), and list the necessary keys to limit the amount of data to load ?
Thanks in advance for any answer, any advice is welcome too !

How to handle dataframes as instance variables in python

All,
I don't see similar questions online, if there are please kindly guide me to that question.
Here is what I am trying to do:
I have a class that builds a financial model in a data frame (so there are several tens of columns in the data frame)
Each column has a formula and many of the columns use the information in the prior columns
Can someone please recommend what is a good way to approach this problem.
# import required libraries
import pandas as pd
import numpy as np
class DataFrameInClass:
def __init__(self, a: int, b: int):
self.a = a
self.b = b
self.df = pd.DataFrame()
def get_df(self) -> pd.DataFrame:
return pd.DataFrame(np.random.randint(self.a,self.b,size=(100, 4)), columns=list('ABCD'))
def set_df(self):
self.df = self.get_df
def calculate_A_square(self) -> pd.Series:
# Approach 1 - calculate each column individually
df = self.df.copy(deep=True)
df['A_Square'] = df['A'] ** 2
return pd.Series(data=df['A_Square'], index=df.index)
def calculate_A_cubed(self):
# Approach 2 - Modify data frame
self.df['A_Cubed'] = self.df['A'] ** 3
I am not sure if it is good practice to just modify the dataframe within each method (Can someone please help me understand that a little better). Or should I recalculate the columns in each method (that also feels like a bad idea).
For e.g. In the below case, if a_square value was made an instance variable, I could calculate a_cubed in one line. However that would also mean my class would have just so many instance variables

Better way to structure a series of df manipulations in your class

How do you better structure the code in your class so that your class returns the df that you want, but you don't have a main method which calls a lot of other methods in sequential order. I find that in a lot of situations I arrive at this structure and it seems bad. I have a df that I just overwrite it with the result of other base functions (that I unit test) until I get what I want.
class A:
def main(self):
df = self.load_file_into_df()
df = self.add_x_columns(df)
df = self.calculate_y(df)
df = self.calculate_consequence(df)
...
return df
def add_x_columns(df)
def calculate_y(df)
def calculate_consequence(df)
...
# now use it somewhere else
df = A().main()
pipe
One feature you may wish to utilize is pd.DataFrame.pipe. This is considered "pandorable" because it facilitates operator chaining.
In my opinion, you should separate reading data into a dataframe from manipulating the dataframe. For example:
class A:
def main(self):
df = self.load_file_into_df()
df = df.pipe(self.add_x_columns)\
.pipe(self.calculate_y)\
.pipe(self.calculate_consequence)
return df
compose
Function composition is not native to Python, but the 3rd party toolz library does offer this feature. This allows you to lazily define chained functions. Note the reversed order of operations, i.e. the last argument of compose is performed first.
from toolz import compose
class A:
def main(self)
df = self.load_file_into_df()
transformer = compose(self.calculate_consequence,
self.calculate_y,
self.add_x_columns)
df = df.pipe(transformer)
return df
In my opinion, compose offers a flexible and adaptable solution. You can, for example, define any number of compositions and apply them selectively or repeatedly at various points in your workflow.

dask.DataFrame.apply and variable length data

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)
Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html

Categories