All,
I don't see similar questions online, if there are please kindly guide me to that question.
Here is what I am trying to do:
I have a class that builds a financial model in a data frame (so there are several tens of columns in the data frame)
Each column has a formula and many of the columns use the information in the prior columns
Can someone please recommend what is a good way to approach this problem.
# import required libraries
import pandas as pd
import numpy as np
class DataFrameInClass:
def __init__(self, a: int, b: int):
self.a = a
self.b = b
self.df = pd.DataFrame()
def get_df(self) -> pd.DataFrame:
return pd.DataFrame(np.random.randint(self.a,self.b,size=(100, 4)), columns=list('ABCD'))
def set_df(self):
self.df = self.get_df
def calculate_A_square(self) -> pd.Series:
# Approach 1 - calculate each column individually
df = self.df.copy(deep=True)
df['A_Square'] = df['A'] ** 2
return pd.Series(data=df['A_Square'], index=df.index)
def calculate_A_cubed(self):
# Approach 2 - Modify data frame
self.df['A_Cubed'] = self.df['A'] ** 3
I am not sure if it is good practice to just modify the dataframe within each method (Can someone please help me understand that a little better). Or should I recalculate the columns in each method (that also feels like a bad idea).
For e.g. In the below case, if a_square value was made an instance variable, I could calculate a_cubed in one line. However that would also mean my class would have just so many instance variables
Related
I am currently building a small app to post-process more easily data stored in HDF5 files (generated by a 1D simulation proprietary software, the output being basically a very large table with ~10k variables, y-axis corresponding to time). I want the user to be able to write custom functions generating series from the data (for plotting). My problem is that the HDF5 file can be quite large and I would like to open it a single time and extract only the necessary data for performance reasons. Column names to load depend on the user-defined functions, so I need to inspect them and collect the calls to 'df.loc' and 'df.getitem()' to list the columns to extract. I am looking for the best way to do that.
Below is a simplified example of my current implementation, using a 'spy' object wrapping a dummy dataframe and listing the keys used when 'get_item' is called. Something similar could be done for pandas 'loc' dataframe method.
import pandas as pd
from scipy import integrate
# Data for the example
data = {
'TIME': [0,1,2,3,4,5],
'frac[air]': [0,0,0,0,0,0],
'frac[water]': [0.5,0.4,0.2,0,0,0],
'frac[oil]': [0.5,0.6,0.8,1,1,1],
'flowrate': [9999,0,0.1,0.2,0,0],
}
def read_data(var_names, loc_filter=None):
"""Simple table reading here, but in real application I am
reading a large hdf5 file. The filter is typicaly used to
select time slices (or slices corresponding to some states
of the system described by the dataframe) through pandas
dataframe loc method"""
if loc_filter is None:
df = pd.DataFrame(data=data)
else:
df = pd.DataFrame(data=data).loc[loc_filter]
return df[var_names]
# User-defined functions using only a few columns of the large dataframe
# Constraint: new functions can be defined by the user ! 2 Examples below
def grp_liq (df, var_name):
"""Summing fractions corresponding to liquids"""
return df[var_name+'[water]'] + df[var_name+'[oil]']
def integ (df, var_name):
"""Cumulative integration of a variable over time"""
return pd.Series(integrate.cumtrapz(df[var_name], df['TIME']))
# Creation of a 'spy' object wrapping a dummy dataframe to be able to
# inspect functions
class DFProxyGetItemSpy:
def __init__(self):
self._var_names = []
self._df_dum = pd.DataFrame({'dum':[1]})
# __getitem__ defined to list used var_names
def __getitem__(self, key):
if key not in self._var_names:
self._var_names.append(key)
# Returning a dummy series to limit the risk of error raising
# if a DataFrame-specific attribute/method is used in the function
return self._df_dum['dum']
# __getattr__ defined to pass any attribute to the dummy DataFrame
def __getattr__(self, name):
return self._df_dum.__getattribute__(name)
# Function to reset spy object
def reset(self):
self._var_names = []
#property
def var_names(self):
return self._var_names
##################
# DATA REQUEST 1 #
##################
# Spying a user function to list the variables used
spy = DFProxyGetItemSpy()
grp_liq(spy,'frac')
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df = read_data(spy.var_names)
print(grp_liq(df,'frac'))
##################
# DATA REQUEST 2 #
##################
spy.reset()
# Spying a second user function, but this time we also want to
# set a filter --> this function needs to be spyied too ...
spy = DFProxyGetItemSpy()
integ(spy,'flowrate')
my_filter = lambda df: df['TIME']>0
my_filter(spy)
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df2 = read_data(spy.var_names, loc_filter=my_filter)
print(integ(df2,'flowrate'))
I find this approach inelegant and I am looking for a better way to do that. I have been looking a bit in dask documentation to see if information about the keys used in a function could be extracted from the graph, but I was not successful. To formalize my question: what would be a good way to allow a user to define custom functions to manipulate a dataframe (using any function from pandas), and list the necessary keys to limit the amount of data to load ?
Thanks in advance for any answer, any advice is welcome too !
I am creating a class to generate LaTeX tables automatically using pylatex and within it, I have a class function that calculates the number of pandas DataFrame columns (x) and outputs a string with x amount of 'c''s. The x amount of 'c''s will eventually go into LaTeX code similar to this:
\begin{tabularx}{\textwidth}{#{}cccccccccc#{}}
Whilst making this code, I have been trying to follow the PEP guidelines (specifically PEP 8 for style, 257 for docstrings and 3107 for function annotations) and have run into a wall over how to declare a pandas DataFrame for the def function annotation.
For example:
def _columns_calculator(self, df) -> str:
""" Returns LaTeX table alignment string (centered by default, 'c')."""
columns = len(df.columns) - 1 # -1 to account for 0 index column
tabularX_columns = 'c' * columns
return tabularX_columns
What type do I declare df to be, if any? When using type on a pandas DataFrame it returned the class type <class 'pandas.core.frame.DataFrame'>. I have other classes within pylatex that are passed to def functions too (namely doc = Document(), the base pylatex document class that the other functions need to add too). How in general should I annotate a class in the function definition input (df: class is not correct naturally but shows what the answer this question seeks) and specifically a pandas DataFrame?
There could be other functions not within a class that have a DataFrame as an input so this question is not specific to solely classes. I searched on this site and through the PEP 3107 and could not see an answer, apologies if this has been asked before, I would appreciate a pointer to the original if this is a duplicate or trivial.
This is how I do it:
import pandas as pd
def _columns_calculator(self, df: pd.DataFrame) -> str:
""" Returns LaTeX table alignment string (centered by default, 'c')."""
columns = len(df.columns) - 1 # -1 to account for 0 index column
tabularX_columns = 'c' * columns
return tabularX_columns
In my project using Classes makes a lot of sense:
The class will contain time-series data which will be repetitively used.
It will have methods/functions operating on those time series and called many times
Problem is that I want to use pandas Data Frames as time series container (in some instance, it will be a one-time Series - Panda's Series, in others multiple time series - Panda's DataFrame) and I have issues with returning and operating on Padas DataFrames withing Class.
I've done reading around this topic, but I couldn't find an elegant and simple solution. Can you please help?
I've prepared a simplified example of DataFrame operations on time-series withing Class.
class MyClass:
def __init__(self, my_dataframe, time_series_type):
self.my_dataframe = my_dataframe
self.time_series_type = time_series_type
def returns(self):
r = self.my_dataframe.pct_change()
r = r.fillna(0, inplace = True)
return r #returns Panda's DataFrame
def rebase_to_1(self):
rebase_to_1 = self.returns + 1
rebase_to_1 = rebase_to_1.cumprod()
return rebase_to_1 #returns Panda's DataFrame
Example of application:
a = MyClass(my_time_series1)
b = MyClass(my_time_series2)
#Show rebased time-series/PadasDataFrame in Jupiter notebook
a.rebase_to_1()
I have quite a lot of functions which I'm trying to put inside the class to streamline operating on time-series. I hope that example above illustrates this well enough.
I'm struggling to find a suitable way of using pandas to hold a quantities and 1-sigma array as a pd.Series data attribute. Currently I pass a class a pd.Series instance and manipulate it using properties, etc.:
import numpy as np
import pandas as pd
class TimeSeriesHandler:
def __init__(self, time_series: pd.Series):
self._time_series = time_series
self.sigmas = np.zeros(self.number_of_timestamps)
#property
def timestamps(self) -> np.array:
return self._time_series.index.values
#property
def number_of_timestamps(self) -> np.array:
return self.timestamps.shape[0]
... and then use it as follows:
time_series = pd.Series(data=[3,2,3,2], index=[1,2,3,4])
time_series_handler = TimeSeriesHandler(time_series)
which represents the quantities that I'm interested in. But each of these quantities have a 1-sigma error of say, sigmas = [1,2,1,3].
Are there any suggestions on how I would manage this, please? Is it a good idea to have a separate time series instance for sigmas -- this would mean unnecessarily duplicating the index value (or the timestamps).
Thanks for any help here.
How do you better structure the code in your class so that your class returns the df that you want, but you don't have a main method which calls a lot of other methods in sequential order. I find that in a lot of situations I arrive at this structure and it seems bad. I have a df that I just overwrite it with the result of other base functions (that I unit test) until I get what I want.
class A:
def main(self):
df = self.load_file_into_df()
df = self.add_x_columns(df)
df = self.calculate_y(df)
df = self.calculate_consequence(df)
...
return df
def add_x_columns(df)
def calculate_y(df)
def calculate_consequence(df)
...
# now use it somewhere else
df = A().main()
pipe
One feature you may wish to utilize is pd.DataFrame.pipe. This is considered "pandorable" because it facilitates operator chaining.
In my opinion, you should separate reading data into a dataframe from manipulating the dataframe. For example:
class A:
def main(self):
df = self.load_file_into_df()
df = df.pipe(self.add_x_columns)\
.pipe(self.calculate_y)\
.pipe(self.calculate_consequence)
return df
compose
Function composition is not native to Python, but the 3rd party toolz library does offer this feature. This allows you to lazily define chained functions. Note the reversed order of operations, i.e. the last argument of compose is performed first.
from toolz import compose
class A:
def main(self)
df = self.load_file_into_df()
transformer = compose(self.calculate_consequence,
self.calculate_y,
self.add_x_columns)
df = df.pipe(transformer)
return df
In my opinion, compose offers a flexible and adaptable solution. You can, for example, define any number of compositions and apply them selectively or repeatedly at various points in your workflow.