I'm struggling to find a suitable way of using pandas to hold a quantities and 1-sigma array as a pd.Series data attribute. Currently I pass a class a pd.Series instance and manipulate it using properties, etc.:
import numpy as np
import pandas as pd
class TimeSeriesHandler:
def __init__(self, time_series: pd.Series):
self._time_series = time_series
self.sigmas = np.zeros(self.number_of_timestamps)
#property
def timestamps(self) -> np.array:
return self._time_series.index.values
#property
def number_of_timestamps(self) -> np.array:
return self.timestamps.shape[0]
... and then use it as follows:
time_series = pd.Series(data=[3,2,3,2], index=[1,2,3,4])
time_series_handler = TimeSeriesHandler(time_series)
which represents the quantities that I'm interested in. But each of these quantities have a 1-sigma error of say, sigmas = [1,2,1,3].
Are there any suggestions on how I would manage this, please? Is it a good idea to have a separate time series instance for sigmas -- this would mean unnecessarily duplicating the index value (or the timestamps).
Thanks for any help here.
Related
I am currently building a small app to post-process more easily data stored in HDF5 files (generated by a 1D simulation proprietary software, the output being basically a very large table with ~10k variables, y-axis corresponding to time). I want the user to be able to write custom functions generating series from the data (for plotting). My problem is that the HDF5 file can be quite large and I would like to open it a single time and extract only the necessary data for performance reasons. Column names to load depend on the user-defined functions, so I need to inspect them and collect the calls to 'df.loc' and 'df.getitem()' to list the columns to extract. I am looking for the best way to do that.
Below is a simplified example of my current implementation, using a 'spy' object wrapping a dummy dataframe and listing the keys used when 'get_item' is called. Something similar could be done for pandas 'loc' dataframe method.
import pandas as pd
from scipy import integrate
# Data for the example
data = {
'TIME': [0,1,2,3,4,5],
'frac[air]': [0,0,0,0,0,0],
'frac[water]': [0.5,0.4,0.2,0,0,0],
'frac[oil]': [0.5,0.6,0.8,1,1,1],
'flowrate': [9999,0,0.1,0.2,0,0],
}
def read_data(var_names, loc_filter=None):
"""Simple table reading here, but in real application I am
reading a large hdf5 file. The filter is typicaly used to
select time slices (or slices corresponding to some states
of the system described by the dataframe) through pandas
dataframe loc method"""
if loc_filter is None:
df = pd.DataFrame(data=data)
else:
df = pd.DataFrame(data=data).loc[loc_filter]
return df[var_names]
# User-defined functions using only a few columns of the large dataframe
# Constraint: new functions can be defined by the user ! 2 Examples below
def grp_liq (df, var_name):
"""Summing fractions corresponding to liquids"""
return df[var_name+'[water]'] + df[var_name+'[oil]']
def integ (df, var_name):
"""Cumulative integration of a variable over time"""
return pd.Series(integrate.cumtrapz(df[var_name], df['TIME']))
# Creation of a 'spy' object wrapping a dummy dataframe to be able to
# inspect functions
class DFProxyGetItemSpy:
def __init__(self):
self._var_names = []
self._df_dum = pd.DataFrame({'dum':[1]})
# __getitem__ defined to list used var_names
def __getitem__(self, key):
if key not in self._var_names:
self._var_names.append(key)
# Returning a dummy series to limit the risk of error raising
# if a DataFrame-specific attribute/method is used in the function
return self._df_dum['dum']
# __getattr__ defined to pass any attribute to the dummy DataFrame
def __getattr__(self, name):
return self._df_dum.__getattribute__(name)
# Function to reset spy object
def reset(self):
self._var_names = []
#property
def var_names(self):
return self._var_names
##################
# DATA REQUEST 1 #
##################
# Spying a user function to list the variables used
spy = DFProxyGetItemSpy()
grp_liq(spy,'frac')
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df = read_data(spy.var_names)
print(grp_liq(df,'frac'))
##################
# DATA REQUEST 2 #
##################
spy.reset()
# Spying a second user function, but this time we also want to
# set a filter --> this function needs to be spyied too ...
spy = DFProxyGetItemSpy()
integ(spy,'flowrate')
my_filter = lambda df: df['TIME']>0
my_filter(spy)
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df2 = read_data(spy.var_names, loc_filter=my_filter)
print(integ(df2,'flowrate'))
I find this approach inelegant and I am looking for a better way to do that. I have been looking a bit in dask documentation to see if information about the keys used in a function could be extracted from the graph, but I was not successful. To formalize my question: what would be a good way to allow a user to define custom functions to manipulate a dataframe (using any function from pandas), and list the necessary keys to limit the amount of data to load ?
Thanks in advance for any answer, any advice is welcome too !
All,
I don't see similar questions online, if there are please kindly guide me to that question.
Here is what I am trying to do:
I have a class that builds a financial model in a data frame (so there are several tens of columns in the data frame)
Each column has a formula and many of the columns use the information in the prior columns
Can someone please recommend what is a good way to approach this problem.
# import required libraries
import pandas as pd
import numpy as np
class DataFrameInClass:
def __init__(self, a: int, b: int):
self.a = a
self.b = b
self.df = pd.DataFrame()
def get_df(self) -> pd.DataFrame:
return pd.DataFrame(np.random.randint(self.a,self.b,size=(100, 4)), columns=list('ABCD'))
def set_df(self):
self.df = self.get_df
def calculate_A_square(self) -> pd.Series:
# Approach 1 - calculate each column individually
df = self.df.copy(deep=True)
df['A_Square'] = df['A'] ** 2
return pd.Series(data=df['A_Square'], index=df.index)
def calculate_A_cubed(self):
# Approach 2 - Modify data frame
self.df['A_Cubed'] = self.df['A'] ** 3
I am not sure if it is good practice to just modify the dataframe within each method (Can someone please help me understand that a little better). Or should I recalculate the columns in each method (that also feels like a bad idea).
For e.g. In the below case, if a_square value was made an instance variable, I could calculate a_cubed in one line. However that would also mean my class would have just so many instance variables
In my project using Classes makes a lot of sense:
The class will contain time-series data which will be repetitively used.
It will have methods/functions operating on those time series and called many times
Problem is that I want to use pandas Data Frames as time series container (in some instance, it will be a one-time Series - Panda's Series, in others multiple time series - Panda's DataFrame) and I have issues with returning and operating on Padas DataFrames withing Class.
I've done reading around this topic, but I couldn't find an elegant and simple solution. Can you please help?
I've prepared a simplified example of DataFrame operations on time-series withing Class.
class MyClass:
def __init__(self, my_dataframe, time_series_type):
self.my_dataframe = my_dataframe
self.time_series_type = time_series_type
def returns(self):
r = self.my_dataframe.pct_change()
r = r.fillna(0, inplace = True)
return r #returns Panda's DataFrame
def rebase_to_1(self):
rebase_to_1 = self.returns + 1
rebase_to_1 = rebase_to_1.cumprod()
return rebase_to_1 #returns Panda's DataFrame
Example of application:
a = MyClass(my_time_series1)
b = MyClass(my_time_series2)
#Show rebased time-series/PadasDataFrame in Jupiter notebook
a.rebase_to_1()
I have quite a lot of functions which I'm trying to put inside the class to streamline operating on time-series. I hope that example above illustrates this well enough.
How do you better structure the code in your class so that your class returns the df that you want, but you don't have a main method which calls a lot of other methods in sequential order. I find that in a lot of situations I arrive at this structure and it seems bad. I have a df that I just overwrite it with the result of other base functions (that I unit test) until I get what I want.
class A:
def main(self):
df = self.load_file_into_df()
df = self.add_x_columns(df)
df = self.calculate_y(df)
df = self.calculate_consequence(df)
...
return df
def add_x_columns(df)
def calculate_y(df)
def calculate_consequence(df)
...
# now use it somewhere else
df = A().main()
pipe
One feature you may wish to utilize is pd.DataFrame.pipe. This is considered "pandorable" because it facilitates operator chaining.
In my opinion, you should separate reading data into a dataframe from manipulating the dataframe. For example:
class A:
def main(self):
df = self.load_file_into_df()
df = df.pipe(self.add_x_columns)\
.pipe(self.calculate_y)\
.pipe(self.calculate_consequence)
return df
compose
Function composition is not native to Python, but the 3rd party toolz library does offer this feature. This allows you to lazily define chained functions. Note the reversed order of operations, i.e. the last argument of compose is performed first.
from toolz import compose
class A:
def main(self)
df = self.load_file_into_df()
transformer = compose(self.calculate_consequence,
self.calculate_y,
self.add_x_columns)
df = df.pipe(transformer)
return df
In my opinion, compose offers a flexible and adaptable solution. You can, for example, define any number of compositions and apply them selectively or repeatedly at various points in your workflow.
Lets say I have a pandas.DataFrame that has hourly data for 3 days:
import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))
I would like to get every, let's say, 6 hours of data and independently fit a curve to that data. Since pandas' resample function has a how keyword that is supposed to be any numpy array function, I thought that I could maybe try to use resample to do that with polyfit, but apparently there is no way (right?).
So the only alternative way I thought of doing that is separating df into a sequence of DataFrames, so I am trying to create a function that would work such as
l=splitDF(df, '6H')
and it would return to me a list of dataframes, each one with 6 hours of data (except maybe the first and last ones). So far I got nothing that could work except something like the following manual method:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
cont=0
for date in data.index:
... check for date in res_index ...
... and start cutting at those points ...
But this method would be extremely slow and there is probably a faster way to do it. Is there a fast (maybe even pythonic) way of doing this?
Thank you!
EDIT
A better method (that needs some improvement but it's faster) would be the following:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
pdate=res_index[0]
for date in res_index:
out.append(data[pdate:date][:-1])
pdate=date
out.append(data[pdate:])
return out
But still seems to me that there should be a better method.
Ok, so this sounds like a textbook case for using groupby. Here's my thinking:
import pandas as pd
#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
new_df = df.copy()
hr_int = int(hr)
new_df['hr_group'] = new_df.index.hour/hr_int
new_df['dt_group'] = new_df.index.date
return new_df
#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
df_new = df.copy()
coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
poly_func = pd.np.poly1d(coef_array)
df_new['poly_fit'] = poly_func(df[x_col])
return df_new
#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
as_index=False).apply(polyfit_x_y)
How about?
np.array_split(df,len(df)/6)