I am currently building a small app to post-process more easily data stored in HDF5 files (generated by a 1D simulation proprietary software, the output being basically a very large table with ~10k variables, y-axis corresponding to time). I want the user to be able to write custom functions generating series from the data (for plotting). My problem is that the HDF5 file can be quite large and I would like to open it a single time and extract only the necessary data for performance reasons. Column names to load depend on the user-defined functions, so I need to inspect them and collect the calls to 'df.loc' and 'df.getitem()' to list the columns to extract. I am looking for the best way to do that.
Below is a simplified example of my current implementation, using a 'spy' object wrapping a dummy dataframe and listing the keys used when 'get_item' is called. Something similar could be done for pandas 'loc' dataframe method.
import pandas as pd
from scipy import integrate
# Data for the example
data = {
'TIME': [0,1,2,3,4,5],
'frac[air]': [0,0,0,0,0,0],
'frac[water]': [0.5,0.4,0.2,0,0,0],
'frac[oil]': [0.5,0.6,0.8,1,1,1],
'flowrate': [9999,0,0.1,0.2,0,0],
}
def read_data(var_names, loc_filter=None):
"""Simple table reading here, but in real application I am
reading a large hdf5 file. The filter is typicaly used to
select time slices (or slices corresponding to some states
of the system described by the dataframe) through pandas
dataframe loc method"""
if loc_filter is None:
df = pd.DataFrame(data=data)
else:
df = pd.DataFrame(data=data).loc[loc_filter]
return df[var_names]
# User-defined functions using only a few columns of the large dataframe
# Constraint: new functions can be defined by the user ! 2 Examples below
def grp_liq (df, var_name):
"""Summing fractions corresponding to liquids"""
return df[var_name+'[water]'] + df[var_name+'[oil]']
def integ (df, var_name):
"""Cumulative integration of a variable over time"""
return pd.Series(integrate.cumtrapz(df[var_name], df['TIME']))
# Creation of a 'spy' object wrapping a dummy dataframe to be able to
# inspect functions
class DFProxyGetItemSpy:
def __init__(self):
self._var_names = []
self._df_dum = pd.DataFrame({'dum':[1]})
# __getitem__ defined to list used var_names
def __getitem__(self, key):
if key not in self._var_names:
self._var_names.append(key)
# Returning a dummy series to limit the risk of error raising
# if a DataFrame-specific attribute/method is used in the function
return self._df_dum['dum']
# __getattr__ defined to pass any attribute to the dummy DataFrame
def __getattr__(self, name):
return self._df_dum.__getattribute__(name)
# Function to reset spy object
def reset(self):
self._var_names = []
#property
def var_names(self):
return self._var_names
##################
# DATA REQUEST 1 #
##################
# Spying a user function to list the variables used
spy = DFProxyGetItemSpy()
grp_liq(spy,'frac')
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df = read_data(spy.var_names)
print(grp_liq(df,'frac'))
##################
# DATA REQUEST 2 #
##################
spy.reset()
# Spying a second user function, but this time we also want to
# set a filter --> this function needs to be spyied too ...
spy = DFProxyGetItemSpy()
integ(spy,'flowrate')
my_filter = lambda df: df['TIME']>0
my_filter(spy)
print(spy.var_names)
# Now that we know what variables to extract, we can actualy
# read the data and do our operations
df2 = read_data(spy.var_names, loc_filter=my_filter)
print(integ(df2,'flowrate'))
I find this approach inelegant and I am looking for a better way to do that. I have been looking a bit in dask documentation to see if information about the keys used in a function could be extracted from the graph, but I was not successful. To formalize my question: what would be a good way to allow a user to define custom functions to manipulate a dataframe (using any function from pandas), and list the necessary keys to limit the amount of data to load ?
Thanks in advance for any answer, any advice is welcome too !
All,
I don't see similar questions online, if there are please kindly guide me to that question.
Here is what I am trying to do:
I have a class that builds a financial model in a data frame (so there are several tens of columns in the data frame)
Each column has a formula and many of the columns use the information in the prior columns
Can someone please recommend what is a good way to approach this problem.
# import required libraries
import pandas as pd
import numpy as np
class DataFrameInClass:
def __init__(self, a: int, b: int):
self.a = a
self.b = b
self.df = pd.DataFrame()
def get_df(self) -> pd.DataFrame:
return pd.DataFrame(np.random.randint(self.a,self.b,size=(100, 4)), columns=list('ABCD'))
def set_df(self):
self.df = self.get_df
def calculate_A_square(self) -> pd.Series:
# Approach 1 - calculate each column individually
df = self.df.copy(deep=True)
df['A_Square'] = df['A'] ** 2
return pd.Series(data=df['A_Square'], index=df.index)
def calculate_A_cubed(self):
# Approach 2 - Modify data frame
self.df['A_Cubed'] = self.df['A'] ** 3
I am not sure if it is good practice to just modify the dataframe within each method (Can someone please help me understand that a little better). Or should I recalculate the columns in each method (that also feels like a bad idea).
For e.g. In the below case, if a_square value was made an instance variable, I could calculate a_cubed in one line. However that would also mean my class would have just so many instance variables
In my project using Classes makes a lot of sense:
The class will contain time-series data which will be repetitively used.
It will have methods/functions operating on those time series and called many times
Problem is that I want to use pandas Data Frames as time series container (in some instance, it will be a one-time Series - Panda's Series, in others multiple time series - Panda's DataFrame) and I have issues with returning and operating on Padas DataFrames withing Class.
I've done reading around this topic, but I couldn't find an elegant and simple solution. Can you please help?
I've prepared a simplified example of DataFrame operations on time-series withing Class.
class MyClass:
def __init__(self, my_dataframe, time_series_type):
self.my_dataframe = my_dataframe
self.time_series_type = time_series_type
def returns(self):
r = self.my_dataframe.pct_change()
r = r.fillna(0, inplace = True)
return r #returns Panda's DataFrame
def rebase_to_1(self):
rebase_to_1 = self.returns + 1
rebase_to_1 = rebase_to_1.cumprod()
return rebase_to_1 #returns Panda's DataFrame
Example of application:
a = MyClass(my_time_series1)
b = MyClass(my_time_series2)
#Show rebased time-series/PadasDataFrame in Jupiter notebook
a.rebase_to_1()
I have quite a lot of functions which I'm trying to put inside the class to streamline operating on time-series. I hope that example above illustrates this well enough.
Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)
I have a DataFrame in which each value is a object of custom class, say:
dc = {"c1":{"a1":CAppState(1,1), "a2":CAppState(2,4) }, "c2":{"a2":CAppState(2,5), "a3":CAppState(3,32)} }
df = pd.DataFrame(dc)
where CAppState is a class:
class CAppState(object):
def __init__(self, nID, nValue):
self.m_nID = nID
self.m_nValue = nValue
I was wondering how could I conduct some common operations on this dataframe, like: cumsum() or sort according to CAppState.m_nValue ?
Any suggestion would be appreciated.
There is no builtin way to do this. You must create a Series from your objects and cumsum that. This can be done fairly easily with map. For instance:
df.c1.map(lambda x: x.m_nValue).cumsum()
You could also use operator.attrgetter:
df.c1.map(operator.attrgetter('m_nValue')).cumsum()