Question
Let's assume the following DataFrame is given
ID IBM MSFT APPL ORCL FB TWTR
date
1986-08-31 -1.332298 0.396217 0.574269 -0.679972 -0.470584 0.234379
1986-09-30 -0.222567 0.281202 -0.505856 -1.392477 0.941539 0.974867
1986-10-31 -1.139867 -0.458111 -0.999498 1.920840 0.478174 -0.315904
1986-11-30 -0.189720 -0.542432 -0.471642 1.506206 -1.506439 0.301714
1986-12-31 1.061092 -0.922713 -0.275050 0.776958 1.371245 -2.540688
and I want to do some operations on it. This could be some complicated mathematical method. The columns are structurally the same.
Q1: What is the best method wrt. performance and/or implementation design?
Q2: Should I program a method that is disassembling the DataFrame into numerical parts ( numpy arrays ) and indices? Thereby the necessary calculations would be undertaken by a submodule on the numpy array. The main method would be only responsible for recollecting the data retrieved from the submodule and rejoining it with the corresponding indices ( see example code below ).
def submodule(np_array):
# some fancy calculations here
return modified_array
def main(df):
cols = df.columns
indices = df.index
values = df.values()
modified_values = submodule(values)
new_df = pd.DataFrame(modified_values, columns=cols, index=indices)
return new_df
Q3: Or should I do the calculations with DataFrames directly?
Q4: Or should I work with objects instead?
Q5: What is better with respect to performance, design, or code structure?
Addendum
Some more practical example would be if I want to do a portfolio optimization.
Q6: Should I pass the whole DataFrame into the optimization or use only the numerical matrix. Strictly speaking I don't think that the information of the DataFrame should be passed to a numerical method. But I am not sure whether my thinking is outdated.
Another example would be calculating the Delta for a number of options ( an operation on every single series instead of a matrix operation )
P.S.:
I know that I wouldn't need to use a separate function for disassembling. But it highlights my intentions.
Related
I have a Pandas DataFrame, with columns 'time' and 'current'. It also has lots of other columns, but I don't want to use them for this operation. All values are floats.
df[['time','current']].head()
time current
1 0.0 9.6
2 300.0 9.3
3 600.0 9.6
4 900.0 9.5
5 1200.0 9.5
I'd like to calculate the rolling integral of current over time, such that at each point in time, I get the integral up to that point of the current over the time. (I realize that this particular operation is simple, but it's an example. I'm not really looking for this function, but the method as a whole)
Ideally, I'd be able to do something like this:
df[['time','current']].expanding().apply(scipy.integrate.trapezoid)
or
df[['time','current']].expanding(method = 'table').apply(scipy.integrate.trapezoid)
but neither of these work, as I'd like to take the 'time' column as the function's first argument, and the 'current' as the second. The function does work with one column (current alone), but I don't like dividing by timesteps separately afterwards.
It seems DataFrame columns can't be accessed within expanding().apply().
I've heard that internally the expanding is treated as an array, so I've also tried this:
df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x[0], x[1]))
df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x['time'], x['current']))
and variations, but I can never access the columns in expanding().
As a matter of fact, even using apply() on a plain DataFrame disallows using columns simultaneously, as each one is treated sequentially as a Series.
df[['time','current']].apply(lambda x:scipy.integrate.trapezoid(x.time,x.current))
...
AttributeError: 'Series' object has no attribute 'time'
This answer mentions the method 'table' for expanding(), but it wasn't out at the time, and I can't seem to figure out what it needs to work here. Their solution was simply to do it manually.
I've also tried defining the function first, but this returns an error too:
def func(x,y):
return(scipy.integrate.trapezoid(x,y))
df[['time','current']].expanding().apply(func)
...
DataError: No numeric types to aggregate
Is what I'm asking even possible with expanding().apply()? Should I just do it another way? Can I apply expanding inside the apply()?
Thanks, and good luck.
Overview
It is not yet fully implemented in pandas but there are things you can do to workaround. expanding() and rolling() plus .agg() or .apply() will deal column by column unless you precise method='table', (see Method 2).
Method 1
There is a workaround to get what you want as long as you output one column. The trick is to move columns to the index and then resetting it in the function: (don't do that with scipy.integrate.trapezoid because, as #ALollz said scipy.integrate.cumtrapz is already a cumulative (expanding) calculation)
def custom_func(serie):
subDf = serie.reset_index()
# work with the sub dataframe as you would do in a groupby
# you have access to subDf.x and subDf.y
return(scipy.integrate.trapezoid(subDf.x,subDf.y))
df.set_index(['y']).expanding().agg(custom_func)
Method 2
You can make use of the method='table' (available from pandas==1.3.0) in expanding()
and rolling() In that case you need to use .apply(custom_func, raw=True,engine='numba') and write a function custom_func in numba python (beware of types) that will take the numpy array representation of your dataframe. If you do this your custom_func needs to output an array of the length that the ones in input so you might have to add dummy columns in the input in order to bypass this and rename your columns afterward.
min_periods=100
def custom_func(table):
rep = np.zeros(len(table))
# You need something like this if you want to use the min_periods argument
if len(table) < min_periods :
return rep
# Do something with your numpy arrays
return rep
df.expanding(min_periods,method='table').apply(custom_func,raw=True,engine='numba')
# Rename
df.columns = ...
I need to perform some simple calculations on a large number of combinations of rows or columns for a pandas dataframe. I need to figure out how to do so most efficiently because the number of combinations might go up above a billion.
The basic approach is easy--just performing means, comparison operators, and sums on subselections of a dataframe. But the only way I've figured out involves doing a loop over the combinations, which isn't very pythonic and isn't super efficient. Since efficiency will matter as the number of samples goes up I'm hoping there might be some smarter way to do this.
Right now I am building the list of combinations and then selecting those rows and doing the calculations using built-in pandas tools (see pseudo-code below). One possibility is to parallelize this, which should be pretty easy. However, I wonder if I'm missing a deeper way to do this more efficiently.
A few thoughts, ordered from big to small:
Is there some smart pandas/python or even some smart linear algebra way to do this? I haven't figured such out, but want to check.
Is the best approach to stick with pandas? Or convert to a numpy array and just do everything using numeric indices there, and then convert back to easier-to-understand data-frames?
Is the built-in mean() the best approach, or should I use some kind of apply()?
Is it faster to select rows or columns in any way? The matrix is symmetric so it's easy to grab either.
I'm currently actually selecting 18 rows because each of the 6 rows actually has three entries with slightly different parameters--I could combine those into individual rows beforehand if it's faster to select 6 rows than 18 for some reason.
Here's a rough-sketch of what I'm doing:
from itertools import combinations
df = from_excel() #Test case is 30 rows & cols
df = df.set_index('Col1') # Column and row 1 are names, rest are the actual matrix values
allSets = combinations(df.columns, 6)
temp = []
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
temp.append([s,avg1,cnt1])
dfOut = pd.DataFrame(temp,columns=['Set','Average','Count'])
A few general considerations that should help:
Not that I know of, though the best place to ask is on Mathematics or Math Professionals. And it is worth a try. There may be a better way to frame the question if you are doing something very specific with the results - looking for minimum/maximum, etc.
In general, you are right, that pandas, as a layer on top of NumPy is probably not the speeding things up. However, most of the heavy-lifting is done at the level of NumPy, and until you are sure pandas is to blame, use it.
mean is better than your own function applied across rows or columns because it uses C implementation of mean in NumPy under the hood which is always going to be faster than Python.
Given that pandas is organizing data in column fashion (i.e. each column is a contiguous NumPy array), it is better to go row-wise.
It would be great to see an example of data here.
Now, some comments on the code:
use iloc and numeric indices instead of loc - it is way faster
it is unnecessary to turn tuples into list here: df.loc[list(s)].gt(0).sum().sum()
just use: df.loc[s].gt(0).sum().sum()
you should rather use a generator instead of the for loop where you append elements to a temporary list (this is awfully slow and unnecessary, because you are creating pandas dataframe either way). Also, use tuples instead of lists wherever possible for maximum speed:
def gen_fun():
allSets = combinations(df.columns, 6)
for s in allSets:
avg1 = df.loc[list(s)].mean().mean()
cnt1 = df.loc[list(s)].gt(0).sum().sum()
yield (s, avg1, cnt1)
dfOut = pd.DataFrame(gen_fun(), columns=['Set', 'Average', 'Count'])
Another thing is, that you can preprocess the dataframe to use only values that are positive to avoid gt(0) operation in each loop.
In this way you are sparing both memory and CPU time.
I want to create pandas dataframes and be able to manipulate them in optimised code (numba).
Most optimised code will take series or dataframe inputs and store results in preallocated outputs.
#njit
def calc(u, v, w):
for i in range (w.shape[0]):
w[i] = some_f(u[i], v[i])
where some_f is a placeholder for operations that can be complex, with tests and loops, hence the use of numba. Most importantly, I want to avoid any useless copies of data in the process.
For vectorized functions such as the above, I want to use the same code for series and dataframes.
So for series u, v, w, I'll use:
calc(u.values, v.values, w.values)
For dataframes, I thought of reusing the same function with
calc(u.values.reshape(-1), v.values.reshape(-1), w.values.reshape(-1))
This only works if
The array ordering of the dataframe (C or Fortran) is consistent across the three dataframes
The reshape method is passed an argument order='C' or 'F' matching the original dataframe ordering, otherwise a copy is made.
Pandas does not seem to have a consistent policy for dataframe ordering.
For instance, a constructor
df=pandas.Dataframe(index=..., columns=..., data=0.0)
will return a C ordered array. While
df.copy()
will be Fortran ordered.
I wanted to know if some people here had encountered similar issues, and found a consistent way to ensure that dataframes are always in the same order (C or Fortran) without cluttering too much ordinary pandas code.
I'm working with time series data and have transformed numbers to logarithmic differences with numpy.
df['dlog']= np.log(df['columnx']).diff()
Then I made predictions with that transformation.
How can I return to normal numbers?
Reversing the transformation shouldn't be necessary, because columnx still exists in df
.diff() calculates the difference of a Series element compared with another
element in the Series.
The first row of dlog is NaN. Without a "base" number (e.g. np.log(764677)) there is not a way to step back that transformation
df = pd.DataFrame({'columnx': [np.random.randint(1_000_000) for _ in range(100)]})
df['dlog'] = np.log(df.columnx).diff()
Output:
columnx dlog
764677 NaN
884574 0.145653
621005 -0.353767
408960 -0.417722
248456 -0.498352
Undo np.log with np.exp
Use np.exp to transform from a logarithmic to linear scale.
df = pd.DataFrame({'columnx': [np.random.randint(1_000_000) for _ in range(100)]})
df['log'] = np.log(df.columnx)
df['linear'] = np.exp(df.log)
Output:
columnx log linear
412863 12.930871 412863.0
437565 12.988981 437565.0
690926 13.445788 690926.0
198166 12.196860 198166.0
427894 12.966631 427894.0
Further Notes:
Without a reproducible set, it's not possible to offer further solutions
You can include some data: How to make good reproducible pandas examples
Include the code used to transform the data: Minimal, Reproducible Example
Another option is, produce the predictions without taking np.log
I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)