I'm trying to rework much of my analysis code for signal processing using Dataframes instead of numpy arrays. However, I'm having a hard time figuring out how to pass the entire matrix of a dataframe to a function as an entire unit.
E.g., If I'm computing the common average reference a signal, I have something like:
avg = signal.mean(axis=1)
CAR = signal - avg
What I'd like to do is pass a pandas array to this function and have it return a dataframe with CAR as the values now. I'd like to do this without just returning an array and then re-converting it back into a dataframe.
It sounds like when you use df.apply(), it goes row-wise or column-wise, and doesn't put in the whole matrix. I could alter the code of CAR to make this work, but it seems like it would slow it down quite a bit rather than just using numpy's code to do it all at once. It probably wouldn't make a big difference for computing the mean, but I foresee this being a problem with other functions in the future that might take longer.
Can anyone point me in the right direction?
EDIT: To clarify, I'm not just doing this for subtracting the mean, it was just a simple example. A more realistic example would be linearly filtering the array along axis 0. I'd like to use the scipy.signal filtfilt function to filter my array. This is quite easy if I can just pass it a tpts x feats matrix, but right now it seems that the only way to do it is column-wise using "apply"
You can get the raw numpy array version of a DataFrame with df.values. However, in many cases you can just pass the DataFrame itself, since it still allows use of the normal numpy API (i.e., it has all the right methods).
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html this will allow you to perform operations on a row (or column, or the entire dataframe).
import random
signal=pd.DataFrame([[10*random.random() for _ in range(3)]for _ in range(5)])
def testm(frame, average=0):
return frame-average
signal.apply(testm,average=signal.mean(),axis=1)
results:
signal
Out[57]:
0 1 2
0 5.566445 7.612070 8.554966
1 0.869158 2.382429 6.197272
2 5.933192 3.564527 9.805669
3 9.676292 1.707944 2.731479
4 5.319629 3.348337 6.476631
signal.mean()
Out[59]:
0 5.472943
1 3.723062
2 6.753203
dtype: float64
signal.apply(testm,average=signal.mean(),axis=1)
Out[58]:
0 1 2
0 0.093502 3.889008 1.801763
1 -4.603785 -1.340632 -0.555932
2 0.460249 -0.158534 3.052466
3 4.203349 -2.015117 -4.021724
4 -0.153314 -0.374724 -0.276572
This will take the mean of each column, and subtract it from each value in that column.
Related
I have a Pandas DataFrame, with columns 'time' and 'current'. It also has lots of other columns, but I don't want to use them for this operation. All values are floats.
df[['time','current']].head()
time current
1 0.0 9.6
2 300.0 9.3
3 600.0 9.6
4 900.0 9.5
5 1200.0 9.5
I'd like to calculate the rolling integral of current over time, such that at each point in time, I get the integral up to that point of the current over the time. (I realize that this particular operation is simple, but it's an example. I'm not really looking for this function, but the method as a whole)
Ideally, I'd be able to do something like this:
df[['time','current']].expanding().apply(scipy.integrate.trapezoid)
or
df[['time','current']].expanding(method = 'table').apply(scipy.integrate.trapezoid)
but neither of these work, as I'd like to take the 'time' column as the function's first argument, and the 'current' as the second. The function does work with one column (current alone), but I don't like dividing by timesteps separately afterwards.
It seems DataFrame columns can't be accessed within expanding().apply().
I've heard that internally the expanding is treated as an array, so I've also tried this:
df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x[0], x[1]))
df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x['time'], x['current']))
and variations, but I can never access the columns in expanding().
As a matter of fact, even using apply() on a plain DataFrame disallows using columns simultaneously, as each one is treated sequentially as a Series.
df[['time','current']].apply(lambda x:scipy.integrate.trapezoid(x.time,x.current))
...
AttributeError: 'Series' object has no attribute 'time'
This answer mentions the method 'table' for expanding(), but it wasn't out at the time, and I can't seem to figure out what it needs to work here. Their solution was simply to do it manually.
I've also tried defining the function first, but this returns an error too:
def func(x,y):
return(scipy.integrate.trapezoid(x,y))
df[['time','current']].expanding().apply(func)
...
DataError: No numeric types to aggregate
Is what I'm asking even possible with expanding().apply()? Should I just do it another way? Can I apply expanding inside the apply()?
Thanks, and good luck.
Overview
It is not yet fully implemented in pandas but there are things you can do to workaround. expanding() and rolling() plus .agg() or .apply() will deal column by column unless you precise method='table', (see Method 2).
Method 1
There is a workaround to get what you want as long as you output one column. The trick is to move columns to the index and then resetting it in the function: (don't do that with scipy.integrate.trapezoid because, as #ALollz said scipy.integrate.cumtrapz is already a cumulative (expanding) calculation)
def custom_func(serie):
subDf = serie.reset_index()
# work with the sub dataframe as you would do in a groupby
# you have access to subDf.x and subDf.y
return(scipy.integrate.trapezoid(subDf.x,subDf.y))
df.set_index(['y']).expanding().agg(custom_func)
Method 2
You can make use of the method='table' (available from pandas==1.3.0) in expanding()
and rolling() In that case you need to use .apply(custom_func, raw=True,engine='numba') and write a function custom_func in numba python (beware of types) that will take the numpy array representation of your dataframe. If you do this your custom_func needs to output an array of the length that the ones in input so you might have to add dummy columns in the input in order to bypass this and rename your columns afterward.
min_periods=100
def custom_func(table):
rep = np.zeros(len(table))
# You need something like this if you want to use the min_periods argument
if len(table) < min_periods :
return rep
# Do something with your numpy arrays
return rep
df.expanding(min_periods,method='table').apply(custom_func,raw=True,engine='numba')
# Rename
df.columns = ...
I am trying to perform automated excel operations, but with cryptocurrency data from Binance. I converted the data to pandas df, and then into a NumPy array.
However, the thing is that I have 1 column of data which is volume.
I want to perform a loop function that takes the latest data in my NumPy array and plusses/additions, the last value with the next value, and then iterates through the ids by one calculation at a time.
I seem to be stuck in trying to make a loop, that can iterate through my 1 column and make basic math operations on only a few of the rows at a time, but still being dynamic in that I don't have to type in the ids of the rows I want to make operations on each time.
Example:
Volume
[0] 9212
[1] 3021 [1]+[0] 3021+9212 = x
[2] 3201
[3] 3921 [3]+[2] 3921+3201 = x
[4] 2010
[5] 1999 [5]+[4] 1999+2010 = x
the idea is that it iterates by creating additions on the way up, or any other math operations.
Any suggestions as to what I can do?
For when your data is in a pandas DataFrame you can try preforming a rolling sum. like so:
df['Rolling_Volume'] = df['Volume'].rolling(2).sum()
This will create a new column called "Rolling_Volume" which will have the information you are looking for.
You can read more here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
If you are insisting on preforming a rolling sum on a Numpy array it is explained here: Fast rolling-sum for list of data vectors (2d matrix)
I would like to select a cycle of data in Python (in this case, pandas, but it's a flexible and recurring problem), such that the selected data circles back around to the start of the data for n of the beginning rows. I keep running into this problem, and while I have a variety of working solutions for it, I'm interested in whether there's some built-in method for this I don't know about?
Here's an example pandas DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame([[0,1],[2,3],[5,6],[7,8]],columns = ['x','y'])
This DataFrame would, of course, look like this:
x y
0 0 1
1 2 3
2 5 6
3 7 8
A good-enough solution I've found is to create an index array:
n = 1 #how far through the start of the dataframe to select
select = np.concatenate((range(len(df)),range(n)))
The output of select would be:
array([0, 1, 2, 3, 0])
If I do df.x[select], I get what I'm looking for:
0 0
1 2
2 5
3 7
0 0
Name: x, dtype: int64
Is there some functionality in NumPy, pandas, or any other module of Python that would allow this kind of looping selection of data? Something like df.x[0:+1] that would work in a way analogous to df.x[0:-1] instead of how it currently works? Or a NumPy method that works the way I'm making np.arange() work by combining it with np.concatenate()? As I mentioned, I keep running into this problem, and for a language as convenient as Python has been, it sure seems like a feature that would or should exist. Am I missing or forgetting something?
Edit to clarify request:
A suggested method by avloss was to use np.tile, which is the right degree of simple and generalizable I'm looking for, only it's excessive for the applications I'm using. These are cases where you have to close the loop of cyclic data either for plotting, or for calculating (eg if you're matching slopes of the beginning and end of a closed curve using a finite difference method). In these cases you only need the first n data points in a series (where n is usually somewhere between 1 and 3) to be repeated for closing the calculation or plotting the cycle.
This is not exactly what you're asking for, but np.tile comes pretty close:
https://numpy.org/doc/stable/reference/generated/numpy.tile.html
Or, if you want to do this through indices, you might use mod division
np.arange(5) % 4 == array([0, 1, 2, 3, 0])
Question
Let's assume the following DataFrame is given
ID IBM MSFT APPL ORCL FB TWTR
date
1986-08-31 -1.332298 0.396217 0.574269 -0.679972 -0.470584 0.234379
1986-09-30 -0.222567 0.281202 -0.505856 -1.392477 0.941539 0.974867
1986-10-31 -1.139867 -0.458111 -0.999498 1.920840 0.478174 -0.315904
1986-11-30 -0.189720 -0.542432 -0.471642 1.506206 -1.506439 0.301714
1986-12-31 1.061092 -0.922713 -0.275050 0.776958 1.371245 -2.540688
and I want to do some operations on it. This could be some complicated mathematical method. The columns are structurally the same.
Q1: What is the best method wrt. performance and/or implementation design?
Q2: Should I program a method that is disassembling the DataFrame into numerical parts ( numpy arrays ) and indices? Thereby the necessary calculations would be undertaken by a submodule on the numpy array. The main method would be only responsible for recollecting the data retrieved from the submodule and rejoining it with the corresponding indices ( see example code below ).
def submodule(np_array):
# some fancy calculations here
return modified_array
def main(df):
cols = df.columns
indices = df.index
values = df.values()
modified_values = submodule(values)
new_df = pd.DataFrame(modified_values, columns=cols, index=indices)
return new_df
Q3: Or should I do the calculations with DataFrames directly?
Q4: Or should I work with objects instead?
Q5: What is better with respect to performance, design, or code structure?
Addendum
Some more practical example would be if I want to do a portfolio optimization.
Q6: Should I pass the whole DataFrame into the optimization or use only the numerical matrix. Strictly speaking I don't think that the information of the DataFrame should be passed to a numerical method. But I am not sure whether my thinking is outdated.
Another example would be calculating the Delta for a number of options ( an operation on every single series instead of a matrix operation )
P.S.:
I know that I wouldn't need to use a separate function for disassembling. But it highlights my intentions.
I'd like to do some math on a series vector. I'd like to take the difference between two rows in a vector. My first intuition was:
def row_diff(prev, next):
return(next - prev)
and then using it
my_col_vec.apply(row_diff)
but this doesn't do what I'd like. It appears apply is row-wise, which is fine, but I can't seem to find an equivalent operation that will allow me to easy create a new vector from the old one by subtracting the previous row from the next.
Is there a better way to do this? I've been reading this document and it doesn't look like it.
Thanks!
To calculate inter-row differences use diff:
In [6]:
df = pd.DataFrame({'a':np.random.rand(5)})
df
Out[6]:
a
0 0.525220
1 0.031826
2 0.260853
3 0.273792
4 0.281368
In [7]:
df['diff'] = df['a'].diff()
df
Out[7]:
a diff
0 0.525220 NaN
1 0.031826 -0.493394
2 0.260853 0.229027
3 0.273792 0.012940
Also please try to avoid using apply as there is usually a vectorised method available