Vectorizing function on arrays in DataFrame column? - python

I want to use a function (scipy.signal.savgol_filter) on every element in a Dataframe column (every element of the column is an array). While looping seems a little unnecessary, I can't wrap my head around a vectorized solution.
I tried the obvious .apply method as well as just using the function on the column. Both show an error like "setting an array element with a sequence".
Example code with lists instead of arrays (but same results):
import pandas as pd
from scipy import signal
df = pd.DataFrame(data={'A': [[1,3,9], [7,2,3], [3,2,6,3], [2,3,4]]})
df['smooth'] = df.apply(signal.savgol_filter, args=(3, 0))
Respectively:
df['smooth'] = signal.savgol_filter(df['A'], 3, 0)
Or:
df['smooth'] = signal.savgol_filter(df['A'].values, 3, 0)
None of those work, I think because the whole column is given to the function.
Is there a way to use the function on all the elements (=arrays) in the column at the same time or do i have to loop over every row?

The problem is that your elements aren't the same shape when trying to treat it as a multidimensional array.
If you just want to apply that function to each row you need to select the column explicitly:
df['smooth'] = df['A'].apply(signal.savgol_filter, args=(3, 0))
This is not really a vectorized solution, though.
Edit:
It's worth adding that there is discussion over on the numpy issue tracker about the ambiguity of this error message.
See here and here

Related

Get rows from a DataFrame by using a list of slices

I have a several million row data frame, and a list of interesting sections I need to select out of it. I'm looking for a highly efficient (read as: fastest possible) way of doing this.
I know I can do this:
slices = [slice(0,10), slice(20,50), slice(1000,5000)]
for slice in slices:
df.loc[slice, 'somecolumn'] = True
... but that just seems like an inefficient way of getting the job done. It's really slow.
This seems faster than the for loop above, but I'm not sure if this is the best possible approach:
from itertools import chain
ranges = chain.from_iterable(slices)
df.loc[ranges, 'somecolumns'] = True
This also doesn't work, even though it seems that maybe it should:
df.loc[slices, 'somecolumns'] = True
TypeError: unhashable type: 'slice'
My primary concern in this is performance. I need the best I can get out of this due to the size of the data frames I am dealing with.
pandas
You can try a couple of tricks:
Use np.r_ to concatenate slice objects into a single NumPy array. Indexing with NumPy arrays is usually efficient as these are used internally in the Pandas framework.
Use positional integer indexing via pd.DataFrame.iloc instead of primarily label-based loc. The former is more restrictive and more closely aligned with NumPy indexing.
Here's a demo:
# some example dataframe
df = pd.DataFrame(dict(zip('ABCD', np.arange(100).reshape((4, 25)))))
# concatenate multiple slices
slices = np.r_[slice(0, 3), slice(6, 10), slice(15, 20)]
# use integer indexing
df.iloc[slices, df.columns.get_loc('C')] = 0
numpy
If your series is held in a contiguous memory block, which is usually the case with numeric (or Boolean) arrays, you can try updating the underlying NumPy array in-place. First define slices via np.r_ as above, then use:
df['C'].values[slices] = 0
This by-passes the Pandas interface and any associated checks which occur via the regular indexing methods.
IIUC, you are looking to slice on axis = 0 (row index). Instead of slices, I'm using numpy's arange method, and using df.ix:
slices = np.append(np.arange(0,10), np.arange(20,50), np.arange(1000,5000)) ##add other row slices here
df.ix[slices, 'some_col']
You can try building a full indexer for the rows first, then do your assignment:
row_indexer = pd.concat((df.index[sub_slice] for sub_slice in slices), axis=0)
df[row_indexer, column] = True

pandas df.apply returns series of the same list (like map) where should return one list

I have a function that takes a row of the daraframe (pd.Series) and returns one list. The idea is to apply it to dataframe and generate a new pd.Series of lists, one per each row:
sale_candidats = closings.apply(get_candidates_3, axis=1,
sales=sales_ts,
settings=settings,
reduce=True)
However, it seems that pandas try to map the list it returns (for the first row, probably) to original row, and raises an error (even despite reduce=True):
ValueError: Shape of passed values is (10, 8), indices imply (10, 23)
When I convert function to return set instead of the list, the whole thing starts working - except returning a data frame with the same shape and index/columns name as an original data frame, except that every cell is filled with corresponding row's set().
Looks a lot like a bug to me... how can I return one pd.Series instead?
Seems that this behaviour is, indeed, a bug in the latest version of pandas. take a look at the issue:
https://github.com/pandas-dev/pandas/pull/18577
You could just apply the function in a for loop, because that's all that apply does. You wouldn't notice a large speed penalty.

Shifting all rows in dask dataframe

In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.
Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)
The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.

Should pandas dataframes be nested?

I am creating a python script that drives an old fortran code to locate earthquakes. I want to vary the input parameters to the fortran code in the python script and record the results, as well as the values that produced them, in a dataframe. The results from each run are also convenient to put in a dataframe, leading me to a situation where I have a nested dataframe (IE a dataframe assigned to an element of a data frame). So for example:
import pandas as pd
import numpy as np
def some_operation(row):
results = np.random.rand(50, 3) * row['p1'] / row['p2']
res = pd.DataFrame(results, columns=['foo', 'bar', 'rms'])
return res
# Init master df
df_master = pd.DataFrame(columns=['p1', 'p2', 'results'], index=range(3))
df_master['p1'] = np.random.rand(len(df_master))
df_master['p2'] = np.random.rand(len(df_master))
df_master = df_master.astype(object) # make sure generic types can be used
# loop over each row, call some_operation and store results DataFrame
for ind, row in df_master.iterrows():
df_master.loc[ind, "results"] = some_operation(row)
Which raises this exception:
ValueError: Incompatible indexer with DataFrame
It works as expected, however, if I change the last line to this:
df_master["results"][ind] = some_operation(row)
I have a few questions:
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc., it seems to work fine.
Should the DataFrame be used in this way? I know that dtype object can be ultra slow for sorting and whatnot, but I am really just using the dataframe a convenient container because the column/index notation is quite slick. If DataFrames should not be used in this way is there similar alternative? I was looking at the Panel class but I am not sure if it is the proper solution for my application. I would hate forge ahead and apply the hack shown above to some code and then have it not supported in future releases of pandas.
Why does .loc (and .ix) fail when the slice assignment succeeds? If the some_operation function returned a list, dictionary, etc. it seems to work fine.
This is a strange little corner case of the code. It stems from the fact that if the item being assigned is a DataFrame, loc and ix assume that you want to fill the given indices with the content of the DataFrame. For example:
>>> df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
>>> df2 = pd.DataFrame({'a':[100], 'b':[200]})
>>> df1.loc[[0], ['a', 'b']] = df2
>>> df1
a b
0 100 200
1 2 5
2 3 6
If this syntax also allowed storing a DataFrame as an object, it's not hard to imagine a situation where the user's intent would be ambiguous, and ambiguity does not make a good API.
Should the DataFrame be used in this way?
As long as you know the performance drawbacks of the method (and it sounds like you do) I think this is a perfectly suitable way to use a DataFrame. For example, I've seen a similar strategy used to store the trained scikit-learn estimators in cross-validation across a large grid of parameters (though I can't recall the exact context of this at the moment...)

Doing column math with numpy in python

I am looking for coding examples to learn Numpy.
Usage would be dtype ='object'.
To construnct array the code used would
a= np.asarray(d, dtype ='object')
not np.asarray(d) or np.asarray(d, dtype='float32')
Is sorting any different than float32/64?
Coming from excel "cell" equations, wrapping my head around Row Column math.
Ex:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
[['a',2,3,4],
['b',5,6,2],
['c',5,1,5]])
Create new array with:
How would I sort high to low by [3].
How calc for entire col. (1,1)- (1,0), Example without sorting A
['b',3],
['c',0]
How calc for enitre array (1,1) - (2,0) Example without sorting A
['b',2],
['c',-1]
Despite the fact that I still cannot understand exactly what you are asking, here is my best guess. Let's say you want to sort A by the values in 3rd column:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
ii = np.argsort(A[:,2])
print A[ii,:]
Here the rows have been sorted according to the 3rd column, but each row is left unsorted.
Subtracting all of the columns is a problem due to the string objects, however if you exclude them, you can for example subtract the 3rd row from the 1st by:
A[0,1:] - A[2,1:]
If I didn't understand the basic point of your question, then please revise it. I highly recommend you take a look at the numpy tutorial and documentation if you have not done so already:
http://docs.scipy.org/doc/numpy/reference/
http://docs.scipy.org/doc/numpy/user/

Categories