Vectorizing standard deviation calculations for pandas dataseries - python

I have a pandas Series, like so,
data = [1,2,3,2,4,5,6,3,5]
ds = pd.Series(data)
print (ds)
0 1
1 2
2 3
3 2
4 4
5 5
6 6
7 3
8 5
I am interested in getting the standard deviation for each index. For example, when I at index 5, I want to calculate the standard deviations for ds[0:4].
I have done this with the following code,
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
print (df)
data avreturns sd
0 1 1 NaN
1 2 1.5 0.7071068
2 3 2 1
3 2 2 0.8164966
4 4 2.4 1.140175
5 5 2.833333 1.47196
6 6 3.285714 1.799471
7 3 3.25 1.669046
8 5 3.444444 1.666667
This works, but I using a loop and it is slow. Is there a way to vectorize this?
I was able to vectorize the mean calculations by using the cumsum() function:
df.data.cumsum()/(df.index+1)
Is there a way to vectorize the standard deviation calculations?

You might be interested in pd.expanding_std, which calculates the cumulative standard deviation for you:
>>> pd.expanding_std(ds)
0 NaN
1 0.707107
2 1.000000
3 0.816497
4 1.140175
5 1.471960
6 1.799471
7 1.669046
8 1.666667
dtype: float64
For what it's worth, this type of cumulative operation might be very fiddly to vectorise: the Pandas implementation appears to loop using Cython for speed.

To expand #ajcr's answer, I ran a %timeit against the two ways to do this. I think there is 1000x improvement by using expanding_stds...
data = [x for x in range(1000)]
ds = pd.Series(data)
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
def foo(df):
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
return (df)
%timeit foo(df)
1 loops, best of 3: 1min 36s per loop
%timeit pd.expanding_std(df.data)
10000 loops, best of 3: 126 µs per loop

Related

Python | Pandas Dataframe: find for each entry in df A the closest timestamp in df B

I am new to python pandas. What I have are 2 pandas dataframes. Among other data both of them contain a timestamp column.
Assume we have df A
x y z timestamp
1 2 3 1.4
4 5 6 1.73
7 8 9 4.1
and df B is:
x y z timestamp
7 4 1 1.7
8 5 2 1.73
9 6 3 3.5
4 5 6 4.8
I would like to compute for each row in A the difference to the position in B which is closest to the timestamp in A. We can assume that the df are both sorted by timestamp. However these timestamps do not share the same start or end time but do certainly have some overlap.
Furthermore the two data frames are not necessarily of same length.
I have a brute force implementation in place which does exactly what I want and which I also can easily extend to potentially interpolate between timestamps -- something which I want to achieve in an improved version. However, my implementation is terribly slow.
I am sure there is a more performant way of implementing the following:
idxA = 0
idxB = 0
endA = len(A)
endB = len(B)
while idxA < endA and idxB < endB:
currentA_ts = A['timestamp'][idxA]
currentB_ts = B['timestamp'][idxB]
if idxB < endB-1:
nextB_ts = B['timestamp'][idxB+1]
if abs(currentB_ts - currentA_ts) > abs(nextB_ts - currentA_ts):
idxB += 1
currentClosestB_row = B.iloc[idxB]
currentA_row = A.iloc[idxA]
B_location = currentClosestB_row[['x','y','z']]
A_location = currentA_row[['x', 'y', 'z']]
direction = get_direction_vector(B_location, A_location)
currentA_row['dir_x'] = direction[0]
currentA_row['dir_y'] = direction[1]
currentA_row['dir_z'] = direction[2]
out_df.append(currentA_row)
idxA += 1
I hope that code snippet clarifies what I try to achieve. But as mentioned above, this is terribly slow as the df A and B both have several 100k entries.
I see two ways of improving the above code:
The general structure of how I try to achieve the described goal.
I can imagine that how I use python and pandas is not optimal. I am using pandas for the very first time, also python is not my main programming language - so please let me know in case you see something that can be improved.
Any feedback on how to speed up that code is highly appreciated.
Many thanks in advance.
Matching rows to the closest values is called an asof merge, i.e. a “left join except that we match on nearest key rather than equal keys” − both columns need to be sorted.
>>> pd.merge_asof(df1, df2, on='timestamp', suffixes=('_a', '_b'), direction='nearest')
x_a y_a z_a timestamp x_b y_b z_b
0 1 2 3 1.40 7 4 1
1 4 5 6 1.73 8 5 2
2 7 8 9 4.10 9 6 3
If you want to be able to subtract the 2 timestamp columns, you need the named differently. You can add suffixes before the merge:
>>> df = pd.merge_asof(df1.add_suffix('_a'), df2.add_suffix('_b'), direction='nearest',
... left_on='timestamp_a', right_on='timestamp_b')
>>> df['delta'] = df['timestamp_a'] - df['timestamp_b']
>>> df
x_a y_a z_a timestamp_a x_b y_b z_b timestamp_b delta
0 1 2 3 1.40 7 4 1 1.70 -0.3
1 4 5 6 1.73 8 5 2 1.73 0.0
2 7 8 9 4.10 9 6 3 3.50 0.6

Getting average of rows in dataframe greater than or equal to zero

I would like to get the average value of a row in a dataframe where I only use values greater than or equal to zero.
For example:
if my dataframe looked like:
df = pd.DataFrame([[3,4,5], [4,5,6],[4,-10,6]])
3 4 5
4 5 6
4 -10 6
currently if I get the average of the row I write :
df['mean'] = df.mean(axis = 1)
and get:
3 4 5 4
4 5 6 5
4 -10 6 0
I would like to get a dataframe that only used values greater than zero to computer the average. I would like a dataframe that looked like:
3 4 5 4
4 5 6 5
4 -10 6 5
In the above example -10 is excluded in the average. Is there a command that excludes the -10?
You can use df[df > 0] to query the data frame before calculating the average; df[df > 0] returns a data frame where cells smaller or equal to zero will be replaced with NaN and get ignored when calculating the mean:
df[df > 0].mean(1)
#0 4.0
#1 5.0
#2 5.0
#dtype: float64
Not nearly as succinct as #Psidom. But if you wanted to use numpy and get some added quickness.
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
0 1 2 Mean
0 3 4 5 4.0
1 4 5 6 5.0
2 4 -10 6 5.0
Timing
small data
%timeit df.assign(Mean=df[df > 0].mean(1))
1000 loops, best of 3: 1.71 ms per loop
%%timeit
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
1000 loops, best of 3: 407 µs per loop

min value till row pandas

I have some problem where data is sorted by date, for example something like this:
date, value, min
2015-08-17, 3, nan
2015-08-18, 2, nan
2015-08-19, 4, nan
2015-08-28, 1, nan
2015-08-29, 5, nan
Now I want to save min values in min column till this row, so result would look something like this:
date, value, min
2015-08-17, 3, 3
2015-08-18, 2, 2
2015-08-19, 4, 2
2015-08-28, 1, 1
2015-08-29, 5, 1
I've tried some options, but still don't get what I'm doing wrong, here is one example that I tried:
data['min'] = min(data['value'], data['min'].shift())
I don't want to iterate through all rows because the data I have is big. What is the best strategy you can write using pandas for this kind of problem?
Since you mentioned that you are working with big dataset, with focus on performance, here's one using NumPy's np.minimum.accumulate -
df['min'] = np.minimum.accumulate(df.value)
Sample run -
In [70]: df
Out[70]:
date value min
0 2015-08-17 3 NaN
1 2015-08-18 2 NaN
2 2015-08-19 4 NaN
3 2015-08-28 1 NaN
4 2015-08-29 5 NaN
In [71]: df['min'] = np.minimum.accumulate(df.value)
In [72]: df
Out[72]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
Runtime test -
In [65]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# #MaxU's soln using pandas cummin
In [66]: %timeit df['min'] = df.value.cummin()
100 loops, best of 3: 6.84 ms per loop
In [67]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# Using NumPy
In [68]: %timeit df['min'] = np.minimum.accumulate(df.value)
100 loops, best of 3: 3.97 ms per loop
Use cummin() method:
In [53]: df['min'] = df.value.cummin()
In [54]: df
Out[54]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1

python - possible to apply percentile cuts to each column in a dataframe?

Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.

keep only lowest value per row in a Python Pandas dataset

In a Pandas dataset I only want to keep the lowest value per line. All other values should be deleted.
I need the original dataset intact. Just remove all values (replace by NaN) which are not the minimum.
What is the best way to do this - speed/performance wise.
I can also transpose the dataset if the operation is easier per column.
Thanks
Robert
Since the operation you are contemplating does not rely on the columns or index, it might be easier (and faster) to do this using NumPy rather than Pandas.
You can find the location (i.e. column index) of the minimums for each row using
idx = np.argmin(arr, axis=1)
You could then make a new array filled with NaNs and copy the minimum values
to the new array.
import numpy as np
import pandas as pd
def nan_all_but_min(df):
arr = df.values
idx = np.argmin(arr, axis=1)
newarr = np.full_like(arr, np.nan, dtype='float')
newarr[np.arange(arr.shape[0]), idx] = arr[np.arange(arr.shape[0]), idx]
df = pd.DataFrame(newarr, columns=df.columns, index=df.index)
return df
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
df = nan_all_but_min(df)
print(df)
yields
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN
Here is a benchmark comparing nan_all_but_min vs using_where:
def using_where(df):
return df.where(df.values == df.min(axis=1)[:,None])
In [73]: df = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [74]: %timeit using_where(df)
1000 loops, best of 3: 701 µs per loop
In [75]: %timeit nan_all_but_min(df)
10000 loops, best of 3: 105 µs per loop
Note that using_where and nan_all_but_min behave differently if a row contains the same min value more than once. using_where will preserve all the mins, nan_all_but_min will preserve only one min. For example:
In [76]: using_where(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[76]:
0 1 2
0 0 0 NaN
1 1 NaN 1
In [77]: nan_all_but_min(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[77]:
0 1 2
0 0 NaN NaN
1 1 NaN NaN
Piggybacking off #unutbu's excellent answer, the following minor change should accommodate your modified question.
The where method
In [26]: df2 = df.copy()
In [27]: df2
Out[27]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [28]: df2.where(df2.values == df2.min(axis=1)[:,None])
Out[28]:
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN NaN
3 9 NaN NaN
Mandatory speed test.
In [29]: df3 = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [30]: %timeit df3.where(df3.values == df3.min(axis=1)[:,None])
1000 loops, best of 3: 723 µs per loop
If your data frame already contains NaN values, you must use numpy's nanmin as follows:
df2.where(df2.values==np.nanmin(df2,axis=0))
I just found and tried out the answer by unutbu.
I tried the .where method, but it will be deprecated soon.
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
However, i got this sucker working instead. However, it is a lambda function, and most likely slower...
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
mask = df.apply(lambda d:(d == df.min(axis=1)))
print (df[mask])
Should yield:
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN

Categories