I have some problem where data is sorted by date, for example something like this:
date, value, min
2015-08-17, 3, nan
2015-08-18, 2, nan
2015-08-19, 4, nan
2015-08-28, 1, nan
2015-08-29, 5, nan
Now I want to save min values in min column till this row, so result would look something like this:
date, value, min
2015-08-17, 3, 3
2015-08-18, 2, 2
2015-08-19, 4, 2
2015-08-28, 1, 1
2015-08-29, 5, 1
I've tried some options, but still don't get what I'm doing wrong, here is one example that I tried:
data['min'] = min(data['value'], data['min'].shift())
I don't want to iterate through all rows because the data I have is big. What is the best strategy you can write using pandas for this kind of problem?
Since you mentioned that you are working with big dataset, with focus on performance, here's one using NumPy's np.minimum.accumulate -
df['min'] = np.minimum.accumulate(df.value)
Sample run -
In [70]: df
Out[70]:
date value min
0 2015-08-17 3 NaN
1 2015-08-18 2 NaN
2 2015-08-19 4 NaN
3 2015-08-28 1 NaN
4 2015-08-29 5 NaN
In [71]: df['min'] = np.minimum.accumulate(df.value)
In [72]: df
Out[72]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
Runtime test -
In [65]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# #MaxU's soln using pandas cummin
In [66]: %timeit df['min'] = df.value.cummin()
100 loops, best of 3: 6.84 ms per loop
In [67]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# Using NumPy
In [68]: %timeit df['min'] = np.minimum.accumulate(df.value)
100 loops, best of 3: 3.97 ms per loop
Use cummin() method:
In [53]: df['min'] = df.value.cummin()
In [54]: df
Out[54]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
Related
Given the DataFrame:
import pandas as pd
df = pd.DataFrame([6, 4, 2, 4, 5], index=[2, 6, 3, 4, 5], columns=['A'])
Results in:
A
2 6
6 4
3 2
4 4
5 5
Now, I would like to sort by values of Column A AND the index.
e.g.
df.sort_values(by='A')
Returns
A
3 2
6 4
4 4
5 5
2 6
Whereas I would like
A
3 2
4 4
6 4
5 5
2 6
How can I get a sort on the column first and index second?
You can sort by index and then by column A using kind='mergesort'.
This works because mergesort is stable.
res = df.sort_index().sort_values('A', kind='mergesort')
Result:
A
3 2
4 4
6 4
5 5
2 6
Using lexsort from numpy may be other way and little faster as well:
df.iloc[np.lexsort((df.index, df.A.values))] # Sort by A.values, then by index
Result:
A
3 2
4 4
6 4
5 5
2 6
Comparing with timeit:
%%timeit
df.iloc[np.lexsort((df.index, df.A.values))] # Sort by A.values, then by index
Result:
1000 loops, best of 3: 278 µs per loop
With reset index and set index again:
%%timeit
df.reset_index().sort_values(by=['A','index']).set_index('index')
Result:
100 loops, best of 3: 2.09 ms per loop
The other answers are great. I'll throw in one other option, which is to provide a name for the index first using rename_axis and then reference it in sort_values. I have not tested the performance but expect the accepted answer to still be faster.
df.rename_axis('idx').sort_values(by=['A', 'idx'])
A
idx
3 2
4 4
6 4
5 5
2 6
You can clear the index name afterward if you want with df.index.name = None.
I would like to get the average value of a row in a dataframe where I only use values greater than or equal to zero.
For example:
if my dataframe looked like:
df = pd.DataFrame([[3,4,5], [4,5,6],[4,-10,6]])
3 4 5
4 5 6
4 -10 6
currently if I get the average of the row I write :
df['mean'] = df.mean(axis = 1)
and get:
3 4 5 4
4 5 6 5
4 -10 6 0
I would like to get a dataframe that only used values greater than zero to computer the average. I would like a dataframe that looked like:
3 4 5 4
4 5 6 5
4 -10 6 5
In the above example -10 is excluded in the average. Is there a command that excludes the -10?
You can use df[df > 0] to query the data frame before calculating the average; df[df > 0] returns a data frame where cells smaller or equal to zero will be replaced with NaN and get ignored when calculating the mean:
df[df > 0].mean(1)
#0 4.0
#1 5.0
#2 5.0
#dtype: float64
Not nearly as succinct as #Psidom. But if you wanted to use numpy and get some added quickness.
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
0 1 2 Mean
0 3 4 5 4.0
1 4 5 6 5.0
2 4 -10 6 5.0
Timing
small data
%timeit df.assign(Mean=df[df > 0].mean(1))
1000 loops, best of 3: 1.71 ms per loop
%%timeit
v0 = df.values
v1 = np.where(v0 > 0, v0, np.nan)
v2 = np.nanmean(v1, axis=1)
df.assign(Mean=v2)
1000 loops, best of 3: 407 µs per loop
Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.
I have a pandas Series, like so,
data = [1,2,3,2,4,5,6,3,5]
ds = pd.Series(data)
print (ds)
0 1
1 2
2 3
3 2
4 4
5 5
6 6
7 3
8 5
I am interested in getting the standard deviation for each index. For example, when I at index 5, I want to calculate the standard deviations for ds[0:4].
I have done this with the following code,
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
print (df)
data avreturns sd
0 1 1 NaN
1 2 1.5 0.7071068
2 3 2 1
3 2 2 0.8164966
4 4 2.4 1.140175
5 5 2.833333 1.47196
6 6 3.285714 1.799471
7 3 3.25 1.669046
8 5 3.444444 1.666667
This works, but I using a loop and it is slow. Is there a way to vectorize this?
I was able to vectorize the mean calculations by using the cumsum() function:
df.data.cumsum()/(df.index+1)
Is there a way to vectorize the standard deviation calculations?
You might be interested in pd.expanding_std, which calculates the cumulative standard deviation for you:
>>> pd.expanding_std(ds)
0 NaN
1 0.707107
2 1.000000
3 0.816497
4 1.140175
5 1.471960
6 1.799471
7 1.669046
8 1.666667
dtype: float64
For what it's worth, this type of cumulative operation might be very fiddly to vectorise: the Pandas implementation appears to loop using Cython for speed.
To expand #ajcr's answer, I ran a %timeit against the two ways to do this. I think there is 1000x improvement by using expanding_stds...
data = [x for x in range(1000)]
ds = pd.Series(data)
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
def foo(df):
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
return (df)
%timeit foo(df)
1 loops, best of 3: 1min 36s per loop
%timeit pd.expanding_std(df.data)
10000 loops, best of 3: 126 µs per loop
In a Pandas dataset I only want to keep the lowest value per line. All other values should be deleted.
I need the original dataset intact. Just remove all values (replace by NaN) which are not the minimum.
What is the best way to do this - speed/performance wise.
I can also transpose the dataset if the operation is easier per column.
Thanks
Robert
Since the operation you are contemplating does not rely on the columns or index, it might be easier (and faster) to do this using NumPy rather than Pandas.
You can find the location (i.e. column index) of the minimums for each row using
idx = np.argmin(arr, axis=1)
You could then make a new array filled with NaNs and copy the minimum values
to the new array.
import numpy as np
import pandas as pd
def nan_all_but_min(df):
arr = df.values
idx = np.argmin(arr, axis=1)
newarr = np.full_like(arr, np.nan, dtype='float')
newarr[np.arange(arr.shape[0]), idx] = arr[np.arange(arr.shape[0]), idx]
df = pd.DataFrame(newarr, columns=df.columns, index=df.index)
return df
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
df = nan_all_but_min(df)
print(df)
yields
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN
Here is a benchmark comparing nan_all_but_min vs using_where:
def using_where(df):
return df.where(df.values == df.min(axis=1)[:,None])
In [73]: df = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [74]: %timeit using_where(df)
1000 loops, best of 3: 701 µs per loop
In [75]: %timeit nan_all_but_min(df)
10000 loops, best of 3: 105 µs per loop
Note that using_where and nan_all_but_min behave differently if a row contains the same min value more than once. using_where will preserve all the mins, nan_all_but_min will preserve only one min. For example:
In [76]: using_where(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[76]:
0 1 2
0 0 0 NaN
1 1 NaN 1
In [77]: nan_all_but_min(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[77]:
0 1 2
0 0 NaN NaN
1 1 NaN NaN
Piggybacking off #unutbu's excellent answer, the following minor change should accommodate your modified question.
The where method
In [26]: df2 = df.copy()
In [27]: df2
Out[27]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [28]: df2.where(df2.values == df2.min(axis=1)[:,None])
Out[28]:
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN NaN
3 9 NaN NaN
Mandatory speed test.
In [29]: df3 = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [30]: %timeit df3.where(df3.values == df3.min(axis=1)[:,None])
1000 loops, best of 3: 723 µs per loop
If your data frame already contains NaN values, you must use numpy's nanmin as follows:
df2.where(df2.values==np.nanmin(df2,axis=0))
I just found and tried out the answer by unutbu.
I tried the .where method, but it will be deprecated soon.
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
However, i got this sucker working instead. However, it is a lambda function, and most likely slower...
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
mask = df.apply(lambda d:(d == df.min(axis=1)))
print (df[mask])
Should yield:
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN