keep only lowest value per row in a Python Pandas dataset - python

In a Pandas dataset I only want to keep the lowest value per line. All other values should be deleted.
I need the original dataset intact. Just remove all values (replace by NaN) which are not the minimum.
What is the best way to do this - speed/performance wise.
I can also transpose the dataset if the operation is easier per column.
Thanks
Robert

Since the operation you are contemplating does not rely on the columns or index, it might be easier (and faster) to do this using NumPy rather than Pandas.
You can find the location (i.e. column index) of the minimums for each row using
idx = np.argmin(arr, axis=1)
You could then make a new array filled with NaNs and copy the minimum values
to the new array.
import numpy as np
import pandas as pd
def nan_all_but_min(df):
arr = df.values
idx = np.argmin(arr, axis=1)
newarr = np.full_like(arr, np.nan, dtype='float')
newarr[np.arange(arr.shape[0]), idx] = arr[np.arange(arr.shape[0]), idx]
df = pd.DataFrame(newarr, columns=df.columns, index=df.index)
return df
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
df = nan_all_but_min(df)
print(df)
yields
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN
Here is a benchmark comparing nan_all_but_min vs using_where:
def using_where(df):
return df.where(df.values == df.min(axis=1)[:,None])
In [73]: df = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [74]: %timeit using_where(df)
1000 loops, best of 3: 701 µs per loop
In [75]: %timeit nan_all_but_min(df)
10000 loops, best of 3: 105 µs per loop
Note that using_where and nan_all_but_min behave differently if a row contains the same min value more than once. using_where will preserve all the mins, nan_all_but_min will preserve only one min. For example:
In [76]: using_where(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[76]:
0 1 2
0 0 0 NaN
1 1 NaN 1
In [77]: nan_all_but_min(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[77]:
0 1 2
0 0 NaN NaN
1 1 NaN NaN

Piggybacking off #unutbu's excellent answer, the following minor change should accommodate your modified question.
The where method
In [26]: df2 = df.copy()
In [27]: df2
Out[27]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [28]: df2.where(df2.values == df2.min(axis=1)[:,None])
Out[28]:
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN NaN
3 9 NaN NaN
Mandatory speed test.
In [29]: df3 = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [30]: %timeit df3.where(df3.values == df3.min(axis=1)[:,None])
1000 loops, best of 3: 723 µs per loop

If your data frame already contains NaN values, you must use numpy's nanmin as follows:
df2.where(df2.values==np.nanmin(df2,axis=0))

I just found and tried out the answer by unutbu.
I tried the .where method, but it will be deprecated soon.
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
However, i got this sucker working instead. However, it is a lambda function, and most likely slower...
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
mask = df.apply(lambda d:(d == df.min(axis=1)))
print (df[mask])
Should yield:
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN

Related

How to use previous N values in pandas column to fill NaNs?

Say I have a time series data as below.
df
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 Nan 34.23
4 32 Nan
5 18.75 41.1
6 Nan 45.12
7 23 39.67
8 Nan 36.45
9 36 Nan
Now I want to fill NaNs in column priceA by taking mean of previous N values in the column. In this case take N=3.
And for column priceB I have to fill Nan by value M rows above(current index-M).
I tried to write for loop for it which is not a good practice as my data is too large. Is there a better way to do this?
N=3
M=2
def fillPriceA( df,indexval,n):
temp=[ ]
for i in range(n):
if i < 0:
continue
temp.append(df.loc[indexval-(i+1), 'priceA'])
return np.nanmean(np.array(temp, dtype=np.float))
def fillPriceB(df, indexval, m):
return df.loc[indexval-m, 'priceB']
for idx, rows for df.iterrows():
if idx< N:
continue
else:
if rows['priceA']==None:
rows['priceA']= fillPriceA(df, idx,N)
if rows['priceB']==None:
rows['priceB']=fillPrriceB(df,idx,M)
Expected output:
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 32.31 34.23
4 32 29.08
5 18.75 41.1
6 27.68 45.12
7 23 39.67
8 23.14 36.45
9 36 39.67
A solution could be to only work with the nan index (see dataframe boolean indexing):
param = dict(priceA = 3, priceB = 2) #Number of previous values to consider
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col])):i][col] #get the nth expected elements
df.loc[i][col] = _window.mean() if col == 'priceA' else _window.iloc[0] #Replace with right method
print(df)
Result:
priceA priceB
0 25.670000 30.56
1 34.120000 28.43
2 37.140000 29.08
3 32.310000 34.23
4 32.000000 29.08
5 18.750000 41.10
6 27.686667 45.12
7 23.000000 39.67
8 23.145556 36.45
9 36.000000 39.67
Note
1. Using np.isnan() implies that your columns are numeric. If not convert your columns before with pd.to_numeric():
...
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors = 'coerce')
...
Or use pd.isnull() instead (see example below). Be aware of the performances (numpy is faster):
from random import randint
#A sample with 10k elements and some np.nan
arr = np.random.rand(10000)
for i in range(100):
arr[randint(0,9999)] = np.nan
#Performances
%timeit pd.isnull(arr)
10000 loops, best of 3: 24.8 µs per loop
%timeit np.isnan(arr)
100000 loops, best of 3: 5.6 µs per loop
2. A more generic alternative could be to define methods and window size to apply for each column in a dict:
import pandas as pd
param = {}
param['priceA'] = {'n':3,
'method':lambda x: pd.isnull(x)}
param['priceB'] = {'n':2,
'method':lambda x: x[0]}
param contains now n the number of elements and method a lambda expression. Accordingly rewrite your loops:
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col]['n'])):i][col] #get the nth expected elements
df.loc[i][col] = param[col]['method'](_window.values) #Replace with right method
print(df)#This leads to a similar result.
You can use an NA mask to do what you need per column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,4, None, 5, 6], 'b': [1, None, 2, 3, 4, None, 7]})
df
# a b
# 0 1.0 1.0
# 1 2.0 NaN
# 2 3.0 2.0
# 3 4.0 3.0
# 4 NaN 4.0
# 5 5.0 NaN
# 6 6.0 7.0
for col in df.columns:
s = df[col]
na_indices = s[s.isnull()].index.tolist()
prev = 0
for k in na_indices:
s[k] = np.mean(s[prev:k])
prev = k
df[col] = s
print(df)
a b
# 0 1.0 1.0
# 1 2.0 1.0
# 2 3.0 2.0
# 3 4.0 3.0
# 4 2.5 4.0
# 5 5.0 2.5
# 6 6.0 7.0
While this is still a custom operation, I am pretty sure it will be slightly faster because it is not iterating over each row, just over the NA values, which I am assuming will be sparse compared to the actual data
To fill priceA use rolling, then shift and use this result in fillna,
# make some data
df = pd.DataFrame({'priceA': range(10)})
#make some rows missing
df.loc[[4, 6], 'priceA'] = np.nan
n = 3
df.priceA = df.priceA.fillna(df.priceA.rolling(n, min_periods=1).mean().shift(1))
The only edge case here is when two nans are within n of one another but it seems to handle this as in your question.
For priceB just use shift,
df = pd.DataFrame({'priceB': range(10)})
df.loc[[4, 8], 'priceB'] = np.nan
m = 2
df.priceB = df.priceB.fillna(df.priceB.shift(m))
Like before, there is the edge case where there is a nan exactly m before another nan.

min value till row pandas

I have some problem where data is sorted by date, for example something like this:
date, value, min
2015-08-17, 3, nan
2015-08-18, 2, nan
2015-08-19, 4, nan
2015-08-28, 1, nan
2015-08-29, 5, nan
Now I want to save min values in min column till this row, so result would look something like this:
date, value, min
2015-08-17, 3, 3
2015-08-18, 2, 2
2015-08-19, 4, 2
2015-08-28, 1, 1
2015-08-29, 5, 1
I've tried some options, but still don't get what I'm doing wrong, here is one example that I tried:
data['min'] = min(data['value'], data['min'].shift())
I don't want to iterate through all rows because the data I have is big. What is the best strategy you can write using pandas for this kind of problem?
Since you mentioned that you are working with big dataset, with focus on performance, here's one using NumPy's np.minimum.accumulate -
df['min'] = np.minimum.accumulate(df.value)
Sample run -
In [70]: df
Out[70]:
date value min
0 2015-08-17 3 NaN
1 2015-08-18 2 NaN
2 2015-08-19 4 NaN
3 2015-08-28 1 NaN
4 2015-08-29 5 NaN
In [71]: df['min'] = np.minimum.accumulate(df.value)
In [72]: df
Out[72]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1
Runtime test -
In [65]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# #MaxU's soln using pandas cummin
In [66]: %timeit df['min'] = df.value.cummin()
100 loops, best of 3: 6.84 ms per loop
In [67]: df = pd.DataFrame(np.random.randint(0,100,(1000000)), columns=list(['value']))
# Using NumPy
In [68]: %timeit df['min'] = np.minimum.accumulate(df.value)
100 loops, best of 3: 3.97 ms per loop
Use cummin() method:
In [53]: df['min'] = df.value.cummin()
In [54]: df
Out[54]:
date value min
0 2015-08-17 3 3
1 2015-08-18 2 2
2 2015-08-19 4 2
3 2015-08-28 1 1
4 2015-08-29 5 1

python - possible to apply percentile cuts to each column in a dataframe?

Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!
pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.

Vectorizing standard deviation calculations for pandas dataseries

I have a pandas Series, like so,
data = [1,2,3,2,4,5,6,3,5]
ds = pd.Series(data)
print (ds)
0 1
1 2
2 3
3 2
4 4
5 5
6 6
7 3
8 5
I am interested in getting the standard deviation for each index. For example, when I at index 5, I want to calculate the standard deviations for ds[0:4].
I have done this with the following code,
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
print (df)
data avreturns sd
0 1 1 NaN
1 2 1.5 0.7071068
2 3 2 1
3 2 2 0.8164966
4 4 2.4 1.140175
5 5 2.833333 1.47196
6 6 3.285714 1.799471
7 3 3.25 1.669046
8 5 3.444444 1.666667
This works, but I using a loop and it is slow. Is there a way to vectorize this?
I was able to vectorize the mean calculations by using the cumsum() function:
df.data.cumsum()/(df.index+1)
Is there a way to vectorize the standard deviation calculations?
You might be interested in pd.expanding_std, which calculates the cumulative standard deviation for you:
>>> pd.expanding_std(ds)
0 NaN
1 0.707107
2 1.000000
3 0.816497
4 1.140175
5 1.471960
6 1.799471
7 1.669046
8 1.666667
dtype: float64
For what it's worth, this type of cumulative operation might be very fiddly to vectorise: the Pandas implementation appears to loop using Cython for speed.
To expand #ajcr's answer, I ran a %timeit against the two ways to do this. I think there is 1000x improvement by using expanding_stds...
data = [x for x in range(1000)]
ds = pd.Series(data)
df = pd.DataFrame(columns = ['data', 'avreturns', 'sd'])
df.data = data
def foo(df):
for i in df.index:
dataslice = df.ix[0:i]
df['avreturns'].loc[i] = dataslice.data.mean()
df['sd'].loc[i] = dataslice.data.std()
return (df)
%timeit foo(df)
1 loops, best of 3: 1min 36s per loop
%timeit pd.expanding_std(df.data)
10000 loops, best of 3: 126 µs per loop

Updating a DataFrame based on another DataFrame

Given DataFrame df:
Id Sex Group Time Time!
0 21 M 2 2.31 NaN
1 2 F 2 2.29 NaN
and update:
Id Sex Group Time
0 21 M 2 2.36
1 2 F 2 2.09
2 3 F 1 1.79
I want to match on Id, Sex and Group and either update Time! with Time value (from the update df) if match, or insert if a new record.
Here is how I do it:
df = df.set_index(['Id', 'Sex', 'Group'])
update = update.set_index(['Id', 'Sex', 'Group'])
for i, row in update.iterrows():
if i in df.index: # update
df.ix[i, 'Time!'] = row['Time']
else: # insert new record
cols = up.columns.values
row = np.array(row).reshape(1, len(row))
_ = pd.DataFrame(row, index=[i], columns=cols)
df = df.append(_)
print df
Time Time!
Id Sex Group
21 M 2 2.31 2.36
2 F 2 2.29 2.09
3 F 1 1.79 NaN
The code seem to work and my wished result matches with the above. However, I have noticed this behaving faultily on a big data set, with the conditional
if i in df.index:
...
else:
...
working obviously wrong (it would proceed to else and vice-verse where it shouldn't, I guess, this MultiIndex may be the cause somehow).
So my question is, do you know any other way, or a more robust version of mine, to update one df based on another df?
I think I would do this with a merge, and then update the columns with a where. First remove the Time column from up:
In [11]: times = up.pop('Time') # up = the update DataFrame
In [12]: df1 = df.merge(up, how='outer')
In [13]: df1
Out[13]:
Id Sex Group Time Time!
0 21 M 2 2.31 NaN
1 2 F 2 2.29 NaN
2 3 F 1 NaN NaN
Update Time if it's not NaN and Time! if it's NaN:
In [14]: df1['Time!'] = df1['Time'].where(df1['Time'].isnull(), times)
In [15]: df1['Time'] = df1['Time'].where(df1['Time'].notnull(), times)
In [16]: df1
Out[16]:
Id Sex Group Time Time!
0 21 M 2 2.31 2.36
1 2 F 2 2.29 2.09
2 3 F 1 1.79 NaN

Categories