How to use previous N values in pandas column to fill NaNs? - python

Say I have a time series data as below.
df
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 Nan 34.23
4 32 Nan
5 18.75 41.1
6 Nan 45.12
7 23 39.67
8 Nan 36.45
9 36 Nan
Now I want to fill NaNs in column priceA by taking mean of previous N values in the column. In this case take N=3.
And for column priceB I have to fill Nan by value M rows above(current index-M).
I tried to write for loop for it which is not a good practice as my data is too large. Is there a better way to do this?
N=3
M=2
def fillPriceA( df,indexval,n):
temp=[ ]
for i in range(n):
if i < 0:
continue
temp.append(df.loc[indexval-(i+1), 'priceA'])
return np.nanmean(np.array(temp, dtype=np.float))
def fillPriceB(df, indexval, m):
return df.loc[indexval-m, 'priceB']
for idx, rows for df.iterrows():
if idx< N:
continue
else:
if rows['priceA']==None:
rows['priceA']= fillPriceA(df, idx,N)
if rows['priceB']==None:
rows['priceB']=fillPrriceB(df,idx,M)
Expected output:
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 32.31 34.23
4 32 29.08
5 18.75 41.1
6 27.68 45.12
7 23 39.67
8 23.14 36.45
9 36 39.67

A solution could be to only work with the nan index (see dataframe boolean indexing):
param = dict(priceA = 3, priceB = 2) #Number of previous values to consider
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col])):i][col] #get the nth expected elements
df.loc[i][col] = _window.mean() if col == 'priceA' else _window.iloc[0] #Replace with right method
print(df)
Result:
priceA priceB
0 25.670000 30.56
1 34.120000 28.43
2 37.140000 29.08
3 32.310000 34.23
4 32.000000 29.08
5 18.750000 41.10
6 27.686667 45.12
7 23.000000 39.67
8 23.145556 36.45
9 36.000000 39.67
Note
1. Using np.isnan() implies that your columns are numeric. If not convert your columns before with pd.to_numeric():
...
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors = 'coerce')
...
Or use pd.isnull() instead (see example below). Be aware of the performances (numpy is faster):
from random import randint
#A sample with 10k elements and some np.nan
arr = np.random.rand(10000)
for i in range(100):
arr[randint(0,9999)] = np.nan
#Performances
%timeit pd.isnull(arr)
10000 loops, best of 3: 24.8 µs per loop
%timeit np.isnan(arr)
100000 loops, best of 3: 5.6 µs per loop
2. A more generic alternative could be to define methods and window size to apply for each column in a dict:
import pandas as pd
param = {}
param['priceA'] = {'n':3,
'method':lambda x: pd.isnull(x)}
param['priceB'] = {'n':2,
'method':lambda x: x[0]}
param contains now n the number of elements and method a lambda expression. Accordingly rewrite your loops:
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col]['n'])):i][col] #get the nth expected elements
df.loc[i][col] = param[col]['method'](_window.values) #Replace with right method
print(df)#This leads to a similar result.

You can use an NA mask to do what you need per column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,4, None, 5, 6], 'b': [1, None, 2, 3, 4, None, 7]})
df
# a b
# 0 1.0 1.0
# 1 2.0 NaN
# 2 3.0 2.0
# 3 4.0 3.0
# 4 NaN 4.0
# 5 5.0 NaN
# 6 6.0 7.0
for col in df.columns:
s = df[col]
na_indices = s[s.isnull()].index.tolist()
prev = 0
for k in na_indices:
s[k] = np.mean(s[prev:k])
prev = k
df[col] = s
print(df)
a b
# 0 1.0 1.0
# 1 2.0 1.0
# 2 3.0 2.0
# 3 4.0 3.0
# 4 2.5 4.0
# 5 5.0 2.5
# 6 6.0 7.0
While this is still a custom operation, I am pretty sure it will be slightly faster because it is not iterating over each row, just over the NA values, which I am assuming will be sparse compared to the actual data

To fill priceA use rolling, then shift and use this result in fillna,
# make some data
df = pd.DataFrame({'priceA': range(10)})
#make some rows missing
df.loc[[4, 6], 'priceA'] = np.nan
n = 3
df.priceA = df.priceA.fillna(df.priceA.rolling(n, min_periods=1).mean().shift(1))
The only edge case here is when two nans are within n of one another but it seems to handle this as in your question.
For priceB just use shift,
df = pd.DataFrame({'priceB': range(10)})
df.loc[[4, 8], 'priceB'] = np.nan
m = 2
df.priceB = df.priceB.fillna(df.priceB.shift(m))
Like before, there is the edge case where there is a nan exactly m before another nan.

Related

Nested lists to python dataframe

I have a nested numpy.ndarray of the following format (each of the sublists has the same size)
len(exp_data) # Timepoints
Out[205]: 42
len(exp_data[0])
Out[206]: 1
len(exp_data[0][0]) # Y_bins
Out[207]: 13
len(exp_data[0][0][0]) # X_bins
Out[208]: 43
type(exp_data[0][0][0][0])
Out[209]: numpy.float64
I want to move these into a pandas DataFrame such that there are 3 columns numbered from 0 to N and the last one with the float value.
I could do this with a series of loops, but that seems like a very non-efficient way of solving the problem.
In addition I would like to get rid of any nan values (not present in sample data). Do I do this after creating the df or is there a way to skip adding them in the first place?
NOTE: code below has been edited and I've added sample data
import random
import numpy as np
import pandas as pd
exp_data = [[[ [random.random() for x in range (5)],
[random.random() for x in range (5)],
[random.random() for x in range (5)],
]]]*5
exp_data[0][0][0][1]=np.nan
df = pd.DataFrame(columns = ['Timepoint','Y_bin','X_bin','Values'])
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [int(t),int(y),int(x),x_bin]
df = df.dropna().reset_index(drop=True)
The final format should be as follows (except I'd preferably like integers instead of floats in first 3 columns, but not essential; int(t) etc. doesn't do the trick)
df
Out[291]:
Timepoint Y_bin X_bin Values
0 0.0 0.0 0.0 0.095391
1 0.0 0.0 2.0 0.963608
2 0.0 0.0 3.0 0.855735
3 0.0 0.0 4.0 0.392637
4 0.0 1.0 0.0 0.555199
5 0.0 1.0 1.0 0.118981
6 0.0 1.0 2.0 0.201782
...
len(df) # has received a total of 75 (5*3*5) input values of which 5 are nan
Out[293]: 70
change the format of the float out put to this by adding this piece of code
pd.options.display.float_format = '{:,.0f}'.format
to the end of your code like this to change the format
df = pd.DataFrame(columns = columns)
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [t,y,x,x_bin]
df.dropna().reset_index(drop=True)
pd.options.display.float_format = '{:,.0f}'.format
df
Out[250]:
Timepoint Y_bin X_bin Values
0 0 4 10 -2
1 0 4 11 -1
2 0 4 12 -2
3 0 4 13 -2
4 0 4 14 -2
5 0 4 15 -2
6 0 4 16 -3
...

How to calculate mean values for a particular column in dataframe?

I have a dataframe which has values of type: object. Dataframe also contains NaN values. I want to ignore NaN values and for all the remaining values in the column, I want to calculate the mean.
Mean is calculated as follows:
Upperbound value = 30
Lowerbound value = 0
(Upperbound and lowerbound are fixed and all values need to be calculated wrt to them.)
So,
for '>20', mean = (20+30)/2 = 25
for '>1', mean = (30+1)/2=15.5
for '<5', mean = (5+0)/2 = 2.5
for '<10', mean = (10+0)/2 = 5
Dataframe:
column1
>20
NaN
<5
12
>1
<10
NaN
8
Note: Above values in the column are strings and I want to convert it to numerical value.
Final converted dataframe should be:
column1
25
NaN
2.5
12
15.5
5
NaN
8
Note: Above values like 8 and 12 are not converted I only want to convert those values which are prefixed with either > or < remaining values just need to be converted to numerical from the string value.
There's probably a better way, but this works too:
df['num'] = df.column1.str.extract('(\d+)')
df['sign'] = df.column1.str.extract('([<>])').fillna('=')
def get_avg(row):
if not row.num:
return row.num
elif row.sign == '>':
return (int(row.num)+30)/2
elif row.sign == '>':
return (int(row.num)+0)/2
else:
return row.num
df['avg'] = df.apply(lambda row: get_avg(row), axis=1)
Output:
column1 sign num avg
0 >20 > 20 25
1 NaN = NaN NaN
2 <5 < 5 5
3 12 = 12 12
4 >1 > 1 15.5
5 <10 < 10 10
6 NaN = NaN NaN
7 8 = 8 8
The below code applies a custom function that checks the first character of each element and calculates the average based on that.
import numpy as np
import pandas as pd
upper = 30
lower = 0
df = pd.DataFrame({'col1':['>20',np.NaN,'<5','12','>1','<10',np.NaN,'8']})
def avg(val):
if val is not np.NaN:
char = val[0]
if char == '>':
res = (float(val[1:])+upper)/2
elif char == '<':
res = (float(val[1:])+lower)/2
else:
res = float(val)
return res
print(df["col1"].apply(avg))
Output:
0 25.0
1 NaN
2 2.5
3 12.0
4 15.5
5 5.0
6 NaN
7 8.0
You can use np.select to assign the value you want to average with. And then you can average, after converting column1 to a number.
import pandas as pd
import numpy as np
lt = df[df.column1.notnull()].column1.str.contains('<')
gt = df[df.column1.notnull()].column1.str.contains('>')
conds = [lt, gt, ~(lt | gt)]
choice = [0, 30, pd.to_numeric(df[df.column1.notnull()].column1, errors='coerce')]
df.loc[df.column1.notnull(), 'column2'] = np.select(conds, choice)
df['column1'] = pd.to_numeric(df.column1.str.replace('<|>', ''))
df['Avg'] = df.mean(axis=1)
Output:
column1 column2 Avg
0 20.0 30.0 25.0
1 NaN NaN NaN
2 5.0 0.0 2.5
3 12.0 12.0 12.0
4 1.0 30.0 15.5
5 10.0 0.0 5.0
6 NaN NaN NaN
7 8.0 8.0 8.0
You could write a function to calculate your "custom average" then call apply on your column.
x = np.array([['>20'],[np.NaN],['<5'],['>1'],['<10'],[np.NaN]])
df = pd.DataFrame(x,columns=["column1"])
def myFunc(content, up, low):
try:
if content.isnumeric(): return float(content)
return {
'>': (float(content[1:])+up)/2,
'<': (float(content[1:])+low)/2
}[content[0]]
except:
return np.nan
df["avg"] = df.column1.apply(lambda x: myFunc(x, up=30, low=0))

how to replace infinite value with maximum value of a pandas column?

I have a dataframe which looks like
City Crime_Rate
A 10
B 20
C inf
D 15
I want to replace the inf with the max value of the Crime_Rate column , so that my resulting dataframe should look like
City Crime_Rate
A 10
B 20
C 20
D 15
I tried
df['Crime_Rate'].replace([np.inf],max(df['Crime_Rate']),inplace=True)
But python takes inf as the maximum value , where am I going wrong here ?
Filter out inf values first and then get max of Series:
m = df.loc[df['Crime_Rate'] != np.inf, 'Crime_Rate'].max()
df['Crime_Rate'].replace(np.inf,m,inplace=True)
Another solution:
mask = df['Crime_Rate'] != np.inf
df.loc[~mask, 'Crime_Rate'] = df.loc[mask, 'Crime_Rate'].max()
print (df)
City Crime_Rate
0 A 10.0
1 B 20.0
2 C 20.0
3 D 15.0
Here is a solution for a whole matrix/data frame:
highest_non_inf = df.max().loc[lambda v: v<np.Inf].max()
df.replace(np.Inf, highest_non_inf)
Set use_inf_as_nan to true and then use fillna. (Use this if you want to consider inf and nan both as missing value) i.e
pd.options.mode.use_inf_as_na = True
df['Crime_Rate'].fillna(df['Crime_Rate'].max(),inplace=True)
City Crime_Rate
0 A 10.0
1 B 20.0
2 C 20.0
3 D 15.0
One way to do it using an additional function replace(np.inf, np.nan) within max().
It replaces inf with nan for the operations happening inside max() and max returns the expected maximum value not inf
Example below : Max value is 100 and replaces inf
#Create dummy data frame
import pandas as pd
import numpy as np
a = float('Inf')
v = [1,2,5,a,10,5,a,5,100,2]
df = pd.DataFrame({'Col_A': v})
#Data frame looks like this
In [33]: df
Out[33]:
Col_A
0 1.000000
1 2.000000
2 5.000000
3 inf
4 10.000000
5 5.000000
6 inf
7 5.000000
8 100.000000
9 2.000000
# Replace inf
df['Col_A'].replace([np.inf],max(df['Col_A'].replace(np.inf,
np.nan)),inplace=True)
In[35]: df
Out[35]:
Col_A
0 1.0
1 2.0
2 5.0
3 100.0
4 10.0
5 5.0
6 100.0
7 5.0
8 100.0
9 2.0
Hope that works !
Use numpy clip. It's elegant and blazingly fast:
import numpy as np
import pandas as pd
df = pd.DataFrame({"x": [-np.inf, +np.inf, np.nan, 4, 3]})
df["x"] = np.clip(df["x"], -np.inf, 100)
# Out:
# x
# 0 -inf
# 1 100.0
# 2 NaN
# 3 4.0
# 4 3.0
To get rid of the negative infinity as well, replace -np.inf with a small number. NaN is always unaffected. To get the max, use max(df["x"]).

Fastest way to convert python iterator output to pandas dataframe

I have a generator that returns an unknown number of rows of data that I want to convert to an indexed pandas dataframe. The fastest way I know of is to write a CSV to disk then parse back in via 'read_csv'. I'm aware that it is not efficient to create an empty dataframe then constantly append new rows. I can't create a pre-sized dataframe because I do not know how many rows will be returned. Is there a way to convert the iterator output to a pandas dataframe without writing to disk?
Iteratively appending to a pandas data frame is not the best solution. It is better to build your data as a list, and then pass it to pd.DataFrame.
import random
import pandas as pd
alpha = list('abcdefghijklmnopqrstuvwxyz')
Here we create a generator, use it to construct a list, then pass it to the dataframe constructor:
%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
my_data = [x for x in gen]
df = pd.DataFrame(my_data, columns=['letter','value'])
# result: 1 loop, best of 3: 373 ms per loop
This is quite a bit faster than creating a generator, construct an empty dataframe, and appending rows, seen here:
%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
df = pd.DataFrame(columns=['letter','value'])
for tup in gen:
df.loc[df.shape[0],:] = tup
# result: 1 loop, best of 3: 13.6 s per loop
This is incredibly slow at 13 seconds to construct 10000 rows.
Would something general like this do the trick?
def make_equal_length_cols(df, new_iter, col_name):
# convert the generator to a list so we can append
new_iter = list(new_iter)
# if the passed generator (as a list) has fewer elements that the dataframe, we ought to add NaN elements until their lengths are equal
if len(new_iter) < df.shape[0]:
new_iter += [np.nan]*(df.shape[0]-len(new_iter))
else:
# otherwise, each column gets n new NaN rows, where n is the difference between the number of elements in new_iter and the length of the dataframe
new_rows = [{c: np.nan for c in df.columns} for _ in range((len(new_iter)-df.shape[0]))]
new_rows_df = pd.DataFrame(new_rows)
df = df.append(new_rows_df).reset_index(drop=True)
df[col_name] = new_iter
return df
Test it out:
make_equal_length_cols(df, (x for x in range(20)), 'new')
Out[22]:
A B new
0 0.0 0.0 0
1 1.0 1.0 1
2 2.0 2.0 2
3 3.0 3.0 3
4 4.0 4.0 4
5 5.0 5.0 5
6 6.0 6.0 6
7 7.0 7.0 7
8 8.0 8.0 8
9 9.0 9.0 9
10 NaN NaN 10
11 NaN NaN 11
12 NaN NaN 12
13 NaN NaN 13
14 NaN NaN 14
15 NaN NaN 15
16 NaN NaN 16
17 NaN NaN 17
18 NaN NaN 18
19 NaN NaN 19
And it also works when the passed generator is shorter than the dataframe:
make_equal_length_cols(df, (x for x in range(5)), 'new')
Out[26]:
A B new
0 0 0 0.0
1 1 1 1.0
2 2 2 2.0
3 3 3 3.0
4 4 4 4.0
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 8 8 NaN
9 9 9 NaN
Edit: removed row-by-row pandas.DataFrame.append call, and constructed separate dataframe to append in one shot. Timings:
New append:
%timeit make_equal_length_cols(df, (x for x in range(10000)), 'new')
10 loops, best of 3: 40.1 ms per loop
Old append:
very slow...
Pandas DataFrame accepts iterator as the data source in the constructor. You can dynamically generate rows and feed them to a data frame, as you are reading and transforming the source data.
This is easiest done by writing a generator function that uses yield to feed the results.
After the data frame has been generated you can use set_index to choose any column as an index.
Here is an example:
def create_timeline(self) -> pd.DataFrame:
"""Create a timeline feed how we traded over a course of time.
Note: We assume each position has only one enter and exit event, not position increases over the lifetime.
:return: DataFrame with timestamp and timeline_event columns
"""
# https://stackoverflow.com/questions/42999332/fastest-way-to-convert-python-iterator-output-to-pandas-dataframe
def gen_events():
"""Generate data for the dataframe.
Use Python generators to dynamically fill Pandas dataframe.
Each dataframe gets timestamp, timeline_event columns.
"""
for pair_id, history in self.asset_histories.items():
for position in history.positions:
open_event = TimelineEvent(
pair_id=pair_id,
position=position,
type=TimelineEventType.open,
)
yield (position.opened_at, open_event)
# If position is closed generated two events
if position.is_closed():
close_event = TimelineEvent(
pair_id=pair_id,
position=position,
type=TimelineEventType.close,
)
yield (position.closed_at, close_event)
df = pd.DataFrame(gen_events(), columns=["timestamp", "timeline_event"])
df = df.set_index(["timestamp"])
return df
The full open source example can be found here.

keep only lowest value per row in a Python Pandas dataset

In a Pandas dataset I only want to keep the lowest value per line. All other values should be deleted.
I need the original dataset intact. Just remove all values (replace by NaN) which are not the minimum.
What is the best way to do this - speed/performance wise.
I can also transpose the dataset if the operation is easier per column.
Thanks
Robert
Since the operation you are contemplating does not rely on the columns or index, it might be easier (and faster) to do this using NumPy rather than Pandas.
You can find the location (i.e. column index) of the minimums for each row using
idx = np.argmin(arr, axis=1)
You could then make a new array filled with NaNs and copy the minimum values
to the new array.
import numpy as np
import pandas as pd
def nan_all_but_min(df):
arr = df.values
idx = np.argmin(arr, axis=1)
newarr = np.full_like(arr, np.nan, dtype='float')
newarr[np.arange(arr.shape[0]), idx] = arr[np.arange(arr.shape[0]), idx]
df = pd.DataFrame(newarr, columns=df.columns, index=df.index)
return df
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
df = nan_all_but_min(df)
print(df)
yields
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN
Here is a benchmark comparing nan_all_but_min vs using_where:
def using_where(df):
return df.where(df.values == df.min(axis=1)[:,None])
In [73]: df = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [74]: %timeit using_where(df)
1000 loops, best of 3: 701 µs per loop
In [75]: %timeit nan_all_but_min(df)
10000 loops, best of 3: 105 µs per loop
Note that using_where and nan_all_but_min behave differently if a row contains the same min value more than once. using_where will preserve all the mins, nan_all_but_min will preserve only one min. For example:
In [76]: using_where(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[76]:
0 1 2
0 0 0 NaN
1 1 NaN 1
In [77]: nan_all_but_min(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[77]:
0 1 2
0 0 NaN NaN
1 1 NaN NaN
Piggybacking off #unutbu's excellent answer, the following minor change should accommodate your modified question.
The where method
In [26]: df2 = df.copy()
In [27]: df2
Out[27]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [28]: df2.where(df2.values == df2.min(axis=1)[:,None])
Out[28]:
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN NaN
3 9 NaN NaN
Mandatory speed test.
In [29]: df3 = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [30]: %timeit df3.where(df3.values == df3.min(axis=1)[:,None])
1000 loops, best of 3: 723 µs per loop
If your data frame already contains NaN values, you must use numpy's nanmin as follows:
df2.where(df2.values==np.nanmin(df2,axis=0))
I just found and tried out the answer by unutbu.
I tried the .where method, but it will be deprecated soon.
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
However, i got this sucker working instead. However, it is a lambda function, and most likely slower...
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
mask = df.apply(lambda d:(d == df.min(axis=1)))
print (df[mask])
Should yield:
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN

Categories