Fastest way to convert python iterator output to pandas dataframe

Fastest way to convert python iterator output to pandas dataframe - python

I have a generator that returns an unknown number of rows of data that I want to convert to an indexed pandas dataframe. The fastest way I know of is to write a CSV to disk then parse back in via 'read_csv'. I'm aware that it is not efficient to create an empty dataframe then constantly append new rows. I can't create a pre-sized dataframe because I do not know how many rows will be returned. Is there a way to convert the iterator output to a pandas dataframe without writing to disk?

Iteratively appending to a pandas data frame is not the best solution. It is better to build your data as a list, and then pass it to pd.DataFrame.
import random
import pandas as pd
alpha = list('abcdefghijklmnopqrstuvwxyz')
Here we create a generator, use it to construct a list, then pass it to the dataframe constructor:
%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
my_data = [x for x in gen]
df = pd.DataFrame(my_data, columns=['letter','value'])
# result: 1 loop, best of 3: 373 ms per loop
This is quite a bit faster than creating a generator, construct an empty dataframe, and appending rows, seen here:
%%timeit
gen = ((random.choice(alpha), random.randint(0,100)) for x in range(10000))
df = pd.DataFrame(columns=['letter','value'])
for tup in gen:
df.loc[df.shape[0],:] = tup
# result: 1 loop, best of 3: 13.6 s per loop
This is incredibly slow at 13 seconds to construct 10000 rows.

Would something general like this do the trick?
def make_equal_length_cols(df, new_iter, col_name):
# convert the generator to a list so we can append
new_iter = list(new_iter)
# if the passed generator (as a list) has fewer elements that the dataframe, we ought to add NaN elements until their lengths are equal
if len(new_iter) < df.shape[0]:
new_iter += [np.nan]*(df.shape[0]-len(new_iter))
else:
# otherwise, each column gets n new NaN rows, where n is the difference between the number of elements in new_iter and the length of the dataframe
new_rows = [{c: np.nan for c in df.columns} for _ in range((len(new_iter)-df.shape[0]))]
new_rows_df = pd.DataFrame(new_rows)
df = df.append(new_rows_df).reset_index(drop=True)
df[col_name] = new_iter
return df
Test it out:
make_equal_length_cols(df, (x for x in range(20)), 'new')
Out[22]:
A B new
0 0.0 0.0 0
1 1.0 1.0 1
2 2.0 2.0 2
3 3.0 3.0 3
4 4.0 4.0 4
5 5.0 5.0 5
6 6.0 6.0 6
7 7.0 7.0 7
8 8.0 8.0 8
9 9.0 9.0 9
10 NaN NaN 10
11 NaN NaN 11
12 NaN NaN 12
13 NaN NaN 13
14 NaN NaN 14
15 NaN NaN 15
16 NaN NaN 16
17 NaN NaN 17
18 NaN NaN 18
19 NaN NaN 19
And it also works when the passed generator is shorter than the dataframe:
make_equal_length_cols(df, (x for x in range(5)), 'new')
Out[26]:
A B new
0 0 0 0.0
1 1 1 1.0
2 2 2 2.0
3 3 3 3.0
4 4 4 4.0
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 8 8 NaN
9 9 9 NaN
Edit: removed row-by-row pandas.DataFrame.append call, and constructed separate dataframe to append in one shot. Timings:
New append:
%timeit make_equal_length_cols(df, (x for x in range(10000)), 'new')
10 loops, best of 3: 40.1 ms per loop
Old append:
very slow...

Pandas DataFrame accepts iterator as the data source in the constructor. You can dynamically generate rows and feed them to a data frame, as you are reading and transforming the source data.
This is easiest done by writing a generator function that uses yield to feed the results.
After the data frame has been generated you can use set_index to choose any column as an index.
Here is an example:
def create_timeline(self) -> pd.DataFrame:
"""Create a timeline feed how we traded over a course of time.
Note: We assume each position has only one enter and exit event, not position increases over the lifetime.
:return: DataFrame with timestamp and timeline_event columns
"""
# https://stackoverflow.com/questions/42999332/fastest-way-to-convert-python-iterator-output-to-pandas-dataframe
def gen_events():
"""Generate data for the dataframe.
Use Python generators to dynamically fill Pandas dataframe.
Each dataframe gets timestamp, timeline_event columns.
"""
for pair_id, history in self.asset_histories.items():
for position in history.positions:
open_event = TimelineEvent(
pair_id=pair_id,
position=position,
type=TimelineEventType.open,
)
yield (position.opened_at, open_event)
# If position is closed generated two events
if position.is_closed():
close_event = TimelineEvent(
pair_id=pair_id,
position=position,
type=TimelineEventType.close,
)
yield (position.closed_at, close_event)
df = pd.DataFrame(gen_events(), columns=["timestamp", "timeline_event"])
df = df.set_index(["timestamp"])
return df
The full open source example can be found here.

Related

How to overwrite multiple rows from one row (iloc/loc difference)?

I have a dataframe and would like to assign multiple values from one row to multiple other rows.
I get it to work with .iloc but for some when I use conditions with .loc it only returns nan.
df = pd.DataFrame(dict(A = [1,2,0,0],B=[0,0,0,10],C=[3,4,5,6]))
df.index = ['a','b','c','d']
When I use loc with conditions or with direct index names:
df.loc[df['A']>0, ['B','C']] = df.loc['d',['B','C']]
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']]
it will return
A B C
a 1.0 NaN NaN
b 2.0 NaN NaN
c 0.0 0.0 5.0
d 0.0 10.0 6.0
but when I use .iloc it actually works as expected
df.iloc[0:2,1:3] = df.iloc[3,1:3]
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6
is there a way to do this with .loc or do I need to rewrite my code to get the row numbers from my mask?

When you use labels, pandas perform index alignment, and in your case there is no common indices thus the NaNs, while location based indexing does not align.
You can assign a numpy array to prevent index alignment:
df.loc[['a','b'], ['B','C']] = df.loc['d',['B','C']].values
output:
A B C
a 1 10 6
b 2 10 6
c 0 0 5
d 0 10 6

Creating an empty dataframe or List with column names then add data by column names

I am trying to learn python 2.7 by converting code I wrote in VB to python. I have column names and I am trying to create a empty dataframe or list then add rows by iterating (see below). I do not know the total number of rows I will need to add in advance. I can create a dataframe with the column names but can't figure out how to add the data. I have looked at several questions like mine but the row/columns of data are unknown in advance.
snippet of code:
cnames=['Security','Time','Vol_21D','Vol2_21D','MaxAPV_21D','MinAPV_21D' ]
df_Calcs = pd.DataFrame(index=range(10), columns=cnames)
this creates the empty df (df_Calcs)...then the code below is where I get the data to fill the rows...I use n as a counter for the new row # to insert (there are 20 other columns that I add to the row), but the below should explain what I am trying to do.
i = 0
n = 0
while True:
df_Calcs.Security[n] = i + 1
df_Calcs.Time[n] = '09:30:00'
df_Calcs.Vol_21D[n] = i + 2
df_Calcs.Vol2_21D[n] = i + 3
df_Calcs.MaxAPV_21D[n] = i + 4
df_Calcs.MinAPV_21D[n] = i + 5
i = i +1
n = n +1
if i > 4:
break
print df_Calcs
If I should use a list or array instead please let me know, I am trying to do this in the fastest most efficient way. This data will then be sent to a MySQL db table.
Result...
Security Time Vol_21D Vol2_21D MaxAPV_21D MinAPV_21D
0 1 09:30:00 2 3 4 5
1 2 09:30:00 3 4 5 6
2 3 09:30:00 4 5 6 7
3 4 09:30:00 5 6 7 8
4 5 09:30:00 6 7 8 9
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN

You have many ways to do that.
Create empty dataframe:
cnames=['Security', 'Time', 'Vol_21D', 'Vol2_21D', 'MaxAPV_21D', 'MinAPV_21D']
df = pd.DataFrame(columns=cnames)
Output:
Empty DataFrame
Columns: [Security, Time, Vol_21D, Vol2_21D, MaxAPV_21D, MinAPV_21D]
Index: []
Then, in loop you can create a pd.series and append to your dataframe, example:
df.append(pd.Series([1, 2, 3, 4, 5, 6], cnames), ignore_index=True)
Or you can append a dict:
df.append({'Security': 1,
'Time': 2,
'Vol_21D': 3,
'Vol2_21D': 4,
'MaxAPV_21D': 5,
'MinAPV_21D': 6
}, ignore_index=True)
It will be the same output:
Security Time Vol_21D Vol2_21D MaxAPV_21D MinAPV_21D
0 1 2 3 4 5 6
But I think, more faster and pythonic way: first create an array, then append all raws to array and make data frame from array.
data = []
for i in range(0,5):
data.append([1,2,3,4,i,6])
df = pd.DataFrame(data, columns=cnames)
I hope it helps.

How to efficiently add multiple columns to pandas dataframe with values that depend on other columns

What I have:
A dataframe with many rows, and several existing columns (python, pandas).
Python 3.6, so a solution that relies on that particular version is fine with me (but obviously solutions that also work for earlier versions are fine too)
What I want to do:
Add multiple additional columns to the dataframe, where the values in these new columns all depend on some way on values in existing columns in the same row.
The original order of the dataframe must be preserved. If a solution changes the ordering, I could restore it afterwards by manually sorting based on one of the existing columns, but obviously this introduces extra overhead.
I already have the following code, which does work correctly. However, profiling has indicated that this code is one of the important bottlenecks in my code, so I'd like to optimize it if possible, and I also have reason to believe that should be possible:
df["NewColumn1"] = df.apply(lambda row: compute_new_column1_value(row), axis=1)
df["NewColumn2"] = df.apply(lambda row: compute_new_column2_value(row), axis=1)
# a few more lines of code like the above
I based this solution on answers to questions like this one (which is a question similar to mine, but specifically about adding one new column, whereas my question is about adding many new columns). I suppose that each of these df.apply() calls is internally implemented with a loop through all the rows, and I suspect it should be possible to optimize this with a solution that only loops through all the loops once (as opposed to once per column I want to add).
In other answers, I have seen references to the assign() function, which does indeed support adding multiple columns at once. I tried using this in the following way:
# WARNING: this does NOT work
df = df.assign(
NewColumn1=lambda row: compute_new_column1_value(row),
NewColumn2=lambda row: compute_new_column2_value(row),
# more lines like the two above
)
The reason why this doesn't work is because the lambda's actually don't receive rows of the dataframe as arguments at all, they simply seem to get the entire dataframe at once. And then it's expected for each of the lambda's to return a complete column/Series/array of values at once. So, my problem here is that I'd have to end up implementing manual loops through all the loops myself inside those lambda's, which is obviously going to be even worse for performance.
I can think of two solutions conceptually, but have been unable to find how to actually implement them so far:
Something like df.assign() (which supports adding multiple columns at once), but with the ability to pass rows into the lambda's instead of the complete dataframe
A way to vectorize my compute_new_columnX_value() functions, so that they can be used as lambda's in the way that df.assign() expects them to be used.
My problem with the second solution so far is that the row-based versions some of my functions look as follows, and I have difficulties finding how to properly vectorize them:
def compute_new_column1_value(row):
if row["SomeExistingColumn"] in some_dictionary:
return some_dictionary[row["SomeExistingColumn"]]
else:
return some_default_value

Have you tried initializing the columns as nan, iterating through the dataframe by row, and assigning the values with loc?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 20, (10, 5)))
df[5] = np.nan
df[6] = np.nan
for i, row in df.iterrows():
df.loc[i, 5] = row[1] + row[4]
df.loc[i, 6] = row[3] * 2
print(df)
yields
0 1 2 3 4
0 17 4 3 11 10
1 16 1 14 11 16
2 4 18 12 19 7
3 11 3 7 10 5
4 11 0 10 1 17
5 5 17 10 3 8
6 0 0 7 3 6
7 7 18 18 13 8
8 16 4 12 11 16
9 13 9 15 8 19
0 1 2 3 4 5 6
0 17 4 3 11 10 14.0 22.0
1 16 1 14 11 16 17.0 22.0
2 4 18 12 19 7 25.0 38.0
3 11 3 7 10 5 8.0 20.0
4 11 0 10 1 17 17.0 2.0
5 5 17 10 3 8 25.0 6.0
6 0 0 7 3 6 6.0 6.0
7 7 18 18 13 8 26.0 26.0
8 16 4 12 11 16 20.0 22.0
9 13 9 15 8 19 28.0 16.0

If you only have 50 conditions to check for it is probably better to iterate through the conditions and fill the cells in blocks rather than going through the whole frame row by row. By the way .assign() doesn't just accept lambda functions and the code can also be made a lot more readable than in my previous suggestion. Below is a modified version that also fills the extra columns in place. If this data frame had 10,000,000 rows and I only wanted to apply different operations to 10 groups of number ranges in column A this would be a very neat way of filling the extra columns.
import pandas as pd
import numpy as np
# Create data frame
rnd = np.random.randint(1, 10, 10)
rnd2 = np.random.randint(100, 1000, 10)
df = pd.DataFrame(
{'A': rnd, 'B': rnd2, 'C': np.nan, 'D': np.nan, 'E': np.nan })
# Define different ways of filling the extra cells
def f1():
return df['A'].mul(df['B'])
def f2():
return np.log10(df['A'])
def f3():
return df['B'] - df['A']
def f4():
return df['A'].div(df['B'])
def f5():
return np.sqrt(df['B'])
def f6():
return df['A'] + df['B']
# First assign() dependent on a boolean mask
df[df['A'] < 50] = df[df['A'] < 15].assign(C = f1(), D = f2(), E = f3())
# Second assign() dependent on a boolean mask
df[df['A'] >= 50] = df[df['A'] >= 50].assign(C = f4(), D = f5(), E = f6())
print(df)
A B C D E
0 4.0 845.0 3380.0 0.602060 841
1 3.0 967.0 2901.0 0.477121 964
2 3.0 468.0 1404.0 0.477121 465
3 2.0 548.0 1096.0 0.301030 546
4 3.0 393.0 1179.0 0.477121 390
5 7.0 741.0 5187.0 0.845098 734
6 1.0 269.0 269.0 0.000000 268
7 4.0 731.0 2924.0 0.602060 727
8 4.0 193.0 772.0 0.602060 189
9 3.0 306.0 918.0 0.477121 303

Rather than trying to bring the row labels into .assign(), you can
apply a boolean mask to your data frame before chaining .assign() to it. The example below can easily be extended to multiple boolean conditions and multiple lambdas with or without additional for loops or if statements.
import pandas as pd
# Create data frame
idx = np.arange(0, 10)
rnd = pd.Series(np.random.randint(10, 20, 10))
alpha_idx = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame({'idx': idx, 'A': rnd, 'B': 100})
df.index = alpha_idx
# First assign() dependent on a boolean mask
df_tmp = df[df['A'] < 15].assign(AmulB = lambda x: (x.A.mul(x.B)),
A_B = lambda x: x.B - x.A)
# Second assign() dependent on a boolean mask
df_tmp2 = df[df['A'] >= 15].assign(AmulB = lambda x: (x.A.div(x.B)),
A_B = lambda x: x.B + x.A)
# Create a new df with different lambdas combined
df_lambdas = df_tmp.append(df_tmp2)
# Sort values
df_lambdas.sort_values('idx', axis=0, inplace=True)
print(df_lambdas)
A B idx
a 19 100 0
b 17 100 1
c 16 100 2
d 13 100 3
e 15 100 4
f 10 100 5
g 16 100 6
h 15 100 7
i 13 100 8
j 10 100 9
A B idx A_B AmulB
a 19 100 0 119 0.19
b 17 100 1 117 0.17
c 16 100 2 116 0.16
d 13 100 3 87 1300.00
e 15 100 4 115 0.15
f 10 100 5 90 1000.00
g 16 100 6 116 0.16
h 15 100 7 115 0.15
i 13 100 8 87 1300.00
j 10 100 9 90 1000.00

The answers provided so far do not provide a speedup for my specific case, for reasons I provided in the comments. The best solution I've been able to find so far is primarily based on this answer to another question. It didn't provide me a large speedup (about 10%), but it's the best I've been able to do so far. I'd still be very much interested in faster solutions if they exist!
It turns out that, like the assign function, apply can in fact also be provided with lambda's that return a series of values for multiple columns at once, instead of only lambda's that return a single scalar. So, the fastest implementation I have so far looks as follows:
# first initialize all the new columns with standard values for entire df at once
# this turns out to be very important. Skipping this comes at a high computational cost
for new_column in ["NewColumn1", "NewColumn2", "etc."]:
df[new_column] = np.nan
df = df.apply(compute_all_new_columns, axis=1)
And then, instead of having all those separate lambda's for all the different new columns, they're all implemented in the same function like this:
def compute_all_new_columns(row):
if row["SomeExistingColumn"] in some_dictionary:
row["NewColumn1"] = some_dictionary[row["SomeExistingColumn"]]
else:
row["NewColumn1"] = some_default_value
if some_other_condition:
row["NewColumn2"] = whatever
else:
row["NewColumn2"] = row["SomeExistingColumn"] * whatever
# assign values to other new columns here
The resulting dataframe contains all the columns it previously did, plus values for all the new columns as inserted on a row-by-row basis by the compute_all_new_columns function. The original ordering is preserved. This solution contains no python-based loops (which are slow), and only a single loop through the rows ''behind the scenes'' as provided to us by the pandas apply function

I am really taken by this question so here is another example involving external dictionaries:
import pandas as pd
import numpy as np
# Create data frame and external dictionaries
rnd = pd.Series(np.random.randint(10, 100, 10))
names = 'Rafael Roger Grigor Alexander Dominic Marin David Jack Stan Pablo'
name = names.split(' ')
surnames = 'Nadal Federer Dimitrov Zverev Thiem Cilic Goffin Sock Wawrinka Busta'
surname = surnames.split()
countries_str = ('Spain Switzerland Bulgaria Germany Austria Croatia Belgium USA Switzerland Spain')
country = countries_str.split(' ')
player = dict(zip(name, surname))
player_country = dict(zip(name, country))
df = pd.DataFrame(
{'A': rnd, 'B': 100, 'Name': name, 'Points': np.nan, 'Surname': np.nan, 'Country': np.nan})
df = df[['A', 'B', 'Name', 'Surname', 'Country', 'Points']]
df.loc[9, 'Name'] = 'Dennis'
print(df)
# Functions to fill the empty columns
def f1():
return df['A'].mul(df['B'])
def f2():
return np.random.randint(1, 10)
def f3():
return player[key]
def f4():
return player_country[key]
def f5():
return 'Unknown'
def f6():
return 0
# .assign() dependent on a boolean mask
for key, value in player.items():
df[df['Name'] == key] = df[df['Name'] == key].assign(
Surname = f3(), Country = f4(), Points = f1())
df[df['Name']=='Dennis'] = df[df['Name'] == 'Dennis'].assign(
Surname = f5(), Country = f5(), Points = f6())
df = df.sort_values('Points', ascending=False)
print(df)
A B Name Surname Country Points
1 97.0 100.0 Roger Federer Switzerland 9700.0
4 93.0 100.0 Dominic Thiem Austria 9300.0
8 92.0 100.0 Stan Wawrinka Switzerland 9200.0
5 86.0 100.0 Marin Cilic Croatia 8600.0
6 67.0 100.0 David Goffin Belgium 6700.0
7 61.0 100.0 Jack Sock USA 6100.0
0 35.0 100.0 Rafael Nadal Spain 3500.0
2 34.0 100.0 Grigor Dimitrov Bulgaria 3400.0
3 25.0 100.0 Alexander Zverev Germany 2500.0
9 48.0 100.0 Dennis Unknown Unknown 0.0

How to use previous N values in pandas column to fill NaNs?

Say I have a time series data as below.
df
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 Nan 34.23
4 32 Nan
5 18.75 41.1
6 Nan 45.12
7 23 39.67
8 Nan 36.45
9 36 Nan
Now I want to fill NaNs in column priceA by taking mean of previous N values in the column. In this case take N=3.
And for column priceB I have to fill Nan by value M rows above(current index-M).
I tried to write for loop for it which is not a good practice as my data is too large. Is there a better way to do this?
N=3
M=2
def fillPriceA( df,indexval,n):
temp=[ ]
for i in range(n):
if i < 0:
continue
temp.append(df.loc[indexval-(i+1), 'priceA'])
return np.nanmean(np.array(temp, dtype=np.float))
def fillPriceB(df, indexval, m):
return df.loc[indexval-m, 'priceB']
for idx, rows for df.iterrows():
if idx< N:
continue
else:
if rows['priceA']==None:
rows['priceA']= fillPriceA(df, idx,N)
if rows['priceB']==None:
rows['priceB']=fillPrriceB(df,idx,M)
Expected output:
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 32.31 34.23
4 32 29.08
5 18.75 41.1
6 27.68 45.12
7 23 39.67
8 23.14 36.45
9 36 39.67

A solution could be to only work with the nan index (see dataframe boolean indexing):
param = dict(priceA = 3, priceB = 2) #Number of previous values to consider
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col])):i][col] #get the nth expected elements
df.loc[i][col] = _window.mean() if col == 'priceA' else _window.iloc[0] #Replace with right method
print(df)
Result:
priceA priceB
0 25.670000 30.56
1 34.120000 28.43
2 37.140000 29.08
3 32.310000 34.23
4 32.000000 29.08
5 18.750000 41.10
6 27.686667 45.12
7 23.000000 39.67
8 23.145556 36.45
9 36.000000 39.67
Note
1. Using np.isnan() implies that your columns are numeric. If not convert your columns before with pd.to_numeric():
...
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors = 'coerce')
...
Or use pd.isnull() instead (see example below). Be aware of the performances (numpy is faster):
from random import randint
#A sample with 10k elements and some np.nan
arr = np.random.rand(10000)
for i in range(100):
arr[randint(0,9999)] = np.nan
#Performances
%timeit pd.isnull(arr)
10000 loops, best of 3: 24.8 µs per loop
%timeit np.isnan(arr)
100000 loops, best of 3: 5.6 µs per loop
2. A more generic alternative could be to define methods and window size to apply for each column in a dict:
import pandas as pd
param = {}
param['priceA'] = {'n':3,
'method':lambda x: pd.isnull(x)}
param['priceB'] = {'n':2,
'method':lambda x: x[0]}
param contains now n the number of elements and method a lambda expression. Accordingly rewrite your loops:
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col]['n'])):i][col] #get the nth expected elements
df.loc[i][col] = param[col]['method'](_window.values) #Replace with right method
print(df)#This leads to a similar result.

You can use an NA mask to do what you need per column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,4, None, 5, 6], 'b': [1, None, 2, 3, 4, None, 7]})
df
# a b
# 0 1.0 1.0
# 1 2.0 NaN
# 2 3.0 2.0
# 3 4.0 3.0
# 4 NaN 4.0
# 5 5.0 NaN
# 6 6.0 7.0
for col in df.columns:
s = df[col]
na_indices = s[s.isnull()].index.tolist()
prev = 0
for k in na_indices:
s[k] = np.mean(s[prev:k])
prev = k
df[col] = s
print(df)
a b
# 0 1.0 1.0
# 1 2.0 1.0
# 2 3.0 2.0
# 3 4.0 3.0
# 4 2.5 4.0
# 5 5.0 2.5
# 6 6.0 7.0
While this is still a custom operation, I am pretty sure it will be slightly faster because it is not iterating over each row, just over the NA values, which I am assuming will be sparse compared to the actual data

To fill priceA use rolling, then shift and use this result in fillna,
# make some data
df = pd.DataFrame({'priceA': range(10)})
#make some rows missing
df.loc[[4, 6], 'priceA'] = np.nan
n = 3
df.priceA = df.priceA.fillna(df.priceA.rolling(n, min_periods=1).mean().shift(1))
The only edge case here is when two nans are within n of one another but it seems to handle this as in your question.
For priceB just use shift,
df = pd.DataFrame({'priceB': range(10)})
df.loc[[4, 8], 'priceB'] = np.nan
m = 2
df.priceB = df.priceB.fillna(df.priceB.shift(m))
Like before, there is the edge case where there is a nan exactly m before another nan.

Using fillna() selectively in pandas

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?

Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to convert python iterator output to pandas dataframe - python

Related

How to overwrite multiple rows from one row (iloc/loc difference)?

Creating an empty dataframe or List with column names then add data by column names

How to efficiently add multiple columns to pandas dataframe with values that depend on other columns

How to use previous N values in pandas column to fill NaNs?

Using fillna() selectively in pandas

Categories

Resources