Filling missing data with historical mean fast and efficiently in pandas - python

I am working with a large panel dataset (longitudinal data) with 500k observations. Currently, I am trying to fill the missing data (at most 30% of observations) using the mean of up till time t of each variable. (The reason why I do not fill the data with overall mean, is to avoid a forward looking bias that arises from using data only available at a later point in time.)
I wrote the following function which does the job, but runs extremely slow (5 hours for 500k rows!!) In general, I find that filling missing data in Pandas is a computationally tedious task. Please enlighten me on how you normally fill missing values, and how you make it run fast
Function to fill with mean till time "t":
def meanTillTimeT(x,cols):
start = time.time()
print('Started')
x.reset_index(inplace=True)
for i in cols:
l1 =[]
for j in range(x.shape[0]):
if x.loc[j,i] !=0 and np.isnan(x.loc[j,i]) == False :
l1.append(x.loc[j,i])
elif np.isnan(x.loc[j,i])==True :
x.loc[j,i]=np.mean(l1)
end = time.time()
print("time elapsed:", end - start)
return x

Let us build a DataFrame for illustration:
import pandas as pd
import numpy as np
df = pd.DataFrame({"value1": [1, 2, 1, 5, np.nan, np.nan, 8, 3],
"value2": [0, 8, 1, np.nan, np.nan, 8, 9, np.nan]})
Here is the DataFrame:
value1 value2
0 1.0 0.0
1 2.0 8.0
2 1.0 1.0
3 5.0 NaN
4 NaN NaN
5 NaN 8.0
6 8.0 9.0
7 3.0 NaN
Now, I suggest to first compute the cumulative sums using pandas.DataFrame.cumsum and also the number of non-NaNs values so as to compute the means. After that, it is enough to fill the NaNs with those means, and inserting them in the original DataFrame. Both actions use pandas.DataFrame.fillna, which is going to be much much faster than Python loops:
df_mean = df.cumsum() / (~df.isna()).cumsum()
df_mean = df_mean.fillna(method = "ffill")
df = df.fillna(value = df_mean)
The result is:
value1 value2
0 1.00 0.0
1 2.00 8.0
2 1.00 1.0
3 5.00 3.0
4 2.25 3.0
5 2.25 8.0
6 8.00 9.0
7 3.00 5.2

Related

Performance-warning when operating on dataframe

This code results in a performance warning, but i have a hard time optimizing it.
for i in range(len(data['Vektoren'][0])):
tmp_lst = []
for v in data['Vektoren']:
tmp_lst.append(v[i])
data[i] = tmp_lst
DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once usi
ng pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
You seem to want to convert your Series of lists/arrays into several columns.
Rather use:
data = data.join(pd.DataFrame(data['Vektoren'].tolist(), index=data.index))
Or:
data = pd.concat([data, pd.DataFrame(data['Vektoren'].tolist(), index=data.index)],
axis=1)
Example output:
Vektoren 0 1 2 3
0 [1, 2, 3, 4] 1.0 2.0 3.0 4.0
1 [5, 6] 5.0 6.0 NaN NaN
2 [] NaN NaN NaN NaN
Used input:
data = pd.DataFrame({'Vektoren': [[1,2,3,4],[5,6],[]]})

Filling NaN values with rolling mean of the previous non-NaN values

I have recently come across a case where I would like to replace NaN values with the rolling mean of the previous non-NaN values in such a way that each newly generated rolling mean is then considered a non-NaN and is used for the next NaN. This is the sample data set:
df = pd.DataFrame({'col1': [1, 3, 4, 5, 6, np.NaN, np.NaN, np.NaN]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 6.0
5 NaN # (6.0 + 5.0) / 2
6 NaN # (5.5 + 6.0) / 2
7 NaN # ...
I have also found a solution for this which I am struggling to understand:
from functools import reduce
reduce(lambda x, _: x.fillna(x.rolling(2, min_periods=2).mean().shift()), range(df['col1'].isna().sum()), df)
My problem with this solution is reduce function takes 3 arguments, where we first define the lambda function then we specify the iterator. In the solution above I don't understand the last df we put in the function call for reduce and I struggle to understand how it works in general to populate the NaN.
I would appreciate any explanation of how it works. Also if there is any pandas, numpy based solution as reduce is not seemingly efficient here.
for i in df.index:
if np.isnan(df["col1"][i]):
df["col1"][i] = (df["col1"][i - 1] + df["col1"][i - 2]) / 2
This can be a start using for loop, it will fail if the first 2 values of the dataframe are NAN

Doubts about pandas axis working my code may be off

My issue is the following, I'm creating a pandas data frame from a dictionary that ends up looking like [70k, 300]. I'm trying to normalise each cell be it either by columns and after rows, and other way around rows then columns.
I ha asked a similar question before but this was with a [70k, 70k] data frame so square and it worked just by doing this
dfNegInfoClearRev = (df - df.mean(axis=1)) / df.std(axis=1).replace(0, 1)
dfNegInfoClearRev = (dfNegInfoClearRev - dfNegInfoClearRev.mean(axis=0)) / dfNegInfoClearRev.std(axis=0).replace(0, 1)
print(dfNegInfoClearRev)
This did what I needed for the case of a [70k, 70k]. A problem came up when I tried the same principle with a [70k, 300] if I do this:
dfRINegInfo = (dfRI - dfRI.mean(axis=0)) / dfRI.std(axis=0).replace(0, 1)
dfRINegInfoRows = (dfRINegInfo - dfRINegInfo.mean(axis=1)) / dfRINegInfo.std(axis=1).replace(0, 1)
I somehow end up with a [70k, 70k+300] full of NaNs with the same names.
I ended up doing this:
dfRIInter = dfRINegInfo.sub(dfRINegInfo.mean(axis=1), axis=0)
dfRINegInfoRows = dfRIInter.div(dfRIInter.std(axis=1), axis=0).fillna(1).replace(0, 1)
print(dfRINegInfoRows)
But I'm not sure if this is what I was trying to do and don't really understand why after the column normalisation which it does work [70k, 300] the row normalisation gives me a [70k, 70k+300], and I'm not sure if the way is working is what I'm trying to do. Any help?
I think your new code is doing what you want.
If we look at a 3x3 toy example:
df = pd.DataFrame([
[1, 2, 3],
[2, 4, 6],
[3, 6, 9],
])
The axis=1 mean is:
df.mean(axis=1)
# 0 2.0
# 1 4.0
# 2 6.0
# dtype: float64
And the subtraction applies to each row (i.e., [1,2,3] - [2,4,6] element-wise, [2-4-6] - [2,4,6], and [3,6,9] - [2,4,6]):
df - df.mean(axis=1)
# 0 1 2
# 0 -1.0 -2.0 -3.0
# 1 0.0 0.0 0.0
# 2 1.0 2.0 3.0
So if we have df2 shaped 3x2:
df2 = pd.DataFrame([
[1,2],
[3,6],
[5,10],
])
The axis=1 mean is still length 3:
df2.mean(axis=1)
# 0 1.5
# 1 4.5
# 2 7.5
# dtype: float64
And subtraction will result in the 3rd column being nan (i.e., [1,2,nan] - [1.5,4.5,7.5] element-wise, [3,6,nan] - [1.5,4.5,7.5], and [5,10,nan] - [1.5,4.5,7.5]):
df2 - df2.mean(axis=1)
# 0 1 2
# 0 -0.5 -2.5 NaN
# 1 1.5 1.5 NaN
# 2 3.5 5.5 NaN
If you make the subtraction itself along axis=0 then it works as expected:
df2.sub(df2.mean(axis=1), axis=0)
# 0 1
# 0 -0.5 0.5
# 1 -1.5 1.5
# 2 -2.5 2.5
So when you use a default subtraction between (70000, 300) and (70000,1), there will be 69700 columns of nan.

Find if sum of any two columns exceed X in pandas dataframe

. Columns are attributes, rows are observation.
I would like to extract rows, where sum of any two attributes exceed a specified value (say 0.7). Then, in two new columns, list column header with bigger and smaller contribution to sum.
I am new to python, so I am stuck proceeding after generating my dataframe.
You can do this:
import pandas as pd
from itertools import combinations
THRESHOLD = 8.0
def valuation_formula(row):
l = [sorted(x) for x in combinations(row, r=2) if sum(x) > THRESHOLD]
if(len(l) == 0):
row["smaller"], row["larger"] = None, None
else:
row["smaller"], row["larger"] = l[0] # since not specified by OP, we take the first such pair
return row
contribution_df = df.apply(lambda row: valuation_formula(row), axis=1)
So that, if
df = pd.DataFrame({"a" : [1.0, 2.0, 4.0], "b" : [5.0, 6.0, 7.0]})
a b
0 1.0 5.0
1 2.0 6.0
2 4.0 7.0
then, contribution_df is
a b smaller larger
0 1.0 5.0 NaN NaN
1 2.0 6.0 NaN NaN
2 4.0 7.0 4.0 7.0
HTH.

How to group near-duplicate values in a pandas dataframe?

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.
How can one replace these near duplicate values with, e.g. their mean?
The example data looks as follows:
df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,],
'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})
I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:
def cluster_near_values(df, colname_to_cluster, bin_size=0.1):
used_x = [] # list of values already grouped
group_index = 0
for search_value in df[colname_to_cluster]:
if search_value in used_x:
# value is already in a group, skip to next
continue
g_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].index
used_x.extend(df.loc[g_ix, colname_to_cluster])
df.loc[g_ix, 'cluster_group'] = group_index
group_index += 1
return df.groupby('cluster_group').mean()
Which does the grouping and averaging:
print(cluster_near_values(df, 'x', 0.1))
x y
cluster_group
0.0 1.000000 1.00
1.0 2.005000 2.10
2.0 3.000000 3.00
3.0 4.016667 4.17
4.0 5.000000 5.50
Is there a better way to achieve this?
Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.
df.groupby(np.ceil(df['x'] * 10) // 10).mean()
x y
x
1.0 1.000000 1.00
2.0 2.005000 2.10
3.0 3.000000 3.00
4.0 4.016667 4.17
5.0 5.000000 5.50

Categories