Performance-warning when operating on dataframe - python

This code results in a performance warning, but i have a hard time optimizing it.
for i in range(len(data['Vektoren'][0])):
tmp_lst = []
for v in data['Vektoren']:
tmp_lst.append(v[i])
data[i] = tmp_lst
DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once usi
ng pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()

You seem to want to convert your Series of lists/arrays into several columns.
Rather use:
data = data.join(pd.DataFrame(data['Vektoren'].tolist(), index=data.index))
Or:
data = pd.concat([data, pd.DataFrame(data['Vektoren'].tolist(), index=data.index)],
axis=1)
Example output:
Vektoren 0 1 2 3
0 [1, 2, 3, 4] 1.0 2.0 3.0 4.0
1 [5, 6] 5.0 6.0 NaN NaN
2 [] NaN NaN NaN NaN
Used input:
data = pd.DataFrame({'Vektoren': [[1,2,3,4],[5,6],[]]})

Related

Filling NaN values with rolling mean of the previous non-NaN values

I have recently come across a case where I would like to replace NaN values with the rolling mean of the previous non-NaN values in such a way that each newly generated rolling mean is then considered a non-NaN and is used for the next NaN. This is the sample data set:
df = pd.DataFrame({'col1': [1, 3, 4, 5, 6, np.NaN, np.NaN, np.NaN]})
df
col1
0 1.0
1 3.0
2 4.0
3 5.0
4 6.0
5 NaN # (6.0 + 5.0) / 2
6 NaN # (5.5 + 6.0) / 2
7 NaN # ...
I have also found a solution for this which I am struggling to understand:
from functools import reduce
reduce(lambda x, _: x.fillna(x.rolling(2, min_periods=2).mean().shift()), range(df['col1'].isna().sum()), df)
My problem with this solution is reduce function takes 3 arguments, where we first define the lambda function then we specify the iterator. In the solution above I don't understand the last df we put in the function call for reduce and I struggle to understand how it works in general to populate the NaN.
I would appreciate any explanation of how it works. Also if there is any pandas, numpy based solution as reduce is not seemingly efficient here.
for i in df.index:
if np.isnan(df["col1"][i]):
df["col1"][i] = (df["col1"][i - 1] + df["col1"][i - 2]) / 2
This can be a start using for loop, it will fail if the first 2 values of the dataframe are NAN

Dealing with NaN while searching for incomplete duplicate rows in the DataFrame

It's a bit hard to explain, but bear with me. Suppose we have the following dataset:
df = pd.DataFrame({'foo': [1, 1, 1, 8, 1, 5, 5, 5],
'bar': [2, float('nan'), 2, 5, 2, 3, float('nan'), 6],
'abc': [3, 3, 3, 7, float('nan'), 9, 9, 7],
'def': [4, 4, 4, 2, 4, 8, 8, 8]})
print(df)
>>>
foo bar abc def
0 1 2.0 3.0 4
1 1 NaN 3.0 4
2 1 2.0 3.0 4
3 8 5.0 7.0 2
4 1 2.0 NaN 4
5 5 3.0 9.0 8
6 5 NaN 9.0 8
7 5 6.0 7.0 8
Our goal is to find all duplicate rows. However, some of these duplicates are incomplete, because they have NaN values. Nevertheless, we want to find these duplicates too. So the expected result is:
foo bar abc def
0 1 2.0 3.0 4
1 1 NaN 3.0 4
2 1 2.0 3.0 4
4 1 2.0 NaN 4
5 5 3.0 9.0 8
6 5 NaN 9.0 8
If we try to do this in a straightforward way, that only gives us complete rows:
print(df[df.duplicated(keep=False)])
>>>
foo bar abc def
0 1 2.0 3.0 4
2 1 2.0 3.0 4
We can try to circumvent it by using only columns that don't have any missing values:
print(df[df.duplicated(['foo', 'def'], keep=False)])
>>>
foo bar abc def
0 1 2.0 3.0 4
1 1 NaN 3.0 4
2 1 2.0 3.0 4
4 1 2.0 NaN 4
5 5 3.0 9.0 8
6 5 NaN 9.0 8
7 5 6.0 7.0 8
Very close, but not quite. Turns out we're missing out on a crucial piece of information in the 'abc' column that lets us determine that row 7 is not a duplicate. So we'd want to include it:
print(df[df.duplicated(['foo', 'def', 'abc'], keep=False)])
>>>
foo bar abc def
0 1 2.0 3.0 4
1 1 NaN 3.0 4
2 1 2.0 3.0 4
5 5 3.0 9.0 8
6 5 NaN 9.0 8
And it succeeds in removing row 7. However, it also removes row 4. NaN is considered its own separate value, rather than something that could be equal to anything, so its presence in row 4 prevents us from detecting this duplicate.
Now, I'm aware that we don't know for sure if row 4 really is [1, 2, 3, 4]. For all we know, it can be something else entirely, like [1, 2, 9, 4]. But let's say that values 1 and 4 are actually some other values that are oddly specific. For example, 34900 and 23893. And let's say that there are many more columns that are also exactly the same. Moreover, the complete duplicate rows are not just 0 and 2, there are over two hundred of them, and then another 40 rows that have these same values in all columns except for 'abc', where they have NaN. So for this particular group of duplicates such coincidences are extremely improbable, and that's how we know for certain that the record [1, 2, 3, 4] is problematic, and that row 4 is almost certainly a duplicate.
However, if [1, 2, 3, 4] is not the only group of duplicates, then it's possible that some other groups have very un-specific values in the 'foo' and 'def' columns, like 1 and 500. And it so happens that including the column 'abc' in the subset would be extremely helpful in resolving this issue, because the values in 'abc' column are nearly always very specific, and allow to determine all duplicates with a near-certainty. But there's a drawback - 'abc' column has missing values, so by using it we're sacrificing detection of some duplicates with NaNs. Some of them we know for a fact to be duplicates (like the aforementioned 40), so it's a hard dilemma.
What would be the best way to deal with this situation? It would be nice if we could somehow make NaNs equal to everything, rather than nothing, for the duration of duplicate detection, that would resolve this issue. But I doubt this is possible. Am I supposed to just go group by group and check this manually?
Thanks to #cs95 for help in figuring this out. When we sort values, NaNs are put at the end of sorting group by default, and if the incomplete record has a duplicate with an existing value instead of this NaN, it will end up right on top of NaN. That means we can fill this NaN with that value by using ffill() method. So we're forward filling missing data with data from the rows that are closest to them, so we can then make a more accurate determination of whether that row is a duplicate.
The code I ended up using (adjusted to this reproducible example) looks like this:
#printing all duplicates
col_list = ['foo', 'def', 'abc', 'bar']
show_mask = df.sort_values(col_list).ffill().duplicated(col_list, keep=False).sort_index()
df[show_mask].sort_values(col_list)
#deleting duplicates, but keeping one record per duplicate group
delete_mask = df.sort_values(col_list).ffill().duplicated(col_list).sort_index()
df = df[~delete_mask].reset_index(drop=True)
It's possible to use bfill() instead of ffill(), since it's the same principle applied upside down. But it requires changing some default parameters of methods used to opposite ones, namely na_position='first' and keep='last'. sort_index() is used just to silence the reindexing warning.
Note that the order in which you list the columns is very important, as it is used for sorting priorities. To make sure that the record above the missing value is the correct value to be copied, you have to enumerate all the columns that don't have any missing values first, and only then the ones that do. For the former columns the order doesn't really matter. For the latter ones it is crucial to start from the column that has the most diverse/specific values and end with the least diverse/specific one (float -> int -> string -> bool is a good rule of thumb, but it largely depends on what exact kind of variables the columns represent in your dataset). In this example they're all the same, but even here you won't get the right answer if you put 'bar' before 'abc'.
And even then it's not a perfect solution. It does a pretty good job of putting the most complete version of the record at the top, and transfer the information in it to less complete versions below whenever needed. But there's a possibility that the fully complete version of the record simply doesn't exist. For example, let's say there are records [5 3 Nan 8] and [5 NaN 9 8] (and there's no [5 3 9 8] record). This solution is not capable of letting them swap the missing pieces with each other. It will put 9 in the former, but NaN in the latter will remain empty, and will cause these duplicates to go unnoticed.
This is not an issue if you're dealing with just a single incomplete column, but each added incomplete column will make such cases more and more frequent. However, it it still preferable to add all the columns, because failing to detect some duplicates is better than end up with some false duplicates in your list, which is a distinct possibility unless you're using all the columns.

Filling missing data with historical mean fast and efficiently in pandas

I am working with a large panel dataset (longitudinal data) with 500k observations. Currently, I am trying to fill the missing data (at most 30% of observations) using the mean of up till time t of each variable. (The reason why I do not fill the data with overall mean, is to avoid a forward looking bias that arises from using data only available at a later point in time.)
I wrote the following function which does the job, but runs extremely slow (5 hours for 500k rows!!) In general, I find that filling missing data in Pandas is a computationally tedious task. Please enlighten me on how you normally fill missing values, and how you make it run fast
Function to fill with mean till time "t":
def meanTillTimeT(x,cols):
start = time.time()
print('Started')
x.reset_index(inplace=True)
for i in cols:
l1 =[]
for j in range(x.shape[0]):
if x.loc[j,i] !=0 and np.isnan(x.loc[j,i]) == False :
l1.append(x.loc[j,i])
elif np.isnan(x.loc[j,i])==True :
x.loc[j,i]=np.mean(l1)
end = time.time()
print("time elapsed:", end - start)
return x
Let us build a DataFrame for illustration:
import pandas as pd
import numpy as np
df = pd.DataFrame({"value1": [1, 2, 1, 5, np.nan, np.nan, 8, 3],
"value2": [0, 8, 1, np.nan, np.nan, 8, 9, np.nan]})
Here is the DataFrame:
value1 value2
0 1.0 0.0
1 2.0 8.0
2 1.0 1.0
3 5.0 NaN
4 NaN NaN
5 NaN 8.0
6 8.0 9.0
7 3.0 NaN
Now, I suggest to first compute the cumulative sums using pandas.DataFrame.cumsum and also the number of non-NaNs values so as to compute the means. After that, it is enough to fill the NaNs with those means, and inserting them in the original DataFrame. Both actions use pandas.DataFrame.fillna, which is going to be much much faster than Python loops:
df_mean = df.cumsum() / (~df.isna()).cumsum()
df_mean = df_mean.fillna(method = "ffill")
df = df.fillna(value = df_mean)
The result is:
value1 value2
0 1.00 0.0
1 2.00 8.0
2 1.00 1.0
3 5.00 3.0
4 2.25 3.0
5 2.25 8.0
6 8.00 9.0
7 3.00 5.2

pandas - Groupby two functions

I've been trying to get a cumsum on a pandas groupby object. I need the cumsum to be shifted by one, which is achieved by shift(). However, doing both of these functions on a single groupby object gives some unwanted results:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [2, 3, 5, 2, 3, 5]})
df.groupby('A').cumsum().shift()
which gives:
B
0 NaN
1 2.0
2 5.0
3 10.0
4 2.0
5 5.0
I.e. the last value of the cumsum() on group 1 is shifted into the first value of group 2. What I want is these groups to stay seperated, and to get:
B
0 NaN
1 2.0
2 5.0
3 NaN
4 2.0
5 5.0
But I'm not sure how to get both functions to work on the groupby object combined. Can't find this question anywhere else. Have been playing around with agg but can't seem to work that out. Any help would be appreciated.
Use lambda function with GroupBy.apply, also is necessary define columns in list after groupby for processing:
df['B'] = df.groupby('A')['B'].apply(lambda x: x.cumsum().shift())
print (df)
A B
0 1 NaN
1 1 2.0
2 1 5.0
3 2 NaN
4 2 2.0
5 2 5.0
The result of your first operation df.groupby('A').cumsum() is a regular dataframe. It is equivalent to df.groupby('A')[['B']].cumsum(), but Pandas conveniently allows you to omit the [['B']] indexing part.
Any subsequent operation on this dataframe therefore will not by default be performed groupwise, unless you use GroupBy again:
res = df.groupby('A').cumsum().groupby(df['A']).shift()
But, as you can see, this repeats the grouping operation and will be inefficient. You can instead define a single function which combines cumsum and shift in the correct order, then apply this function on a single GroupBy object. Defining this single function is known as function composition, and it's not native to Python. Here are a few alternatives:
Define a new named function
This is an explicit and recommended solution:
def cum_shift(x):
return x.cumsum().shift()
res1 = df.groupby('A')[['B']].apply(cum_shift)
Define an anonymous lambda function
A one-line version of the above:
res2 = df.groupby('A')[['B']].apply(lambda x: x.cumsum().shift())
Use a library which composes
This a pure functional solution; for example, via 3rd party toolz:
from toolz import compose
from operator import methodcaller
cumsum_shift_comp = compose(methodcaller('shift'), methodcaller('cumsum'))
res3 = df.groupby('A')[['B']].apply(cumsum_shift_comp)
All the above give the equivalent result:
assert res.equals(res1) and res1.equals(res2) and res2.equals(res3)
print(res1)
B
0 NaN
1 2.0
2 5.0
3 NaN
4 2.0
5 5.0

Why doesn't first and last in a groupby give me first and last

I'm posting this because the topic just got brought up in another question/answer and the behavior isn't very well documented.
Consider the dataframe df
df = pd.DataFrame(dict(
A=list('xxxyyy'),
B=[np.nan, 1, 2, 3, 4, np.nan]
))
A B
0 x NaN
1 x 1.0
2 x 2.0
3 y 3.0
4 y 4.0
5 y NaN
I wanted to get the first and last rows of each group defined by column 'A'.
I tried
df.groupby('A').B.agg(['first', 'last'])
first last
A
x 1.0 2.0
y 3.0 4.0
However, This doesn't give me the np.NaNs that I expected.
How do I get the actual first and last values in each group?
As noted here by #unutbu:
The groupby.first and groupby.last methods return the first and last non-null values respectively.
To get the actual first and last values, do:
def h(x):
return x.values[0]
def t(x):
return x.values[-1]
df.groupby('A').B.agg([h, t])
h t
A
x NaN 2.0
y 3.0 NaN
One option is to use the .nth method:
>>> gb = df.groupby('A')
>>> gb.nth(0)
B
A
x NaN
y 3.0
>>> gb.nth(-1)
B
A
x 2.0
y NaN
>>>
However, I haven't found a way to aggregate them neatly. Of course, one can always use a pd.DataFrame constructor:
>>> pd.DataFrame({'first':gb.B.nth(0), 'last':gb.B.nth(-1)})
first last
A
x NaN 2.0
y 3.0 NaN
Note: I explicitly used the gb.B attribute, or else you have to use .squeeze

Categories