I have the following dataframe grouped by datafile and I want to fillna(method ='bfill') only for those 'groups' that contain more than half of the data.
df.groupby('datafile').count()
datafile column1 column2 column3 column4
datafile1 5 5 3 4
datafile2 5 5 4 5
datafile3 5 5 5 5
datafile4 5 5 0 0
datafile5 5 5 1 1
As you can see in the df above, I'd like to fill those groups that contain most of the information but not those who has none or little information. So I was thinking in a condition something like fillna those who have more than half of the counts and don't fill the rest or those with less than half.
I'm struggling on how to set up my condition since it involves working with a result of a groupby and the original df.
Help is appreciated it.
example df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 NaN 20
1 datafile1 6 6 NaN 21
2 datafile1 7 7 9 NaN
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 NaN 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
expected output df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 9 20
1 datafile1 6 6 9 21
2 datafile1 7 7 9 23
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 6 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
if the proportion of NON-null values is greater than or equal to 0.5 in each column then it is filled with the bfill method:
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (
df.bfill()
.where(
g.transform('sum')
.div(g['datafile'].transform('size'), axis=0)
.ge(rate) |
not_na
)
)
print(df_fill)
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN
Also we can use:
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(), axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
both methods have similar returns for the sample dataframe
%%timeit
rate = 0.5
not_na = df.notna()
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(),axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
11.1 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (df.bfill()
.where(g.transform('sum').div(g['datafile'].transform('size'),
axis=0).ge(rate) |
not_na)
)
12.9 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use pandas.groupby.filter
def most_not_null(x): return x.isnull().sum().sum() < (x.notnull().sum().sum() // 2)
filtered_groups = df.groupby('datafile').filter(most_not_null)
df.loc[filtered_groups.index] = filtered_groups.bfill()
Output
>>> df
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN
Related
I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN
I am trying to filter out rows when groupby id when all values in column a are NaN. So, across all id, if all observations across the dates are NaN, I want to filter the row out. e.g. I want to filter out id = 2
df
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 2 NaN 21 1.0 11
2 1/1/2000 3 15 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10 25 10.0 11
5 1/2/2000 2 NaN 20 13.0 15
6 1/2/2000 3 10 26 22.0 11
7 1/2/2000 4 10 20 16.0 13
8 1/3/2000 1 10 20 10.0 11
9 1/3/2000 2 NaN 20 13.0 11
10 1/3/2000 3 10 20 18.0 11
11 1/3/2000 4 10 20 10.0 11
desired dataframe
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 3 15 20 14.0 11
2 1/1/2000 4 NaN 24 13.0 11
3 1/2/2000 1 10 25 10.0 11
4 1/2/2000 3 10 26 22.0 11
5 1/2/2000 4 10 20 16.0 13
6 1/3/2000 1 10 20 10.0 11
7 1/3/2000 3 10 20 18.0 11
8 1/3/2000 4 10 20 10.0 11
Test non missing values by Series.notna and then get all groups with at least one match by GroupBy.any, GroupBy.transform is used for return Series with same size like original, so possible filter by boolean indexing:
df = df[df['a'].notna().groupby(df['id']).transform('any')]
print (df)
date id a b c d
0 1/1/2000 1 10.0 20 10.0 11
2 1/1/2000 3 15.0 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10.0 25 10.0 11
6 1/2/2000 3 10.0 26 22.0 11
7 1/2/2000 4 10.0 20 16.0 13
8 1/3/2000 1 10.0 20 10.0 11
10 1/3/2000 3 10.0 20 18.0 11
11 1/3/2000 4 10.0 20 10.0 11
Or use DataFrame.loc for filter also id with non missisng a and then filter original column by Series.isin with boolean indexing too:
df = df[df['id'].isin(df.loc[df['a'].notna(), 'id'])]
print (df)
date id a b c d
0 1/1/2000 1 10.0 20 10.0 11
2 1/1/2000 3 15.0 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10.0 25 10.0 11
6 1/2/2000 3 10.0 26 22.0 11
7 1/2/2000 4 10.0 20 16.0 13
8 1/3/2000 1 10.0 20 10.0 11
10 1/3/2000 3 10.0 20 18.0 11
11 1/3/2000 4 10.0 20 10.0 11
Notice not fast as transform but easy to understand
out = df.groupby('id').filter(lambda x : x['a'].notnull().any())
Out[31]:
date id a b c d
0 1/1/2000 1 10.0 20 10.0 11
2 1/1/2000 3 15.0 20 14.0 11
3 1/1/2000 4 NaN 24 13.0 11
4 1/2/2000 1 10.0 25 10.0 11
6 1/2/2000 3 10.0 26 22.0 11
7 1/2/2000 4 10.0 20 16.0 13
8 1/3/2000 1 10.0 20 10.0 11
10 1/3/2000 3 10.0 20 18.0 11
11 1/3/2000 4 10.0 20 10.0 11
I am sure this has been answered before but I cannot seem to find the right solution. I have tried pd.merge, merge, combine_first and update and they all don't seem to get the right job. They either create a new variable with an _x or they imply stack in below. I am wishing to merge df1 into df where column c is missing values. I wish to do this for each id on each date
Example df for task
df
date id a b c d
1/1/2000 1 10 20 10 11
1/1/2000 2 11 21 NaN 11
1/1/2000 3 15 20 NaN 11
1/1/2000 4 12 24 13 11
1/2/2000 1 10 25 10 11
1/2/2000 2 10 20 NaN 15
1/2/2000 3 10 26 NaN 11
1/2/2000 4 10 20 16 13
1/3/2000 1 10 20 10 11
1/3/2000 2 10 20 NaN 11
1/3/2000 3 10 20 NaN 11
1/3/2000 4 10 20 10 11
df1
date id c
12/29/1999 2 1
12/30/1999 3 1
12/30/1999 2 1
12/31/1999 3 1
12/31/1999 2 1
12/31/1999 4 1
1/1/2000 2 1
1/1/2000 3 14
1/2/2000 2 13
1/2/2000 3 22
1/3/2000 2 13
1/3/2000 3 18
desired df after combining df and d1
df
date id a b c d
1/1/2000 1 10 20 10 11
1/1/2000 2 11 21 1 11
1/1/2000 3 15 20 14 11
1/1/2000 4 12 24 13 11
1/2/2000 1 10 25 10 11
1/2/2000 2 10 20 13 15
1/2/2000 3 10 26 22 11
1/2/2000 4 10 20 16 13
1/3/2000 1 10 20 10 11
1/3/2000 2 10 20 13 11
1/3/2000 3 10 20 18 11
1/3/2000 4 10 20 10 11
Lets create a MultiIndex in both the dataframe with id and date columns then use Series.fillna to fill the NaN values in column c of df1 from corresponding values in df2:
df1['c'] = df1.set_index(['date', 'id'])['c']\
.fillna(df2.set_index(['id', 'date'])['c']).tolist()
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 2 11 21 1.0 11
2 1/1/2000 3 15 20 14.0 11
3 1/1/2000 4 12 24 13.0 11
4 1/2/2000 1 10 25 10.0 11
5 1/2/2000 2 10 20 13.0 15
6 1/2/2000 3 10 26 22.0 11
7 1/2/2000 4 10 20 16.0 13
8 1/3/2000 1 10 20 10.0 11
9 1/3/2000 2 10 20 13.0 11
10 1/3/2000 3 10 20 18.0 11
11 1/3/2000 4 10 20 10.0 11
At least in your example, you can fill your NA values from a list of values (without indexing). AKA, df1, is always the same length as the number of missing values:
df = df.reset_index(drop=True)
df1 = df.reset_index(drop=True)
df.loc[df['c'].isna(), 'c'] = list(df1['c'])
Result:
date id a b c d
0 1/1/2000 1 10 20 10.0 11
1 1/1/2000 2 11 21 1.0 11
2 1/1/2000 3 15 20 14.0 11
3 1/1/2000 4 12 24 13.0 11
4 1/2/2000 1 10 25 10.0 11
5 1/2/2000 2 10 20 13.0 15
6 1/2/2000 3 10 26 22.0 11
7 1/2/2000 4 10 20 16.0 13
8 1/3/2000 1 10 20 10.0 11
9 1/3/2000 2 10 20 13.0 11
10 1/3/2000 3 10 20 18.0 11
11 1/3/2000 4 10 20 10.0 11
One column has NaN and some values, the other columns also has NaN and some values. It is not possible for both columns to have values but it is possible for both columns to have NaN. Is there a way I can merge the columns together?
I've tried selecting one column and df.fillna with a forumla, that doesn't work.
quad_data['new'] = quad_data.apply(lambda x: function(x.a, x.b, const_a, const_b), axis=1)
df1 = pd.merge(df1, quad_data[['a','b','new']], left_on=['a','b'], right_on = ['a','b'], how='inner')
new_x new_y
0 NaN 0.997652
1 NaN 0.861592
2 0 NaN
3 0.997652 NaN
4 0.861592 NaN
5 2.673742 NaN
6 2.618845 NaN
7 NaN 0.432525
8 NaN NaN
9 0.582576 NaN
10 0.50845 NaN
11 NaN 0.341510
12 NaN 0.351510
13 1.404787 NaN
14 2.410116 NaN
15 0.540265 NaN
16 NaN 1.404787
17 NaN 2.410116
18 NaN 0.540265
19 NaN 1.403903
20 1.448987 NaN
combine_first and fillna are good alternatives in general, but these alternatives work since your NaNs are exclusive.
Option 1
df.max
s = quad_data.max(1)
print(s)
0 0.997652
1 0.861592
2 0.000000
3 0.997652
4 0.861592
5 2.673742
6 2.618845
7 0.432525
8 NaN
9 0.582576
10 0.508450
11 0.341510
12 0.351510
13 1.404787
14 2.410116
15 0.540265
16 1.404787
17 2.410116
18 0.540265
19 1.403903
20 1.448987
dtype: float64
Option 2
df.sum
s = quad_data.sum(1)
print(s)
0 0.997652
1 0.861592
2 0.000000
3 0.997652
4 0.861592
5 2.673742
6 2.618845
7 0.432525
8 NaN
9 0.582576
10 0.508450
11 0.341510
12 0.351510
13 1.404787
14 2.410116
15 0.540265
16 1.404787
17 2.410116
18 0.540265
19 1.403903
20 1.448987
dtype: float64
quad_data['new'] = s
Try this .. LOL
df.bfill(1)['new_x']
Out[45]:
0 0.997652
1 0.861592
2 0.000000
3 0.997652
4 0.861592
5 2.673742
6 2.618845
7 0.432525
8 NaN
9 0.582576
10 0.508450
11 0.341510
12 0.351510
13 1.404787
14 2.410116
15 0.540265
16 1.404787
17 2.410116
18 0.540265
19 1.403903
20 1.448987
Name: new_x, dtype: float64
You can use combine_first
df['new'] = df['new_x'].combine_first(df['new_y'])
Or simply
df['new'] = df['new_x'].fillna(df['new_y'])
You get
new_x new_y new
0 NaN 0.997652 0.997652
1 NaN 0.861592 0.861592
2 0.000000 NaN 0.000000
3 0.997652 NaN 0.997652
4 0.861592 NaN 0.861592
5 2.673742 NaN 2.673742
6 2.618845 NaN 2.618845
7 NaN 0.432525 0.432525
8 NaN NaN NaN
9 0.582576 NaN 0.582576
10 0.508450 NaN 0.508450
11 NaN 0.341510 0.341510
12 NaN 0.351510 0.351510
13 1.404787 NaN 1.404787
14 2.410116 NaN 2.410116
15 0.540265 NaN 0.540265
16 NaN 1.404787 1.404787
17 NaN 2.410116 2.410116
18 NaN 0.540265 0.540265
19 NaN 1.403903 1.403903
20 1.448987 NaN 1.448987
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
time Group blocks
0 1 A 4
1 2 A 7
2 3 A 12
3 4 A 17
4 5 A 21
5 6 A 26
6 7 A 33
7 8 A 39
8 9 A 48
9 10 A 59
.... .... ....
36 35 A 231
37 1 B 1
38 2 B 1.5
39 3 B 3
40 4 B 5
41 5 B 6
.... .... ....
911 35 Z 349
This is a dataframe with multiple time series-esque data, from min=1 to max=35. Each Group has a relationship in the range time=1 to time=35 .
I would like to segment this dataframe into columns Group A, Group B, Group C, etc.
How does one "unconcatenate" this dataframe?
is that what you want?
In [84]: df.pivot_table(index='time', columns='Group')
Out[84]:
blocks
Group A B
time
1 4.0 1.0
2 7.0 1.5
3 12.0 3.0
4 17.0 5.0
5 21.0 6.0
6 26.0 NaN
7 33.0 NaN
8 39.0 NaN
9 48.0 NaN
10 59.0 NaN
35 231.0 NaN
data:
In [86]: df
Out[86]:
time Group blocks
0 1 A 4.0
1 2 A 7.0
2 3 A 12.0
3 4 A 17.0
4 5 A 21.0
5 6 A 26.0
6 7 A 33.0
7 8 A 39.0
8 9 A 48.0
9 10 A 59.0
36 35 A 231.0
37 1 B 1.0
38 2 B 1.5
39 3 B 3.0
40 4 B 5.0
41 5 B 6.0