One column has NaN and some values, the other columns also has NaN and some values. It is not possible for both columns to have values but it is possible for both columns to have NaN. Is there a way I can merge the columns together?
I've tried selecting one column and df.fillna with a forumla, that doesn't work.
quad_data['new'] = quad_data.apply(lambda x: function(x.a, x.b, const_a, const_b), axis=1)
df1 = pd.merge(df1, quad_data[['a','b','new']], left_on=['a','b'], right_on = ['a','b'], how='inner')
new_x new_y
0 NaN 0.997652
1 NaN 0.861592
2 0 NaN
3 0.997652 NaN
4 0.861592 NaN
5 2.673742 NaN
6 2.618845 NaN
7 NaN 0.432525
8 NaN NaN
9 0.582576 NaN
10 0.50845 NaN
11 NaN 0.341510
12 NaN 0.351510
13 1.404787 NaN
14 2.410116 NaN
15 0.540265 NaN
16 NaN 1.404787
17 NaN 2.410116
18 NaN 0.540265
19 NaN 1.403903
20 1.448987 NaN
combine_first and fillna are good alternatives in general, but these alternatives work since your NaNs are exclusive.
Option 1
df.max
s = quad_data.max(1)
print(s)
0 0.997652
1 0.861592
2 0.000000
3 0.997652
4 0.861592
5 2.673742
6 2.618845
7 0.432525
8 NaN
9 0.582576
10 0.508450
11 0.341510
12 0.351510
13 1.404787
14 2.410116
15 0.540265
16 1.404787
17 2.410116
18 0.540265
19 1.403903
20 1.448987
dtype: float64
Option 2
df.sum
s = quad_data.sum(1)
print(s)
0 0.997652
1 0.861592
2 0.000000
3 0.997652
4 0.861592
5 2.673742
6 2.618845
7 0.432525
8 NaN
9 0.582576
10 0.508450
11 0.341510
12 0.351510
13 1.404787
14 2.410116
15 0.540265
16 1.404787
17 2.410116
18 0.540265
19 1.403903
20 1.448987
dtype: float64
quad_data['new'] = s
Try this .. LOL
df.bfill(1)['new_x']
Out[45]:
0 0.997652
1 0.861592
2 0.000000
3 0.997652
4 0.861592
5 2.673742
6 2.618845
7 0.432525
8 NaN
9 0.582576
10 0.508450
11 0.341510
12 0.351510
13 1.404787
14 2.410116
15 0.540265
16 1.404787
17 2.410116
18 0.540265
19 1.403903
20 1.448987
Name: new_x, dtype: float64
You can use combine_first
df['new'] = df['new_x'].combine_first(df['new_y'])
Or simply
df['new'] = df['new_x'].fillna(df['new_y'])
You get
new_x new_y new
0 NaN 0.997652 0.997652
1 NaN 0.861592 0.861592
2 0.000000 NaN 0.000000
3 0.997652 NaN 0.997652
4 0.861592 NaN 0.861592
5 2.673742 NaN 2.673742
6 2.618845 NaN 2.618845
7 NaN 0.432525 0.432525
8 NaN NaN NaN
9 0.582576 NaN 0.582576
10 0.508450 NaN 0.508450
11 NaN 0.341510 0.341510
12 NaN 0.351510 0.351510
13 1.404787 NaN 1.404787
14 2.410116 NaN 2.410116
15 0.540265 NaN 0.540265
16 NaN 1.404787 1.404787
17 NaN 2.410116 2.410116
18 NaN 0.540265 0.540265
19 NaN 1.403903 1.403903
20 1.448987 NaN 1.448987
Related
Suppose I had a column in a dataframe like :
colname
Na
Na
Na
1
2
3
4
Na
Na
Na
Na
2
8
5
44
Na
Na
Does anyone know of a function to forward fill the Non NA values with the first value in the non na run? To produce :
colname
Na
Na
Na
1
1
1
1
Na
Na
Na
Na
2
2
2
2
Na
Na
Use GroupBy.transform with GroupBy.first by compare values for missing values by Series.isna with cumulative sum by Series.cumsum, last correct NaNs by Series.where with Series.duplicated:
s = df['colNaNme'].isna().cumsum()
df['colNaNme'] = df.groupby(s)['colNaNme'].transform('first').where(s.duplicated())
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Or filter only non missing values by invert mask m and processing only these groups:
m = df['colNaNme'].isna()
df.loc[~m, 'colNaNme'] = df[~m].groupby(m.cumsum())['colNaNme'].transform('first')
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Solution with non groupby:
m = df['colNaNme'].isna()
m1 = m.cumsum().shift().bfill()
m2 = ~m1.duplicated() & m.duplicated(keep=False)
df['colNaNme'] = df['colNaNme'].where(m2).ffill().mask(m)
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
You could try groupby and cumsum with shift and transform('first'):
>>> df.groupby(df['colname'].isna().ne(df['colname'].isna().shift()).cumsum()).transform('first')
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>
Or try something like:
>>> x = df.groupby(df['colname'].isna().cumsum()).transform('first')
>>> x.loc[~x.duplicated()] = np.nan
>>> x
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>
I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN
full['Name'].head(10)
here is a Series which show like below:
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
5 Mr
6 Mr
7 Master
8 Mrs
9 Mrs
Name: Name, dtype: object
And after using the map dict function:
full['Name']=full['Name'].map({'Mr':1})
full['Name'].head(100)
it turns out to be:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
29 NaN
And it is strange that I have suceed in doing this on the other Series in DataFrame full, which really make me confused.
Please help.
I have the following dataframe grouped by datafile and I want to fillna(method ='bfill') only for those 'groups' that contain more than half of the data.
df.groupby('datafile').count()
datafile column1 column2 column3 column4
datafile1 5 5 3 4
datafile2 5 5 4 5
datafile3 5 5 5 5
datafile4 5 5 0 0
datafile5 5 5 1 1
As you can see in the df above, I'd like to fill those groups that contain most of the information but not those who has none or little information. So I was thinking in a condition something like fillna those who have more than half of the counts and don't fill the rest or those with less than half.
I'm struggling on how to set up my condition since it involves working with a result of a groupby and the original df.
Help is appreciated it.
example df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 NaN 20
1 datafile1 6 6 NaN 21
2 datafile1 7 7 9 NaN
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 NaN 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
expected output df:
index datafile column1 column2 column3 column4
0 datafile1 5 5 9 20
1 datafile1 6 6 9 21
2 datafile1 7 7 9 23
3 datafile1 8 8 10 23
4 datafile1 9 9 11 24
5 datafile2 3 3 2 7
6 datafile2 4 4 3 8
7 datafile2 5 5 4 9
8 datafile2 6 6 6 10
9 datafile2 7 7 6 11
10 datafile3 10 10 24 4
11 datafile3 11 11 25 5
12 datafile3 12 12 26 6
13 datafile3 13 13 27 7
14 datafile3 14 14 28 8
15 datafile4 4 4 NaN NaN
16 datafile4 5 5 NaN NaN
17 datafile4 6 6 NaN NaN
18 datafile4 7 7 NaN NaN
19 datafile4 8 8 NaN NaN
19 datafile4 9 9 NaN NaN
20 datafile5 7 7 1 3
21 datafile5 8 8 NaN NaN
22 datafile5 9 9 NaN NaN
23 datafile5 10 10 NaN NaN
24 datafile5 11 1 NaN NaN
if the proportion of NON-null values is greater than or equal to 0.5 in each column then it is filled with the bfill method:
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (
df.bfill()
.where(
g.transform('sum')
.div(g['datafile'].transform('size'), axis=0)
.ge(rate) |
not_na
)
)
print(df_fill)
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN
Also we can use:
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(), axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
both methods have similar returns for the sample dataframe
%%timeit
rate = 0.5
not_na = df.notna()
m = (not_na.groupby(df['datafile'], sort=False)
.sum()
.div(df['datafile'].value_counts(),axis=0)
.ge(rate)
.reindex(df['datafile']).reset_index(drop=True))
df.bfill().where(m | not_na)
11.1 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
rate = 0.5
not_na = df.notna()
g = not_na.groupby(df['datafile'])
df_fill = (df.bfill()
.where(g.transform('sum').div(g['datafile'].transform('size'),
axis=0).ge(rate) |
not_na)
)
12.9 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use pandas.groupby.filter
def most_not_null(x): return x.isnull().sum().sum() < (x.notnull().sum().sum() // 2)
filtered_groups = df.groupby('datafile').filter(most_not_null)
df.loc[filtered_groups.index] = filtered_groups.bfill()
Output
>>> df
index datafile column1 column2 column3 column4
0 0 datafile1 5 5 9.0 20.0
1 1 datafile1 6 6 9.0 21.0
2 2 datafile1 7 7 9.0 23.0
3 3 datafile1 8 8 10.0 23.0
4 4 datafile1 9 9 11.0 24.0
5 5 datafile2 3 3 2.0 7.0
6 6 datafile2 4 4 3.0 8.0
7 7 datafile2 5 5 4.0 9.0
8 8 datafile2 6 6 6.0 10.0
9 9 datafile2 7 7 6.0 11.0
10 10 datafile3 10 10 24.0 4.0
11 11 datafile3 11 11 25.0 5.0
12 12 datafile3 12 12 26.0 6.0
13 13 datafile3 13 13 27.0 7.0
14 14 datafile3 14 14 28.0 8.0
15 15 datafile4 4 4 NaN NaN
16 16 datafile4 5 5 NaN NaN
17 17 datafile4 6 6 NaN NaN
18 18 datafile4 7 7 NaN NaN
19 19 datafile4 8 8 NaN NaN
20 19 datafile4 9 9 NaN NaN
21 20 datafile5 7 7 1.0 3.0
22 21 datafile5 8 8 NaN NaN
23 22 datafile5 9 9 NaN NaN
24 23 datafile5 10 10 NaN NaN
25 24 datafile5 11 1 NaN NaN
Note: I asked a worse version of this yesterday, which I quickly deleted. However #FlorianGD left a comment which gave me the answer I needed - so I've added this up in any case with the solution he advised.
Prepare Start Dataframe
I have some dates:
date_dict = {0: '1/31/2010',
1: '12/15/2009',
2: '3/19/2010',
3: '10/25/2009',
4: '1/17/2009',
5: '9/4/2009',
6: '2/21/2010',
7: '8/30/2009',
8: '1/31/2010',
9: '11/30/2008',
10: '2/8/2009',
11: '4/9/2010',
12: '9/13/2009',
13: '10/19/2009',
14: '1/24/2010',
15: '3/8/2009',
16: '11/30/2008',
17: '7/30/2009',
18: '12/12/2009',
19: '3/8/2009',
20: '6/18/2010',
21: '11/30/2008',
22: '12/30/2009',
23: '10/28/2009',
24: '1/28/2010'}
Convert to dataframe and datetime format:
import pandas as pd
from datetime import datetime
df = pd.DataFrame(list(date_dict.items()), columns=['Ind', 'Game_date'])
df['Date'] = df['Game_date'].apply(lambda x: datetime.strptime(x.strip(), "%m/%d/%Y"))
df.sort_values(by='Date', inplace=True)
df.reset_index(drop=True, inplace=True)
del df['Ind'], df['Game_date']
df['Count'] = 1
df
Date
0 2008-11-30
1 2008-11-30
2 2008-11-30
3 2009-01-17
4 2009-02-08
5 2009-03-08
6 2009-03-08
7 2009-07-30
8 2009-08-30
9 2009-09-04
10 2009-09-13
11 2009-10-19
12 2009-10-25
13 2009-10-28
14 2009-12-12
15 2009-12-15
16 2009-12-30
17 2010-01-24
18 2010-01-28
19 2010-01-31
20 2010-01-31
21 2010-02-21
22 2010-03-19
23 2010-04-09
24 2010-06-18
Now what I want to do is to resample this dataframe to group rows into groups of weeks, and return the information to the original dataframe.
2 Use resample() to group for each week and return count
I perform a resample for each week every Tuesday:
c_index = df.set_index('Date', drop=True).resample('1W-TUE').sum()['Count'].reset_index()
c_index.dropna(subset=['Count'], axis=0, inplace=True)
c_index = c_index.reset_index(drop=True)
c_index['Index_Col'] = c_index.index + 1
c_index
Date Count Index_Col
0 2008-12-02 3.0 1
1 2009-01-20 1.0 2
2 2009-02-10 1.0 3
3 2009-03-10 2.0 4
4 2009-08-04 1.0 5
5 2009-09-01 1.0 6
6 2009-09-08 1.0 7
7 2009-09-15 1.0 8
8 2009-10-20 1.0 9
9 2009-10-27 1.0 10
10 2009-11-03 1.0 11
11 2009-12-15 2.0 12
12 2010-01-05 1.0 13
13 2010-01-26 1.0 14
14 2010-02-02 3.0 15
15 2010-02-23 1.0 16
16 2010-03-23 1.0 17
17 2010-04-13 1.0 18
18 2010-06-22 1.0 19
This shows the number of rows in df that fall within each week in c_index, so, for week 2008-12-02 there were 3 rows that fell in this week.
Broadcast Information back to original df
Now, I want to merge those columns back onto the original df, essentially broadcasting the grouped data onto the individual rows.
This should give:
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3 1
1 2008-11-30 1 3 1
2 2008-11-30 1 3 1
3 2009-01-17 1 1 2
4 2009-02-08 1 1 3
5 2009-03-08 1 2 4
6 2009-03-08 1 2 4
7 2009-07-30 1 1 5
8 2009-08-30 1 1 6
9 2009-09-04 1 1 7
10 2009-09-13 1 1 8
11 2009-10-19 1 1 9
12 2009-10-25 1 1 10
13 2009-10-28 1 1 11
14 2009-12-12 1 2 12
15 2009-12-15 1 2 12
16 2009-12-30 1 1 13
17 2010-01-24 1 1 14
18 2010-01-28 1 3 15
19 2010-01-31 1 3 15
20 2010-01-31 1 3 15
21 2010-02-21 1 1 16
22 2010-03-19 1 1 17
23 2010-04-09 1 1 18
24 2010-06-18 1 1 19
So the Count_Total represents the total number in that group, and Index_Col tracks the order of the groups.
For example, in this case, the group info for 2010-02-02 has been assigned to 2010-01-28, 2010-01-31, and 2010-01-31.
To do this I have tried the following:
Failed attempt
df.merge(c_index, on='Date', how='left', suffixes=('_Raw', '_Total'))
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 NaN NaN
1 2008-11-30 1 NaN NaN
2 2008-11-30 1 NaN NaN
3 2009-01-17 1 NaN NaN
4 2009-02-08 1 NaN NaN
5 2009-03-08 1 NaN NaN
6 2009-03-08 1 NaN NaN
7 2009-07-30 1 NaN NaN
8 2009-08-30 1 NaN NaN
9 2009-09-04 1 NaN NaN
10 2009-09-13 1 NaN NaN
11 2009-10-19 1 NaN NaN
12 2009-10-25 1 NaN NaN
13 2009-10-28 1 NaN NaN
14 2009-12-12 1 NaN NaN
15 2009-12-15 1 2.0 12.0
16 2009-12-30 1 NaN NaN
17 2010-01-24 1 NaN NaN
18 2010-01-28 1 NaN NaN
19 2010-01-31 1 NaN NaN
20 2010-01-31 1 NaN NaN
21 2010-02-21 1 NaN NaN
22 2010-03-19 1 NaN NaN
23 2010-04-09 1 NaN NaN
24 2010-06-18 1 NaN NaN
Reasons for failure: This merges the two dataframes only when the date in c_index is also present in df. In this example the only week that has had information added is 2009-12-15 as this is the only date common across both dataframes.
How can I do a better merge to get what I'm after?
As indicated by #FlorianGD, this can be achieved using pandas.merge_asof with the direction='forward' argument:
pd.merge_asof(left=df, right=c_index, on='Date', suffixes=('_Raw', '_Total'), direction='forward')
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3.0 1
1 2008-11-30 1 3.0 1
2 2008-11-30 1 3.0 1
3 2009-01-17 1 1.0 2
4 2009-02-08 1 1.0 3
5 2009-03-08 1 2.0 4
6 2009-03-08 1 2.0 4
7 2009-07-30 1 1.0 5
8 2009-08-30 1 1.0 6
9 2009-09-04 1 1.0 7
10 2009-09-13 1 1.0 8
11 2009-10-19 1 1.0 9
12 2009-10-25 1 1.0 10
13 2009-10-28 1 1.0 11
14 2009-12-12 1 2.0 12
15 2009-12-15 1 2.0 12
16 2009-12-30 1 1.0 13
17 2010-01-24 1 1.0 14
18 2010-01-28 1 3.0 15
19 2010-01-31 1 3.0 15
20 2010-01-31 1 3.0 15
21 2010-02-21 1 1.0 16
22 2010-03-19 1 1.0 17
23 2010-04-09 1 1.0 18
24 2010-06-18 1 1.0 19