Pandas Merge Resample Result for Missing Rows - python

Note: I asked a worse version of this yesterday, which I quickly deleted. However #FlorianGD left a comment which gave me the answer I needed - so I've added this up in any case with the solution he advised.
Prepare Start Dataframe
I have some dates:
date_dict = {0: '1/31/2010',
1: '12/15/2009',
2: '3/19/2010',
3: '10/25/2009',
4: '1/17/2009',
5: '9/4/2009',
6: '2/21/2010',
7: '8/30/2009',
8: '1/31/2010',
9: '11/30/2008',
10: '2/8/2009',
11: '4/9/2010',
12: '9/13/2009',
13: '10/19/2009',
14: '1/24/2010',
15: '3/8/2009',
16: '11/30/2008',
17: '7/30/2009',
18: '12/12/2009',
19: '3/8/2009',
20: '6/18/2010',
21: '11/30/2008',
22: '12/30/2009',
23: '10/28/2009',
24: '1/28/2010'}
Convert to dataframe and datetime format:
import pandas as pd
from datetime import datetime
df = pd.DataFrame(list(date_dict.items()), columns=['Ind', 'Game_date'])
df['Date'] = df['Game_date'].apply(lambda x: datetime.strptime(x.strip(), "%m/%d/%Y"))
df.sort_values(by='Date', inplace=True)
df.reset_index(drop=True, inplace=True)
del df['Ind'], df['Game_date']
df['Count'] = 1
df
Date
0 2008-11-30
1 2008-11-30
2 2008-11-30
3 2009-01-17
4 2009-02-08
5 2009-03-08
6 2009-03-08
7 2009-07-30
8 2009-08-30
9 2009-09-04
10 2009-09-13
11 2009-10-19
12 2009-10-25
13 2009-10-28
14 2009-12-12
15 2009-12-15
16 2009-12-30
17 2010-01-24
18 2010-01-28
19 2010-01-31
20 2010-01-31
21 2010-02-21
22 2010-03-19
23 2010-04-09
24 2010-06-18
Now what I want to do is to resample this dataframe to group rows into groups of weeks, and return the information to the original dataframe.
2 Use resample() to group for each week and return count
I perform a resample for each week every Tuesday:
c_index = df.set_index('Date', drop=True).resample('1W-TUE').sum()['Count'].reset_index()
c_index.dropna(subset=['Count'], axis=0, inplace=True)
c_index = c_index.reset_index(drop=True)
c_index['Index_Col'] = c_index.index + 1
c_index
Date Count Index_Col
0 2008-12-02 3.0 1
1 2009-01-20 1.0 2
2 2009-02-10 1.0 3
3 2009-03-10 2.0 4
4 2009-08-04 1.0 5
5 2009-09-01 1.0 6
6 2009-09-08 1.0 7
7 2009-09-15 1.0 8
8 2009-10-20 1.0 9
9 2009-10-27 1.0 10
10 2009-11-03 1.0 11
11 2009-12-15 2.0 12
12 2010-01-05 1.0 13
13 2010-01-26 1.0 14
14 2010-02-02 3.0 15
15 2010-02-23 1.0 16
16 2010-03-23 1.0 17
17 2010-04-13 1.0 18
18 2010-06-22 1.0 19
This shows the number of rows in df that fall within each week in c_index, so, for week 2008-12-02 there were 3 rows that fell in this week.
Broadcast Information back to original df
Now, I want to merge those columns back onto the original df, essentially broadcasting the grouped data onto the individual rows.
This should give:
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3 1
1 2008-11-30 1 3 1
2 2008-11-30 1 3 1
3 2009-01-17 1 1 2
4 2009-02-08 1 1 3
5 2009-03-08 1 2 4
6 2009-03-08 1 2 4
7 2009-07-30 1 1 5
8 2009-08-30 1 1 6
9 2009-09-04 1 1 7
10 2009-09-13 1 1 8
11 2009-10-19 1 1 9
12 2009-10-25 1 1 10
13 2009-10-28 1 1 11
14 2009-12-12 1 2 12
15 2009-12-15 1 2 12
16 2009-12-30 1 1 13
17 2010-01-24 1 1 14
18 2010-01-28 1 3 15
19 2010-01-31 1 3 15
20 2010-01-31 1 3 15
21 2010-02-21 1 1 16
22 2010-03-19 1 1 17
23 2010-04-09 1 1 18
24 2010-06-18 1 1 19
So the Count_Total represents the total number in that group, and Index_Col tracks the order of the groups.
For example, in this case, the group info for 2010-02-02 has been assigned to 2010-01-28, 2010-01-31, and 2010-01-31.
To do this I have tried the following:
Failed attempt
df.merge(c_index, on='Date', how='left', suffixes=('_Raw', '_Total'))
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 NaN NaN
1 2008-11-30 1 NaN NaN
2 2008-11-30 1 NaN NaN
3 2009-01-17 1 NaN NaN
4 2009-02-08 1 NaN NaN
5 2009-03-08 1 NaN NaN
6 2009-03-08 1 NaN NaN
7 2009-07-30 1 NaN NaN
8 2009-08-30 1 NaN NaN
9 2009-09-04 1 NaN NaN
10 2009-09-13 1 NaN NaN
11 2009-10-19 1 NaN NaN
12 2009-10-25 1 NaN NaN
13 2009-10-28 1 NaN NaN
14 2009-12-12 1 NaN NaN
15 2009-12-15 1 2.0 12.0
16 2009-12-30 1 NaN NaN
17 2010-01-24 1 NaN NaN
18 2010-01-28 1 NaN NaN
19 2010-01-31 1 NaN NaN
20 2010-01-31 1 NaN NaN
21 2010-02-21 1 NaN NaN
22 2010-03-19 1 NaN NaN
23 2010-04-09 1 NaN NaN
24 2010-06-18 1 NaN NaN
Reasons for failure: This merges the two dataframes only when the date in c_index is also present in df. In this example the only week that has had information added is 2009-12-15 as this is the only date common across both dataframes.
How can I do a better merge to get what I'm after?

As indicated by #FlorianGD, this can be achieved using pandas.merge_asof with the direction='forward' argument:
pd.merge_asof(left=df, right=c_index, on='Date', suffixes=('_Raw', '_Total'), direction='forward')
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3.0 1
1 2008-11-30 1 3.0 1
2 2008-11-30 1 3.0 1
3 2009-01-17 1 1.0 2
4 2009-02-08 1 1.0 3
5 2009-03-08 1 2.0 4
6 2009-03-08 1 2.0 4
7 2009-07-30 1 1.0 5
8 2009-08-30 1 1.0 6
9 2009-09-04 1 1.0 7
10 2009-09-13 1 1.0 8
11 2009-10-19 1 1.0 9
12 2009-10-25 1 1.0 10
13 2009-10-28 1 1.0 11
14 2009-12-12 1 2.0 12
15 2009-12-15 1 2.0 12
16 2009-12-30 1 1.0 13
17 2010-01-24 1 1.0 14
18 2010-01-28 1 3.0 15
19 2010-01-31 1 3.0 15
20 2010-01-31 1 3.0 15
21 2010-02-21 1 1.0 16
22 2010-03-19 1 1.0 17
23 2010-04-09 1 1.0 18
24 2010-06-18 1 1.0 19

Related

Pandas: start a new group on every non-NA value

I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64

Forward fill non na values with last observation carried forwards in Python

Suppose I had a column in a dataframe like :
colname
Na
Na
Na
1
2
3
4
Na
Na
Na
Na
2
8
5
44
Na
Na
Does anyone know of a function to forward fill the Non NA values with the first value in the non na run? To produce :
colname
Na
Na
Na
1
1
1
1
Na
Na
Na
Na
2
2
2
2
Na
Na
Use GroupBy.transform with GroupBy.first by compare values for missing values by Series.isna with cumulative sum by Series.cumsum, last correct NaNs by Series.where with Series.duplicated:
s = df['colNaNme'].isna().cumsum()
df['colNaNme'] = df.groupby(s)['colNaNme'].transform('first').where(s.duplicated())
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Or filter only non missing values by invert mask m and processing only these groups:
m = df['colNaNme'].isna()
df.loc[~m, 'colNaNme'] = df[~m].groupby(m.cumsum())['colNaNme'].transform('first')
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
Solution with non groupby:
m = df['colNaNme'].isna()
m1 = m.cumsum().shift().bfill()
m2 = ~m1.duplicated() & m.duplicated(keep=False)
df['colNaNme'] = df['colNaNme'].where(m2).ffill().mask(m)
print (df)
colNaNme
0 NaN
1 NaN
2 NaN
3 1.0
4 1.0
5 1.0
6 1.0
7 NaN
8 NaN
9 NaN
10 NaN
11 2.0
12 2.0
13 2.0
14 2.0
15 NaN
16 NaN
You could try groupby and cumsum with shift and transform('first'):
>>> df.groupby(df['colname'].isna().ne(df['colname'].isna().shift()).cumsum()).transform('first')
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>
Or try something like:
>>> x = df.groupby(df['colname'].isna().cumsum()).transform('first')
>>> x.loc[~x.duplicated()] = np.nan
>>> x
colname
0 NaN
1 NaN
2 NaN
3 1
4 1
5 1
6 1
7 NaN
8 NaN
9 NaN
10 NaN
11 2
12 2
13 2
14 2
15 NaN
16 NaN
>>>

Pandas merge 2 dataframes

I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN

Get surface weighted average of multiple columns in pandas dataframe

I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np
import pandas as pd

df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4],
"U1": [10,12,24,8,84,84],
"A2": [1,2,3,np.nan,5,6],
"U2": [11,12,13,14,15,16]})

print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1)
EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))

Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000
Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000

Pandas: Fill missing dataframe values from other dataframe

I have two dataframes of different size:
df1 = pd.DataFrame({'A':[1,2,None,4,None,6,7,8,None,10], 'B':[11,12,13,14,15,16,17,18,19,20]})
df1
A B
0 1.0 11
1 2.0 12
2 NaN 13
3 4.0 14
4 NaN 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df2 = pd.DataFrame({'A':[2,3,4,5,6,8], 'B':[12,13,14,15,16,18]})
df2['A'] = df2['A'].astype(float)
df2
A B
0 2.0 12
1 3.0 13
2 4.0 14
3 5.0 15
4 6.0 16
5 8.0 18
I need to fill missing values (and only them) in column A of the first dataframe with values from the second dataframe with common key in the column B. It is equivalent to a SQL query:
UPDATE df1 JOIN df2
ON df1.B = df2.B
SET df1.A = df2.A WHERE df1.A IS NULL;
I tried to use answers to similar questions from this site, but it does not work as I need:
df1.fillna(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df1.combine_first(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
Intended output is:
A B
0 1.0 11
1 2.0 12
2 3.0 13
3 4.0 14
4 5.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
How do I get this result?
You were right about using combine_first(), except that both dataframes must share the same index, and the index must be the column B:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()

Categories