Pandas: Fill missing dataframe values from other dataframe

Pandas: Fill missing dataframe values from other dataframe - python

I have two dataframes of different size:
df1 = pd.DataFrame({'A':[1,2,None,4,None,6,7,8,None,10], 'B':[11,12,13,14,15,16,17,18,19,20]})
df1
A B
0 1.0 11
1 2.0 12
2 NaN 13
3 4.0 14
4 NaN 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df2 = pd.DataFrame({'A':[2,3,4,5,6,8], 'B':[12,13,14,15,16,18]})
df2['A'] = df2['A'].astype(float)
df2
A B
0 2.0 12
1 3.0 13
2 4.0 14
3 5.0 15
4 6.0 16
5 8.0 18
I need to fill missing values (and only them) in column A of the first dataframe with values from the second dataframe with common key in the column B. It is equivalent to a SQL query:
UPDATE df1 JOIN df2
ON df1.B = df2.B
SET df1.A = df2.A WHERE df1.A IS NULL;
I tried to use answers to similar questions from this site, but it does not work as I need:
df1.fillna(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df1.combine_first(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
Intended output is:
A B
0 1.0 11
1 2.0 12
2 3.0 13
3 4.0 14
4 5.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
How do I get this result?

You were right about using combine_first(), except that both dataframes must share the same index, and the index must be the column B:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()

Related

Pandas: start a new group on every non-NA value

I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!

You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4

You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64

Create two columns (interval) from a column in pandas

I have a df with a column
Value
3
8
10
15
I would like to obtain a dataframe that has an 'interpolated' From To values as the following:
Value From To
Nan 0 3
3 3 5
Nan 5 8
8 8 10
10 10 12
Nan 12 15
15 15 17
The increment is always 2 when a value exist.

I found a solution
This is the starting dataframe:
df = pd.DataFrame({'Value' : [0, 2,5,7,9,14,21,25]})
df['From'] = df['Value']
df['To'] = df['Value'] + 2
>>> print(df)
Value From To
0 0 0 2
1 2 2 4
2 5 5 7
3 7 7 9
4 9 9 11
5 14 14 16
6 21 21 23
7 25 25 27
And this code allow to habe Fromand To columns without empties intervals
new_rows = []
for i in range(len(df)-1):
if df.loc[i, 'To'] != df.loc[i+1, 'Value']:
new_row = {'Value': np.nan,'From': df.loc[i, 'To'], 'To': df.loc[i+1, 'Value']}
else:
new_row = {'Value': np.nan, 'From': np.nan, 'To': np.nan}
new_rows.append(new_row)
df1 = df.append(new_rows, ignore_index=True)
df1.dropna(how='all', inplace=True)
df1.sort_values(by=['To'], inplace=True)
df1
>>> df1
Value From To
0 0.0 0.0 2.0
1 2.0 2.0 4.0
9 NaN 4.0 5.0
2 5.0 5.0 7.0
3 7.0 7.0 9.0
4 9.0 9.0 11.0
12 NaN 11.0 14.0
5 14.0 14.0 16.0
13 NaN 16.0 21.0
6 21.0 21.0 23.0
14 NaN 23.0 25.0
7 25.0 25.0 27.0
It is probably improvable....

Get surface weighted average of multiple columns in pandas dataframe

I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np import pandas as pd  df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4], "U1": [10,12,24,8,84,84], "A2": [1,2,3,np.nan,5,6], "U2": [11,12,13,14,15,16]})  print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1) EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))

Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000

Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000

How to sum each row and replace each value in row with sum?

If I have a data frame like this:
1 2 3 4 5
1 2 4 5 NaN 3
2 3 5 6 1 2
3 3 1 1 1 1
How do I sum each row and replace the values in that row with the sum so I get something like this:
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17 17
3 7 7 7 7 7

Use mask for replace all non missing values by sum:
df = df.mask(df.notnull(), df.sum(axis=1), axis=0)
print (df)
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17.0 17
3 7 7 7 7.0 7
Or use numpy.broadcast_to with numpy.where:
arr = df.values
a = np.broadcast_to(np.nansum(arr, axis=1)[:, None], df.shape)
df = pd.DataFrame(np.where(np.isnan(arr), np.nan, a),
index=df.index, columns=df.columns)
#alternative
df[:] = np.where(np.isnan(arr), np.nan, a)
print (df)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0

Using mul
df.notnull().replace(False,np.nan).mul(df.sum(1),axis=0).astype(float)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0

fill pandas missing places with different method for different column

I have a pandas dataframe df and I want the final output dataframe final_df as
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 12 a 25 4
3 13 a 29 5
In [18]: final_df
Out[18]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 0.0 5.0
In [19]: dates=[10,11,12,13,14]
That is as you can see I want to fill up the missing dates and fill the corresponding values with 0 for cost column but for column prev I want to fill it with the value from previous date. As the single date may contains multiple symbol I am using the pivot_table.
If I use the ffill
In [12]: df.pivot_table(index="Date",columns="symbol").reindex(dates,method="ffill").stack().reset_index()
Out[12]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 30.0 9.0
3 11 b 33.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 29.0 5.0
This gives almost final data structure (it has 7 rows as final_df) except for cost column where it copies previous data but I want 0 there.
So I tried to fill missing values of different columns with different method, but that gives a problem, like
In [13]: df1=df.pivot_table(index="Date",columns="symbol").reindex(dates)
In [14]: df1["cost"]=df1["cost"].fillna(0)
In [15]: df1["prev"]=df1["prev"].ffill()
In [16]: df1.stack().reset_index()
Out[16]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 12 b 0.0 10.0
6 13 a 29.0 5.0
7 13 b 0.0 10.0
8 14 a 0.0 5.0
9 14 b 0.0 10.0
As you can see in output there is data with symbol "b" for date 12,13,14 but I don't want that because in initial dataframe there was no data data with symbol "b" for date 12,13 and I want to keep it that way and also there must not be one in new date 14 as it follows 13.
So how can I solve this problem and get the final_df output?
EDIT
Here is another example to check the program.
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 14 a 29 5
In [18]: dates=range(10,17)
In [19]: final_df
Out[19]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 11 a 0 9
3 11 b 0 10
4 12 a 0 9
5 12 b 0 10
6 13 a 0 9
7 13 b 0 10
8 14 a 29 5
9 15 a 0 5
10 16 a 0 5
Solution
I have found this way to the problem. Here I using a trick that keeps track of the missing places in in the initial pivot_table and removes finally.
In [44]: df1=df.pivot_table(index="Date",columns='symbol',fill_value="missing").reindex(dates)
In [45]: df1["cost"]= df1["cost"].fillna(0)
In [46]: df1["prev"]=df1["prev"].ffill()
In [47]: df1.stack().replace(to_replace="missing",value=np.nan).dropna().reset_index()
Out[47]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 0.0 9.0
5 12 b 0.0 10.0
6 13 a 0.0 9.0
7 13 b 0.0 10.0
8 14 a 29.0 5.0
9 15 a 0.0 5.0
10 16 a 0.0 5.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Fill missing dataframe values from other dataframe - python

You were right about using combine_first(), except that both dataframes must share the same index, and the index must be the column B: df1.set_index('B').combine_first(df2.set_index('B')).reset_index()

Related

Pandas: start a new group on every non-NA value

Create two columns (interval) from a column in pandas

Get surface weighted average of multiple columns in pandas dataframe

How to sum each row and replace each value in row with sum?

fill pandas missing places with different method for different column

Categories

Resources