I have two dataframes of different size:
df1 = pd.DataFrame({'A':[1,2,None,4,None,6,7,8,None,10], 'B':[11,12,13,14,15,16,17,18,19,20]})
df1
A B
0 1.0 11
1 2.0 12
2 NaN 13
3 4.0 14
4 NaN 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df2 = pd.DataFrame({'A':[2,3,4,5,6,8], 'B':[12,13,14,15,16,18]})
df2['A'] = df2['A'].astype(float)
df2
A B
0 2.0 12
1 3.0 13
2 4.0 14
3 5.0 15
4 6.0 16
5 8.0 18
I need to fill missing values (and only them) in column A of the first dataframe with values from the second dataframe with common key in the column B. It is equivalent to a SQL query:
UPDATE df1 JOIN df2
ON df1.B = df2.B
SET df1.A = df2.A WHERE df1.A IS NULL;
I tried to use answers to similar questions from this site, but it does not work as I need:
df1.fillna(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
df1.combine_first(df2)
A B
0 1.0 11
1 2.0 12
2 4.0 13
3 4.0 14
4 6.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
Intended output is:
A B
0 1.0 11
1 2.0 12
2 3.0 13
3 4.0 14
4 5.0 15
5 6.0 16
6 7.0 17
7 8.0 18
8 NaN 19
9 10.0 20
How do I get this result?
You were right about using combine_first(), except that both dataframes must share the same index, and the index must be the column B:
df1.set_index('B').combine_first(df2.set_index('B')).reset_index()
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
I have a df with a column
Value
3
8
10
15
I would like to obtain a dataframe that has an 'interpolated' From To values as the following:
Value From To
Nan 0 3
3 3 5
Nan 5 8
8 8 10
10 10 12
Nan 12 15
15 15 17
The increment is always 2 when a value exist.
I found a solution
This is the starting dataframe:
df = pd.DataFrame({'Value' : [0, 2,5,7,9,14,21,25]})
df['From'] = df['Value']
df['To'] = df['Value'] + 2
>>> print(df)
Value From To
0 0 0 2
1 2 2 4
2 5 5 7
3 7 7 9
4 9 9 11
5 14 14 16
6 21 21 23
7 25 25 27
And this code allow to habe Fromand To columns without empties intervals
new_rows = []
for i in range(len(df)-1):
if df.loc[i, 'To'] != df.loc[i+1, 'Value']:
new_row = {'Value': np.nan,'From': df.loc[i, 'To'], 'To': df.loc[i+1, 'Value']}
else:
new_row = {'Value': np.nan, 'From': np.nan, 'To': np.nan}
new_rows.append(new_row)
df1 = df.append(new_rows, ignore_index=True)
df1.dropna(how='all', inplace=True)
df1.sort_values(by=['To'], inplace=True)
df1
>>> df1
Value From To
0 0.0 0.0 2.0
1 2.0 2.0 4.0
9 NaN 4.0 5.0
2 5.0 5.0 7.0
3 7.0 7.0 9.0
4 9.0 9.0 11.0
12 NaN 11.0 14.0
5 14.0 14.0 16.0
13 NaN 16.0 21.0
6 21.0 21.0 23.0
14 NaN 23.0 25.0
7 25.0 25.0 27.0
It is probably improvable....
I want to take the surface-weighted average of the columns in my dataframe. I have two surface-columns and two U-value-columns. I want to create an extra column 'U_av' (surface-weighted-average U-value) and U_av = (A1*U1 + A2*U2) / (A1+A2). If NaN occurs in one of the columns, NaN should be returned.
Initial df:
ID A1 A2 U1 U2
0 14 2 1.0 10.0 11
1 16 2 2.0 12.0 12
2 18 2 3.0 24.0 13
3 20 2 NaN 8.0 14
4 22 4 5.0 84.0 15
5 24 4 6.0 84.0 16
Desired Output:
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.33
1 16 2 2.0 12.0 12 12
2 18 2 3.0 24.0 13 17.4
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.66
5 24 4 6.0 84.0 16 43.2
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({"ID": [14,16,18,20,22,24],
"A1": [2,2,2,2,4,4],
"U1": [10,12,24,8,84,84],
"A2": [1,2,3,np.nan,5,6],
"U2": [11,12,13,14,15,16]})
print(df)
#the mean of two columns U1 and U2 and dropping NaN is easy (U1+U2/2 in this case)
#but what to do for the surface-weighted mean (U_av = (A1*U1 + A2*U2) / (A1+A2))?
df.loc[:,'Umean'] = df[['U1','U2']].dropna().mean(axis=1)
EDIT:
adding to the solutions below:
df["U_av"] = (df.A1.mul(df.U1) + df.A2.mul(df.U2)).div(df[['A1','A2']].sum(axis=1))
Hope I got you correct:
df['U_av'] = (df['A1']*df['U1'] + df['A2']*df['U2']) / (df['A1']+df['A2'])
df
ID A1 U1 A2 U2 U_av
0 14 2 10 1.0 11 10.333333
1 16 2 12 2.0 12 12.000000
2 18 2 24 3.0 13 17.400000
3 20 2 8 NaN 14 NaN
4 22 4 84 5.0 15 45.666667
5 24 4 84 6.0 16 43.200000
Try this code:
numerator = df.A1.mul(df.U1) + (df.A2.mul(df.U2))
denominator = df.A1.add(df.A2)
df["U_av"] = numerator.div(denominator)
df
ID A1 A2 U1 U2 U_av
0 14 2 1.0 10.0 11 10.333333
1 16 2 2.0 12.0 12 12.000000
2 18 2 3.0 24.0 13 17.400000
3 20 2 NaN 8.0 14 NaN
4 22 4 5.0 84.0 15 45.666667
5 24 4 6.0 84.0 16 43.200000
If I have a data frame like this:
1 2 3 4 5
1 2 4 5 NaN 3
2 3 5 6 1 2
3 3 1 1 1 1
How do I sum each row and replace the values in that row with the sum so I get something like this:
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17 17
3 7 7 7 7 7
Use mask for replace all non missing values by sum:
df = df.mask(df.notnull(), df.sum(axis=1), axis=0)
print (df)
1 2 3 4 5
1 14 14 14 NaN 14
2 17 17 17 17.0 17
3 7 7 7 7.0 7
Or use numpy.broadcast_to with numpy.where:
arr = df.values
a = np.broadcast_to(np.nansum(arr, axis=1)[:, None], df.shape)
df = pd.DataFrame(np.where(np.isnan(arr), np.nan, a),
index=df.index, columns=df.columns)
#alternative
df[:] = np.where(np.isnan(arr), np.nan, a)
print (df)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0
Using mul
df.notnull().replace(False,np.nan).mul(df.sum(1),axis=0).astype(float)
1 2 3 4 5
1 14.0 14.0 14.0 NaN 14.0
2 17.0 17.0 17.0 17.0 17.0
3 7.0 7.0 7.0 7.0 7.0
I have a pandas dataframe df and I want the final output dataframe final_df as
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 12 a 25 4
3 13 a 29 5
In [18]: final_df
Out[18]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 0.0 5.0
In [19]: dates=[10,11,12,13,14]
That is as you can see I want to fill up the missing dates and fill the corresponding values with 0 for cost column but for column prev I want to fill it with the value from previous date. As the single date may contains multiple symbol I am using the pivot_table.
If I use the ffill
In [12]: df.pivot_table(index="Date",columns="symbol").reindex(dates,method="ffill").stack().reset_index()
Out[12]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 30.0 9.0
3 11 b 33.0 10.0
4 12 a 25.0 4.0
5 13 a 29.0 5.0
6 14 a 29.0 5.0
This gives almost final data structure (it has 7 rows as final_df) except for cost column where it copies previous data but I want 0 there.
So I tried to fill missing values of different columns with different method, but that gives a problem, like
In [13]: df1=df.pivot_table(index="Date",columns="symbol").reindex(dates)
In [14]: df1["cost"]=df1["cost"].fillna(0)
In [15]: df1["prev"]=df1["prev"].ffill()
In [16]: df1.stack().reset_index()
Out[16]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 25.0 4.0
5 12 b 0.0 10.0
6 13 a 29.0 5.0
7 13 b 0.0 10.0
8 14 a 0.0 5.0
9 14 b 0.0 10.0
As you can see in output there is data with symbol "b" for date 12,13,14 but I don't want that because in initial dataframe there was no data data with symbol "b" for date 12,13 and I want to keep it that way and also there must not be one in new date 14 as it follows 13.
So how can I solve this problem and get the final_df output?
EDIT
Here is another example to check the program.
In [17]: df
Out[17]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 14 a 29 5
In [18]: dates=range(10,17)
In [19]: final_df
Out[19]:
Date symbol cost prev
0 10 a 30 9
1 10 b 33 10
2 11 a 0 9
3 11 b 0 10
4 12 a 0 9
5 12 b 0 10
6 13 a 0 9
7 13 b 0 10
8 14 a 29 5
9 15 a 0 5
10 16 a 0 5
Solution
I have found this way to the problem. Here I using a trick that keeps track of the missing places in in the initial pivot_table and removes finally.
In [44]: df1=df.pivot_table(index="Date",columns='symbol',fill_value="missing").reindex(dates)
In [45]: df1["cost"]= df1["cost"].fillna(0)
In [46]: df1["prev"]=df1["prev"].ffill()
In [47]: df1.stack().replace(to_replace="missing",value=np.nan).dropna().reset_index()
Out[47]:
Date symbol cost prev
0 10 a 30.0 9.0
1 10 b 33.0 10.0
2 11 a 0.0 9.0
3 11 b 0.0 10.0
4 12 a 0.0 9.0
5 12 b 0.0 10.0
6 13 a 0.0 9.0
7 13 b 0.0 10.0
8 14 a 29.0 5.0
9 15 a 0.0 5.0
10 16 a 0.0 5.0