Replace NaN values with values from other table - python

Please help.
My first table looks like:
id val1 val2
0 4 30
1 5 NaN
2 3 10
3 2 8
4 3 NaN
My second table looks like
id val1 val2_estimate
0 1 8
1 2 12
2 3 13
3 4 16
4 5 22
I want to replace Nan in 1st table with estimated values from column val2_estimate from 2nd table where val1 are the same. val1 in 2nd table are unique. End result need to look like that:
id val1 val2
0 4 30
1 5 22
2 3 10
3 2 8
4 3 13
I want to replace NaN values only.

Use merge to get the corresponding df2's estimate for df1, then use fillna:
df['val2'] = df['val2'].fillna(
df.merge(df2, on=['val1'], how='left')['val2_estimate'])
df
id val1 val2
0 0 4 30.0
1 1 5 22.0
2 2 3 10.0
3 3 2 8.0
4 4 3 13.0
Many ways to skin a cat, this is one of them.

Use fillna with map from a pd.Series created using set_index:
df['val2'] = df['val2'].fillna(df['val1'].map(df2.set_index('val1')['val2_estimate']))
df
Output:
val1 val2
id
0 4 30.0
1 5 22.0
2 3 10.0
3 2 8.0
4 3 13.0

Related

Summarize rows in pandas dataframe by column value and append specific column values as columns [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed last month.
I have a dataframe as follows with multiple rows per id (maximum 3).
dat = pd.DataFrame({'id':[1,1,1,2,2,3,4,4], 'code': ["A","B","D","B","D","A","A","D"], 'amount':[11,2,5,22,5,32,11,5]})
id code amount
0 1 A 11
1 1 B 2
2 1 D 5
3 2 B 22
4 2 D 5
5 3 A 32
6 4 A 11
7 4 D 5
I want to consolidate the df and have only one row per id so that it looks as follows:
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11 B 2 D 5
1 2 B 22 D 5 NaN NaN
2 3 A 32 NaN NaN NaN NaN
3 4 A 11 D 5 NaN NaN
How can I acheive this in pandas?
Use GroupBy.cumcount for counter with reshape by DataFrame.unstack and DataFrame.sort_index, last flatten MultiIndex and convert id to column by DataFrame.reset_index:
df = (dat.set_index(['id',dat.groupby('id').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1, sort_remaining=False))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
id code1 amount1 code2 amount2 code3 amount3
0 1 A 11.0 B 2.0 D 5.0
1 2 B 22.0 D 5.0 NaN NaN
2 3 A 32.0 NaN NaN NaN NaN
3 4 A 11.0 D 5.0 NaN NaN

fill nan values with values from another row with common values in two or more columns [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Replace NaN of rows using data from another rows [duplicate]

I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()

Pandas: Create a new column by comparing 2 columns in 2 different data frames

I've 2 data frames in pandas.
in_degree:
Target in_degree
0 2 1
1 4 24
2 5 53
3 6 98
4 7 34
out_degree
Source out_degree
0 1 4
1 2 4
2 3 5
3 4 5
4 5 5
By comparing 2 columns, I'd like to create a new data frame which should add columns "in_degree" and "out_degree" and display the result.
The Sample output should look like
Source/Target out_degree
0 1 4
1 2 5
2 3 5
3 4 29
4 5 58
Any help would be appreciated.
Thanks.
Traditionally, this would need a merge, but I think you can take advantage of pandas' index aligned arithmetic to do this a bit faster.
x = df2.set_index('Source')
y = df1.set_index('Target').rename_axis('Source')
y.columns = x.columns
x.add(y.reindex(x.index), fill_value=0).reset_index()
Source out_degree
0 1 4.0
1 2 5.0
2 3 5.0
3 4 29.0
4 5 58.0
The "traditional" SQL way of solving this would be using merge:
v = df1.merge(df2, left_on='Target', right_on='Source', how='right')
dct = dict(
Source=v['Source'],
out_degree=v['in_degree'].add(v['out_degree'], fill_value=0))
pd.DataFrame(dct).sort_values('Source')
Source out_degree
3 1 4.0
0 2 5.0
4 3 5.0
1 4 29.0
2 5 58.0

Pandas - Delete cells based on ranking within column

I want to delete values based on their relative rank within their column. Specifically, I want to isolate the X highest and X lowest values within several columns. So if X=2 and my dataframe looks like this:
ID Val1 Val2 Val3
001 2 8 14
002 10 15 8
003 3 1 20
004 11 11 7
005 14 4 19
The output should look like this:
ID Val1 Val2 Val3
001 2 NaN NaN
002 NaN 15 8
003 3 1 20
004 11 11 7
005 14 4 19
I know that I can make a sub-table to isolate the high and low rank using:
df = df.sort('Column Name')
df2 = df.head(X) # OR: df.tail(X)
And I figure I clear these sub-tables of the values from other columns using:
df2['Other Column'] = np.NaN
df2['Other Column B'] = np.NaN
Then merge the sub-tables back together in a way that replaces NaN values when there is data in one of the tables. I tried:
df2.update(df3) # df3 is a sub-table made the same way as df2 using a different column
Which only updated rows already present in df2.
I tried:
out = pd.merge(df2, df3, how='outer')
which gave me separate rows when a row appeared in both df2 and d3
I tried:
out = df2.combine_first(df3)
which over-wrote numerical values with found NaN values in some cases making it unsuitable.
There must be a way to do this: I want to the original dataframe with NaN values plugged in whenever a value is not among the X highest or X lowest values in that column.
Interesting question, you can get the index of the values of each columns in the sorted values of each columns (here in the mask DataFrame), and then keep the values that have the index within you defined boundary.
In [98]:
print df
Val1 Val2 Val3
ID
1 2 8 14
2 10 15 8
3 3 1 20
4 11 11 7
5 14 4 19
In [99]:
mask = df.apply(lambda x: np.searchsorted(sorted(x),x))
print mask
Val1 Val2 Val3
ID
1 0 2 2
2 2 4 1
3 1 0 4
4 3 3 0
5 4 1 3
In [100]:
print (mask<=1)|(mask>=(len(mask)-2))
Val1 Val2 Val3
ID
1 True False False
2 False True True
3 True True True
4 True True True
5 True True True
In [101]:
print df.where((mask<=1)|(mask>=(len(mask)-2)))
Val1 Val2 Val3
ID
1 2 NaN NaN
2 NaN 15 8
3 3 1 20
4 11 11 7
5 14 4 19

Categories