Setting nan to rows in pandas dataframe based on column value - python

Using:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
a = pd.read_csv('file.csv', na_values=['-9999.0'], decimal=',')
a.index = pd.to_datetime(a[['Year', 'Month', 'Day', 'Hour', 'Minute']])
pd.options.mode.chained_assignment = None
The dataframe is something like:
Index A B C D
2016-07-20 18:00:00 9 4.0 NaN 2
2016-07-20 19:00:00 9 2.64 0.0 3
2016-07-20 20:00:00 12 2.59 0.0 1
2016-07-20 21:00:00 9 4.0 NaN 2
The main objective is to set np.nan to the entire row if the value on A column is 9 and on D column is 2 at the same time, for exemple:
Output expectation
Index A B C D
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9 2.64 0.0 3
2016-07-20 20:00:00 12 2.59 0.0 2
2016-07-20 21:00:00 NaN NaN NaN NaN
Would be thankful if someone could help.

Option 1
This is the opposite of #Jezrael's mask solution.
a.where(a.A.ne(9) | a.D.ne(2))
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Option 2
pd.DataFrame.reindex
a[a.A.ne(9) | a.D.ne(2)].reindex(a.index)
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN

Try this:
df.loc[df.A.eq(9) & df.D.eq(2)] = [np.nan] * len(df.columns)
Demo:
In [158]: df
Out[158]:
A B C D
Index
2016-07-20 18:00:00 9 4.00 NaN 2
2016-07-20 19:00:00 9 2.64 0.0 3
2016-07-20 20:00:00 12 2.59 0.0 1
2016-07-20 21:00:00 9 4.00 NaN 2
In [159]: df.loc[df.A.eq(9) & df.D.eq(2)] = [np.nan] * len(df.columns)
In [160]: df
Out[160]:
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
alternatively we can use DataFrame.where() method:
In [174]: df = df.where(~(df.A.eq(9) & df.D.eq(2)))
In [175]: df
Out[175]:
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN

Use mask, which create NaNs by default:
df = a.mask((a['A'] == 9) & (a['D'] == 2))
print (df)
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Or boolean indexing with assign NaN:
a[(a['A'] == 9) & (a['D'] == 2)] = np.nan
print (a)
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN
Timings:
np.random.seed(123)
N = 1000000
L = list('abcdefghijklmnopqrst'.upper())
a = pd.DataFrame(np.random.choice([np.nan,2,9], size=(N,20)), columns=L)
#jez2
In [256]: %timeit a[(a['A'] == 9) & (a['D'] == 2)] = np.nan
10 loops, best of 3: 25.8 ms per loop
#jez2upr
In [257]: %timeit a.loc[(a['A'] == 9) & (a['D'] == 2)] = np.nan
10 loops, best of 3: 27.6 ms per loop
#Wen
In [258]: %timeit a.mul(np.where((a.A==9)&(a.D==2),np.nan,1),0)
10 loops, best of 3: 90.5 ms per loop
#jez1
In [259]: %timeit a.mask((a['A'] == 9) & (a['D'] == 2))
1 loop, best of 3: 316 ms per loop
#maxu2
In [260]: %timeit a.where(~(a.A.eq(9) & a.D.eq(2)))
1 loop, best of 3: 318 ms per loop
#pir1
In [261]: %timeit a.where(a.A.ne(9) | a.D.ne(2))
1 loop, best of 3: 316 ms per loop
#pir2
In [263]: %timeit a[a.A.ne(9) | a.D.ne(2)].reindex(a.index)
1 loop, best of 3: 355 ms per loop

Or you can try using.mul after np.where
a=np.where((df2.A==9)&(df2.D==2),np.nan,1)
df2.mul(a,0)
#one line df.mul(np.where((df.A==9)&(df.D==2),np.nan,1))
A B C D
Index
2016-07-20 18:00:00 NaN NaN NaN NaN
2016-07-20 19:00:00 9.0 2.64 0.0 3.0
2016-07-20 20:00:00 12.0 2.59 0.0 1.0
2016-07-20 21:00:00 NaN NaN NaN NaN

Related

Getting an error while trying to drop rows in pandas

I have the following DataFrame:
fin_data[fin_data['Ticker']=='DNMR']
high low open close volume adj_close Ticker CUMLOGRET_1 PCTRET_1 CUMPCTRET_1 OBV EMA_5 EMA_10 EMA_20 VWMA_15 BBL_20_2.0 BBM_20_2.0 BBU_20_2.0 RSI_14 PVT MACD_10_20_9 MACDh_10_20_9 MACDs_10_20_9 VOLUME_SMA_10 NAV Status Premium_over_NAV
date
2020-05-28 4.700000 4.700000 4.700000 4.700000 100.0 4.700000 DNMR NaN NaN NaN 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Completed -0.530
2020-05-29 4.700000 4.700000 4.700000 4.700000 0.0 4.700000 DNMR 0.000000 0.000000 0.000000 100.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN 10 Completed -0.530
2020-06-01 9.660000 9.630000 9.630000 9.660000 2000.0 9.660000 DNMR 0.720431 1.055319 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-02 9.660000 9.650000 9.650000 9.660000 60020 9.660000 DNMR 0.720431 0.000000 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-03 9.720000 9.630000 9.720000 9.630000 1100.0 9.630000 DNMR 0.717321 -0.003106 1.052214 1000.0 7.670000 NaN NaN NaN NaN NaN NaN 99.303423 2.107222e+05 NaN NaN NaN NaN 10 Completed -0.037
I'd like to either drop the first two rows where the close price is 4.70 or replace 4.70 by 9.66.
In order to drop the rows I tried this but it's giving me an error:
fin_data.drop(fin_data[fin_data['Ticker']=='DNMR'],axis=0,inplace=True)
KeyError: "['high' 'low' 'open' 'close' 'volume' 'adj_close' 'Ticker' 'CUMLOGRET_1'\n 'PCTRET_1' 'CUMPCTRET_1' 'OBV' 'EMA_5' 'EMA_10' 'EMA_20' 'VWMA_15'\n 'BBL_20_2.0' 'BBM_20_2.0' 'BBU_20_2.0' 'RSI_14' 'PVT' 'MACD_10_20_9'\n 'MACDh_10_20_9' 'MACDs_10_20_9' 'VOLUME_SMA_10' 'NAV' 'Status'\n 'Premium_over_NAV'] not found in axis"
Then I tried replace the 4.70 values but even though the code executed without an error the DataFrame is unchanged.
fin_data.loc[fin_data['Ticker']=='DNMR','adj_close'][0:2] = 9.66
Please note that I don't want to delete the data for those two dates (2020-05-28 and 2020-5-29) for other Tickers in the database but just for this one ('DNMR')
Thanks.
you are using it wrong, to drop the rows in question (or actually select the opposite ones) you should do
fin_data = fin_data[(find_data['Ticker'] == 'DNMR']) & (fin_data['close'] == 4.7)]
to point out something about your drop-issue:
the drop() method expects "single label or list-like Index or column labels to drop" (DataFrame.drop) so this would work (notice the .index after the subset)
df
a b c d
0 10 8 3 5
1 5 5 3 1
2 2 2 8 6
df.drop(df[df["a"]== 10].index, axis= 0, inplace= True)
df
a b c d
1 5 5 3 1
2 2 2 8 6
BUT, if you have dates as indexes and there are multiple rows with the same dates, you would also drop those.
A solution would be to reset the index to integers.. but (even though I don't see why you would want non-unique indexes) that may not be what you want and you should stick to Jimmar's answer :)
It's quite often simpler to use a mask
values updates
rows dropped
import pandas as pd
import io
fin_data = pd.read_csv(io.StringIO("""date high low open close volume adj_close Ticker CUMLOGRET_1 PCTRET_1 CUMPCTRET_1 OBV EMA_5 EMA_10 EMA_20 VWMA_15 BBL_20_2.0 BBM_20_2.0 BBU_20_2.0 RSI_14 PVT MACD_10_20_9 MACDh_10_20_9 MACDs_10_20_9 VOLUME_SMA_10 NAV Status Premium_over_NAV
2020-05-28 4.700000 4.700000 4.700000 4.700000 100.0 4.700000 DNMR NaN NaN NaN 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 10 Completed -0.530
2020-05-29 4.700000 4.700000 4.700000 4.700000 0.0 4.700000 DNMR 0.000000 0.000000 0.000000 100.0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000e+00 NaN NaN NaN NaN 10 Completed -0.530
2020-06-01 9.660000 9.630000 9.630000 9.660000 2000.0 9.660000 DNMR 0.720431 1.055319 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-02 9.660000 9.650000 9.650000 9.660000 60020 9.660000 DNMR 0.720431 0.000000 1.055319 2100.0 NaN NaN NaN NaN NaN NaN NaN 100.000000 2.110638e+05 NaN NaN NaN NaN 10 Completed -0.034
2020-06-03 9.720000 9.630000 9.720000 9.630000 1100.0 9.630000 DNMR 0.717321 -0.003106 1.052214 1000.0 7.670000 NaN NaN NaN NaN NaN NaN 99.303423 2.107222e+05 NaN NaN NaN NaN 10 Completed -0.037"""), sep="\s+")
fin_data.date=pd.to_datetime(fin_data.date)
fin_data = fin_data.set_index(["date"])
mask = fin_data["Ticker"].eq("DNMR") & fin_data["close"].eq(4.7)
fin_data.loc[mask, "close"] = 0
print(fin_data.iloc[:,0:6].to_markdown())
date
high
low
open
close
volume
adj_close
2020-05-28 00:00:00
4.7
4.7
4.7
0
100
4.7
2020-05-29 00:00:00
4.7
4.7
4.7
0
0
4.7
2020-06-01 00:00:00
9.66
9.63
9.63
9.66
2000
9.66
2020-06-02 00:00:00
9.66
9.65
9.65
9.66
60020
9.66
2020-06-03 00:00:00
9.72
9.63
9.72
9.63
1100
9.63
fin_data = fin_data.drop(fin_data.loc[mask].index, axis=0)
print(fin_data.iloc[:,0:6].to_markdown())
date
high
low
open
close
volume
adj_close
2020-06-01 00:00:00
9.66
9.63
9.63
9.66
2000
9.66
2020-06-02 00:00:00
9.66
9.65
9.65
9.66
60020
9.66
2020-06-03 00:00:00
9.72
9.63
9.72
9.63
1100
9.63

Concat multiindex pandas into single one

I have 3 pandas multiindex groupby(['location','date'])
print(a)
location date hosp
976 2020-10-02 9
2020-10-03 10
2020-10-04 10
print(b)
incid_hosp
location date
976 2020-10-02 1
2020-10-03 1
2020-10-04 0
print(c)
P T
location date
978 2020-10-02 5 60
2020-10-02 4 52
2020-10-03 4 2
I want to concat them to get:
print(result)
hosp incid_hosp P T
location date
976 2020-10-02 9 1 NaN NaN
2020-10-03 10 1 NaN NaN
2020-10-04 10 0 NaN NaN
978 2020-10-02 NaN NaN 5 60
2020-10-03 NaN NaN 4 52
2020-10-04 NaN NaN 4 2
I have tried
result = pd.concat([a,b,c], axis=1, sort=False)
But It produces to much NaN values ...
Try this using combine_first and reduce:
from functools import reduce
reduce(lambda x, y: x.combine_first(y), [a,b,c])
Output:
P T hosp incid_hosp
location date
976 2020-10-02 NaN NaN 9.0 1.0
2020-10-03 NaN NaN 10.0 1.0
2020-10-04 NaN NaN 10.0 0.0
978 2020-10-02 5.0 60.0 NaN NaN
2020-10-02 4.0 52.0 NaN NaN
2020-10-03 4.0 2.0 NaN NaN
For three dataframes, you can use chain join:
a.join(b,how='outer').join(c, how='outer')
Output:
hosp incid_hosp P T
location date
976 2020-10-02 9.0 1.0 NaN NaN
2020-10-03 10.0 1.0 NaN NaN
2020-10-04 10.0 0.0 NaN NaN
978 2020-10-02 NaN NaN 5.0 60.0
2020-10-02 NaN NaN 4.0 52.0
2020-10-03 NaN NaN 4.0 2.0

Remove NaN from each column and rearranging it with python pandas/numpy

I got similar issue with my previous question:
Remove zero from each column and rearranging it with python pandas/numpy
But in this case, I need to remove NaN. I have tried many solutions including modifying solutions from my previous post:
a = a[a!=np.nan].reshape(-1,3)
but it gave me weird result.
Here is my initial matrix from Dataframe :
A B C D E F
nan nan nan 0.0 27.7 nan
nan nan nan 5.0 27.5 nan
nan nan nan 10.0 27.4 nan
0.0 29.8 nan nan nan nan
5.0 29.9 nan nan nan nan
10.0 30.0 nan nan nan nan
nan nan 0.0 28.6 nan nan
nan nan 5.0 28.6 nan nan
nan nan 10.0 28.5 nan nan
nan nan 15.0 28.4 nan nan
nan nan 20.0 28.3 nan nan
nan nan 25.0 28.2 nan nan
And I expect to have result like this :
A B
0.0 27.7
5.0 27.5
10.0 27.4
0.0 29.8
5.0 29.9
10.0 30.0
0.0 28.6
5.0 28.6
10.0 28.5
15.0 28.4
0.0 28.3
25.0 28.2
Solution:
Given the input dataframe a:
a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
This will give you the desired output.
Example:
import numpy as np
import pandas as pd
a = pd.DataFrame({ 'A':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'B':[np.nan,np.nan,np.nan,np.nan,np.nan,4],
'C':[7,np.nan,9,np.nan,2,np.nan],
'D':[1,3,np.nan,7,np.nan,np.nan],
'E':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
print (a)
A B C D E
0 NaN NaN 7.0 1.0 NaN
1 NaN NaN NaN 3.0 NaN
2 NaN NaN 9.0 NaN NaN
3 NaN NaN NaN 7.0 NaN
4 NaN NaN 2.0 NaN NaN
5 NaN 4.0 NaN NaN NaN
a_new = a.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
print(a_new)
C D
0 7.0 1.0
1 9.0 3.0
2 2.0 7.0
Use np.isnan for test missing values with ~ for invert mask if there are always 2 non missing values per rows:
a = df.to_numpy()
df = pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
print (df)
0 1
0 0.0 27.7
1 5.0 27.5
2 10.0 27.4
3 0.0 29.8
4 5.0 29.9
5 10.0 30.0
6 0.0 28.6
7 5.0 28.6
8 10.0 28.5
9 15.0 28.4
10 20.0 28.3
11 25.0 28.2
Another idea is use justify fucntion with remove only NaNs columns:
df1 = (pd.DataFrame(justify(a, invalid_val=np.nan),
columns=df.columns).dropna(how='all', axis=1))
print (df1)
A B
0 0.0 27.7
1 5.0 27.5
2 10.0 27.4
3 0.0 29.8
4 5.0 29.9
5 10.0 30.0
6 0.0 28.6
7 5.0 28.6
8 10.0 28.5
9 15.0 28.4
10 20.0 28.3
11 25.0 28.2
EDIT:
df = pd.concat([df] * 1000, ignore_index=True)
a = df.to_numpy()
print (a.shape)
(12000, 6)
In [168]: %timeit df.apply(lambda x: pd.Series(x.dropna().values)).dropna(axis='columns')
8.06 ms ± 597 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [172]: %%timeit
...: a = df.to_numpy()
...: pd.DataFrame(a[~np.isnan(a)].reshape(-1,2))
...:
...:
...:
422 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [173]: %timeit pd.DataFrame(justify(a, invalid_val=np.nan),columns=df.columns).dropna(how='all', axis=1)
2.88 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How to find the highest count of sequential (numbers|increasing|decreasing) in pandas DataFrame column of values

How to find the highest count of sequential occurrences such as same number, increasing value or decreasing value in the same column.
so given something like:
h_diff l_diff monotonic
timestamp
2000-01-18 NaN NaN NaN
2000-01-19 2.75 2.93 1.0
2000-01-20 12.75 10.13 1.0
2000-01-21 -7.25 -3.31 0.0
2000-01-24 -1.50 -5.07 0.0
2000-01-25 0.37 -2.75 1.0
2000-01-26 1.07 7.38 1.0
2000-01-27 -1.19 -2.75 0.0
2000-01-28 -2.13 -6.38 0.0
2000-01-31 -7.00 -6.12 0.0
the highest value of monotonicity for positive values in h_diff is 2 and for negative values it is 3. same for l_diff. so given a rolling of 10 or n, how would i find the highest monotonic count while still being able to change the window size dynamically?
this gives me the 1.0 value for the monotonic column: lambda x: np.all(np.diff(x) > 0) and lambda x: np.count_nonzero(np.diff(x) > 0) will count the total count of 1.0 for the window but what i am trying to find is the longest run in a series of a given window.
What I am hoping for is something like:
h_diff l_diff monotonic
timestamp
2000-01-18 NaN NaN NaN
2000-01-19 2.75 2.93 1.0
2000-01-20 12.75 10.13 2.0
2000-01-21 -7.25 -3.31 0.0
2000-01-24 -1.50 -5.07 0.0
2000-01-25 0.37 -2.75 1.0
2000-01-26 1.07 7.38 2.0
2000-01-27 1.19 -2.75 3.0
2000-01-28 2.13 -6.38 4.0
2000-01-31 -7.00 -6.12 0.0
use GroupBy.cumcount + Series.where.
Initial DataFrame
h_diff l_diff
timestamp
2000-01-18 NaN NaN
2000-01-19 2.75 2.93
2000-01-20 12.75 10.13
2000-01-21 -7.25 -3.31
2000-01-24 -1.50 -5.07
2000-01-25 0.37 -2.75
2000-01-26 1.07 7.38
2000-01-27 1.19 -2.75
2000-01-28 2.13 -6.38
2000-01-31 -7.00 -6.12
h = df['h_diff'].gt(0)
#h = np.sign(df['h_diff'])
df['monotonic_h']=h.groupby(h.ne(h.shift()).cumsum()).cumcount().add(1).where(h,0)
print(df)
h_diff l_diff monotonic_h
timestamp
2000-01-18 NaN NaN 0
2000-01-19 2.75 2.93 1
2000-01-20 12.75 10.13 2
2000-01-21 -7.25 -3.31 0
2000-01-24 -1.50 -5.07 0
2000-01-25 0.37 -2.75 1
2000-01-26 1.07 7.38 2
2000-01-27 1.19 -2.75 3
2000-01-28 2.13 -6.38 4
2000-01-31 -7.00 -6.12 0
df['monotonic_h'].max()
#4
Detail
h.ne(h.shift()).cumsum()
timestamp
2000-01-18 1
2000-01-19 2
2000-01-20 2
2000-01-21 3
2000-01-24 3
2000-01-25 4
2000-01-26 4
2000-01-27 4
2000-01-28 4
2000-01-31 5
Name: h_diff, dtype: int64
UPDATE
df = df.join( h.groupby(h.ne(h.shift()).cumsum()).cumcount().add(1)
.to_frame('values')
.assign(monotic = np.where(h,'monotic_h_greater_0',
'monotic_h_not_greater_0'),
index = lambda x: x.index)
.where(df['h_diff'].notna())
.pivot_table(columns = 'monotic',
index = 'index',
values = 'values',
fill_value=0) )
print(df)
h_diff l_diff monotic_h_greater_0 monotic_h_not_greater_0
timestamp
2000-01-18 NaN NaN NaN NaN
2000-01-19 2.75 2.93 1.0 0.0
2000-01-20 12.75 10.13 2.0 0.0
2000-01-21 -7.25 -3.31 0.0 1.0
2000-01-24 -1.50 -5.07 0.0 2.0
2000-01-25 0.37 -2.75 1.0 0.0
2000-01-26 1.07 7.38 2.0 0.0
2000-01-27 1.19 -2.75 3.0 0.0
2000-01-28 2.13 -6.38 4.0 0.0
2000-01-31 -7.00 -6.12 0.0 1.0
The code below should do the trick of finding the sequential occurrences of positive or negative number. Code below is for column h_diff
df1[df1.h_diff.gt(0)].index.to_series().diff().ne(1).cumsum().value_counts().max() #sequential occurrences greater than 0
df1[df1.h_diff.lt(0)].index.to_series().diff().ne(1).cumsum().value_counts().max() #sequential occurrences less than 0

Combining dataframes in pandas with the same rows and columns, but different cell values

I'm interested in combining two dataframes in pandas that have the same row indices and column names, but different cell values. See the example below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A':[22,2,np.NaN,np.NaN],
'B':[23,4,np.NaN,np.NaN],
'C':[24,6,np.NaN,np.NaN],
'D':[25,8,np.NaN,np.NaN]})
df2 = pd.DataFrame({'A':[np.NaN,np.NaN,56,100],
'B':[np.NaN,np.NaN,58,101],
'C':[np.NaN,np.NaN,59,102],
'D':[np.NaN,np.NaN,60,103]})
In[6]: print(df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In[7]: print(df2)
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I would like the resulting frame to look like this:
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I have tried different ways of pd.concat and pd.merge but some of the data always gets replaced with NaNs. Any pointers in the right direction would be greatly appreciated.
Use combine_first:
print (df1.combine_first(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or fillna:
print (df1.fillna(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or update:
df1.update(df2)
print (df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Use combine_first
df1.combine_first(df2)

Categories