Getting most recent observation & date from several columns - python

Take the following toy DataFrame:
data = np.arange(35, dtype=np.float32).reshape(7, 5)
data = pd.concat((
pd.DataFrame(list('abcdefg'), columns=['field1']),
pd.DataFrame(data, columns=['field2', '2014', '2015', '2016', '2017'])),
axis=1)
data.iloc[1:4, 4:] = np.nan
data.iloc[4, 3:] = np.nan
print(data)
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 4.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
I'd like to replace the "year" columns (2014-2017) with two fields: the most recent non-null observation, and the corresponding year of that observation. Assume field1 is a unique key. (I'm not looking to do any groupby ops, just 1 row per record.) I.e.:
field1 field2 obs date
0 a 0.0 4.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
I've gotten this far:
pd.melt(data, id_vars=['field1', 'field2'],
value_vars=['2014', '2015', '2016', '2017'])\
.dropna(subset=['value'])
field1 field2 variable value
0 a 0.0 2014 1.0
1 b 5.0 2014 6.0
2 c 10.0 2014 11.0
3 d 15.0 2014 16.0
4 e 20.0 2014 21.0
5 f 25.0 2014 26.0
6 g 30.0 2014 31.0
# ...
But am struggling with how to pivot back to desired format.

Maybe:
d2 = data.melt(id_vars=["field1", "field2"], var_name="date", value_name="obs").dropna(subset=["obs"])
d2["date"] = d2["date"].astype(int)
df = d2.loc[d2.groupby(["field1", "field2"])["date"].idxmax()]
which gives me
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0

what about the following apporach:
In [160]: df
Out[160]:
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 -10.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
In [180]: df.groupby(lambda x: 'obs' if x.isdigit() else x, axis=1) \
...: .last() \
...: .assign(date=df.filter(regex='^\d{4}').loc[:, ::-1].notnull().idxmax(1))
Out[180]:
field1 field2 obs date
0 a 0.0 -10.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017

last_valid_index + agg('last')
A=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1)
B=data.groupby(['value'] * data.shape[1], 1).agg('last')
data['date']=A
data['obs']=B
data
Out[1326]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
By using assign we can push them into one line as blow
data.assign(date=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1),obs=data.groupby(['value'] * data.shape[1], 1).agg('last'))
Out[1340]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0

Also another possibility by using sort_values and drop_duplicates:
data.melt(id_vars=["field1", "field2"], var_name="date",
value_name="obs")\
.dropna(subset=['obs'])\
.sort_values(['field1', 'date'], ascending=[True, False])\
.drop_duplicates('field1', keep='first')
which gives you
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0

Related

Change yearly ordered dataframe to seasonly orderd dataframe

In Pandas, I would like to create columns, which will represent the season (e.g. travel season) starting from November and ending in October next year.
This is my snippet:
from numpy import random
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('1990-01-01', freq='M', periods=12),
'travel_2016': random.randint(10, size=(12)),
'travel_2017': random.randint(10, size=(12)),
'travel_2018': random.randint(10, size=(12)),
'travel_2019': random.randint(10, size=(12)),
'travel_2020': random.randint(10, size=(12))})
df['month_date'] = df['date'].dt.strftime('%m')
df = df.drop(columns = ['date'])
I was trying this approach pandas groupby by customized year, e.g. a school year
I failed after 'unpivoting' the table with both solutions. It would be easier for me to keep up the pivot table for future operations.
My desired output would be something like this:
season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 8 7 7 4 11
1 0 1 4 8 12
2 1 4 5 9 01
3 8 3 5 7 02
4 4 7 8 3 03
5 6 8 4 4 04
6 5 8 3 1 05
7 7 0 1 1 06
8 1 2 1 3 07
9 8 9 7 5 08
10 7 7 7 8 09
11 9 1 4 0 10
Many thanks!
Your table is already foramtted as you want, roughly: you’re basically shifting all the rows down by 2, and getting the 2 bottom rows up to the start − but shifted into the next year.
>>> year_ends = df.shift(-10)
>>> year_ends = year_ends.drop(columns=['month_date']).shift(axis='columns').join(year_ends['month_date'])
>>> year_ends
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN
The rest is pretty easy:
>>> seasons = df.shift(2).fillna(year_ends)
>>> seasons
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Of course you should now rename the columns appropriately:
>>> seasons.rename(columns=lambda c: c if not c.startswith('travel_') else f"season_{int(c[7:]) - 1}/{c[7:]}")
season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Note that the 2 first values of 2015 are NaN, which makes sense, as those were not in the initial dataframe.
An alternate way is to use datetime tools. This may be more generic:
>>> data = df.set_index('month_date').rename_axis('year', axis='columns').stack().reset_index(name='data')
>>> data.head()
month_date year data
0 01 travel_2016 5
1 01 travel_2017 8
2 01 travel_2018 4
3 01 travel_2019 3
4 01 travel_2020 2
>>> dates = data['year'].str[7:].str.cat(data['month_date']).transform(pd.to_datetime, format='%Y%m')
>>> dates.head()
0 2016-01-01
1 2017-01-01
2 2018-01-01
3 2019-01-01
4 2020-01-01
Name: year, dtype: datetime64[ns]
Then as in the linked question get the year fiscal year starting in november:
>>> season = dates.dt.to_period('Q-OCT').dt.qyear.rename('season')
>>> seasonal_data = data.join(season).pivot('month_date', 'season', 'data')
>>> seasonal_data.rename(columns=lambda c: f"season_{c - 1}/{c}", inplace=True)
>>> seasonal_data.reindex([*df['month_date'][-2:], *df['month_date'][:-2]]).reset_index()
season month_date season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 season_2020/2021
0 11 NaN 7.0 8.0 3.0 2.0 4.0
1 12 NaN 6.0 9.0 3.0 7.0 9.0
2 01 5.0 8.0 4.0 3.0 2.0 NaN
3 02 0.0 8.0 3.0 7.0 0.0 NaN
4 03 3.0 1.0 0.0 0.0 0.0 NaN
5 04 3.0 6.0 3.0 1.0 4.0 NaN
6 05 7.0 7.0 5.0 9.0 5.0 NaN
7 06 9.0 7.0 0.0 9.0 5.0 NaN
8 07 3.0 8.0 2.0 0.0 6.0 NaN
9 08 5.0 1.0 3.0 4.0 8.0 NaN
10 09 2.0 5.0 8.0 7.0 4.0 NaN
11 10 4.0 9.0 1.0 3.0 1.0 NaN

Pandas rolling but involves last rows value

I have this dataframe
hour = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
visitor = [4,6,2,4,3,7,5,7,8,3,2,8,3,6,4,5,1,8,9,4,2,3,4,1]
df = {"Hour":hour, "Total_Visitor":visitor}
df = pd.DataFrame(df)
print(df)
I applied 6 window rolling sum
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
The first 5 rows will give you NaN value,
The problem is I want to know the sum of total visitor from 9pm to 3am, so I have to sum total visitor from hour 21 and then back to hour 0 until 3
How do you do that automatically with rolling?
I think you need add last N values, then using rolling and filter by length of Series:
N = 6
df_roll = df.iloc[-N:].append(df).rolling(N).sum().iloc[-len(df):]
print (df_roll)
Hour Total_Visitor
0 105.0 18.0
1 87.0 20.0
2 69.0 20.0
3 51.0 21.0
4 33.0 20.0
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Check original solution:
df_roll = df.rolling(6, min_periods=6).sum()
print(df_roll)
Hour Total_Visitor
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 15.0 26.0
6 21.0 27.0
7 27.0 28.0
8 33.0 34.0
9 39.0 33.0
10 45.0 32.0
11 51.0 33.0
12 57.0 31.0
13 63.0 30.0
14 69.0 26.0
15 75.0 28.0
16 81.0 27.0
17 87.0 27.0
18 93.0 33.0
19 99.0 31.0
20 105.0 29.0
21 111.0 27.0
22 117.0 30.0
23 123.0 23.0
Numpy alternative with strides is complicated, but faster if large one Series:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
N = 3
x = np.concatenate([fv[-N+1:], fv.to_numpy()])
cv = pd.Series(rolling_window(x, N).sum(axis=1), index=fv.index)
print (cv)
0 5
1 4
2 4
3 6
4 5
dtype: int64
Though you have mentioned a series, see if this is helpful-
import pandas as pd
def cyclic_roll(s, n):
s = s.append(s[:n-1])
result = s.rolling(n).sum()
return result[-n+1:].append(result[n-1:-n+1])
fv = pd.DataFrame([1, 2, 3, 4, 5])
cv = fv.apply(cyclic_roll, n=3)
cv.reset_index(inplace=True, drop=True)
print cv
Output
0
0 10.0
1 8.0
2 6.0
3 9.0
4 12.0

How do I change multiple values in pandas df column to np.nan, based on condition in other column?

I don't have much experience in coding and this is my first question, so please be patient with me. I need to find a way to change multiple values of a pandas df column to np.nan, based on a condition in another column. Therefore I have created copies of the required columns "Vorgabe" and "Temp".
Whenever the value in "Grad" isn't 0 i want to change the values in a definded area in "Vorgabe" and "Temp" to np.nan.
print(df)
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 22.0 20.5
2 22.0 21.0 8 1.0 22.0 21.0
3 22.0 21.0 6 0.0 22.0 21.0
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 23.0 21.5
6 24.0 22.5 3 1.0 24.0 22.5
7 24.0 23.0 4 0.0 24.0 23.0
8 24.0 25.5 9 0.0 24.0 25.5
So I want to achieve something like this:
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 nan nan <-one row above
2 22.0 21.0 8 1.0 nan nan
3 22.0 21.0 6 0.0 nan nan <-one row among
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 nan nan
6 24.0 22.5 3 1.0 nan nan
7 24.0 23.0 4 0.0 nan nan
8 24.0 25.5 9 0.0 24.0 25.5
Does somebody have a solution to my problem?
EDIT: I may have been unclear. The goal is to change every value in "Vorgabe" and "Temp" in an defined area to nan. In my example the area would be one row above, the row with 1.0 in it, and one row among. So not only the row, where 1.0 is located, but also rows above and under.
Use loc:
df.loc[df.Grad != 0.0, ['Vorgabe', 'Temp']] = np.nan
print(df)
Output
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 22.0 20.5
2 22.0 21.0 8 1.0 NaN NaN
3 22.0 21.0 6 0.0 22.0 21.0
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 23.0 21.5
6 24.0 22.5 3 1.0 NaN NaN
7 24.0 23.0 4 0.0 24.0 23.0
8 24.0 25.5 9 0.0 24.0 25.5
You could use numpy.where.
import numpy as np
df['Vorbage']=np.where(df['Grad']!=0, df['OptOpTemp'], np.nan)
df['Temp']=np.where(df['Grad']!=0, df['OpTemp'], np.nan)
Chain 3 conditions with | for bitwise OR, for rows above and under 1 use mask with shift:
mask1 = df['Grad'] == 1
mask2 = df['Grad'].shift() == 1
mask3 = df['Grad'].shift(-1) == 1
mask1 = df['Grad'] != 0
mask2 = df['Grad'].shift() != 0
mask3 = df['Grad'].shift(-1) != 0
mask = mask1 | mask2 | mask3
df.loc[mask, ['Vorgabe', 'Temp']] = np.nan
print (df)
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 NaN NaN
2 22.0 21.0 8 1.0 NaN NaN
3 22.0 21.0 6 0.0 NaN NaN
4 22.0 23.5 7 0.0 22.0 20.0
5 23.0 21.5 1 0.0 NaN NaN
6 24.0 22.5 3 1.0 NaN NaN
7 24.0 23.0 4 0.0 NaN NaN
8 24.0 25.5 9 0.0 24.0 25.5
General solution for multiple rows:
N = 1
#create range for test value betwen -N to N
r = np.concatenate([np.arange(0, N+1), np.arange(-1, -N-1, -1)])
#create boolean mask by comparing with shift and join together by reduce
mask = np.logical_or.reduce([df['Grad'].shift(x) == 1 for x in r])
df.loc[mask, ['Vorgabe', 'Temp']] = np.nan
EDIT:
You can join both masks together:
N = 1
r1 = np.concatenate([np.arange(0, N+1), np.arange(-1, -N-1, -1)])
mask1 = np.logical_or.reduce([df['Grad'].shift(x) == 1 for x in r1])
N = 2
r2 = np.concatenate([np.arange(0, N+1), np.arange(-1, -N-1, -1)])
mask2 = np.logical_or.reduce([df['Grad'].shift(x) == 1.5 for x in r2])
#if not working ==1.5 because precision of floats
#mask2 = np.logical_or.reduce([np.isclose(df['Grad'].shift(x), 1.5) for x in r2])
mask = mask1 | mask2
df.loc[mask, ['Vorgabe', 'Temp']] = np.nan
print (df)
OptOpTemp OpTemp BSP Grad Vorgabe Temp
0 22.0 20.0 5 0.0 22.0 20.0
1 22.0 20.5 7 0.0 NaN NaN
2 22.0 21.0 8 1.0 NaN NaN
3 22.0 21.0 6 0.0 NaN NaN
4 22.0 23.5 7 0.0 NaN NaN
5 23.0 21.5 1 0.0 NaN NaN
6 24.0 22.5 3 1.5 NaN NaN <- changed value to 1.5
7 24.0 23.0 4 0.0 NaN NaN
8 24.0 25.5 9 0.0 NaN NaN
You can use df.apply(f,axis=1), and define f to be what you want to do on each row. You description seems to be saying you want
def f(row):
if row['Grad']!=0:
row.loc[['Vorgabe','Temp']]=np.nan
return row
However, your example seems to be saying you want something else.

Efficiently updating NaN's in a pandas dataframe from a prior row & specific columns value

I have a pandas'DataFrame, it looks like this:
# Output
# A B C D
# 0 3.0 6.0 7.0 4.0
# 1 42.0 44.0 1.0 3.0
# 2 4.0 2.0 3.0 62.0
# 3 90.0 83.0 53.0 23.0
# 4 22.0 23.0 24.0 NaN
# 5 5.0 2.0 5.0 34.0
# 6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN
# 8 2.0 12.0 65.0 1.0
# 9 5.0 7.0 32.0 7.0
# 10 2.0 13.0 6.0 12.0
# 11 NaN NaN NaN NaN
# 12 23.0 NaN 23.0 34.0
# 13 61.0 NaN 63.0 3.0
# 14 32.0 43.0 12.0 76.0
# 15 24.0 2.0 34.0 2.0
What I would like to do is fill the NaN's with the earliest preceding row's B value. Apart from Column D, on this row, I would like NaN's replaced with zeros.
I've looked into ffill, fillna.. neither seem to be able to do the job.
My solution so far:
def fix_abc(row, column, df):
# If the row/column value is null/nan
if pd.isnull( row[column] ):
# Get the value of row[column] from the row before
prior = row.name
value = df[prior-1:prior]['B'].values[0]
# If that values empty, go to the row before that
while pd.isnull( value ) and prior >= 1 :
prior = prior - 1
value = df[prior-1:prior]['B'].values[0]
else:
value = row[column]
return value
df['A'] = df.apply( lambda x: fix_abc(x,'A',df), axis=1 )
df['B'] = df.apply( lambda x: fix_abc(x,'B',df), axis=1 )
df['C'] = df.apply( lambda x: fix_abc(x,'C',df), axis=1 )
def fix_d(x):
if pd.isnull(x['D']):
return 0
return x
df['D'] = df.apply( lambda x: fix_d(x), axis=1 )
It feels like this quite inefficient, and slow. So I'm wondering if there is a quicker, more efficient way to do this.
Example output;
# A B C D
# 0 3.0 6.0 7.0 3.0
# 1 42.0 44.0 1.0 42.0
# 2 4.0 2.0 3.0 4.0
# 3 90.0 83.0 53.0 90.0
# 4 22.0 23.0 24.0 0.0
# 5 5.0 2.0 5.0 5.0
# 6 2.0 2.0 2.0 0.0
# 7 2.0 2.0 2.0 0.0
# 8 2.0 12.0 65.0 2.0
# 9 5.0 7.0 32.0 5.0
# 10 2.0 13.0 6.0 2.0
# 11 13.0 13.0 13.0 0.0
# 12 23.0 13.0 23.0 23.0
# 13 61.0 13.0 63.0 61.0
# 14 32.0 43.0 12.0 32.0
# 15 24.0 2.0 34.0 24.0
I have dumped the code including the data for the dataframe into a python fiddle available (here)
fillna allows for various ways to do the filling. In this case, column D can just fill with 0. Column B can fill via pad. And then columns A and C can fill from column B, like:
Code:
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
Test Code:
df = pd.read_fwf(StringIO(u"""
A B C D
3.0 6.0 7.0 4.0
42.0 44.0 1.0 3.0
4.0 2.0 3.0 62.0
90.0 83.0 53.0 23.0
22.0 23.0 24.0 NaN
5.0 2.0 5.0 34.0
NaN NaN NaN NaN
NaN NaN NaN NaN
2.0 12.0 65.0 1.0
5.0 7.0 32.0 7.0
2.0 13.0 6.0 12.0
NaN NaN NaN NaN
23.0 NaN 23.0 34.0
61.0 NaN 63.0 3.0
32.0 43.0 12.0 76.0
24.0 2.0 34.0 2.0"""), header=1)
print(df)
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
print(df)
Results:
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 NaN
5 5.0 2.0 5.0 34.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 NaN NaN NaN NaN
12 23.0 NaN 23.0 34.0
13 61.0 NaN 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 0.0
5 5.0 2.0 5.0 34.0
6 2.0 2.0 2.0 0.0
7 2.0 2.0 2.0 0.0
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 13.0 13.0 13.0 0.0
12 23.0 13.0 23.0 34.0
13 61.0 13.0 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0

Combining dataframes in pandas with the same rows and columns, but different cell values

I'm interested in combining two dataframes in pandas that have the same row indices and column names, but different cell values. See the example below:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A':[22,2,np.NaN,np.NaN],
'B':[23,4,np.NaN,np.NaN],
'C':[24,6,np.NaN,np.NaN],
'D':[25,8,np.NaN,np.NaN]})
df2 = pd.DataFrame({'A':[np.NaN,np.NaN,56,100],
'B':[np.NaN,np.NaN,58,101],
'C':[np.NaN,np.NaN,59,102],
'D':[np.NaN,np.NaN,60,103]})
In[6]: print(df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In[7]: print(df2)
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I would like the resulting frame to look like this:
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
I have tried different ways of pd.concat and pd.merge but some of the data always gets replaced with NaNs. Any pointers in the right direction would be greatly appreciated.
Use combine_first:
print (df1.combine_first(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or fillna:
print (df1.fillna(df2))
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Or update:
df1.update(df2)
print (df1)
A B C D
0 22.0 23.0 24.0 25.0
1 2.0 4.0 6.0 8.0
2 56.0 58.0 59.0 60.0
3 100.0 101.0 102.0 103.0
Use combine_first
df1.combine_first(df2)

Categories