Python: remove rows with max value in each hour - python

I have a pandas data frame df like this.
date id eng math sci
2021-08-01 00:00:37 23 4.0 5.0 7.0
2021-08-01 00:05:37 23 4.0 4.0 5.0
2021-08-01 00:10:37 23 4.0 4.0 6.0
2021-08-01 00:15:38 23 4.0 4.0 5.0
2021-08-01 00:20:37 23 4.0 5.0 6.0
2021-08-01 00:25:37 23 4.0 5.0 7.0
... ... ... ... ...
2021-08-31 23:38:40 1995 4.0 4.0 5.0
2021-08-31 23:43:40 1995 4.0 4.0 4.0
2021-08-31 23:48:40 1995 4.0 5.0 5.0
2021-08-31 23:53:40 1995 4.0 4.0 4.0
2021-08-31 23:58:40 1995 4.0 5.0 7.0
1661089 rows × 4 columns
I want to remove rows with the maximum sci value in each hour.
In each hour, I would like to remove exactly 1 maximum sci value. If there are 2 maxmimum values in each hour like above case, remove just first row.
So the result should look like:
date id eng math sci
2021-08-01 00:05:37 23 4.0 4.0 5.0
2021-08-01 00:10:37 23 4.0 4.0 6.0
2021-08-01 00:15:38 23 4.0 4.0 5.0
2021-08-01 00:20:37 23 4.0 5.0 6.0
2021-08-01 00:25:37 23 4.0 5.0 7.0
... ... ... ... ...
2021-08-31 23:38:40 1995 4.0 4.0 5.0
2021-08-31 23:43:40 1995 4.0 4.0 4.0
2021-08-31 23:48:40 1995 4.0 5.0 5.0
2021-08-31 23:53:40 1995 4.0 4.0 4.0
My first attempt:
df_filtered = df.reset_index()
df_temp_max = (df_filtered.groupby(['id', pd.Grouper(key='date', freq='1H')])
.agg({'sci': 'max'})
.reset_index())
df_test_max = pd.Series(df_temp_max['sci'].values)
df_filtered.insert(5, 'sci_max', df_test_max, True)
I got:
date id eng math sci sci_max
0 2021-08-01 00:00:37 23 4.0 5.0 7.0 7.0
1 2021-08-01 00:05:37 23 4.0 4.0 5.0 7.0
2 2021-08-01 00:10:37 23 4.0 4.0 6.0 7.0
3 2021-08-01 00:15:38 23 4.0 4.0 5.0 7.0
4 2021-08-01 00:20:37 23 4.0 5.0 6.0 7.0
... ... ... ... ... ... ...
1661084 2021-08-31 23:38:40 1995 4.0 4.0 5.0 NaN
1661085 2021-08-31 23:43:40 1995 4.0 4.0 4.0 NaN
1661086 2021-08-31 23:48:40 1995 4.0 5.0 5.0 NaN
1661087 2021-08-31 23:53:40 1995 4.0 4.0 4.0 NaN
1661088 2021-08-31 23:58:40 1995 4.0 5.0 7.0 NaN
Of course, it's not true. There are so many NaN values.
I tried to using for loop, but it took too much time and if I remove one row, there was indexing error as well.
Could you help me to solve this problem, plase? Thank you so much!

Use idxmax instead of max to get the index to remove per group:
idx = df.groupby(['id', pd.Grouper(key='date', freq='H')])['sci'].idxmax()
out = df.drop(idx)
Output:
>>> idx
id date
23 2021-08-01 00:00:00 0
1995 2021-08-31 23:00:00 10
Name: sci, dtype: int64
>>> out
date id eng math sci
1 2021-08-01 00:05:37 23 4.0 4.0 5.0
2 2021-08-01 00:10:37 23 4.0 4.0 6.0
3 2021-08-01 00:15:38 23 4.0 4.0 5.0
4 2021-08-01 00:20:37 23 4.0 5.0 6.0
5 2021-08-01 00:25:37 23 4.0 5.0 7.0
6 2021-08-31 23:38:40 1995 4.0 4.0 5.0
7 2021-08-31 23:43:40 1995 4.0 4.0 4.0
8 2021-08-31 23:48:40 1995 4.0 5.0 5.0
9 2021-08-31 23:53:40 1995 4.0 4.0 4.0

Related

Change yearly ordered dataframe to seasonly orderd dataframe

In Pandas, I would like to create columns, which will represent the season (e.g. travel season) starting from November and ending in October next year.
This is my snippet:
from numpy import random
import pandas as pd
np.random.seed(0)
df = pd.DataFrame({
'date': pd.date_range('1990-01-01', freq='M', periods=12),
'travel_2016': random.randint(10, size=(12)),
'travel_2017': random.randint(10, size=(12)),
'travel_2018': random.randint(10, size=(12)),
'travel_2019': random.randint(10, size=(12)),
'travel_2020': random.randint(10, size=(12))})
df['month_date'] = df['date'].dt.strftime('%m')
df = df.drop(columns = ['date'])
I was trying this approach pandas groupby by customized year, e.g. a school year
I failed after 'unpivoting' the table with both solutions. It would be easier for me to keep up the pivot table for future operations.
My desired output would be something like this:
season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 8 7 7 4 11
1 0 1 4 8 12
2 1 4 5 9 01
3 8 3 5 7 02
4 4 7 8 3 03
5 6 8 4 4 04
6 5 8 3 1 05
7 7 0 1 1 06
8 1 2 1 3 07
9 8 9 7 5 08
10 7 7 7 8 09
11 9 1 4 0 10
Many thanks!
Your table is already foramtted as you want, roughly: you’re basically shifting all the rows down by 2, and getting the 2 bottom rows up to the start − but shifted into the next year.
>>> year_ends = df.shift(-10)
>>> year_ends = year_ends.drop(columns=['month_date']).shift(axis='columns').join(year_ends['month_date'])
>>> year_ends
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN
The rest is pretty easy:
>>> seasons = df.shift(2).fillna(year_ends)
>>> seasons
travel_2016 travel_2017 travel_2018 travel_2019 travel_2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Of course you should now rename the columns appropriately:
>>> seasons.rename(columns=lambda c: c if not c.startswith('travel_') else f"season_{int(c[7:]) - 1}/{c[7:]}")
season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 month_date
0 NaN 7.0 8.0 3.0 2.0 11
1 NaN 6.0 9.0 3.0 7.0 12
2 5.0 8.0 4.0 3.0 2.0 01
3 0.0 8.0 3.0 7.0 0.0 02
4 3.0 1.0 0.0 0.0 0.0 03
5 3.0 6.0 3.0 1.0 4.0 04
6 7.0 7.0 5.0 9.0 5.0 05
7 9.0 7.0 0.0 9.0 5.0 06
8 3.0 8.0 2.0 0.0 6.0 07
9 5.0 1.0 3.0 4.0 8.0 08
10 2.0 5.0 8.0 7.0 4.0 09
11 4.0 9.0 1.0 3.0 1.0 10
Note that the 2 first values of 2015 are NaN, which makes sense, as those were not in the initial dataframe.
An alternate way is to use datetime tools. This may be more generic:
>>> data = df.set_index('month_date').rename_axis('year', axis='columns').stack().reset_index(name='data')
>>> data.head()
month_date year data
0 01 travel_2016 5
1 01 travel_2017 8
2 01 travel_2018 4
3 01 travel_2019 3
4 01 travel_2020 2
>>> dates = data['year'].str[7:].str.cat(data['month_date']).transform(pd.to_datetime, format='%Y%m')
>>> dates.head()
0 2016-01-01
1 2017-01-01
2 2018-01-01
3 2019-01-01
4 2020-01-01
Name: year, dtype: datetime64[ns]
Then as in the linked question get the year fiscal year starting in november:
>>> season = dates.dt.to_period('Q-OCT').dt.qyear.rename('season')
>>> seasonal_data = data.join(season).pivot('month_date', 'season', 'data')
>>> seasonal_data.rename(columns=lambda c: f"season_{c - 1}/{c}", inplace=True)
>>> seasonal_data.reindex([*df['month_date'][-2:], *df['month_date'][:-2]]).reset_index()
season month_date season_2015/2016 season_2016/2017 season_2017/2018 season_2018/2019 season_2019/2020 season_2020/2021
0 11 NaN 7.0 8.0 3.0 2.0 4.0
1 12 NaN 6.0 9.0 3.0 7.0 9.0
2 01 5.0 8.0 4.0 3.0 2.0 NaN
3 02 0.0 8.0 3.0 7.0 0.0 NaN
4 03 3.0 1.0 0.0 0.0 0.0 NaN
5 04 3.0 6.0 3.0 1.0 4.0 NaN
6 05 7.0 7.0 5.0 9.0 5.0 NaN
7 06 9.0 7.0 0.0 9.0 5.0 NaN
8 07 3.0 8.0 2.0 0.0 6.0 NaN
9 08 5.0 1.0 3.0 4.0 8.0 NaN
10 09 2.0 5.0 8.0 7.0 4.0 NaN
11 10 4.0 9.0 1.0 3.0 1.0 NaN

pandas groupby rolling behaviour

Here my pandas:
df = pd.DataFrame({
'location': ['USA','USA','USA','USA', 'France','France','France','France'],
'date':['2020-11-20','2020-11-21','2020-11-22','2020-11-23', '2020-11-20','2020-11-21','2020-11-22','2020-11-23'],
'dm':[5.,4.,2.,2.,17.,3.,3.,7.]
})
For a precise location (so groupby is needed) I want the mean of dm over 2 days. If I use this :
df['rolling']=df.groupby('location').dm.rolling(2).mean().values
I obtain this incorrect pandas
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 10.0
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 5.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 4.5
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 2.0
While it should be:
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 4.5
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 2.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 10
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 5.0
Two questions:
what my syntax is actually doing ?
what is the correct way to proceed ?
Here is problem groupby create new level of MultiIndex, so for matching original index values is necessary remove it by Series.reset_index with drop=True, if use .value then is no alignemnt by index, so order should be different like here:
df['rolling']=df.groupby('location').dm.rolling(2).mean().reset_index(level=0, drop=True)
print (df)
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 4.5
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 2.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 10.0
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 5.0
Details:
print (df.groupby('location').dm.rolling(2).mean())
location
France 4 NaN
5 10.0
6 3.0
7 5.0
USA 0 NaN
1 4.5
2 3.0
3 2.0
Name: dm, dtype: float64
print (df.groupby('location').dm.rolling(2).mean().reset_index(level=0, drop=True))
4 NaN
5 10.0
6 3.0
7 5.0
0 NaN
1 4.5
2 3.0
3 2.0
Name: dm, dtype: float64

how to sort dataframe rows in pandas wrt to months from Jan to Dec

How can we sort the below rows in dataframe wrt to month from Jan to Dec,
currently this dataframe is in alphabetical order.
0 Col1 Col2 Col3 ... Col22 Col23 Col24
1 April 53.0 0.0 ... 11.0 0.0 0.0
2 August 43.0 0.0 ... 11.0 3.0 5.0
3 December 36.0 0.0 ... 4.0 1.0 0.0
4 February 48.0 0.0 ... 16.0 0.0 0.0
5 January 55.0 0.0 ... 24.0 4.0 0.0
6 July 45.0 0.0 ... 4.0 8.0 1.0
7 June 34.0 0.0 ... 4.0 8.0 1.0
8 March 34.0 2.0 ... 24.0 4.0 1.0
9 May 52.0 1.0 ... 3.0 2.0 1.0
10 November 33.0 0.0 ... 7.0 2.0 3.0
11 October 21.0 1.0 ... 7.0 1.0 2.0
12 September 27.0 0.0 ... 5.0 3.0 3.0
We can also use Series.date_range with month_name() and month:
month = pd.date_range(start='2018-01', freq='M', periods=12)
df.loc[df['Col1'].map(dict(zip(month.month_name(),month.month))).sort_values().index]
Col1 Col2 Col3 Col22 Col23 Col24
5 January 55.0 0.0 24.0 4.0 0.0
4 February 48.0 0.0 16.0 0.0 0.0
8 March 34.0 2.0 24.0 4.0 1.0
1 April 53.0 0.0 11.0 0.0 0.0
9 May 52.0 1.0 3.0 2.0 1.0
7 June 34.0 0.0 4.0 8.0 1.0
6 July 45.0 0.0 4.0 8.0 1.0
2 August 43.0 0.0 11.0 3.0 5.0
12 September 27.0 0.0 5.0 3.0 3.0
11 October 21.0 1.0 7.0 1.0 2.0
10 November 33.0 0.0 7.0 2.0 3.0
3 December 36.0 0.0 4.0 1.0 0.0
You can use calender to create a month number integer mapping , then sort the values and reindex:
import calendar
df.reindex(df['Col1'].map({i:e
for e,i in enumerate(calendar.month_name)}).sort_values().index)
Col1 Col2 Col3 ... Col22 Col23 Col24
5 January 55.0 0.0 ... 24.0 4.0 0.0
4 February 48.0 0.0 ... 16.0 0.0 0.0
8 March 34.0 2.0 ... 24.0 4.0 1.0
1 April 53.0 0.0 ... 11.0 0.0 0.0
9 May 52.0 1.0 ... 3.0 2.0 1.0
7 June 34.0 0.0 ... 4.0 8.0 1.0
6 July 45.0 0.0 ... 4.0 8.0 1.0
2 August 43.0 0.0 ... 11.0 3.0 5.0
12 September 27.0 0.0 ... 5.0 3.0 3.0
11 October 21.0 1.0 ... 7.0 1.0 2.0
10 November 33.0 0.0 ... 7.0 2.0 3.0
3 December 36.0 0.0 ... 4.0 1.0 0.0

Getting most recent observation & date from several columns

Take the following toy DataFrame:
data = np.arange(35, dtype=np.float32).reshape(7, 5)
data = pd.concat((
pd.DataFrame(list('abcdefg'), columns=['field1']),
pd.DataFrame(data, columns=['field2', '2014', '2015', '2016', '2017'])),
axis=1)
data.iloc[1:4, 4:] = np.nan
data.iloc[4, 3:] = np.nan
print(data)
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 4.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
I'd like to replace the "year" columns (2014-2017) with two fields: the most recent non-null observation, and the corresponding year of that observation. Assume field1 is a unique key. (I'm not looking to do any groupby ops, just 1 row per record.) I.e.:
field1 field2 obs date
0 a 0.0 4.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
I've gotten this far:
pd.melt(data, id_vars=['field1', 'field2'],
value_vars=['2014', '2015', '2016', '2017'])\
.dropna(subset=['value'])
field1 field2 variable value
0 a 0.0 2014 1.0
1 b 5.0 2014 6.0
2 c 10.0 2014 11.0
3 d 15.0 2014 16.0
4 e 20.0 2014 21.0
5 f 25.0 2014 26.0
6 g 30.0 2014 31.0
# ...
But am struggling with how to pivot back to desired format.
Maybe:
d2 = data.melt(id_vars=["field1", "field2"], var_name="date", value_name="obs").dropna(subset=["obs"])
d2["date"] = d2["date"].astype(int)
df = d2.loc[d2.groupby(["field1", "field2"])["date"].idxmax()]
which gives me
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0
what about the following apporach:
In [160]: df
Out[160]:
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 -10.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
In [180]: df.groupby(lambda x: 'obs' if x.isdigit() else x, axis=1) \
...: .last() \
...: .assign(date=df.filter(regex='^\d{4}').loc[:, ::-1].notnull().idxmax(1))
Out[180]:
field1 field2 obs date
0 a 0.0 -10.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
last_valid_index + agg('last')
A=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1)
B=data.groupby(['value'] * data.shape[1], 1).agg('last')
data['date']=A
data['obs']=B
data
Out[1326]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
By using assign we can push them into one line as blow
data.assign(date=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1),obs=data.groupby(['value'] * data.shape[1], 1).agg('last'))
Out[1340]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
Also another possibility by using sort_values and drop_duplicates:
data.melt(id_vars=["field1", "field2"], var_name="date",
value_name="obs")\
.dropna(subset=['obs'])\
.sort_values(['field1', 'date'], ascending=[True, False])\
.drop_duplicates('field1', keep='first')
which gives you
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0

Efficiently updating NaN's in a pandas dataframe from a prior row & specific columns value

I have a pandas'DataFrame, it looks like this:
# Output
# A B C D
# 0 3.0 6.0 7.0 4.0
# 1 42.0 44.0 1.0 3.0
# 2 4.0 2.0 3.0 62.0
# 3 90.0 83.0 53.0 23.0
# 4 22.0 23.0 24.0 NaN
# 5 5.0 2.0 5.0 34.0
# 6 NaN NaN NaN NaN
# 7 NaN NaN NaN NaN
# 8 2.0 12.0 65.0 1.0
# 9 5.0 7.0 32.0 7.0
# 10 2.0 13.0 6.0 12.0
# 11 NaN NaN NaN NaN
# 12 23.0 NaN 23.0 34.0
# 13 61.0 NaN 63.0 3.0
# 14 32.0 43.0 12.0 76.0
# 15 24.0 2.0 34.0 2.0
What I would like to do is fill the NaN's with the earliest preceding row's B value. Apart from Column D, on this row, I would like NaN's replaced with zeros.
I've looked into ffill, fillna.. neither seem to be able to do the job.
My solution so far:
def fix_abc(row, column, df):
# If the row/column value is null/nan
if pd.isnull( row[column] ):
# Get the value of row[column] from the row before
prior = row.name
value = df[prior-1:prior]['B'].values[0]
# If that values empty, go to the row before that
while pd.isnull( value ) and prior >= 1 :
prior = prior - 1
value = df[prior-1:prior]['B'].values[0]
else:
value = row[column]
return value
df['A'] = df.apply( lambda x: fix_abc(x,'A',df), axis=1 )
df['B'] = df.apply( lambda x: fix_abc(x,'B',df), axis=1 )
df['C'] = df.apply( lambda x: fix_abc(x,'C',df), axis=1 )
def fix_d(x):
if pd.isnull(x['D']):
return 0
return x
df['D'] = df.apply( lambda x: fix_d(x), axis=1 )
It feels like this quite inefficient, and slow. So I'm wondering if there is a quicker, more efficient way to do this.
Example output;
# A B C D
# 0 3.0 6.0 7.0 3.0
# 1 42.0 44.0 1.0 42.0
# 2 4.0 2.0 3.0 4.0
# 3 90.0 83.0 53.0 90.0
# 4 22.0 23.0 24.0 0.0
# 5 5.0 2.0 5.0 5.0
# 6 2.0 2.0 2.0 0.0
# 7 2.0 2.0 2.0 0.0
# 8 2.0 12.0 65.0 2.0
# 9 5.0 7.0 32.0 5.0
# 10 2.0 13.0 6.0 2.0
# 11 13.0 13.0 13.0 0.0
# 12 23.0 13.0 23.0 23.0
# 13 61.0 13.0 63.0 61.0
# 14 32.0 43.0 12.0 32.0
# 15 24.0 2.0 34.0 24.0
I have dumped the code including the data for the dataframe into a python fiddle available (here)
fillna allows for various ways to do the filling. In this case, column D can just fill with 0. Column B can fill via pad. And then columns A and C can fill from column B, like:
Code:
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
Test Code:
df = pd.read_fwf(StringIO(u"""
A B C D
3.0 6.0 7.0 4.0
42.0 44.0 1.0 3.0
4.0 2.0 3.0 62.0
90.0 83.0 53.0 23.0
22.0 23.0 24.0 NaN
5.0 2.0 5.0 34.0
NaN NaN NaN NaN
NaN NaN NaN NaN
2.0 12.0 65.0 1.0
5.0 7.0 32.0 7.0
2.0 13.0 6.0 12.0
NaN NaN NaN NaN
23.0 NaN 23.0 34.0
61.0 NaN 63.0 3.0
32.0 43.0 12.0 76.0
24.0 2.0 34.0 2.0"""), header=1)
print(df)
df['D'] = df.D.fillna(0)
df['B'] = df.B.fillna(method='pad')
df['A'] = df.A.fillna(df['B'])
df['C'] = df.C.fillna(df['B'])
print(df)
Results:
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 NaN
5 5.0 2.0 5.0 34.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 NaN NaN NaN NaN
12 23.0 NaN 23.0 34.0
13 61.0 NaN 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0
A B C D
0 3.0 6.0 7.0 4.0
1 42.0 44.0 1.0 3.0
2 4.0 2.0 3.0 62.0
3 90.0 83.0 53.0 23.0
4 22.0 23.0 24.0 0.0
5 5.0 2.0 5.0 34.0
6 2.0 2.0 2.0 0.0
7 2.0 2.0 2.0 0.0
8 2.0 12.0 65.0 1.0
9 5.0 7.0 32.0 7.0
10 2.0 13.0 6.0 12.0
11 13.0 13.0 13.0 0.0
12 23.0 13.0 23.0 34.0
13 61.0 13.0 63.0 3.0
14 32.0 43.0 12.0 76.0
15 24.0 2.0 34.0 2.0

Categories