This question already has answers here:
forward fill specific columns in pandas dataframe
(6 answers)
Closed 4 years ago.
I have had a rethink of the issue and have reformulated my question.
I have a dataframe (df) which has timeseries data for a number of factors. The timeseries for each factor can start on different days which is fine. For some specific days there is missing data (white space) for FactorB and FactorC (in this example 07/01/2017). For FactorB and FactorC with these white-space days I would like to fill the holes with value for that factor from the previous day. For example:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
In df data is missing for FactorB and FactorC on 07/01/2017. I would like the resulting df to look like:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62 5.93 5.91
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
I am wondering if I need to specifically change the white space for FactorB and FactorC on the date with the hole in it (in this example 07/01/2017) to NaN before I then apply
df= df.replace('',np.NaN).ffill()
So my intermediate output for the issue would look like:
FactorA FactorB FactorC
01/01/2017 5.50
02/01/2017 5.31
03/01/2017 5.62
04/01/2017 5.84 5.62 5.74
05/01/2017 5.95 5.85 5.86
06/01/2017 5.94 5.93 5.91
07/01/2017 5.62 NaN NaN
08/01/2017 6.01 6.20 6.21
09/01/2017 6.12 6.20 3.23
But how would I apply a NaN to only days where I am legitimately missing data (not changing the days before the FactorB and FactorC timeseries started. Also is there a way to do this without specifically calling a date as the holes could be on any date.
I have tried the following but when I check the data the white space is still there and I feel like I'm going no where:
col = ['FactorB', 'FactorC']
df[col] = df[col].ffill()
I've also tried:
df.fillna(method='ffill')
and
df= df.replace('',np.NaN).ffill()
If some values are missing and not NaN:
df = df.replace('',np.NaN).ffill()
Related
I have this dataframe df:
alpha1 week_day calendar_week
0 2.49 Freitag 2022-04-(01/07)
1 1.32 Samstag 2022-04-(01/07)
2 2.70 Sonntag 2022-04-(01/07)
3 3.81 Montag 2022-04-(01/07)
4 3.58 Dienstag 2022-04-(01/07)
5 3.48 Mittwoch 2022-04-(01/07)
6 1.79 Donnerstag 2022-04-(01/07)
7 2.12 Freitag 2022-04-(08/14)
8 2.41 Samstag 2022-04-(08/14)
9 1.78 Sonntag 2022-04-(08/14)
10 3.19 Montag 2022-04-(08/14)
11 3.33 Dienstag 2022-04-(08/14)
12 2.88 Mittwoch 2022-04-(08/14)
13 2.98 Donnerstag 2022-04-(08/14)
14 3.01 Freitag 2022-04-(15/21)
15 3.04 Samstag 2022-04-(15/21)
16 2.72 Sonntag 2022-04-(15/21)
17 4.11 Montag 2022-04-(15/21)
18 3.90 Dienstag 2022-04-(15/21)
19 3.16 Mittwoch 2022-04-(15/21)
and so on, with ascending calendar weeks.
I performed a pivot table to generate a heatmap.
df_pivot = pd.pivot_table(df, values=['alpha1'], index=['week_day'], columns=['calendar_week'])
What I get is:
alpha1 \
calendar_week 2022-(04-29/05-05) 2022-(05-27/06-02) 2022-(07-29/08-04)
week_day
Dienstag 3.32 2.09 4.04
Donnerstag 3.27 2.21 4.65
Freitag 2.83 3.08 4.19
Mittwoch 3.22 3.14 4.97
Montag 2.83 2.86 4.28
Samstag 2.62 3.62 3.88
Sonntag 2.81 3.25 3.77
\
calendar_week 2022-(08-26/09-01) 2022-04-(01/07) 2022-04-(08/14)
week_day
Dienstag 2.92 3.58 3.33
Donnerstag 3.58 1.79 2.98
Freitag 3.96 2.49 2.12
Mittwoch 3.09 3.48 2.88
Montag 3.85 3.81 3.19
Samstag 3.10 1.32 2.41
Sonntag 3.39 2.70 1.78
As you see the sorting of the pivot table is messed up. I need the same sorting for the columns (calendar weeks) as in the original dataframe.
I have been looking all over but couldn't find how to achieve this.
Would be also very nice, if the sorting of the rows remains the same.
Any help will be greatly appreciated
UPDATE
I didn't paste all the data. It would have been too long
The calendar_week column consist of following elements
'2022-04-(01/07)',
'2022-04-(08/14)',
'2022-04-(15/21)',
'2022-04-(22/28)',
'2022-(04-29/05-05)',
'2022-05-(06/12)',
'2022-05-(13/19)',
'2022-05-(20/26)',
'2022-(05-27/06-02)',
'2022-06-(03/09)'
'2022-06-(10/16)'
'2022-06-(17/23)'
'2022-06-(24/30)'
'2022-07-(01/07)'
'2022-07-(08/14)'
'2022-07-(15/21)'
'2022-07-(22/28)'
'2022-(07-29/08-04)'
'2022-08-(05/11)'
etc....
Each occurs 7 times in df. It represents a calendar week.
The sorting is the natural time sorting.
After pivoting the dataframe, the sorting of the column get messed up. And I guess it's due to the 2 different types: 2022-(07-29/08-04) and 2022-07-(15/21).
Try running this:
df_pivot.sort_values(by = ['calendar_week'], axis = 1, ascending = True)
I got the following output. Is this what you wanted?
calendar_week
2022-04-(01/07)
2022-04-(08/14)
2022-04-(15/21)
week_day
Dienstag
3.58
3.33
3.90
Donnerstag
1.79
2.98
NaN
Freitag
2.49
2.12
3.01
Mittwoch
3.48
2.88
3.16
Montag
3.81
3.19
4.11
be sure to remove the NaN values using the fillna() function.
I hope that answers it. :)
You can use an ordered Categorical for your week days and sort the dates after pivoting with sort_index:
# define the desired order of the days
days = ['Montag', 'Dienstag', 'Mittwoch', 'Donnerstag',
'Freitag', 'Samstag', 'Sonntag']
df_pivot = (df
.assign(week_day=pd.Categorical(df['week_day'], categories=days,
ordered=True))
.pivot_table(values='alpha1', index='week_day',
columns='calendar_week')
.sort_index(axis=1)
)
output:
calendar_week 2022-04-(01/07) 2022-04-(08/14) 2022-04-(15/21)
week_day
Montag 3.81 3.19 4.11
Dienstag 3.58 3.33 3.90
Mittwoch 3.48 2.88 3.16
Donnerstag 1.79 2.98 NaN
Freitag 2.49 2.12 3.01
Samstag 1.32 2.41 3.04
Sonntag 2.70 1.78 2.72
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.
I like to merge or combine two dataframes of different size df1 and df2, based on a range of dates, for example:
df1:
Date Open High Low
2021-07-01 8.43 8.44 8.22
2021-07-02 8.36 8.4 8.28
2021-07-06 8.22 8.23 8.06
2021-07-07 8.1 8.19 7.98
2021-07-08 8.07 8.1 7.91
2021-07-09 7.97 8.11 7.92
2021-07-12 8 8.2 8
2021-07-13 8.15 8.18 8.06
2021-07-14 8.18 8.27 8.12
2021-07-15 8.21 8.26 8.06
2021-07-16 8.12 8.23 8.07
df2:
Day of month Revenue Earnings
01 45000 4000
07 43500 5000
12 44350 6000
15 39050 7000
results should be something like this:
combination:
Date Open High Low Earnings
2021-07-01 8.43 8.44 8.22 4000
2021-07-02 8.36 8.4 8.28 4000
2021-07-06 8.22 8.23 8.06 4000
2021-07-07 8.1 8.19 7.98 5000
2021-07-08 8.07 8.1 7.91 5000
2021-07-09 7.97 8.11 7.92 5000
2021-07-12 8 8.2 8 6000
2021-07-13 8.15 8.18 8.06 6000
2021-07-14 8.18 8.27 8.12 6000
2021-07-15 8.21 8.26 8.06 7000
2021-07-16 8.12 8.23 8.07 7000
The Earnings column is merged based on a range of date, how can I do this in python pandas?
Try merge_asof
#df1.date=pd.to_datetime(df1.date)
df1['Day of month'] = df1.Date.dt.day
out = pd.merge_asof(df1, df2, on ='Day of month', direction = 'backward')
out
Out[213]:
Date Open High Low Day of month Revenue Earnings
0 2021-07-01 8.43 8.44 8.22 1 45000 4000
1 2021-07-02 8.36 8.40 8.28 2 45000 4000
2 2021-07-06 8.22 8.23 8.06 6 45000 4000
3 2021-07-07 8.10 8.19 7.98 7 43500 5000
4 2021-07-08 8.07 8.10 7.91 8 43500 5000
5 2021-07-09 7.97 8.11 7.92 9 43500 5000
6 2021-07-12 8.00 8.20 8.00 12 44350 6000
7 2021-07-13 8.15 8.18 8.06 13 44350 6000
8 2021-07-14 8.18 8.27 8.12 14 44350 6000
9 2021-07-15 8.21 8.26 8.06 15 39050 7000
10 2021-07-16 8.12 8.23 8.07 16 39050 7000
A more general approach is the following:
First you introduce a key both dataframes share.
In this case, the day of the month (or, potentially, multiple keys like day of the month and month). df1["day"] = df1["Date"].dt.day
If you were to merge (leftjoin df2 on df1) now, you wouldn't have enough keys in df2, as there are days missing. To fill the gaps, we could interpolate, or use the naïve approach: If we don't know the Revenue / Earnings for a specific day, we take the last known one and apply no further calculation. One way to achieve this is described here: How to replace NaNs by preceding or next values in pandas DataFrame? df.fillna(method='ffill')
Now we merge on our key. Following the doc https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html , we do it like this: df1.merge(df2, left_on='day')
Voilà!
So I have a DataFrame, f, with weekly indexes:
Open High Low Close Volume
Date
2017-07-24 5.05 5.120 5.010 5.19 16306737.0
2017-07-31 5.31 5.475 5.280 5.24 45182199.0
2017-08-07 5.69 5.740 5.640 5.67 10167161.0
2017-08-14 5.65 5.680 5.440 5.76 28296416.0
2017-08-21 5.49 5.605 5.480 5.55 16126060.0
2017-08-28 6.00 6.030 5.940 5.95 19398271.0
2017-09-04 5.86 5.965 5.845 6.01 20218389.0
2017-09-11 5.98 6.030 5.830 5.98 15812289.0
2017-09-18 5.71 5.770 5.540 5.81 30786508.0
2017-09-25 5.16 5.190 5.090 5.17 13641128.0
I want to parse a datetime object to it, if that datetime object exists in the index then I'll use the data in that row, otherwise if it doesn't exist in the index then grab the next row after where my parsed date would be.
E.g: if I parse f.loc[(datetime.datetime(2017, 09, 07)]
then that isn't in the index so I want it to grab the row
2017-09-11 5.98 6.030 5.830 5.98 15812289.0
since that is the next indexed date after 7 September.
One straightforward solution is using np.searchsorted:
df.iloc[[np.searchsorted(df.index, '2017-09-07')]]
Open High Low Close Volume
Date
2017-09-11 5.98 6.03 5.83 5.98 15812289.0
Details
df
Open High Low Close Volume
Date
2017-07-24 5.05 5.120 5.010 5.19 16306737.0
2017-07-31 5.31 5.475 5.280 5.24 45182199.0
2017-08-07 5.69 5.740 5.640 5.67 10167161.0
2017-08-14 5.65 5.680 5.440 5.76 28296416.0
2017-08-21 5.49 5.605 5.480 5.55 16126060.0
2017-08-28 6.00 6.030 5.940 5.95 19398271.0
2017-09-04 5.86 5.965 5.845 6.01 20218389.0
2017-09-11 5.98 6.030 5.830 5.98 15812289.0
2017-09-18 5.71 5.770 5.540 5.81 30786508.0
2017-09-25 5.16 5.190 5.090 5.17 13641128.0
df.index.dtype
dtype('<M8[ns]')
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.