I’ve a lot of DataFrames with 2 columns, like this:
Fecha
unidades
0
2020-01-01
2.0
84048
2020-09-01
4.0
149445
2020-10-01
11.0
532541
2020-11-01
4.0
660659
2020-12-01
2.0
1515682
2021-03-01
9.0
1563644
2021-04-01
2.0
1759823
2021-05-01
1.0
2226586
2021-07-01
1.0
As it can be seen, there are some months that are missing. Missing data depends on the DataFrame, I can have 2 months, 10, 100% complete, only one...I need to complete column "Fecha" with missing months (from 2020-01-01 to 2021-12-01) and when date is added into "Fecha", add "0" value to "unidades" column.
Each element in Fecha Column is a class 'pandas._libs.tslibs.timestamps.Timestamp
How could I fill the missing dates for each DataFrame??
You could create a date range and use "Fecha" column to set_index + reindex to add missing months. Then fillna + reset_index fetches the desired outcome:
df['Fecha'] = pd.to_datetime(df['Fecha'])
df = (df.set_index('Fecha')
.reindex(pd.date_range('2020-01-01', '2021-12-01', freq='MS'))
.rename_axis(['Fecha'])
.fillna(0)
.reset_index())
Output:
Fecha unidades
0 2020-01-01 2.0
1 2020-02-01 0.0
2 2020-03-01 0.0
3 2020-04-01 0.0
4 2020-05-01 0.0
5 2020-06-01 0.0
6 2020-07-01 0.0
7 2020-08-01 0.0
8 2020-09-01 4.0
9 2020-10-01 11.0
10 2020-11-01 4.0
11 2020-12-01 2.0
12 2021-01-01 0.0
13 2021-02-01 0.0
14 2021-03-01 9.0
15 2021-04-01 2.0
16 2021-05-01 1.0
17 2021-06-01 0.0
18 2021-07-01 1.0
19 2021-08-01 0.0
20 2021-09-01 0.0
21 2021-10-01 0.0
22 2021-11-01 0.0
23 2021-12-01 0.0
Related
I have a df and I want to stick the values of it. At first I want to select the specific time, and replace the Nan values with the same in the day before. Here is a simple example: I only want to choose the values in 2020, I want to stick its value based on the time, and also replace the nan value same as day before.
df = pd.DataFrame()
df['day'] =[ '2020-01-01', '2019-01-01', '2020-01-02','2020-01-03', '2018-01-01', '2020-01-15','2020-03-01', '2020-02-01', '2017-01-01' ]
df['value_1'] = [ 1, np.nan, 32, 48, 5, -1, 5,10,2]
df['value_2'] = [ np.nan, 121, 23, 34, 15, 21, 15, 12, 39]
df
day value_1 value_2
0 2020-01-01 1.0 NaN
1 2019-01-01 NaN 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
4 2018-01-01 5.0 15.0
5 2020-01-15 -1.0 21.0
6 2020-03-01 5.0 15.0
7 2020-02-01 10.0 12.0
8 2017-01-01 2.0 39.0
The output:
_1 _2 _3 _4 _5 _6 _7 _8 _9 _10 _11 _12
0 1 121 1 23 48 34 -1 21 10 12 -1 21
I have tried to use the follwing code, but it does not solve my problem:
val_cols = df.filter(like='value_').columns
output = (df.pivot('day', val_cols).groupby(level=0, axis=1).apply(lambda x:x.ffill(axis=1).bfill(axis=1)).sort_index(axis=1, level=1))
I don't know what the output is supposed to be but i think this should do at least part of what you're trying to do
df['day'] = pd.to_datetime(df['day'], format='%Y-%m-%d')
df = df.sort_values(by=['day'])
filter_2020 = df['day'].dt.year == 2020
val_cols = df.filter(like='value_').columns
df.loc[filter_2020, val_cols] = df.loc[:,val_cols].ffill().loc[filter_2020]
print(df)
day value_1 value_2
8 2017-01-01 2.0 39.0
4 2018-01-01 5.0 15.0
1 2019-01-01 NaN 121.0
0 2020-01-01 1.0 121.0
2 2020-01-02 32.0 23.0
3 2020-01-03 48.0 34.0
5 2020-01-15 -1.0 21.0
7 2020-02-01 10.0 12.0
6 2020-03-01 5.0 15.0
I have a dataframe which has two columns. date and value.
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['date'] = ['2020-03-01 00:00:00','2020-03-01 00:00:15', '2020-03-01 00:00:30', '2020-03-02 00:00:00','2020-03-02 00:00:15', '2020-03-02 00:00:30' , '2020-03-03 00:00:15', '2020-03-03 00:00:30', '2020-03-05 00:00:00', '2020-03-05 00:00:30']
df['value'] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4]
df
date value
0 2020-03-01 00:00:00 1
1 2020-03-01 00:00:15 2
2 2020-03-01 00:00:30 3
3 2020-03-02 00:00:00 4
4 2020-03-02 00:00:15 5
5 2020-03-02 00:00:30 6
6 2020-03-03 00:00:15 1
7 2020-03-03 00:00:30 2
8 2020-03-05 00:00:00 3
9 2020-03-05 00:00:30 4
in the date column, I have some missing values (I want all the days, like 1-2-3-4-... but in this example I dont have day 2020-03-4, so I put nan for that), so I want to build this df at first which shows me the days which I dont have their data:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.0 2.0 3.0
1 2020-03-02 4.0 5.0 6.0
2 2020-03-03 NaN 1.0 2.0
3 2020-03-04 NaN NaN NaN
4 2020-03-05 3.0 NaN 4.0
Then replace the Nan values with mean of columns, like:
day 00:00:00 00:00:15 00:00:30
0 2020-03-01 1.000000 2.000000 3.000000
1 2020-03-02 4.000000 5.000000 6.000000
2 2020-03-03 2.666667 1.000000 2.000000
3 2020-03-04 2.666667 2.666667 2.666667
4 2020-03-05 3.000000 5.000000 4.000000
And then build one df with one row as(the name of columns is not important)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 4 5 6 2.67 1 2 2.67 2.67 2.67 3 5 4
I am working with pivot and groupby, but I could not solve it. Especially the missing date. Could you please help me with that?
you can use resample():
df['date']=pd.to_datetime(df['date'])
dfx=df.set_index('date').resample('15S').first()
We got the distribution of all hours of the day. But we only need values between 00:00:00 and 00:00:30.
dfx = dfx.between_time("00:00:00", "00:00:30").reset_index()
print(dfx)
'''
date value
0 2020-03-01 00:00:00 1.0
1 2020-03-01 00:00:15 2.0
2 2020-03-01 00:00:30 3.0
3 2020-03-02 00:00:00 4.0
4 2020-03-02 00:00:15 5.0
5 2020-03-02 00:00:30 6.0
6 2020-03-03 00:00:00
7 2020-03-03 00:00:15 1.0
8 2020-03-03 00:00:30 2.0
9 2020-03-04 00:00:00
10 2020-03-04 00:00:15
11 2020-03-04 00:00:30
12 2020-03-05 00:00:00 3.0
13 2020-03-05 00:00:15
14 2020-03-05 00:00:30 4.0
'''
Then i convert times into columns using crosstab:
dfx=pd.crosstab(dfx['date'].dt.date, dfx['date'].dt.time,values=dfx['value'],aggfunc='sum',dropna=False)
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.0 2.0 3.0
2020-03-02 4.0 5.0 6.0
2020-03-03 0.0 1.0 2.0
2020-03-04 0.0 0.0 0.0
2020-03-05 3.0 0.0 4.0
'''
Values with 0 are times that are not in the data set. I replace them with nan and populate them with the column averages:
dfx=dfx.replace(0,np.nan)
for i in dfx.columns:
dfx[i]=dfx[i].fillna(dfx[i].mean())
print(dfx)
'''
date 00:00:00 00:00:15 00:00:30
date
2020-03-01 1.000000 2.000000 3.00
2020-03-02 4.000000 5.000000 6.00
2020-03-03 2.666667 1.000000 2.00
2020-03-04 2.666667 2.666667 3.75
2020-03-05 3.000000 2.666667 4.00
'''
I did not fully understand what you want at the last stage, if you write it in detail, I will edit my answer.
I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0
I am a new python user and have a few questions regarding filling NA's of a data frame.
Currently, I have a data frame that has a series of dates from 2022-08-01 to 2037-08-01 with a frequency of monthly data.
However, after 2027-06-01 the pricing data stops and I would like to extrapolate the values forward to fill out the rest of the dates. Essentially I would like to take the last 12 months of prices and fill those forward for the rest of the data frame. I am thinking of doing some type of groupby month with a fillna(method=ffill) however when I do this it just fills the last value in the df forward. Below is an example of my code.
Above is a picture you will see that the values stop at 12/1/2023 I wish to fill the previous 12 values forward for the rest of the maturity dates. So all prices fro 1/1/2023 to 12/1/2023 will be fill forward for all months.
import pandas as pd
mat = pd.DataFrame(pd.date_range('01/01/2020','01/01/2022',freq='MS'))
prices = pd.DataFrame(['179.06','174.6','182.3','205.59','204.78','202.19','216.17','218.69','220.73','223.28','225.16','226.31'])
example = pd.concat([mat,prices],axis=1)
example.columns = ['maturity', 'price']
Output
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 NaN
13 2021-02-01 NaN
14 2021-03-01 NaN
15 2021-04-01 NaN
16 2021-05-01 NaN
17 2021-06-01 NaN
18 2021-07-01 NaN
19 2021-08-01 NaN
20 2021-09-01 NaN
21 2021-10-01 NaN
22 2021-11-01 NaN
23 2021-12-01 NaN
24 2022-01-01 NaN
Is this what you're looking for?
out = df.groupby(df.maturity.dt.month).ffill()
print(out)
Output:
maturity price
0 2020-01-01 179.06
1 2020-02-01 174.6
2 2020-03-01 182.3
3 2020-04-01 205.59
4 2020-05-01 204.78
5 2020-06-01 202.19
6 2020-07-01 216.17
7 2020-08-01 218.69
8 2020-09-01 220.73
9 2020-10-01 223.28
10 2020-11-01 225.16
11 2020-12-01 226.31
12 2021-01-01 179.06
13 2021-02-01 174.6
14 2021-03-01 182.3
15 2021-04-01 205.59
16 2021-05-01 204.78
17 2021-06-01 202.19
18 2021-07-01 216.17
19 2021-08-01 218.69
20 2021-09-01 220.73
21 2021-10-01 223.28
22 2021-11-01 225.16
23 2021-12-01 226.31
24 2022-01-01 179.06
Suppose I have a dataframe which has Date and $ columns like below:
>>> import pandas as pd
>>> data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
>>> df = pd.DataFrame(data, columns=['Date','$'])
Since the original data has date gaps, I filled the date gaps using a recommendation answered in this StackOverflow post like below:
>>> df.Date = pd.to_datetime(df.Date)
>>> mn = df.Date.min()
>>> mx = df.Date.max()
>>>
>>> dr = pd.date_range(
... mn - pd.tseries.offsets.MonthBegin(),
... mx + pd.tseries.offsets.MonthEnd(),
... name="Date",
... )
>>>
>>> df = df.set_index("Date").reindex(dr).ffill().bfill().reset_index()
>>> print(df)
Date $
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 2.0
5 2021-01-06 2.0
6 2021-01-07 2.0
7 2021-01-08 2.0
8 2021-01-09 2.0
9 2021-01-10 2.0
10 2021-01-11 2.0
11 2021-01-12 2.0
12 2021-01-13 2.0
13 2021-01-14 2.0
14 2021-01-15 2.0
15 2021-01-16 2.0
16 2021-01-17 2.0
17 2021-01-18 2.0
18 2021-01-19 2.0
19 2021-01-20 2.0
20 2021-01-21 2.0
21 2021-01-22 2.0
22 2021-01-23 2.0
23 2021-01-24 2.0
24 2021-01-25 2.0
25 2021-01-26 2.0
26 2021-01-27 2.0
27 2021-01-28 2.0
28 2021-01-29 2.0
29 2021-01-30 2.0
30 2021-01-31 2.0
31 2021-02-01 2.0 # <== here, the $ value should be 3.0 and onward
32 2021-02-02 2.0
33 2021-02-03 2.0
34 2021-02-04 2.0
35 2021-02-05 3.0
36 2021-02-06 3.0
37 2021-02-07 3.0
38 2021-02-08 3.0
39 2021-02-09 3.0
40 2021-02-10 3.0
41 2021-02-11 3.0
42 2021-02-12 3.0
43 2021-02-13 3.0
44 2021-02-14 3.0
45 2021-02-15 3.0
46 2021-02-16 3.0
47 2021-02-17 3.0
48 2021-02-18 3.0
49 2021-02-19 3.0
50 2021-02-20 3.0
51 2021-02-21 3.0
52 2021-02-22 3.0
53 2021-02-23 3.0
54 2021-02-24 3.0
55 2021-02-25 3.0
56 2021-02-26 3.0
57 2021-02-27 3.0
58 2021-02-28 3.0
Using that approach, the forward fill, ffill, without checking the month's boundary copies the $ values a bit too far as you can see above. I looked around StackOverflow and found that (e.g., this post) there's a way to use groupby() to know the boundary of each month. That leads me to this point below:
>>> start_date_of_each_month = (df.set_index('Date').index.to_series().groupby(pd.Grouper(freq='M')).min())
>>> start_date_of_each_month
Date
2021-01-31 2021-01-01
2021-02-28 2021-02-01
Freq: M, Name: Date, dtype: datetime64[ns]
Q: How can I utilize this to correct the $ above so that each month's values are contained within that particular month? In particular, how do I transform the df to look like this?
Date $
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 2.0
5 2021-01-06 2.0
6 2021-01-07 2.0
7 2021-01-08 2.0
8 2021-01-09 2.0
9 2021-01-10 2.0
10 2021-01-11 2.0
11 2021-01-12 2.0
12 2021-01-13 2.0
13 2021-01-14 2.0
14 2021-01-15 2.0
15 2021-01-16 2.0
16 2021-01-17 2.0
17 2021-01-18 2.0
18 2021-01-19 2.0
19 2021-01-20 2.0
20 2021-01-21 2.0
21 2021-01-22 2.0
22 2021-01-23 2.0
23 2021-01-24 2.0
24 2021-01-25 2.0
25 2021-01-26 2.0
26 2021-01-27 2.0
27 2021-01-28 2.0
28 2021-01-29 2.0
29 2021-01-30 2.0
30 2021-01-31 2.0
31 2021-02-01 3.0 # <== here, the $ value should be 3.0
32 2021-02-02 3.0
33 2021-02-03 3.0
34 2021-02-04 3.0
35 2021-02-05 3.0
36 2021-02-06 3.0
37 2021-02-07 3.0
38 2021-02-08 3.0
39 2021-02-09 3.0
40 2021-02-10 3.0
41 2021-02-11 3.0
42 2021-02-12 3.0
43 2021-02-13 3.0
44 2021-02-14 3.0
45 2021-02-15 3.0
46 2021-02-16 3.0
47 2021-02-17 3.0
48 2021-02-18 3.0
49 2021-02-19 3.0
50 2021-02-20 3.0
51 2021-02-21 3.0
52 2021-02-22 3.0
53 2021-02-23 3.0
54 2021-02-24 3.0
55 2021-02-25 3.0
56 2021-02-26 3.0
57 2021-02-27 3.0
58 2021-02-28 3.0
Thanks in advance for your answers/suggestions!
Some options:
pd.Grouper
import pandas as pd
data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(
mn - pd.tseries.offsets.MonthBegin(),
mx + pd.tseries.offsets.MonthEnd(),
name="Date",
)
df = df.set_index("Date").reindex(dr).reset_index()
df['$'] = df.groupby(pd.Grouper(key='Date', freq='1M'))['$'].ffill().bfill()
print(df)
dt.strftime %m:
import pandas as pd
data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(
mn - pd.tseries.offsets.MonthBegin(),
mx + pd.tseries.offsets.MonthEnd(),
name="Date",
)
df = df.set_index("Date").reindex(dr).reset_index()
df['$'] = df.groupby(df['Date'].dt.strftime('%m'))['$'].ffill().bfill()
print(df)
Resample
import pandas as pd
data = [['2021-01-02', 1.0], ['2021-01-05', 2.0], ['2021-02-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(
mn - pd.tseries.offsets.MonthBegin(),
mx + pd.tseries.offsets.MonthEnd(),
name="Date",
)
df = df.set_index("Date").reindex(dr)
df['$'] = df.resample('M')['$'].ffill().bfill()
df = df.reset_index()
print(df)
All produce:
Date $
0 2021-01-01 1.0
1 2021-01-02 1.0
2 2021-01-03 1.0
3 2021-01-04 1.0
4 2021-01-05 2.0
5 2021-01-06 2.0
6 2021-01-07 2.0
7 2021-01-08 2.0
...
30 2021-01-31 2.0
31 2021-02-01 3.0
32 2021-02-02 3.0
33 2021-02-03 3.0
34 2021-02-04 3.0
35 2021-02-05 3.0
36 2021-02-06 3.0
37 2021-02-07 3.0
38 2021-02-08 3.0
...
57 2021-02-27 3.0
58 2021-02-28 3.0
Depending on how you want to handle months with no values you can apply ffill and bfill instead:
import pandas as pd
data = [['2020-12-02', 1.0], ['2021-02-05', 2.0], ['2021-03-05', 3.0]]
df = pd.DataFrame(data, columns=['Date', '$'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(
mn - pd.tseries.offsets.MonthBegin(),
mx + pd.tseries.offsets.MonthEnd(),
name="Date",
)
df = df.set_index("Date").reindex(dr)
df['$'] = df.resample('M')['$'].apply(lambda s: s.ffill().bfill())
df = df.reset_index()
print(df.to_string())
January has no values so it stays NaN as opposed to previous methods which would have filled January with the last known value 1.0.
Date $
0 2020-12-01 1.0
...
30 2020-12-31 1.0
31 2021-01-01 NaN
...
61 2021-01-31 NaN
62 2021-02-01 2.0
...
89 2021-02-28 2.0
90 2021-03-01 3.0
91 2021-03-02 3.0
...
120 2021-03-31 3.0