Add rows for missing hourly data in a pandas dataframe - python

I have a pandas dataframe with 2 columns: Created'(%Y-%m-%d %H) and Count which is an integer.
It is counting the amount of "tickets" registered per hour.
The problem is that there are many hours in the day that there are not registered any tickets.
I would like to add these hours as new rows with a Count of 0. The dataframe looks like this:
Created Count
0 2020-10-26 10 11
1 2020-10-26 09 123
2 2020-10-26 08 36
3 2020-10-26 07 28
4 2020-10-26 06 7
But I would need it to add rows like this:
Created Count
enter code here
0 2020-10-26 10 11
1 2020-10-26 09 123
2 2020-10-26 08 36
3 2020-10-26 07 28
4 2020-10-26 06 7
1 2020-10-26 05. 0
3 2020-10-26 04. 0
Also adding that it needs to be able to update continuously as new dates are added to the original dataframe.

You can resample with:
import datetime
import pandas as pd
df = pd.DataFrame({
'Created': ['2020-10-26 10', '2020-10-26 08','2020-10-26 09','2020-10-26 07','2020-10-26 06'],
'count': [11, 10,14,16,20]})
df['Created'] = pd.to_datetime(df['Created'], format='%Y-%m-%d %H')
df.sort_values(by=['Created'],inplace=True)
df.set_index('Created',inplace=True)
df_Date=pd.date_range(start=df.index.min().replace(hour=0), end=(df.index.max()), freq='H')
df=df.reindex(df_Date,fill_value=0)
df.reset_index(inplace=True)
df.rename(columns={'index': 'Created'},inplace=True)
print(df)
Result:
Created count
0 2020-10-26 00:00:00 0
1 2020-10-26 01:00:00 0
2 2020-10-26 02:00:00 0
3 2020-10-26 03:00:00 0
4 2020-10-26 04:00:00 0
5 2020-10-26 05:00:00 0
6 2020-10-26 06:00:00 20
7 2020-10-26 07:00:00 16
8 2020-10-26 08:00:00 10
9 2020-10-26 09:00:00 14
10 2020-10-26 10:00:00 11

Related

Extract year from pandas datetime column as numeric value with NaN for empty cells instead of NaT

I want to extract the year from a datetime column into a new 'yyyy'-column AND I want the missing values (NaT) to be displayed as 'NaN', so the datetime-dtype of the new column should be changed I guess but there I'm stuck..
Initial df:
Date ID
0 2016-01-01 12
1 2015-01-01 96
2 NaT 20
3 2018-01-01 73
4 2017-01-01 84
5 NaT 26
6 2013-01-01 87
7 2016-01-01 64
8 2019-01-01 11
9 2014-01-01 34
Desired df:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 NaN
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 NaN
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014
Code:
import pandas as pd
import numpy as np

# example df
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"Date": ['2016-01-01', '2015-01-01', np.nan, '2018-01-01', '2017-01-01', np.nan, '2013-01-01', '2016-01-01', '2019-01-01', '2014-01-01']})

df.ID = pd.to_numeric(df.ID)

df.Date = pd.to_datetime(df.Date)
print(df)
#extraction of year from date
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y')

#Try to set NaT to NaN or datetime to numeric, PROBLEM: empty cells keep 'NaT'
df.loc[(df['yyyy'].isna()), 'yyyy'] = np.nan

 #(try1)
df.yyyy = df.Date.astype(float)
 #(try2)
df.yyyy = pd.to_numeric(df.Date)
 #(try3)
print(df)
Use Series.dt.year with converting to integers with Int64:
df.Date = pd.to_datetime(df.Date)
df['yyyy'] = df.Date.dt.year.astype('Int64')
print (df)
ID Date yyyy
0 12 2016-01-01 2016
1 96 2015-01-01 2015
2 20 NaT <NA>
3 73 2018-01-01 2018
4 84 2017-01-01 2017
5 26 NaT <NA>
6 87 2013-01-01 2013
7 64 2016-01-01 2016
8 11 2019-01-01 2019
9 34 2014-01-01 2014
With no convert floats to integers:
df['yyyy'] = df.Date.dt.year
print (df)
ID Date yyyy
0 12 2016-01-01 2016.0
1 96 2015-01-01 2015.0
2 20 NaT NaN
3 73 2018-01-01 2018.0
4 84 2017-01-01 2017.0
5 26 NaT NaN
6 87 2013-01-01 2013.0
7 64 2016-01-01 2016.0
8 11 2019-01-01 2019.0
9 34 2014-01-01 2014.0
Your solution convert NaT to strings NaT, so is possible use replace.
Btw, in last versions of pandas replace is not necessary, it working correctly.
df['yyyy'] = pd.to_datetime(df.Date).dt.strftime('%Y').replace('NaT', np.nan)
Isn't it:
df['yyyy'] = df.Date.dt.year
Output:
Date ID yyyy
0 2016-01-01 12 2016.0
1 2015-01-01 96 2015.0
2 NaT 20 NaN
3 2018-01-01 73 2018.0
4 2017-01-01 84 2017.0
5 NaT 26 NaN
6 2013-01-01 87 2013.0
7 2016-01-01 64 2016.0
8 2019-01-01 11 2019.0
9 2014-01-01 34 2014.0
For pandas 0.24.2+, you can use Int64 data type for nullable integers:
df['yyyy'] = df.Date.dt.year.astype('Int64')
which gives:
Date ID yyyy
0 2016-01-01 12 2016
1 2015-01-01 96 2015
2 NaT 20 <NA>
3 2018-01-01 73 2018
4 2017-01-01 84 2017
5 NaT 26 <NA>
6 2013-01-01 87 2013
7 2016-01-01 64 2016
8 2019-01-01 11 2019
9 2014-01-01 34 2014

Divide Single to Multiple Columns by Delimiter Python Dataframe

i have a dataframe called "dates" with shape 4380,1 that looks like this -
date
0 2017-01-01 00:00:00
1 2017-01-01 06:00:00
2 2017-01-01 12:00:00
3 2017-01-01 18:00:00
4 2017-01-02 00:00:00
...
4375 2019-12-30 18:00:00
4376 2019-12-31 00:00:00
4377 2019-12-31 06:00:00
4378 2019-12-31 12:00:00
4379 2019-12-31 18:00:00
but i need to divide the single column of dates by the delimiter "-" or dash so that I can use this to groupby the month e.g., 01, 02,...12. So, my final result for the new dataframe should have shape 4380,4 and look like:
Year Month Day HHMMSS
0 2017 01 01 00:00:00
1 2017 01 01 06:00:00
...
4379 2019 12 31 18:00:00
I cannot find how to do this python transformation from single to multiple columns based on a delimiter. Thank you much!
Use Series.dt.strftime and Series.str.split:
new_df = df['date'].dt.strftime('%Y-%m-%d-%H:%M:%S').str.split('-',expand=True)
new_df.columns = ['Year','Month','Day', 'HHMMSS']
print(new_df)
Year Month Day HHMMSS
0 2017 01 01 00:00:00
1 2017 01 01 06:00:00
2 2017 01 01 12:00:00
3 2017 01 01 18:00:00
4 2017 01 02 00:00:00
4375 2019 12 30 18:00:00
4376 2019 12 31 00:00:00
4377 2019 12 31 06:00:00
4378 2019 12 31 12:00:00
4379 2019 12 31 18:00:00

Set two levels of indexes (indices?) as datetime.date() and datetime.time() using a column of datetime objects

I have a DataFrame that has a column 'timestamp' which consists of datetime objects (YYYY-mm-dd HH:MM:SS). I would like to extract the date (datetime.date()) from these timestamps and set it as the level 0 index, and the time (datetime.time()) as the level 1 index.
Example:
timestamp value1 value2
index
0 2018-01-01 09:00:00 10 20
1 2018-01-01 09:01:00 11 21
2 2018-01-02 09:00:00 12 22
3 2018-01-02 09:01:00 13 23
Would become:
value1 value2
date time
2018-01-01 09:00:00 10 20
09:01:00 11 21
2018-01-02 09:00:00 12 22
09:01:00 13 23
Option 1
Use drop and set_index
df.set_index([df.timestamp.dt.date, df.timestamp.dt.time]).drop('timestamp', 1)
value1 value2
timestamp timestamp
2018-01-01 09:00:00 10 20
09:01:00 11 21
2018-01-02 09:00:00 12 22
09:01:00 13 23
Option 2
d = df.set_index('timestamp')
d.index = [d.index.date, d.index.time]
d
value1 value2
2018-01-01 09:00:00 10 20
09:01:00 11 21
2018-01-02 09:00:00 12 22
09:01:00 13 23
Use set_index with MultiIndex.from_arrays and last drop original column:
mux = pd.MultiIndex.from_arrays([df['timestamp'].dt.date, df['timestamp'].dt.time],
names=('date','time'))
df = df.set_index(mux).drop('timestamp', 1)
Or add rename_axis:
df = (df.set_index([df['timestamp'].dt.date, df['timestamp'].dt.time])
.drop('timestamp', 1)
.rename_axis(('date','time')))
print (df)
value1 value2
date time
2018-01-01 09:00:00 10 20
09:01:00 11 21
2018-01-02 09:00:00 12 22
09:01:00 13 23

pandas - groupby and filtering for consecutive values

I have this dataframe df:
U,Datetime
01,2015-01-01 20:00:00
01,2015-02-01 20:05:00
01,2015-04-01 21:00:00
01,2015-05-01 22:00:00
01,2015-07-01 22:05:00
02,2015-08-01 20:00:00
02,2015-09-01 21:00:00
02,2014-01-01 23:00:00
02,2014-02-01 22:05:00
02,2015-01-01 20:00:00
02,2014-03-01 21:00:00
03,2015-10-01 20:00:00
03,2015-11-01 21:00:00
03,2015-12-01 23:00:00
03,2015-01-01 22:05:00
03,2015-02-01 20:00:00
03,2015-05-01 21:00:00
03,2014-01-01 20:00:00
03,2014-02-01 21:00:00
made by U and a Datetime object. What I would like to do is to filter U values having at least three consecutive occurrences in months/year. So far I have grouped by by U, year and month as:
m = df.groupby(['U',df.index.year,df.index.month]).size()
obtaining:
U
1 2015 1 1
2 1
4 1
5 1
7 1
2 2014 1 1
2 1
3 1
2015 1 1
8 1
9 1
3 2014 1 1
2 1
2015 1 1
2 1
5 1
10 1
11 1
12 1
The third column is related to the occurrences in different months/year. In this case only U values of 02 and 03 contain at least three consecutive values in months/year. Now I can't figured out how can I select those users and getting them out in a list, for instance, or just keeping them in the original dataframe df and discard the others. I tried also:
g = m.groupby(level=[0,1]).diff()
But I can't get any useful information.
Finally I could come up with the solution :) .
to give you an idea of how custom function works , simply it subtracts the value of the month from it's preceding value , the result should be one of course , and this should happen twice , for example if you have a list of numbers [5 , 6 , 7] , so 7 - 6 = 1 and 6 - 5 = 1 , 1 here appeared twice so the condition has been fulfilled
In [80]:
df.reset_index(inplace=True)
In [281]:
df['month'] = df.Datetime.dt.month
df['year'] = df.Datetime.dt.year
df
Out[281]:
Datetime U month year
0 2015-01-01 20:00:00 1 1 2015
1 2015-02-01 20:05:00 1 2 2015
2 2015-04-01 21:00:00 1 4 2015
3 2015-05-01 22:00:00 1 5 2015
4 2015-07-01 22:05:00 1 7 2015
5 2015-08-01 20:00:00 2 8 2015
6 2015-09-01 21:00:00 2 9 2015
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
9 2015-01-01 20:00:00 2 1 2015
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
17 2014-01-01 20:00:00 3 1 2014
18 2014-02-01 21:00:00 3 2 2014
In [284]:
g = df.groupby([df['U'] , df.year])
In [86]:
res = g.filter(lambda x : is_at_least_three_consec(x['month'].diff().values.tolist()))
res
Out[86]:
Datetime U month year
7 2014-01-01 23:00:00 2 1 2014
8 2014-02-01 22:05:00 2 2 2014
10 2014-03-01 21:00:00 2 3 2014
11 2015-10-01 20:00:00 3 10 2015
12 2015-11-01 21:00:00 3 11 2015
13 2015-12-01 23:00:00 3 12 2015
14 2015-01-01 22:05:00 3 1 2015
15 2015-02-01 20:00:00 3 2 2015
16 2015-05-01 21:00:00 3 5 2015
if you want to see the result of the custom function
In [84]:
res = g['month'].agg(lambda x : is_at_least_three_consec(x.diff().values.tolist()))
res
Out[84]:
U year
1 2015 False
2 2014 True
2015 False
3 2014 False
2015 True
Name: month, dtype: bool
this is how custom function implemented
In [53]:
def is_at_least_three_consec(month_diff):
consec_count = 0
#print(month_diff)
for index , val in enumerate(month_diff):
if index != 0 and val == 1:
consec_count += 1
if consec_count == 2:
return True
else:
consec_count = 0
​
return False

Changing time components of pandas datetime64 column

I have a dataframe that can be simplified as:
date id
0 02/04/2015 02:34 1
1 06/04/2015 12:34 2
2 09/04/2015 23:03 3
3 12/04/2015 01:00 4
4 15/04/2015 07:12 5
5 21/04/2015 12:59 6
6 29/04/2015 17:33 7
7 04/05/2015 10:44 8
8 06/05/2015 11:12 9
9 10/05/2015 08:52 10
10 12/05/2015 14:19 11
11 19/05/2015 19:22 12
12 27/05/2015 22:31 13
13 01/06/2015 11:09 14
14 04/06/2015 12:57 15
15 10/06/2015 04:00 16
16 15/06/2015 03:23 17
17 19/06/2015 05:37 18
18 23/06/2015 13:41 19
19 27/06/2015 15:43 20
It can be created using:
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"]})
The data has the following types:
tempDF.dtypes
date object
id int64
dtype: object
I have set the 'date' variable to be Pandas datefime64 format (if that's the right way to describe it) using:
import numpy as np
import pandas as pd
tempDF['date'] = pd_to_datetime(tempDF['date'])
So now, the dtypes look like:
tempDF.dtypes
date datetime64[ns]
id int64
dtype: object
I want to change the hours of the original date data. I can use .normalize() to convert to midnight via the .dt accessor:
tempDF['date'] = tempDF['date'].dt.normalize()
And, I can get access to individual datetime components (e.g. year) using:
tempDF['date'].dt.year
This produces:
0 2015
1 2015
2 2015
3 2015
4 2015
5 2015
6 2015
7 2015
8 2015
9 2015
10 2015
11 2015
12 2015
13 2015
14 2015
15 2015
16 2015
17 2015
18 2015
19 2015
Name: date, dtype: int64
The question is, how can I change specific date and time components? For example, how could I change the midday (12:00) for all the dates? I've found that datetime.datetime has a .replace() function. However, having converted dates to Pandas format, it would make sense to keep in that format. Is there a way to do that without changing the format again?
EDIT :
A vectorized way to do this would be to normalize the series, and then add 12 hours to it using timedelta. Example -
tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Demo -
In [59]: tempDF
Out[59]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
In [60]: tempDF['date'].dt.normalize() + datetime.timedelta(hours=12)
Out[60]:
0 2015-02-04 12:00:00
1 2015-06-04 12:00:00
2 2015-09-04 12:00:00
3 2015-12-04 12:00:00
4 2015-04-15 12:00:00
5 2015-04-21 12:00:00
6 2015-04-29 12:00:00
7 2015-04-05 12:00:00
8 2015-06-05 12:00:00
9 2015-10-05 12:00:00
10 2015-12-05 12:00:00
11 2015-05-19 12:00:00
12 2015-05-27 12:00:00
13 2015-01-06 12:00:00
14 2015-04-06 12:00:00
15 2015-10-06 12:00:00
16 2015-06-15 12:00:00
17 2015-06-19 12:00:00
18 2015-06-23 12:00:00
19 2015-06-27 12:00:00
dtype: datetime64[ns]
Timing information for both methods at bottom
One method would be to use Series.apply along with the .replace() method OP mentions in his post. Example -
tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
Demo -
In [12]: tempDF
Out[12]:
date id
0 2015-02-04 02:34:00 1
1 2015-06-04 12:34:00 2
2 2015-09-04 23:03:00 3
3 2015-12-04 01:00:00 4
4 2015-04-15 07:12:00 5
5 2015-04-21 12:59:00 6
6 2015-04-29 17:33:00 7
7 2015-04-05 10:44:00 8
8 2015-06-05 11:12:00 9
9 2015-10-05 08:52:00 10
10 2015-12-05 14:19:00 11
11 2015-05-19 19:22:00 12
12 2015-05-27 22:31:00 13
13 2015-01-06 11:09:00 14
14 2015-04-06 12:57:00 15
15 2015-10-06 04:00:00 16
16 2015-06-15 03:23:00 17
17 2015-06-19 05:37:00 18
18 2015-06-23 13:41:00 19
19 2015-06-27 15:43:00 20
In [13]: tempDF['date'] = tempDF['date'].apply(lambda x:x.replace(hour=12,minute=0))
In [14]: tempDF
Out[14]:
date id
0 2015-02-04 12:00:00 1
1 2015-06-04 12:00:00 2
2 2015-09-04 12:00:00 3
3 2015-12-04 12:00:00 4
4 2015-04-15 12:00:00 5
5 2015-04-21 12:00:00 6
6 2015-04-29 12:00:00 7
7 2015-04-05 12:00:00 8
8 2015-06-05 12:00:00 9
9 2015-10-05 12:00:00 10
10 2015-12-05 12:00:00 11
11 2015-05-19 12:00:00 12
12 2015-05-27 12:00:00 13
13 2015-01-06 12:00:00 14
14 2015-04-06 12:00:00 15
15 2015-10-06 12:00:00 16
16 2015-06-15 12:00:00 17
17 2015-06-19 12:00:00 18
18 2015-06-23 12:00:00 19
19 2015-06-27 12:00:00 20
Timing information
In [52]: df = pd.DataFrame([[datetime.datetime.now()] for _ in range(100000)],columns=['date'])
In [54]: %%timeit
....: df['date'].dt.normalize() + datetime.timedelta(hours=12)
....:
The slowest run took 12.53 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 32.3 ms per loop
In [57]: %%timeit
....: df['date'].apply(lambda x:x.replace(hour=12,minute=0))
....:
1 loops, best of 3: 1.09 s per loop
Here's the solution I used to replace the time component of the datetime values in a Pandas DataFrame. Not sure how efficient this solution is, but it fit my needs.
import pandas as pd
# Create a list of EOCY dates for a specified period
sDate = pd.Timestamp('2022-01-31 23:59:00')
eDate = pd.Timestamp('2060-01-31 23:59:00')
dtList = pd.date_range(sDate, eDate, freq='Y').to_pydatetime()
# Create a DataFrame with a single column called 'Date' and fill the rows with the list of EOCY dates.
df = pd.DataFrame({'Date': dtList})
# Loop through the DataFrame rows using the replace function to replace the hours and minutes of each date value.
for i in range(df.shape[0]):
df.iloc[i, 0]=df.iloc[i, 0].replace(hour=00, minute=00)
Not sure how efficient this solution is, but it fit my needs.

Categories