Group by and fill missing datetime values - python

What I'm just trying is to group a Pandas Dataframe by contract and date, and fill missing datetime values.
My input is this:
contract datetime value1 value2
x 2019-01-01 00:00:00 50 60
x 2019-01-01 01:00:00 30 60
x 2019-01-01 02:00:00 70 80
y 2019-01-01 00:00:00 30 100
What I want to do is to have all possible datetimes (from 00:00:00 to 23:00:00) for each contract, and fill missing values with NaN or None.
Thank you very much.

You can use DataFrame.reindex per groups with DataFrame.groupby and lambda function:
df['datetime'] = pd.to_datetime(df['datetime'])
f= lambda x: x.reindex(pd.date_range(x.index.min().floor('d'),
.index.max().floor('d')+pd.Timedelta(23, 'H'),freq='H'))
df1 = (df.set_index('datetime')
.groupby('contract')
.apply(f)
.drop('contract', axis=1)
.reset_index())
print (df1)

Related

Time Series Data Reformat

I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN

Python/Pandas: dataframe merge and fillna

I try to merge two Pandas dataframes based on date and then ffill NaN values until specific date. I have the following data example:
df_1:
date
value1
01/12
10
02/12
20
03/12
30
04/12
40
05/12
60
06/12
70
07/12
80
df_2:
date
value2
01/12
100
03/12
300
05/12
500
I use the following line:
df = pd.merge((df_1,df_2, how='left', on=['date']
I get this:
date
value1
value2
01/12
10
100
02/12
20
NaN
03/12
30
300
04/12
40
Nan
05/12
50
500
06/12
60
NaN
07/12
70
NaN
What I want to achieve then is to forwardfill the NaN values in df['value2'] until 05/12 and not until 07/12.
First, convert date to datetime format to use conditional operand. It will return YYYY-MM-DD by default.
Next, create a mask for your condition ffill to 05/12. and use loc for fillna.
Lastly, convert back date from datetime back to string
df['date'] = pd.to_datetime(df["date"], format="%d/%m")
mask = (df["date"].lt(pd.to_datetime('05/12', format="%d/%m")))
df.loc[mask, "val2"] = df.loc[mask, "val2"].fillna(method="ffill")
df['date'] = df['date'].dt.strftime('%d/%m')

Python - Pandas, count time diff from first record in a group

In continue to this question
Having the following DF:
group_id timestamp
A 2020-09-29 06:00:00 UTC
A 2020-09-29 08:00:00 UTC
A 2020-09-30 09:00:00 UTC
B 2020-09-01 04:00:00 UTC
B 2020-09-01 06:00:00 UTC
I would like to count the deltas between records using all groups, not counting deltas between groups. Result for the above example:
delta count
2 2
27 1
Explanation: In group A the deltas are
06:00:00 -> 08:00:00 (2 hours)
08:00:00 -> 09:00:00 on the next day (27 hours from the first event)
And in group B:
04:00:00 -> 06:00:00 (2 hours)
How can I achieve this using Python Pandas?
FIrst idea is use custom lambda function with Series.cumsum for cumulative sum:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df.groupby("group_id")['timestamp']
.apply(lambda x: x.diff().dt.total_seconds().cumsum())
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1
Or add another groupby with GroupBy.cumsum:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df.groupby("group_id")['timestamp']
.diff()
.dt.total_seconds()
.div(3600)
.groupby(df['group_id'])
.cumsum()
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1
Another idea is subtract first values per groups by GroupBy.transform and GroupBy.first, but for remove first rows with 0 is added filter by Series.duplicated:
df['timestamp'] = pd.to_datetime(df['timestamp'])
df1 = (df['timestamp'].sub(df.groupby("group_id")['timestamp'].transform('first'))
.loc[df['group_id'].duplicated()]
.dt.total_seconds()
.div(3600)
.value_counts()
.rename_axis('delta')
.reset_index(name='count')
)
print (df1)
delta count
0 2.0 2
1 27.0 1

Create a list of years with pandas

I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01

Pandas Pivot with datetime index

I am having some troubles pivoting a dataframe with a datetime value as the index.
my df looks like this:
Timestamp Value
2016-01-01 00:00:00 16.546900
2016-01-01 01:00:00 16.402375
2016-01-01 02:00:00 16.324250
Where the timestamp is a, datetime64[ns]. I am trying to pivot the table so that it looks like this.
Hour 0 1 2 4 ....
Date
2016-01-01 16.5 16.4 16.3 17 ....
....
....
I've tried using the code below but am getting an error when I run it.
df3 = pd.pivot_table(df2,index=np.unique(df2.index.date),columns=np.unique(df2.index.hour),values=df2.Temp)
KeyError Traceback (most recent call last)
in ()
1 # Pivot Table
----> 2 df3 = pd.pivot_table(df2,index=np.unique(df2.index.date),columns=np.unique(df2.index.hour),values=df2.Temp)
~\Anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name)
56 for i in values:
57 if i not in data:
---> 58 raise KeyError(i)
59
60 to_filter = []
KeyError: 16.5469
Any help or insights would be greatly appreciated.
A different way of accomplishing this without the lambda is to create the indices from the DateTimeIndex.
df2 = pd.pivot_table(df, index=df.index.date, columns=df.index.hour, values="Value")
I slightly extended input data like below (assuming no duplicated entries in the same date/hour)
Timestamp Value
2016-01-01 00:00:00 16.546900
2016-01-01 01:00:00 16.402375
2016-01-01 02:00:00 16.324250
2016-01-01 04:00:00 16.023928
2016-01-03 04:00:00 16.101919
2016-01-05 23:00:00 13.405928
It looks a bit awkward, but something like below works.
df2['Date'] = df2.Timestamp.apply(lambda x: str(x).split(" ")[0])
df2['Hour'] = df2.Timestamp.apply(lambda x: str(x).split(" ")[1].split(":")[0])
df3 = pd.pivot_table(df2, values='Value', index='Date', columns='Hour')
[Output]
Hour 00 01 02 04 23
Date
2016-01-01 16.5469 16.402375 16.32425 16.023928 NaN
2016-01-03 NaN NaN NaN 16.101919 NaN
2016-01-05 NaN NaN NaN NaN 13.405928
Finally if your columns need to be integer,
df3.columns = [int(x) for x in df3.columns]
Hope this helps.
Adapting #Seanny123 's answer above for an arbitrary cadence:
start = [2018, 1, 1, 0, 0, 0]
end = [date.today().year, date.today().month, date.today().day]
quant='freq'
sTime_tmp = datetime.datetime(start[0], start[1], start[2], tzinfo = pytz.UTC)
eTime_tmp = datetime.datetime(end[0], end[1], end[2], tzinfo = pytz.UTC)
cadence = '5min'
t = pd.date_range(start=sTime_tmp,
end=eTime_tmp,
freq = cadence)
keo = pd.DataFrame(np.nan, index=t, columns=[quant])
keo[quant] = 0
keo = pd.pivot_table(keo, index=keo.index.time, columns=keo.index.date, values=quant)
keo

Categories