MRE:
idx = pd.date_range('2015-07-03 08:00:00', periods=30,
freq='H')
data = np.random.randint(1, 100, size=len(idx))
df = pd.DataFrame({'index':idx, 'col':data})
df.set_index("index", inplace=True)
which looks like:
col
index
2015-07-03 08:00:00 96
2015-07-03 09:00:00 79
2015-07-03 10:00:00 15
2015-07-03 11:00:00 2
2015-07-03 12:00:00 84
2015-07-03 13:00:00 86
2015-07-03 14:00:00 5
.
.
.
Note that dataframe contain multiple days. Since frequency is in hours, starting from 07/03 08:00:00 it will contain hourly date.
I want to get all data from 05:00:00 including day 07/03 even if it will contain value 0 in "col" column.
I want to extend it backwards so it starts from 05:00:00.
No I just can't start from 05:00:00 since I already have dataframe that starts from 08:00:00. I am trying to keep everything same but add 3 rows in the beginning to include 05:00:00, 06:00:00, and 07:00:00
The reindex method is handy for changing the index values:
idx = pd.date_range('2015-07-03 08:00:00', periods=30, freq='H')
data = np.random.randint(1, 100, size=len(idx))
# use the index param to set index or you might lose the freq
df = pd.DataFrame({'col':data}, index=idx)
# reindex with a new index
start = df.tshift(-3).index[0]
end = df.index[-1]
new_index = pd.date_range(start, end, freq='H')
new_df = df.reindex(new_index)
resample is also very useful for date indices
Just change the time from 08:00:00 to 05:00:00 in your code and create 3 more rows and update this dataframe to the existing one.
idx1 = pd.date_range('2015-07-03 05:00:00', periods=3,freq='H')
df1 = pd.DataFrame({'index': idx1 ,'col':np.random.randint(1,100,size = 3)})
df1.set_index('index',inplace=True)
df = df1.append(df)
print(df)
Add this snippet to your code...
Related
I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data
I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN
I have a dataframe with a column of dates of the form
2004-01-01
2005-01-01
2006-01-01
2007-01-01
2008-01-01
2009-01-01
2010-01-01
2011-01-01
2012-01-01
2013-01-01
2014-01-01
2015-01-01
2016-01-01
2017-01-01
2018-01-01
2019-01-01
Given an integer number k, let's say k=5, I would like to generate an array of the next k years after the maximum date of the column. The output should look like:
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
Let's use pd.to_datetime + max to compute the largest date in the column date then use pd.date_range to generate the dates based on the offset frequency one year and having the number of periods equals to k=5:
strt, offs = pd.to_datetime(df['date']).max(), pd.DateOffset(years=1)
dates = pd.date_range(strt + offs, freq=offs, periods=k).strftime('%Y-%m-%d').tolist()
print(dates)
['2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01', '2024-01-01']
Here you go:
import pandas as pd
# this is your k
k = 5
# Creating a test DF
array = {'dt': ['2018-01-01', '2019-01-01']}
df = pd.DataFrame(array)
# Extracting column of year
df['year'] = pd.DatetimeIndex(df['dt']).year
year1 = df['year'].max()
# creating a new DF and populating it with k years
years_df = pd.DataFrame()
for i in range (1,k+1):
row = {'dates':[str(year1 + i) + '-01-01']}
years_df = years_df.append(pd.DataFrame(row))
years_df
The output:
dates
2020-01-01
2021-01-01
2022-01-01
2023-01-01
2024-01-01
I have three dataframes in Pandas, say df1, df2 and df3. The first column of all dataframes is the Timestamp (DateTime format like 2017-01-01 12:30:00 etc.) Here is an example of each's first column:-
df1 TimeStamp
2016-01-01 12:00:00
2016-01-01 12:10:00
.....
df2 TimeStamp
2016-01-01 12:00:00
2016-01-01 12:10:00
.....
df3 TimeStamp
2016-13-01 12:00:00
2016-13-01 12:30:00
.....
As you can see, for the first two are at 10 minutes intervals, while the third one is at 30 minutes intervals. What I would like to do is to merge all 3 dataframes together, such that for cases where there is not exact match due to non-available data(like 12:10:00 not available for 3rd dataframe ), it would be considered as 12:00:00 (the preceding measurement) for merging purposes. (But of course, the Date should be the same) Note that all the dataframes have different sizes, but I would like to merge them based on Timestamp together for analytical purposes. Thank you!
DESIRED RESULT:
df_final TimeStamp .. Columns of df1 Columns of df2 Columns of df3
2016-13-01 12:00:00
2016-13-01 12:10:00
2016-13-01 12:20:00
.....
MORE DETAILS BASED ON ANSWER SUGGESTED
Firstly, as my dataframes (all 3) did not have index as TimeStamps, but had columns as TimeStamps, I set index for each as the TimeStamps:
df1.index = df1.TimeStamp
df2.index = df2.TimeStamp
df3.index = df3.TimeStamp
On using this
u_index = df3.index.union(df2.index.union(df1.index))
I get a weird output strangely which is not at regularly 10 min intervals like needed.
Index(['2016-01-01 00:00:00.000', '2016-01-01 00:00:00.000',
'2016-01-01 00:00:00.000', '2016-01-01 00:00:00.000',
...
'2017-12-31 23:50:00.000', '2017-12-31 23:50:00.000',
'2017-12-31 23:50:00.000', '2017-12-31 23:50:00.000',
dtype='object', name='TimeStamp', length=3199372)
Accordingly, the final df1_n dataframe is at 30 min intervals and not 10 mins (as the Union of indices was not properly done). I think that there is something going wrong here and once Step 2 suggested (u_index) is working properly, everything will be easy to merge the dataframes.
So I'm not 100% sure if what you asked for is how to complete the missing values after merging the three dataframes with the next valid observation.
if so, that's the quickest way I found to do this (not the most elegant...):
create a new index which is the union of the three indexes (will result in timestamp with intervals of 10 minutes in you case).
reindex all three dfs according to the new index while filling in missing values separately.
merge columns of the three dfs (which will be easy since after step NO.2 they will have the same index).
taking a portion of the data:
df1
Out[48]:
val_1
TimeStamp
2016-01-01 12:00:00 11
2016-01-01 12:10:00 12
df2
Out[49]:
val_2
TimeStamp
2016-01-01 12:00:00 21
2016-01-01 12:10:00 22
df3
Out[50]:
val_3
TimeStamp
2016-01-01 12:00:00 31
2016-13-01 12:30:00 32
step NO.1
u_index = df3.index.union(df2.index.union(df1.index))
u_index
Out[38]: Index(['2016-01-01 12:00:00', '2016-01-01 12:10:00', '2016-13-01 12:30:00'], dtype='object', name='TimeStamp')
step NO.2
df3_n = df3.reindex(index=u_index,method='bfill')
df2_n = df2.reindex(index=u_index,method='bfill')
df1_n = df1.reindex(index=u_index,method='bfill')
step NO.3
df1_n.merge(df2_n,on='TimeStamp').merge(df3_n,on='TimeStamp')
Out[47]:
val_1 val_2 val_3
TimeStamp
2016-01-01 12:00:00 11.0 21.0 31
2016-01-01 12:10:00 12.0 22.0 32
2016-13-01 12:30:00 NaN NaN 32
You might need to adjust the last row, since it has no following row to fill values from. but that's it pretty much.
I want read in weekday data and then reindex the data to fill the weekend with Friday's data. I have tried the following code but it will not reindex the data. Set_index produces a length error message.
import pandas as pd
def fill_dataframe(filename):
dataf = pd.read_csv(filename, header= None, index_col = [0])
return(dataf)
rng = pd.date_range('10/1/2010', periods=61)
date_rng = pd.DataFrame(rng,index = rng)
data_1.reindex(date_rng, method = 'ffill')
The data read in has 41 rows and the generated date values have 61 rows. Any suggestions?
data read in by csv (1st 7 rows)
X0 X1
10/1/2010 71.27
10/4/2010 70.33
10/5/2010 72.94
10/6/2010 74.15
10/7/2010 71.40
10/8/2010 72.58
10/11/2010 72.66
dates generated by rng in the second Data Frame (first 11 rows)
0
2010-10-01 2010-10-01 00:00:00
2010-10-02 2010-10-02 00:00:00
2010-10-03 2010-10-03 00:00:00
2010-10-04 2010-10-04 00:00:00
2010-10-05 2010-10-05 00:00:00
2010-10-06 2010-10-06 00:00:00
2010-10-07 2010-10-07 00:00:00
2010-10-08 2010-10-08 00:00:00
2010-10-09 2010-10-09 00:00:00
2010-10-10 2010-10-10 00:00:00
2010-10-11 2010-10-11 00:00:00
Reindexing just by the (1D) timeseries or as a Series this works (in 0.10.1):
data_1.reindex(rng, method = 'ffill')
data_1.reindex(Series(rng, index=rng), method = 'ffill')
.
With date_rng as the DataFrame I get TypeError: Cannot compare Timestamp with 0, I suspect this could be a bug, but I'm not entirely sure what the expected behaviour should be...