Related
I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data
I have the below data sample
date,00:00:00,00:15:00,00:30:00,00:45:00,01:00:00,01:15:00,01:30:00,01:45:00,02:00:00,event
2008-01-01,115.87869701,115.37569504,79.9510802,123.68891355,110.89528693, 112.15190765,110.1277647,76.16662078,100.39338951,A
2008-01-02,104.29757522,89.11652179,91.80890697,109.91423556,112.91809129,114.91459611,117.50170579,111.08030786,81.5893157,B
2008-01-02,81.16506701,97.13170328,89.25478466,93.51884481,107.11447296,120.40638709,116.1653649,79.8861492,111.99530301,C
2008-01-02,121.98507602,105.20973701,84.46996209,96.2210916,107.65437228,121.4604217,120.96638889,117.94695867,94.33309319,D
2008-01-02,82.5839125,104.3308685,98.32658468,101.79562494,86.02883206,90.61788466,109.89027977,107.89093632,101.64082595,E
2008-01-02,100.68446746,89.90700858,115.97450984,112.85364917,100.76204374,87.49141078,81.69930821,79.78106694,99.97354515,F
2008-01-02,98.49917234,112.93161335,85.30015915,120.59233515,102.15602621,84.9536008,116.98786228,107.95753105,112.75693735,G
2008-01-02,76.5186262,111.22137123,102.20065099,88.4490991,84.67584098,86.00205813,95.02734271,114.29076806,102.62969032,H
2008-01-02,93.27785451,122.90242719,123.27263927,102.83454346,87.84973282,95.38098403,88.03719802,108.68335342,97.6581398,I
2008-01-02,119.589143,94.15858259,94.32809506,120.5637488,120.43827996,79.66190052,100.40782173,89.90362719,80.46005726,J
I want to assign clusters to the data and have the final output in the below format
Expected output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
I have tried the below and the current output does not return 'time' label in the first row
import pandas as pd
import numpy as np
from datetime import datetime
from scipy.cluster.vq import kmeans, vq, whiten
from scipy.spatial.distance import cdist
from sklearn import metrics
#read data
df = pd.read_csv('df.csv', index_col=0)
df = df.drop(['event'], axis=1)
#stack the data
df = df.stack()
df.index = pd.to_datetime([' '.join(i) for i in df.index])
df = df.rename_axis('event_timestamp').reset_index(name='value')
df.index = df.event_timestamp
df = df.drop(['event_timestamp'], axis=1)
df.columns = ['value']
#normalize the df
df_norm = (df - df.mean()) / (df.max() - df.min())
df['time'] = df.index.map(lambda t: t.time())
df['date'] = df.index.map(lambda t: t.date())
df_norm['time'] = df_norm.index.map(lambda t: t.time())
df_norm['date'] = df_norm.index.map(lambda t: t.date())
#pivot data
df_daily = pd.pivot_table(df, values='value', index='date', columns='time', aggfunc='mean')
df_daily_norm = pd.pivot_table(df_norm, values='value', index='date', columns='time', aggfunc='mean')
#assign clusters to daily data
df_daily_matrix_norm = np.matrix(df_daily_norm.dropna())
centers, _ = kmeans(df_daily_matrix_norm, 2)
cluster, _ = vq(df_daily_matrix_norm, centers)
clusterdf = pd.DataFrame(cluster, columns=['cluster_num'])
dailyclusters = pd.concat([df_daily.dropna().reset_index(), clusterdf], axis=1)
print(dailyclusters)
Current output
date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 0
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 1
What do I need to do to get the desired output with the 'time' label.
simply add the name to the index:
dailyclusters.index.name = "time"
Use:
dailyclusters = df_daily.dropna().assign(cluster_num=cluster).reset_index()
print(dailyclusters)
# Output
time date 00:00:00 00:15:00 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 cluster_num
0 2008-01-01 115.878697 115.375695 79.951080 123.688914 110.895287 112.151908 110.127765 76.166621 100.393390 1
1 2008-01-02 97.622322 102.989982 98.326255 105.193686 101.066410 97.876583 105.187030 101.935633 98.115212 0
I'm attempting to create a rolling average over 10 minutes on an irregularly time stepped data set. I get the error shown below
Traceback (most recent call last):
File "asosreaderpandas.py", line 13, in <module>
df.rolling('10min').mean()
File "/opt/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 8900, in rolling
on=on, axis=axis, closed=closed)
File "/opt/anaconda3/lib/python3.6/site-packages/pandas/core/window.py", line 2469, in rolling
return Rolling(obj, **kwds)
File "/opt/anaconda3/lib/python3.6/site-packages/pandas/core/window.py", line 80, in __init__
self.validate()
File "/opt/anaconda3/lib/python3.6/site-packages/pandas/core/window.py", line 1478, in validate
raise ValueError("window must be an integer")
ValueError: window must be an integer
This is my code that I am using to create my rolling average, I would manually input my timestamps, as that has solved my issue in the past, except the .txt file is 98,000 lines long...
import pandas as pd
from datetime import datetime
df = pd.read_csv('KART.txt', header = 0)
#indexing the date format from txt file
pd.to_datetime(df.index, format='%Y-%m-%d %H:%M')
#creating ten minute average
df.rolling('10min').mean()
print(df)
I don't understand the pandas module well, I have tried multiple ways of assigning my datetime differently to no avail am I going about this completely wrong?
Dataset Sample
0,1
2019-01-01 00:00:00,4
2019-01-01 00:05:00,4
2019-01-01 00:10:00,4
2019-01-01 00:15:00,4
2019-01-01 00:25:00,5
2019-01-01 00:30:00,4
2019-01-01 00:35:00,4
2019-01-01 00:40:00,4
2019-01-01 00:45:00,4
2019-01-01 00:50:00,4
2019-01-01 00:55:00,4
2019-01-01 00:56:00,4
2019-01-01 01:00:00,4
...
You have multiple issues in you code:
you have an automatic integer index assigned to your dataframe when you load your dataframe without specifying the column index (you later try to convert into datetime which is obviously not what you want)
you don't save the index when you convert it to datetime
Here's the fixed version:
import pandas as pd
from datetime import datetime
df = pd.read_csv('KART.txt', header = 0, index_col=0) # <- specified column index
df.index = pd.to_datetime(df.index, format='%Y-%m-%d %H:%M') # <- saving index when converting it to datetime
df.rolling('10min').mean()
> 1
0
2019-01-01 00:00:00 4.0
2019-01-01 00:05:00 4.0
2019-01-01 00:10:00 4.0
2019-01-01 00:15:00 4.0
2019-01-01 00:25:00 5.0
2019-01-01 00:30:00 4.5
2019-01-01 00:35:00 4.0
2019-01-01 00:40:00 4.0
2019-01-01 00:45:00 4.0
2019-01-01 00:50:00 4.0
2019-01-01 00:55:00 4.0
2019-01-01 00:56:00 4.0
2019-01-01 01:00:00 4.0
...
EDIT
Thanks to the comment of Parfait you can be get even a shorter version of a code by parsing dates right in the read_csv method:
import pandas as pd
from datetime import datetime
df = pd.read_csv('KART.txt',
header = 0,
index_col=0, # <-- specified column index
parse_dates=True) # <-- parsed dates from txt
df.rolling('10min').mean()
I have a csv-file with time series data, the first column is the date in the format %Y:%m:%d and the second column is the intraday time in the format '%H:%M:%S'. I would like to import this csv-file into a multiindex dataframe or panel object.
With this code, it already works:
_file_data = pd.read_csv(_file,
sep=",",
header=0,
index_col=['Date', 'Time'],
thousands="'",
parse_dates=True,
skipinitialspace=True
)
It returns the data in the following format:
Date Time Volume
2016-01-04 2018-04-25 09:01:29 53645
2018-04-25 10:01:29 123
2018-04-25 10:01:29 1345
....
2016-01-05 2018-04-25 10:01:29 123
2018-04-25 12:01:29 213
2018-04-25 10:01:29 123
1st question:
I would like to show the second index as a pure time-object not datetime. To do that, I have to declare two different date-pasers in the read_csv function, but I can't figure out how. What is the "best" way to do that?
2nd question:
After I created the Dataframe, I converted it to a panel-object. Would you recommend doing that? Is the panel-object the better choice for such a data structure? What are the benefits (drawbacks) of a panel-object?
1st question:
You can create multiple converters and define parsers in dictionary:
import pandas as pd
temp=u"""Date,Time,Volume
2016:01:04,09:00:00,53645
2016:01:04,09:20:00,0
2016:01:04,09:40:00,0
2016:01:04,10:00:00,1468
2016:01:05,10:00:00,246
2016:01:05,10:20:00,0
2016:01:05,10:40:00,0
2016:01:05,11:00:00,0
2016:01:05,11:20:00,0
2016:01:05,11:40:00,0
2016:01:05,12:00:00,213"""
def converter1(x):
#convert to datetime and then to times
return pd.to_datetime(x).time()
def converter2(x):
#define format of datetime
return pd.to_datetime(x, format='%Y:%m:%d')
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter1, 'Date': converter2})
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 0
12:00:00 213
Sometimes is possible use built-in parser, e.g. if format of dates is YY-MM-DD:
import pandas as pd
temp=u"""Date,Time,Volume
2016-01-04,09:00:00,53645
2016-01-04,09:20:00,0
2016-01-04,09:40:00,0
2016-01-04,10:00:00,1468
2016-01-05,10:00:00,246
2016-01-05,10:20:00,0
2016-01-05,10:40:00,0
2016-01-05,11:00:00,0
2016-01-05,11:20:00,0
2016-01-05,11:40:00,0
2016-01-05,12:00:00,213"""
def converter(x):
#define format of datetime
return pd.to_datetime(x).time()
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp),
index_col=['Date','Time'],
parse_dates=['Date'],
thousands="'",
skipinitialspace=True,
converters={'Time': converter})
print (df.index.get_level_values(0))
DatetimeIndex(['2016-01-04', '2016-01-04', '2016-01-04', '2016-01-04',
'2016-01-05', '2016-01-05', '2016-01-05', '2016-01-05',
'2016-01-05', '2016-01-05', '2016-01-05'],
dtype='datetime64[ns]', name='Date', freq=None)
Last possible solution is convert datetime to times in MultiIndex by set_levels - after processing:
df.index = df.index.set_levels(df.index.get_level_values(1).time, level=1)
print (df)
Volume
Date Time
2016-01-04 09:00:00 53645
09:20:00 0
09:40:00 0
10:00:00 1468
2016-01-05 10:00:00 246
10:00:00 0
10:20:00 0
10:40:00 0
11:00:00 0
11:20:00 0
11:40:00 213
2nd question:
Panel in pandas 0.20.+ is deprecated and will be removed in a future version.
To convert to a time series use pd.to_timedelta.
Ex:
import pandas as pd
df = pd.DataFrame({"Time": ["2018-04-25 09:01:29", "2018-04-25 10:01:29", "2018-04-25 10:01:29"]})
df["Time"] = pd.to_timedelta(pd.to_datetime(df["Time"]).dt.strftime('%H:%M:%S'))
print df["Time"]
Output:
0 09:01:29
1 10:01:29
2 10:01:29
Name: Time, dtype: timedelta64[ns]
I have a large data set like this
user category
time
2014-01-01 00:00:00 21155349 2
2014-01-01 00:00:00 56347479 6
2014-01-01 00:00:00 68429517 13
2014-01-01 00:00:00 39055685 4
2014-01-01 00:00:00 521325 13
I want to make it as
user category
time
00:00:00 21155349 2
00:00:00 56347479 6
00:00:00 68429517 13
00:00:00 39055685 4
00:00:00 521325 13
How you do this using pandas
If you want to mutate a series (column) in pandas, the pattern is to apply a function to it (that updates on element in the series at a time), and to then assign that series back into into the dataframe
import pandas
import StringIO
# load data
data = '''date,user,category
2014-01-01 00:00:00, 21155349, 2
2014-01-01 00:00:00, 56347479, 6
2014-01-01 00:00:00, 68429517, 13
2014-01-01 00:00:00, 39055685, 4
2014-01-01 00:00:00, 521325, 13'''
df = pandas.read_csv(StringIO.StringIO(data))
df['date'] = pandas.to_datetime(df['date'])
# make the required change
without_date = df['date'].apply( lambda d : d.time() )
df['date'] = without_date
# display results
print df
If the problem is because the date is the index, you've got a few more hoops to jump through:
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.apply(lambda d : d.time() ))
As suggested by #DSM, If you have pandas later than 0.15.2, you can use use the .dt accessor on the series to do fast updates.
df = pandas.read_csv(StringIO.StringIO(data), index_col='date')
ser = pandas.to_datetime(df.index).to_series()
df.set_index(ser.dt.time)