I am learning to use pandas resample() function, however, the following code does not return anything as expected. I re-sampled the time series by day.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print df.head()
weekly_summary = pd.DataFrame()
weekly_summary['speed'] = df.speed.resample('D').mean()
weekly_summary['distance'] = df.distance.resample('D').sum()
print weekly_summary.head()
Output
speed distance cumulative_distance
2015-01-01 00:00:00 40 10.00 10.00
2015-01-01 00:15:00 6 1.50 11.50
2015-01-01 00:30:00 31 7.75 19.25
2015-01-01 00:45:00 41 10.25 29.50
2015-01-01 01:00:00 59 14.75 44.25
[5 rows x 3 columns]
Empty DataFrame
Columns: [speed, distance]
Index: []
[0 rows x 2 columns]
Depending on your pandas version, how you will do this will vary.
In pandas 0.19.0, your code works as expected:
In [7]: pd.__version__
Out[7]: '0.19.0'
In [8]: df.speed.resample('D').mean().head()
Out[8]:
2015-01-01 28.562500
2015-01-02 30.302083
2015-01-03 30.864583
2015-01-04 29.197917
2015-01-05 30.708333
Freq: D, Name: speed, dtype: float64
In older versions, your solution might not work but at least in 0.14.1, you can tweak it to do so:
>>> pd.__version__
'0.14.1'
>>> df.speed.resample('D').mean()
29.41087328767123
>>> df.speed.resample('D', how='mean').head()
2015-01-01 29.354167
2015-01-02 26.791667
2015-01-03 31.854167
2015-01-04 26.593750
2015-01-05 30.312500
Freq: D, Name: speed, dtype: float64
This looks like an issue with old version of pandas, in newer versions it will enlarge the df when assigning a new column where the index is not the same shape. What should work is to not make an empty df and instead pass the initial call to resample as the data arg for the df ctor:
In [8]:
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print (df.head())
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
weekly_summary['distance'] = df.distance.resample('D').sum()
print( weekly_summary.head())
speed distance cumulative_distance
2015-01-01 00:00:00 28 7.0 7.0
2015-01-01 00:15:00 8 2.0 9.0
2015-01-01 00:30:00 10 2.5 11.5
2015-01-01 00:45:00 56 14.0 25.5
2015-01-01 01:00:00 6 1.5 27.0
speed distance
2015-01-01 27.895833 669.50
2015-01-02 29.041667 697.00
2015-01-03 27.104167 650.50
2015-01-04 28.427083 682.25
2015-01-05 27.854167 668.50
Here I pass the call to resample as the data arg for the df ctor, this will take the index and column name and create a single column df:
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
then subsequent assignments should work as expected
Related
I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data
So I am reading in a csv file of a 30 minute timeseries going from "2015-01-01 00:00" upto and including "2020-12-31 23:30". There are five sets of these timeseries, each being at a certain location, and there are 105215 rows going down for each 30 minutes. My job is to go through and find the timedelta between each row, for each column. It should be 30 minutes for each one, except sometimes it isn't, and I have to find that.
So far I'm reading in the data fine via
ca_time = np.array(ca.iloc[0:, 1], dtype= "datetime64")
ny_time = np.array(ny.iloc[0:, 1], dtype = "datetime64")
tx_time = np.array(tx.iloc[0:, 1], dtype = "datetime64")
#I'm then passing these to a pandas dataframe for more convenient manipulation
frame_ca = pd.DataFrame(data = ca_time, dtype = "datetime64[s]")
frame_ny = pd.DataFrame(data = ny_time, dtype = "datetime64[s]")
frame_tx = pd.DataFrame(data = tx_time, dtype = "datetime64[s]")
#Then concatenating them into an array with 100k+ rows, and the five columns represent each location
full_array = pd.concat([frame_ca, frame_ny, frame_tx], axis = 1)
I now want to find the timedelta between each cell for each respective location.
Currently I'm trying this as a simply test
first_row = full_array2.loc[1:1, :1]
second_row = full_array2.loc[2:2, :1]
delta = first_row - second_row
I'm getting back
0 0 0
1 NaT NaT NaT
2 NaT NaT NaT
These seems simple enough but don't know how I'm getting Not a Time here.
For reference, below are both those rows I'm trying to subtract
ca ny tx fl az
1 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00, 0 0 0 0 0
2 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00
Any help appreciated!
Currently I have two data frames representing excel spreadsheets. I wish to join the data where the dates are equal. This is a one to many join as one spread sheet has a date then I need to add data which has multiple rows with the same date
an example:
A B
date data date data
0 2015-0-1 ... 0 2015-0-1 to 2015-0-2 ...
1 2015-0-2 ... 1 2015-0-1 to 2015-0-2 ...
In this case both rows from A would recieve rows 0 and 1 from B because they are in that range.
I tried using
df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')
to accomplish this but received this error.
Traceback (most recent call last):
File "<pyshell#61>", line 1, in <module>
df3 = pandas.merge(df2, df1, how='right', validate='1:m', left_on='Travel Date/Range', right_on='End')
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 61, in merge
validate=validate)
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 555, in __init__
self._maybe_coerce_merge_keys()
File "C:\Users\M199449\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 990, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on object and datetime64[ns] columns. If you wish to proceed you should use pd.concat
I can add more information as needed of course
So here's the option with merging:
Assume you have two DataFrames:
import pandas as pd
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
Now do some cleaning to get all of the dates you need and make sure they are datetime
df1['date'] = pd.to_datetime(df1.date)
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
# No need for this anymore
df2 = df2.drop(columns='date')
Now merge it all together. You'll get 99x10K rows.
df = df1.assign(dummy=1).merge(df2.assign(dummy=1), on='dummy').drop(columns='dummy')
And subset to the dates that fall in between the ranges:
df[(df.date >= df.start) & (df.date <= df.end)]
# date data_x data_y start end
#0 2015-01-01 A E 2015-01-01 2015-01-02
#1 2015-01-01 A F 2015-01-01 2015-01-02
#3 2015-01-02 B E 2015-01-01 2015-01-02
#4 2015-01-02 B F 2015-01-01 2015-01-02
#5 2015-01-02 B G 2015-01-02 2015-01-03
#8 2015-01-03 C G 2015-01-02 2015-01-03
If for instance, some dates in df2 were a single date, since we're using .str.split we will get None for the second date. Then just use .loc to set it appropriately.
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03',
'2015-01-03'],
'data': ['E', 'F', 'G', 'H']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2.loc[df2.end.isnull(), 'end'] = df2.loc[df2.end.isnull(), 'start']
# data start end
#0 E 2015-01-01 2015-01-02
#1 F 2015-01-01 2015-01-02
#2 G 2015-01-02 2015-01-03
#3 H 2015-01-03 2015-01-03
Now the rest follows unchanged
Let's use this numpy method by #piRSquared:
df1 = pd.DataFrame({'date': ['2015-01-01', '2015-01-02', '2015-01-03'],
'data': ['A', 'B', 'C']})
df2 = pd.DataFrame({'date': ['2015-01-01 to 2015-01-02', '2015-01-01 to 2015-01-02', '2015-01-02 to 2015-01-03'],
'data': ['E', 'F', 'G']})
df2[['start', 'end']] = df2['date'].str.split(' to ', expand=True)
df2['start'] = pd.to_datetime(df2.start)
df2['end'] = pd.to_datetime(df2.end)
df1['date'] = pd.to_datetime(df1['date'])
a = df1['date'].values
bh = df2['end'].values
bl = df2['start'].values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns))
Output:
date data date data start end
0 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
1 2015-01-01 00:00:00 A 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
2 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 E 2015-01-01 00:00:00 2015-01-02 00:00:00
3 2015-01-02 00:00:00 B 2015-01-01 to 2015-01-02 F 2015-01-01 00:00:00 2015-01-02 00:00:00
4 2015-01-02 00:00:00 B 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00
5 2015-01-03 00:00:00 C 2015-01-02 to 2015-01-03 G 2015-01-02 00:00:00 2015-01-03 00:00:00
I have a dataframe like
import pandas as pd
import numpy as np
range = pd.date_range('2015-01-01', '2015-01-5', freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['otherF'] = np.random.randint(low=2, high=42, size=len(df.index))
I can easily resample and apply a builtin as sum():
df['speed'].resample('1D').sum()
Out[121]:
2015-01-01 2865
2015-01-02 2923
2015-01-03 2947
2015-01-04 2751
I can also apply a custom function returning multiple values:
def mu_cis(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
return np.mean(x_),np.mean(x_)-CI,np.mean(x_)+CI,len(x_)
df['speed'].resample('1D').agg(mu_cis)
Out[122]:
2015-01-01 (29.84375, [28.1098628611], [31.5776371389], 96)
2015-01-02 (30.4479166667, [28.7806726396], [32.115160693...
2015-01-03 (30.6979166667, [29.0182072972], [32.377626036...
2015-01-04 (28.65625, [26.965228204], [30.347271796], 96)
As I have read here, I can even multiple values with a name, pandas apply function that returns multiple values to rows in pandas dataframe
def myfunc1(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
e=np.mean(x_)
f=np.mean(x_)+CI
g=np.mean(x_)-CI
return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])
df['speed'].resample('1D').agg(myfunc1)
which gives
Out[124]:
2015-01-01 MU 29.8438
MU+ [31.5776371389]
MU- [28.1098628611]
2015-01-02 MU 30.4479
MU+ [32.1151606937]
MU- [28.7806726396]
2015-01-03 MU 30.6979
MU+ [32.3776260361]
MU- [29.0182072972]
2015-01-04 MU 28.6562
MU+ [30.347271796]
MU- [26.965228204]
However, when I try to apply this to all the original columns by, I only get NaNs:
df.resample('1D').agg(myfunc1)
Out[127]:
speed otherF
2015-01-01 NaN NaN
2015-01-02 NaN NaN
2015-01-03 NaN NaN
2015-01-04 NaN NaN
2015-01-05 NaN NaN
Results do not change using agg or apply after the resample().
What is the right way to do this?
The problem is in myfunc1. It tries to return a pd.Series, while you have a pd.DataFrame. The following seems to work just fine.
def myfunc1(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
e=np.mean(x_)
f=np.mean(x_)+CI
g=np.mean(x_)-CI
try:
return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
except AttributeError: #will still raise errors of other nature
return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])
Alternatively:
def myfunc1(x):
x_=x[~np.isnan(x)]
CI=np.std(x_)/np.sqrt(x.shape)
e=np.mean(x_)
f=np.mean(x_)+CI
g=np.mean(x_)-CI
if x.ndim > 1: #Equivalent to if len(x.shape) > 1
return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])
I would like to retrieve the sampling frequency of a dataframe say as an integer in microseconds, or a float in seconds.
I found the following to work
import pandas as pd
(pd.datetime(1,1,1) + data_frame.index.freq - pd.datetime(1,1,1)).total_seconds()
but somehow I think there might be a less cumbersome way of doing it…
You might want to use pd.Timedelta.
import pandas as pd
import numpy as np
# your dataframe with some unknown freq
# ====================================
df = pd.DataFrame(np.random.randn(100), columns=['col'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='20ms'))
Out[263]:
col
2015-01-01 00:00:00.000 0.8647
2015-01-01 00:00:00.020 -0.2269
2015-01-01 00:00:00.040 0.8112
2015-01-01 00:00:00.060 0.2878
2015-01-01 00:00:00.080 -0.5385
2015-01-01 00:00:00.100 1.9085
2015-01-01 00:00:00.120 -0.4758
2015-01-01 00:00:00.140 1.4407
2015-01-01 00:00:00.160 -1.1491
2015-01-01 00:00:00.180 0.8057
... ...
2015-01-01 00:00:01.800 -0.6615
2015-01-01 00:00:01.820 0.7059
2015-01-01 00:00:01.840 -0.3586
2015-01-01 00:00:01.860 0.7320
2015-01-01 00:00:01.880 -0.0364
2015-01-01 00:00:01.900 0.5889
2015-01-01 00:00:01.920 -0.7796
2015-01-01 00:00:01.940 0.4763
2015-01-01 00:00:01.960 0.8339
2015-01-01 00:00:01.980 1.3138
[100 rows x 1 columns]
# processing using pd.Timedelta()
# =================================
# get the freq in ms
(df.index[1] - df.index[0])/pd.Timedelta('1ms')
Out[262]: 20.0