DatetimeIndex cannot perform the operation median for pandas series - python

I see an error of "DatetimeIndex cannot perform the operation median" when computing a series median.
Is there a suggestion on this? Thanks.
Repro code is below.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': pd.date_range("2012", periods=3, freq='D')})
df['a'].median()
...
TypeError: DatetimeIndex cannot perform the operation median

It is possible only if convert column to native unix times format, get median and convert back to datetime:
df = pd.DataFrame({'a': pd.date_range("2012", periods=3, freq='D')})
m = np.median(df['a'].to_numpy().astype(np.int64))
print (pd.Timestamp(m))
2012-01-02 00:00:00
Detail:
print (df['a'].to_numpy().astype(np.int64))
[1325376000000000000 1325462400000000000 1325548800000000000]
Another idea, thank you #cs95:
print (pd.Timestamp(df['a'].astype(np.int64).median()))
2012-01-02 00:00:00

Related

join two dataframes after a specific column but showing no index

I am converting my date into matlab datenum and saving values in text file. then I am reading that csv and trying to add those values in existing dataframe after id column. I can direct add it at the end or as a 1st column but I can't add it after a specific column. number of rows are eual in both dataframes. it throws following error
KeyError: 'datenum'
Here is what i am doing
import pandas as pd
import datetime
import numpy as np
from datetime import datetime as dt
from datetime import timedelta
import os
df=pd.read_csv(r'C:\example.csv')
def datenum(d):
return 366 + d.toordinal() + (d - dt.fromordinal(d.toordinal())).total_seconds()/(24*60*60)
d = dt.strptime('2021-01-01 00:15:00','%Y-%m-%d %H:%M:%S')
column = df['date']
new_column = [dt.strptime(i,'%Y-%m-%d %H:%M:%S') for i in column]
end_column = [datenum(i) for i in new_column]
for i in end_column:
print(i)
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
np.savetxt('timePminus.txt', end_column,fmt='% 1.5f')
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
after that reading this csv
import pandas as pd
import datetime
import numpy as np
from datetime import datetime as dt
from datetime import timedelta
import os
df=pd.read_csv(r'example.csv')
df1=pd.read_csv(r'time.csv')
df2= pd.concat([df, df1], axis=1, join='outer')
print(df2)
df2= pd.concat([df, df1], axis=1, join='outer')
print(df2.get('datenum'))
df3=pd.DataFrame().assign(B=df2['B'],A=df2['A'],datenum=df2['datenum'],D=df2['D'])
print(df3)
it throws error keyerror:datenum
here is the dataframe or csv
A,B,C
2021-01-02 00:15:00,"43289,95698800",236985
2021-01-01 00:30:00,"425962,555555",236985
2021-01-01 00:45:00,"2368,56980000",236985
2021-01-01 01:00:00,"2368,56980000",236985
2021-01-15 01:15:00,"2368,56980000",236985
2021-05-01 01:30:00,"2368,56980000",236985
if I do
print(df2.get('datenum'))
output is none.
2nd dataframe
datenum
738157.01042
738157.02083
738157.03125
738157.04167
738157.05208
738157.06250
can some one please guide me what is wrong. I am trying it for hours.
Thanks in advance.
If you are looking to just re-arrange your dataframe columns after concat you can do,
column_order = ['id', 'datenum', 'A', 'B', 'C']
df = df[column_order]

getting mean values of dates in pandas dataframe

I can't seem to understand what the difference is between <M8[ns] and date time formats on how these operations relate to why this does or doesn't work.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a')
# ultimate goal is to be able to go. * df.mean() * and be able to see mean DATE
# but this doesn't seem to work so...
df['a'].mean().strftime('%Y-%m-%d') ### ok this works... I can mess around and concat stuff...
# But why won't this work?
df2 = df.select_dtypes('datetime')
df2.mean() # WONT WORK
df2['a'].mean() # WILL WORK?
What I seem to be running into unless I am missing something is the difference between 'datetime' and '<M8[ns]' and how that works when I'm trying to get the mean date.
You can try passing numeric_only parameter in mean() method:
out=df.select_dtypes('datetime').mean(numeric_only=False)
output of out:
a 2021-06-03 04:48:00
dtype: datetime64[ns]
Note: It will throw you an error If the dtype is string
mean function you apply is different in each case.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a'])
df.mean()
This mean function its the DataFrame mean function, and it works on numeric data. To see who is numeric, do:
df._get_numeric_data()
b
0 100
1 200
2 0
3 400
4 500
But df['a'] is a datetime series.
df['a'].dtype, type(df)
(dtype('<M8[ns]'), pandas.core.frame.DataFrame)
So df['a'].mean() apply different mean function that works on datetime values. That's why df['a'].mean() output the mean of datetime values.
df['a'].mean()
Timestamp('2021-06-03 04:48:00')
read more here:
difference-between-data-type-datetime64ns-and-m8ns
DataFrame.mean() ignores datetime series
#28108

Python: How to develop a between_time similar method when on pandas 0.9.0?

I am stick to pandas 0.9.0 as I'm working under python 2.5, hence I have no between_time method available.
I have a DataFrame of dates and would like to filter all the dates that are between certain hours, e.g. between 08:00 and 09:00 for all the dates within the DataFrame df.
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
How can I develop a method that provides same functionality as between_time method?
N.B.: The original problem I am trying to accomplish is under Python: Filter DataFrame in Pandas by hour, day and month grouped by year
UPDATE:
try to use:
df.loc[df.index.indexer_between_time('08:00','09:50')]
OLD answer:
I'm not sure that it'll work on Pandas 0.9.0, but it's worth to try it:
df[(df.index.hour >= 8) & (df.index.hour <= 9)]
PS please be aware - it's not the same as between_time as it checks only hours and between_time is able to check time like df.between_time('08:01:15','09:13:28')
Hint: download a source code for a newer version of Pandas and take a look at the definition of indexer_between_time() function in pandas/tseries/index.py - you can clone it for your needs
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
Here is a NumPy-based way of doing it:
import pandas as pd
import numpy as np
import datetime
dates = pd.date_range(start="08/01/2009",end="08/01/2012",freq="10min")
df = pd.DataFrame(np.random.rand(len(dates), 1)*1500, index=dates, columns=['Power'])
epoch = np.datetime64('1970-01-01')
start = np.datetime64('1970-01-01 08:00:00')
end = np.datetime64('1970-01-01 09:00:00')
# convert the dates to a NumPy datetime64 array
date_array = df.index.asi8.astype('<M8[ns]')
# replace the year/month/day with 1970-01-01
truncated = (date_array - date_array.astype('M8[D]')) + epoch
# compare the hour/minute/seconds etc with `start` and `end`
mask = (start <= truncated) & (truncated <=end)
print(df[mask])
yields
Power
2009-08-01 08:00:00 1007.289466
2009-08-01 08:10:00 770.732422
2009-08-01 08:20:00 617.388909
2009-08-01 08:30:00 1348.384210
...
2012-07-31 08:30:00 999.133350
2012-07-31 08:40:00 1451.500408
2012-07-31 08:50:00 1161.003167
2012-07-31 09:00:00 670.545371

Obtain Gradient of values against time stamp python

I have a df, self.meter_readings, where the index is datetime values and there is a column of numbers, as below:
self.meter_readings['PointProduction']
2012-03 7707.443
2012-04 9595.481
2012-05 5923.493
2012-06 4813.446
2012-07 5384.159
2012-08 4108.496
2012-09 6370.271
2012-10 8829.357
2012-11 7495.700
2012-12 13709.940
2013-01 6148.129
2013-02 7249.951
2013-03 6546.819
2013-04 7290.730
2013-05 5056.485
Freq: M, Name: PointProduction, dtype: float64
I want to get the gradient of PointProduction against time. i.e. y=PointProduction x=time. I'm currently trying to obtain m using a linear regression:
m,c,r,x,y = stats.linregress(list(self.meter_readings.index),list(self.meter_readings['PointProduction']))
However I am getting an error:
raise TypeError(other).
This is seemingly due to the formation of the x-axis as timestamps as oppose to just numbers.
How can I correct this?
You could try converting each Timestamp to Gregorian ordinal: linregress should then work with your freq='M' index.
import pandas as pd
from scipy import stats
data = [
7707.443,
9595.481,
5923.493,
4813.446,
5384.159,
4108.496,
6370.271,
8829.357,
7495.700,
13709.940,
6148.129,
7249.951,
6546.819,
7290.730,
5056.485
]
period_index = pd.period_range(start='2012-03', periods=len(data), freq='M')
df = pd.DataFrame(data=data,
index=period_index,
columns=['PointProduction'])
# these ordinals are months since the start of the Unix epoch
df['ords'] = [tstamp.ordinal for tstamp in df.index]
m,c,r,x,y = stats.linregress(list(df.ords),
list(df['PointProduction']))
Convert the datetimestamps in the x-axis as epoch time in seconds.
If the indexes are in datetime objects you need to convert them to epoch time, for example if ts is a datetime object the following function does the conversion
ts_epoch = int(ts.strftime('%s'))
This is an example of piece of code that could it be good for you, for converting the index column into epoch seconds:
import pandas as pd
from datetime import datetime
import numpy as np
rng = pd.date_range('1/1/2011', periods=5, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
t = ts.index
print [int(t[x].strftime('%s')) for x in range(len(t)) ]
This code is fully working on python2.7.
For using this piece of code on your problem, the solution could be the following:
t = self.meter_readings.index
indexes = [int(t[x].strftime('%s')) for x in range(len(t)) ]
m,c,r,x,y = stats.linregress(indexes,list(self.meter_readings['PointProduction']))

Pivoting out Datetimes and then calling an operation in Pandas/Python

I've seen several articles about using datetime and dateutil to convert into datetime objects.
However, I can't seem to figure out how to convert a column into a datetime object so I can pivot out that columns and perform operations against it.
I have a dataframe as such:
Col1 Col 2
a 1/1/2013
a 1/12/2013
b 1/5/2013
b 4/3/2013 ....etc
What I want is :
pivott = pivot_table( df, rows ='Col1', values='Col2', and then I want to get the range of dates for each value in Col1)
I am not sure how to correctly approach this. Even after using
df['Col2']= pd.to_datetime(df['Col2'])
I couldn't do operations against the dates since they are strings...
Any advise?
Use datetime.strptime
import pandas as pd
from datetime import datetime
df = pd.read_csv('somedata.csv')
convertdatetime = lambda d: datetime.strptime(d,'%d/%m/%Y')
converted = df['DATE_TIME_IN_STRING'].apply(convertdatetime)
converted[:10] # you should be getting dtype: datetime64[ns]

Categories