I'm creating two same date ranges using Pandas and Matplotlib. After conversion of numpy.float64 to Pandas timestamp I have 1 minute diff - why?
import pandas as pd
import matplotlib.dates as mdates
import datetime as dt
dstart = dt.date(2013,12,5)
dend = dt.date(2013,12,10)
d1 = pd.date_range(dstart, dend, freq='H')
d2 = mdates.drange(dstart, dend, dt.timedelta(hours=1))
print d1[2]
print pd.Timestamp(mdates.num2date(d2[2]))
And get the result:
2013-12-05 02:00:00
2013-12-05 02:01:00.504201+00:00
Note that also the length of both ranges are not the same:
>>> len(d1)
121
>>> len(d2)
120
I think this can be considered as a bug in mdates.drange, but the error is introduced because you are using dates as input and no datetimes (which is what the docstring also says it should be). At least, mdates.drange could check for this I think.
When using datetimes, it is as expected:
In [50]: dstart = dt.datetime(2013,12,5)
In [51]: dend = dt.datetime(2013,12,10)
In [52]: d1 = pd.date_range(dstart, dend, freq='H')
In [53]: d2 = mdates.drange(dstart, dend, dt.timedelta(hours=1))
In [54]: print d1[2]
2013-12-05 02:00:00
In [55]: print pd.Timestamp(mdates.num2date(d2[2]))
2013-12-05 02:00:00+00:00
Notice that the length is still different, because mdates.drange produces a half open interval (so dend not included) while pd.date_range produces a closed interval.
The technical explanation of why this fails is that the calculation in mdates.drange of the end value of the range goes wrong because of the date (https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/dates.py#L361). The end value would be in your case in hours, but by using a date, the hours are neglected, and a wrong interval is created.
Related
I want to change Datetime (2014-12-23 00:00:00) into unixtime. I tried it with the Datetime function but it didn´t work. I got the Datetime stamps in an array.
Zeit =np.array(Jahresgang1.ix[ :,'Zeitstempel'])
t = pd.to_datetime(Zeit, unit='s')
unixtime = pd.DataFrame(t)
print unixtime
Thanks a lot
I think you can subtract the date 1970-1-1 to create a timedelta and then access the attribute total_seconds:
In [130]:
s = pd.Series(pd.datetime(2012,1,1))
s
Out[130]:
0 2012-01-01
dtype: datetime64[ns]
In [158]:
(s - dt.datetime(1970,1,1)).dt.total_seconds()
Out[158]:
0 1325376000
dtype: float64
to emphasize EdChum's first comment, you can directly get Unix time like
import pandas as pd
s = pd.to_datetime(["2014-12-23 00:00:00"])
unix = s.astype("int64")
print(unix)
# Int64Index([1419292800000000000], dtype='int64')
or for a pd.Timestamp:
print(pd.to_datetime("2014-12-23 00:00:00").value)
# 1419292800000000000
Notes
the output precision is nanoseconds - if you want another, divide appropriately, e.g. by 10⁹ to get seconds, 10⁶ to get milliseconds etc.
this assumes the input date/time to be UTC, unless a time zone / UTC offset is specified
I have a pandas DataFrame with dtype=numpy.datetime64
In the data I want to change
'2011-11-14T00:00:00.000000000'
to:
'2010-11-14T00:00:00.000000000'
or other year. Timedelta is not known, only year number to assign.
this displays year in int
Dates_profit.iloc[50][stock].astype('datetime64[Y]').astype(int)+1970
but can't assign value.
Anyone know how to assign year to numpy.datetime64?
Since you're using a DataFrame, consider using pandas.Timestamp.replace:
In [1]: import pandas as pd
In [2]: dates = pd.DatetimeIndex([f'200{i}-0{i+1}-0{i+1}' for i in range(5)])
In [3]: df = pd.DataFrame({'Date': dates})
In [4]: df
Out[4]:
Date
0 2000-01-01
1 2001-02-02
2 2002-03-03
3 2003-04-04
4 2004-05-05
In [5]: df.loc[:, 'Date'] = df['Date'].apply(lambda x: x.replace(year=1999))
In [6]: df
Out[6]:
Date
0 1999-01-01
1 1999-02-02
2 1999-03-03
3 1999-04-04
4 1999-05-05
numpy.datetime64 objects are hard to work with. To update a value, it is normally easier to convert the date to a standard Python datetime object, do the change and then convert it back to a numpy.datetime64 value again:
import numpy as np
from datetime import datetime
dt64 = np.datetime64('2011-11-14T00:00:00.000000000')
# convert to timestamp:
ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
# standard utctime from timestamp
dt = datetime.utcfromtimestamp(ts)
# get the new updated year
dt = dt.replace(year=2010)
# convert back to numpy.datetime64:
dt64 = np.datetime64(dt)
There might be simpler ways, but this works, at least.
This vectorised solution gives the same result as using pandas to iterate over with x.replace(year=n), but the speed up on large arrays is at least x10 faster.
It is important to remember the year that the datetime64 object is replaced with should be a leap year. Using the python datetime library, the following crashes: datetime(2012,2,29).replace(year=2011) crashes. Here, the function 'replace_year' will simply move 2012-02-29 to 2011-03-01.
I'm using numpy v 1.13.1.
import numpy as np
import pandas as pd
def replace_year(x, year):
""" Year must be a leap year for this to work """
# Add number of days x is from JAN-01 to year-01-01
x_year = np.datetime64(str(year)+'-01-01') + (x - x.astype('M8[Y]'))
# Due to leap years calculate offset of 1 day for those days in non-leap year
yr_mn = x.astype('M8[Y]') + np.timedelta64(59,'D')
leap_day_offset = (yr_mn.astype('M8[M]') - yr_mn.astype('M8[Y]') - 1).astype(np.int)
# However, due to days in non-leap years prior March-01,
# correct for previous step by removing an extra day
non_leap_yr_beforeMarch1 = (x.astype('M8[D]') - x.astype('M8[Y]')).astype(np.int) < 59
non_leap_yr_beforeMarch1 = np.logical_and(non_leap_yr_beforeMarch1, leap_day_offset).astype(np.int)
day_offset = np.datetime64('1970') - (leap_day_offset - non_leap_yr_beforeMarch1).astype('M8[D]')
# Finally, apply the day offset
x_year = x_year - day_offset
return x_year
x = np.arange('2012-01-01', '2014-01-01', dtype='datetime64[h]')
x_datetime = pd.to_datetime(x)
x_year = replace_year(x, 1992)
x_datetime = x_datetime.map(lambda x: x.replace(year=1992))
print(x)
print(x_year)
print(x_datetime)
print(np.all(x_datetime.values == x_year))
I have a df, self.meter_readings, where the index is datetime values and there is a column of numbers, as below:
self.meter_readings['PointProduction']
2012-03 7707.443
2012-04 9595.481
2012-05 5923.493
2012-06 4813.446
2012-07 5384.159
2012-08 4108.496
2012-09 6370.271
2012-10 8829.357
2012-11 7495.700
2012-12 13709.940
2013-01 6148.129
2013-02 7249.951
2013-03 6546.819
2013-04 7290.730
2013-05 5056.485
Freq: M, Name: PointProduction, dtype: float64
I want to get the gradient of PointProduction against time. i.e. y=PointProduction x=time. I'm currently trying to obtain m using a linear regression:
m,c,r,x,y = stats.linregress(list(self.meter_readings.index),list(self.meter_readings['PointProduction']))
However I am getting an error:
raise TypeError(other).
This is seemingly due to the formation of the x-axis as timestamps as oppose to just numbers.
How can I correct this?
You could try converting each Timestamp to Gregorian ordinal: linregress should then work with your freq='M' index.
import pandas as pd
from scipy import stats
data = [
7707.443,
9595.481,
5923.493,
4813.446,
5384.159,
4108.496,
6370.271,
8829.357,
7495.700,
13709.940,
6148.129,
7249.951,
6546.819,
7290.730,
5056.485
]
period_index = pd.period_range(start='2012-03', periods=len(data), freq='M')
df = pd.DataFrame(data=data,
index=period_index,
columns=['PointProduction'])
# these ordinals are months since the start of the Unix epoch
df['ords'] = [tstamp.ordinal for tstamp in df.index]
m,c,r,x,y = stats.linregress(list(df.ords),
list(df['PointProduction']))
Convert the datetimestamps in the x-axis as epoch time in seconds.
If the indexes are in datetime objects you need to convert them to epoch time, for example if ts is a datetime object the following function does the conversion
ts_epoch = int(ts.strftime('%s'))
This is an example of piece of code that could it be good for you, for converting the index column into epoch seconds:
import pandas as pd
from datetime import datetime
import numpy as np
rng = pd.date_range('1/1/2011', periods=5, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
t = ts.index
print [int(t[x].strftime('%s')) for x in range(len(t)) ]
This code is fully working on python2.7.
For using this piece of code on your problem, the solution could be the following:
t = self.meter_readings.index
indexes = [int(t[x].strftime('%s')) for x in range(len(t)) ]
m,c,r,x,y = stats.linregress(indexes,list(self.meter_readings['PointProduction']))
What is going on here?
I need to generate a dataframe with beginning of month dates,,(1-1-2014 To 12-1-2014) fwiw I use the fcast_year variable elsewhere where I need the end of month, hence doing date math
from pandas.tseries.offsets import *
fcast_yr=pd.to_datetime('2014-12-31')
x=(fcast_yr + pd.DateOffset(days= -30)) # to set x to 2014-12-01
d=pd.date_range((x +pd.DateOffset(months=-10)), periods=12, freq='MS') #"MS" means start of month!!
print d.values
Gives these end of month values....yech!!
['2014-01-31T18:00:00.000000000-0600' '2014-02-28T18:00:00.000000000-0600'
'2014-03-31T19:00:00.000000000-0500' '2014-04-30T19:00:00.000000000-0500'
'2014-05-31T19:00:00.000000000-0500' '2014-06-30T19:00:00.000000000-0500'
'2014-07-31T19:00:00.000000000-0500' '2014-08-31T19:00:00.000000000-0500'
'2014-09-30T19:00:00.000000000-0500' '2014-10-31T19:00:00.000000000-0500'
'2014-11-30T18:00:00.000000000-0600' '2014-12-31T18:00:00.000000000-0600']
Using 13.0 pf Pandas
You don't need to coerce the timestamp to the begin month; the frequency will do it (but your answer is correct).
The 'values' are just the way numpy represents dates (they are UTC).
In [8]: pd.date_range((Timestamp('20141231') +pd.DateOffset(months=-11)), periods=12, freq='MS')
Out[8]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-02-01, ..., 2015-01-01]
Length: 12, Freq: MS, Timezone: None
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()