I have a df, self.meter_readings, where the index is datetime values and there is a column of numbers, as below:
self.meter_readings['PointProduction']
2012-03 7707.443
2012-04 9595.481
2012-05 5923.493
2012-06 4813.446
2012-07 5384.159
2012-08 4108.496
2012-09 6370.271
2012-10 8829.357
2012-11 7495.700
2012-12 13709.940
2013-01 6148.129
2013-02 7249.951
2013-03 6546.819
2013-04 7290.730
2013-05 5056.485
Freq: M, Name: PointProduction, dtype: float64
I want to get the gradient of PointProduction against time. i.e. y=PointProduction x=time. I'm currently trying to obtain m using a linear regression:
m,c,r,x,y = stats.linregress(list(self.meter_readings.index),list(self.meter_readings['PointProduction']))
However I am getting an error:
raise TypeError(other).
This is seemingly due to the formation of the x-axis as timestamps as oppose to just numbers.
How can I correct this?
You could try converting each Timestamp to Gregorian ordinal: linregress should then work with your freq='M' index.
import pandas as pd
from scipy import stats
data = [
7707.443,
9595.481,
5923.493,
4813.446,
5384.159,
4108.496,
6370.271,
8829.357,
7495.700,
13709.940,
6148.129,
7249.951,
6546.819,
7290.730,
5056.485
]
period_index = pd.period_range(start='2012-03', periods=len(data), freq='M')
df = pd.DataFrame(data=data,
index=period_index,
columns=['PointProduction'])
# these ordinals are months since the start of the Unix epoch
df['ords'] = [tstamp.ordinal for tstamp in df.index]
m,c,r,x,y = stats.linregress(list(df.ords),
list(df['PointProduction']))
Convert the datetimestamps in the x-axis as epoch time in seconds.
If the indexes are in datetime objects you need to convert them to epoch time, for example if ts is a datetime object the following function does the conversion
ts_epoch = int(ts.strftime('%s'))
This is an example of piece of code that could it be good for you, for converting the index column into epoch seconds:
import pandas as pd
from datetime import datetime
import numpy as np
rng = pd.date_range('1/1/2011', periods=5, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
t = ts.index
print [int(t[x].strftime('%s')) for x in range(len(t)) ]
This code is fully working on python2.7.
For using this piece of code on your problem, the solution could be the following:
t = self.meter_readings.index
indexes = [int(t[x].strftime('%s')) for x in range(len(t)) ]
m,c,r,x,y = stats.linregress(indexes,list(self.meter_readings['PointProduction']))
Related
I am trying to perform time series data analysis on financial data and I want to perform seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
import pandas as pd
import datetime
import pandas_datareader as data
df = data.get_data_yahoo('UGA', start=everSince, end=today)
df_close = df[['Close']]
result = seasonal_decompose(df_close, model='multiplicative')
The error I get in this way
You must specify a period or x must be a pandas object with a PeriodIndex or a DatetimeIndex with a freq not set to None
I know I can specify the frequency as df.asfreq() but financial data do not have a daily frequency (i.e., I do not have an entry for every single day) since they are from Monday to Friday and sometimes there are holidays.
How can I apply seasonal_decompose to this kind of data? I have also tried df_close.index = df_close.index.to_period('B') but did not work.
An example of the df is:
Close
Date
2008-02-28 49.790001
2008-02-29 49.610001
2008-03-03 49.810001
2008-03-04 47.450001
2008-03-05 49.049999
2008-03-06 49.369999
2008-03-07 50.230000
2008-03-10 50.610001
2008-03-11 50.700001
2008-03-12 50.919998
2008-03-13 49.939999
2008-03-14 50.049999
2008-03-17 46.869999
2008-03-18 48.980000
2008-03-19 47.540001
2008-03-20 48.070000
2008-03-24 48.459999
2008-03-25 49.490002
2008-03-26 50.320000
2008-03-27 50.110001
2008-03-28 50.009998
2008-03-31 48.509998
2008-04-01 48.840000
2008-04-02 51.130001
2008-04-03 50.419998
2008-04-04 50.900002
2008-04-07 51.430000
2008-04-08 50.959999
2008-04-09 51.290001
2008-04-10 51.540001
where indices are of type pandas.core.indexes.datetimes.DatetimeIndex.
Your issue can be solved by:
filling the missing date gaps within dataframe if you don't have daily data and replace respected values with 0
Set period/frequency for target attribute to make seasonality :
# import libraries
import numpy as np
import pandas as pd
import datetime as dt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
print(sm.__version__)
# Generate some data
TODAY = dt.date.today()
ONE_WEEK = dt.timedelta(days=107)
ONE_DAY = dt.timedelta(days=1)
# Create pandas dataframe
df = pd.DataFrame({'Date': [TODAY-ONE_WEEK, TODAY-3*ONE_DAY, TODAY], 'Close': [42, 45,127]})
# Date Close
#0 2021-09-02 42
#1 2021-12-15 45
#2 2021-12-18 127
# Fill the missing dates and relative attribute with 0
r = pd.date_range(start=df.Date.min(), end=df.Date.max())
df = df.set_index('Date').reindex(r).fillna(0).rename_axis('Date').reset_index().dropna()
# Set period/frequency using set_index() dates
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').asfreq('D').dropna()
# Close
#Date
#2021-09-02 42.0
#2021-09-03 0.0
#2021-09-04 0.0
#2021-09-05 0.0
#2021-09-06 0.0
#... ...
#2021-12-14 0.0
#2021-12-15 45.0
#2021-12-16 0.0
#2021-12-17 0.0
#2021-12-18 127.0
# 108 rows × 1 columns
Finally, now we can use the function seasonal_decompose() to decompose time-series data into other components:
# inspect frequency attribute
print(df.index.freq) #<Day>
# Reproduce the example for OP and plot output
seasonal_decompose(df, model='additive').plot()
outputs:
Here is another output plot you can achieve via my another answer if you wish:
Note: decomposition doesn't work for model='multiplicative' due to:
ValueError: Multiplicative seasonality is not appropriate for zero and negative values
I can't seem to understand what the difference is between <M8[ns] and date time formats on how these operations relate to why this does or doesn't work.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a')
# ultimate goal is to be able to go. * df.mean() * and be able to see mean DATE
# but this doesn't seem to work so...
df['a'].mean().strftime('%Y-%m-%d') ### ok this works... I can mess around and concat stuff...
# But why won't this work?
df2 = df.select_dtypes('datetime')
df2.mean() # WONT WORK
df2['a'].mean() # WILL WORK?
What I seem to be running into unless I am missing something is the difference between 'datetime' and '<M8[ns]' and how that works when I'm trying to get the mean date.
You can try passing numeric_only parameter in mean() method:
out=df.select_dtypes('datetime').mean(numeric_only=False)
output of out:
a 2021-06-03 04:48:00
dtype: datetime64[ns]
Note: It will throw you an error If the dtype is string
mean function you apply is different in each case.
import pandas as pd
import datetime as dt
import numpy as np
my_dates = ['2021-02-03','2021-02-05','2020-12-25', '2021-12-27','2021-12-12']
my_numbers = [100,200,0,400,500]
df = pd.DataFrame({'a':my_dates, 'b':my_numbers})
df['a']=pd.to_datetime(df['a'])
df.mean()
This mean function its the DataFrame mean function, and it works on numeric data. To see who is numeric, do:
df._get_numeric_data()
b
0 100
1 200
2 0
3 400
4 500
But df['a'] is a datetime series.
df['a'].dtype, type(df)
(dtype('<M8[ns]'), pandas.core.frame.DataFrame)
So df['a'].mean() apply different mean function that works on datetime values. That's why df['a'].mean() output the mean of datetime values.
df['a'].mean()
Timestamp('2021-06-03 04:48:00')
read more here:
difference-between-data-type-datetime64ns-and-m8ns
DataFrame.mean() ignores datetime series
#28108
I have a pandas DataFrame with dtype=numpy.datetime64
In the data I want to change
'2011-11-14T00:00:00.000000000'
to:
'2010-11-14T00:00:00.000000000'
or other year. Timedelta is not known, only year number to assign.
this displays year in int
Dates_profit.iloc[50][stock].astype('datetime64[Y]').astype(int)+1970
but can't assign value.
Anyone know how to assign year to numpy.datetime64?
Since you're using a DataFrame, consider using pandas.Timestamp.replace:
In [1]: import pandas as pd
In [2]: dates = pd.DatetimeIndex([f'200{i}-0{i+1}-0{i+1}' for i in range(5)])
In [3]: df = pd.DataFrame({'Date': dates})
In [4]: df
Out[4]:
Date
0 2000-01-01
1 2001-02-02
2 2002-03-03
3 2003-04-04
4 2004-05-05
In [5]: df.loc[:, 'Date'] = df['Date'].apply(lambda x: x.replace(year=1999))
In [6]: df
Out[6]:
Date
0 1999-01-01
1 1999-02-02
2 1999-03-03
3 1999-04-04
4 1999-05-05
numpy.datetime64 objects are hard to work with. To update a value, it is normally easier to convert the date to a standard Python datetime object, do the change and then convert it back to a numpy.datetime64 value again:
import numpy as np
from datetime import datetime
dt64 = np.datetime64('2011-11-14T00:00:00.000000000')
# convert to timestamp:
ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
# standard utctime from timestamp
dt = datetime.utcfromtimestamp(ts)
# get the new updated year
dt = dt.replace(year=2010)
# convert back to numpy.datetime64:
dt64 = np.datetime64(dt)
There might be simpler ways, but this works, at least.
This vectorised solution gives the same result as using pandas to iterate over with x.replace(year=n), but the speed up on large arrays is at least x10 faster.
It is important to remember the year that the datetime64 object is replaced with should be a leap year. Using the python datetime library, the following crashes: datetime(2012,2,29).replace(year=2011) crashes. Here, the function 'replace_year' will simply move 2012-02-29 to 2011-03-01.
I'm using numpy v 1.13.1.
import numpy as np
import pandas as pd
def replace_year(x, year):
""" Year must be a leap year for this to work """
# Add number of days x is from JAN-01 to year-01-01
x_year = np.datetime64(str(year)+'-01-01') + (x - x.astype('M8[Y]'))
# Due to leap years calculate offset of 1 day for those days in non-leap year
yr_mn = x.astype('M8[Y]') + np.timedelta64(59,'D')
leap_day_offset = (yr_mn.astype('M8[M]') - yr_mn.astype('M8[Y]') - 1).astype(np.int)
# However, due to days in non-leap years prior March-01,
# correct for previous step by removing an extra day
non_leap_yr_beforeMarch1 = (x.astype('M8[D]') - x.astype('M8[Y]')).astype(np.int) < 59
non_leap_yr_beforeMarch1 = np.logical_and(non_leap_yr_beforeMarch1, leap_day_offset).astype(np.int)
day_offset = np.datetime64('1970') - (leap_day_offset - non_leap_yr_beforeMarch1).astype('M8[D]')
# Finally, apply the day offset
x_year = x_year - day_offset
return x_year
x = np.arange('2012-01-01', '2014-01-01', dtype='datetime64[h]')
x_datetime = pd.to_datetime(x)
x_year = replace_year(x, 1992)
x_datetime = x_datetime.map(lambda x: x.replace(year=1992))
print(x)
print(x_year)
print(x_datetime)
print(np.all(x_datetime.values == x_year))
Hey: Spent several hours trying to do a quite simple thing,but couldnt figure it out.
I have a dataframe with a column, df['Time'] which contains time, starting from 0, up to 20 minutes,like this:
1:10,10
1:16,32
3:03,04
First being minutes, second is seconds, third is miliseconds (only two digits).
Is there a way to automatically transform that column into seconds with Pandas, and without making that column the time index of the series?
I already tried the following but it wont work:
pd.to_datetime(df['Time']).convert('s') # AttributeError: 'Series' object has no attribute 'convert'
If the only way is to parse the time just point that out and I will prepare a proper / detailed answer to this question, dont waste your time =)
Thank you!
Code:
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({'Time':['1:10,10', '1:16,32', '3:03,04']})
df['time'] = df.Time.apply(lambda x: datetime.datetime.strptime(x,'%M:%S,%f'))
df['timedelta'] = df.time - datetime.datetime.strptime('00:00,0','%M:%S,%f')
df['secs'] = df['timedelta'].apply(lambda x: x / np.timedelta64(1, 's'))
print df
Output:
Time time timedelta secs
0 1:10,10 1900-01-01 00:01:10.100000 00:01:10.100000 70.10
1 1:16,32 1900-01-01 00:01:16.320000 00:01:16.320000 76.32
2 3:03,04 1900-01-01 00:03:03.040000 00:03:03.040000 183.04
If you have also negative time deltas:
import pandas as pd
import numpy as np
import datetime
import re
regex = re.compile(r"(?P<minus>-)?((?P<minutes>\d+):)?(?P<seconds>\d+)(,(?P<centiseconds>\d{2}))?")
def parse_time(time_str):
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.iteritems():
if param and (name != 'minus'):
time_params[name] = int(param)
time_params['milliseconds'] = time_params['centiseconds']*10
del time_params['centiseconds']
return (-1 if parts['minus'] else 1) * datetime.timedelta(**time_params)
df = pd.DataFrame({'Time':['-1:10,10', '1:16,32', '3:03,04']})
df['timedelta'] = df.Time.apply(lambda x: parse_time(x))
df['secs'] = df['timedelta'].apply(lambda x: x / np.timedelta64(1, 's'))
print df
Output:
Time timedelta secs
0 -1:10,10 -00:01:10.100000 -70.10
1 1:16,32 00:01:16.320000 76.32
2 3:03,04 00:03:03.040000 183.04
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()