I am learning python for a few days now and I have a hard time setting a variable as a time variable. I would be grateful if anyone can help me.
The variable is of the type: pandas.core.series.Series
And looks like the following:
2018S1;
2017S2;
2017S1
The idea is that python recognizes this as time data such that I can plot it and use it in regressions. I have searched on the forum and the internet but did not find any similar problem.
Kind regards
It looks like your data consists of years and seasons. For plotting purposes you could use a date (using typical year, month, day) in the middle of the season.
There is a post where someone was determining seasons based on date, it might give you some ideas Determine season given timestamp in Python using datetime
For pandas periods have a look here.
If the last number means month, use pd.Period(pp,freq='M')
If the last number means quarter, use pd.Period(pp,freq='Q')
The following workaround generates a pandas Series which you can use for regressions and more:
A = np.array(['2018S1', '2017S2', '2017S1'])
periods = []
for a in A:
yr =a[0:4]
ss =a[-1]
pp = yr + '-' + ss
periods.append(pd.Period(pp,freq='Q') )
ts = pd.Series(np.random.randn(3), periods)
ts
In the case of quarters we get:
2018Q1 0.531245
2017Q1 -0.126469
2017Q1 0.250046
Freq: Q-DEC, dtype: float64
In the case of month we get:
2018-01 0.098571
2017-02 1.407439
2017-01 -0.406087
Freq: M, dtype: float64
Related
The question can be reframed as "How to remove daily seasonality from the dataset in python?"
Please read the following:
I have a time series and have used seasonal_decompose() from statsmodel to remove seasonality from the series. As I have used seasonal_decompose() on "Months" data, I get the seasonality only in months. How do I convert these months in to days/dates? Can I use seasonal_decompose() to remove daily seasonality? I tried one option of keeping frequency=365, but it raises following error:
x must have 2 complete cycles requires 730 observations. x only has 24 observation(s)
Snippet of the code:
grp_month = train.append(test).groupby(data['Month']).sum()['Var1']
season_result = seasonal_decompose(grp_month, model='addition', period=12)
This gives me the output:
Month
Out
2018-01-01
-17707.340278
2018-02-01
-49501.548611
2018-03-01
-28172.590278
..
..
..
..
2019-12-01
-13296.173611
As you can see in the table, implementing seasonal_decompose() gives me the monthly seasonality. Is there any way I can get the daily data from this? Or can I convert this into a date wise series?
Edit:
I tried to remove daily seasonality as follows but I'm not really sure if this is the way to go.
period = 365
season_mean = data.groupby(data.index % period).transform('mean')
data -= season_mean
print(data.head())
If you want to substract these values to a daily DataFrame, you should upsample the DataFrame season_result using pandas.DataFrame.resample this way you will be able to substract the monthly seasonnality from your original one.
I work with a variety of instruments, and one is particularly troublesome in that the exported data is in XLS or XLSX format with multiple pages, and multiple columns. I only want some pages and some columns, I have achieved reading this into pandas already.
I want to convert time (see below) into a decimal, in hours. This would be from an initial time (in the time stamp data) at the top of the column so timedelta is probably a more correct value, in hours. I am only concerned about this column. How to convert an entire column of data from one format, to another?
date/time (absolute time) timestamped format YYYY-MM-DD TT:MM:SS
I have found quite a few answers but they don't seem to apply to this particular case, mostly focusing on individual cells or manually entered small data sets. My thousands of data files each have as many as 500,000 lines so something more automated is preferred. There is no upper limit to the number of hours.
What might be part of the same question (someone asked me) is this is already in a Pandas dataframe, should it be converted before or after being read in?
This might seem an amateur-ish question, and it is. I've avoided code writing for years, now I have to learn to data-wrangle for my job and it's frustrating so go easy on me.
Going about it the usual way by trying to adapt most of the solutions I found to a column, I get errors
**This is the code which works
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime # not used
import time # not used
import numpy as np # Not used
loc1 = r"path\file.xls"
pd.read_excel(loc1)
filename=Path(loc1).stem
str_1=filename
df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
***I NEED A CODE TO CONVERT DATESTAMPS TO HOURS (decimal) most likely a form of timedelta***
df.plot(x='Relative Time(h:min:s.ms)',y='Voltage(V)', color='blue')
plt.xlabel("relative time") # This is a specific value
plt.ylabel("voltage (V)")
plt.title(str_1) # filename is used in each sample as a graph title
plt.show()
Image of relevent information (already described above)
You should provide a minimal reproducible example, to help understand what exactly are the issues you are facing.
Setup
Reading between the lines, here is a setup that hopefully exemplifies the kind of data you have:
vals = pd.Series([
'2019-10-21 17:22:06', # absolute date
'2019-10-21 23:22:06.236', # absolute date, with milliseconds
'2019-10-21 12:00:00.236145', # absolute date, with microseconds
'5:10:10', # timedelta
'40:10:10.123', # timedelta, with milliseconds
'345:10:10.123456', # timedelta, with microseconds
])
Solution
Now, we can use two great tools that Pandas offers to quickly convert string series into Timestamps (pd.to_datetime) and Timedelta (pd.to_timedelta), for absolute date-times and durations, respectively.
In both cases, we use errors='coerce' to convert what is convertible, and leave the rest to NaN.
origin = pd.Timestamp('2019-01-01 00:00:00') # origin for absolute dates
a = pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce') - origin
b = pd.to_timedelta(vals, errors='coerce')
tdelta = a.where(~a.isna(), b)
hours = tdelta.dt.total_seconds() / 3600
With the above:
>>> hours
0 7049.368333
1 7055.368399
2 7044.000066
3 5.169444
4 40.169479
5 345.169479
dtype: float64
Explanation
Let's examine some of the pieces above. a handles absolute date-times. Before subtraction of origin to obtain a Timedelta, it is still a Series of Timestamps:
>>> pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce')
0 2019-10-21 17:22:06.000000
1 2019-10-21 23:22:06.236000
2 2019-10-21 12:00:00.236145
3 NaT
4 NaT
5 NaT
dtype: datetime64[ns]
b handles values that are already expressed as durations:
>>> b
0 NaT
1 NaT
2 NaT
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
tdelta is the merge of the non-NaN values of a and b:
>>> tdelta
0 293 days 17:22:06
1 293 days 23:22:06.236000
2 293 days 12:00:00.236145
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
Of course, you can change your origin to be any particular date of reference.
Addendum
After clarifying comments, it seems that the main issue is how to adapt the solution above (or any similar existing example) to their specific problem.
Using the names seen in the images of the edited question, I would suggest:
# (...)
# df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
# note: if df['Absolute Time'] is still of dtypes str, then do this:
# (adapt format as needed; hard to be sure from the image)
df['Absolute Time'] = pd.to_datetime(
df['Absolute Time'],
format='%m.%d.%Y %H:%M:%S.%f',
errors='coerce')
# origin of time; this may have to be taken over multiple sheets
# if all experiments share an absolute origin
origin = df['Absolute Time'].min()
df['Time in hours'] = (df['Absolute Time'] - origin).dt.total_seconds() / 3600
I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
Say I have a dataset at daily scale, but not all days have valid data. In other words, some days are missing in the data. I want to compute the summer season mean from the dataset, and want to remove the month which has less than 20 days of valid data.
How do I achieve this (in pythonic fashion)?
Say my dataframe (df) is like this:
DATE VAR
1900-01-01 123
1900-01-02 456
1900-01-10 789
...
I know how to compute the count:
df_count = df.resample('MS').count()
I also know how to compute the summer season mean:
df_summer = df.resample('Q-NOV').mean()
You can based on df_count to filter out the month which have less than 20 days of valid data. After that compute the summer season mean using your formula.
df_count = df.resample('MS').count()
relevant_month = df_count[df_count > 10].index
df_summer = df[df.index.isin(relevant_month)].resample('Q-NOV').mean()
I suppose you store the month in index. If the month or time is stored in a different column, change df.index.isin(relevant_month) to df.columnName.isin(relevant_month).
I also don't know the format of your time column (date or datetime) so you might need to modify the code to change this part df.index.isin(relevant_month) accordingly. It is just the general idea.
I am trying to use pandas to compute daily climatology. My code is:
import pandas as pd
dates = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")
cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...
Any suggestions how I can do this using pandas?
You can groupby the day of the year, and the calculate the mean for these groups:
cum_data.groupby(cum_data.index.dayofyear).mean()
However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:
In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1 1 462.25
2 631.00
3 615.50
4 496.00
...
12 28 378.25
29 427.75
30 528.50
31 678.50
Length: 366, dtype: float64
Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.
I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series.
This way, things like leap years are automatically taken care of.
#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()
#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]
#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index
#joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.
...
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
(mo, dy) = name
# Note: can use any leap year here.
yrday = (date(1952, mo, dy)).timetuple().tm_yday
doy.append(yrday)
doy_mean.append(group.mean())
doy_size.append(group.count())
# Note: useful climatology stats are also available via group.describe() returned as dict
#desc = group.describe()
# desc["mean"], desc["min"], desc["max"], std,quartiles, etc.
# we lose the counts here.
new_cum_data = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5
pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)
print cum_data_df.ix[366]
>> mean 634.5
>> size 4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean 343
>> size 1
>> Name: 60, dtype: float64
Groupby month and day is a good solution. However, the perfect thinking of groupby(dayofyear) is still possible if you use xrray.CFtimeIndex instead of pandas.DatetimeIndex. i.e,
Delete feb29 by using
rand_data=rand_data[~((rand_data.index.month==2) & (rand_data.index.day==29))]
Replace the index of the above data by xrray.CFtimeIndex, i.e.,
index = xarray.cftime_range('1950-01-01', '1953-12-31', freq='D', calendar = 'noleap')
index = index[~((index.month==2)&(index.day==29))]
rand_data['time']=index
Now, for both non-leap and leap year, the 60th dayofyear would be March 1st, and the total number of dayofyear would be 365. groupbyyear would be correct to calculate climatological daily mean.