I want to use time series with Pandas. I read multiple time series one by one, from a csv file which has the date in the column named "Date" as (YYYY-MM-DD):
Date,Business,Education,Holiday
2005-01-01,6665,8511,86397
2005-02-01,8910,12043,92453
2005-03-01,8834,12720,78846
2005-04-01,8127,11667,52644
2005-05-01,7762,11092,33789
2005-06-01,7652,10898,34245
2005-07-01,7403,12787,42020
2005-08-01,7968,13235,36190
2005-09-01,8345,12141,36038
2005-10-01,8553,12067,41089
2005-11-01,8880,11603,59415
2005-12-01,8331,9175,70736
df = pd.read_csv(csv_file, index_col = 'Date',header=0)
Series_list = df.keys()
The time series can have different frequencies: day, week, month, quarter, year and I want to index the time series according to a frequency I decide before I generate the Arima model. Could someone please explain how can I define the frequency of the series?
stepwise_fit = auto_arima(df[Series_name]....
pandas has a built in function pandas.infer_freq()
import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])
pd.infer_freq(df.Date)
#'MS'
pd.infer_freq(df.Date1)
#'D'
pd.infer_freq(df.Date2)
#'AS-JAN'
Alternatively you could also make use of the datetime functionality of the columns.
df.Date.dt.freq
#'MS'
Of course if your data doesn't actually have a real frequency, then you won't get anything.
pd.infer_freq(df.Date3)
#
The frequency descriptions are docmented under offset-aliases.
Related
i have a csv file with many lines and three column. first column is the unix time, second column the price, and third column represents the volume of the symbol that has been traded at that specific price. what i'm doing is, calculating ohlc for different time frames (e.g. 1h, 4h, 12h, 1d) out of tha csv file. that is working very well by first converting the unix time into datetime
code:
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['date'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('date')
df = df['price'].resample('4h').ohlc()
df.to_csv('file_4h_ohlc.csv')
result:
date,open,high,low,close
2017-05-01 20:00:00,0.757881,1.07,0.650011,1.069999
target:
i wanna now converte the datetime (2017-05-01 20:00:00) back to the unix time (1493658000) within the same file by keeping the ohlc values. or if not possible so, to save into a different file.
thanks a lot for support and sorry if such question has been already answered, but i didnt find it
-hotshot
You can create a new date column instead of overwriting the existing one, so you can re-use it as the index.
import pandas as pd
df = pd.read_csv('file.csv', names=['date', 'price', 'volume'])
df['datestamp'] = pd.to_datetime(df['date'], unit='s')
df = df.set_index('datestamp')
df = df['price'].resample('4h').ohlc()
# Set the index back to the original (after calculating ohlc)
df = df.set_index('date')
# Optional: Drop the datestamp column
df = df.drop(columns=['datestamp'])
df.to_csv('file_4h_ohlc.csv')
Alternatively, you can convert the existing datetime column to a Unix timestamp like so:
df['date'].apply(lambda x : (x - datetime.datetime(1970, 1, 1)).total_seconds())
Is there a way to create a new data frame from a time series with the daily diffence?
This means, suppose that on October 5 I had 5321 counts and on October 6 5331 counts. This represents the difference of 10; what I want is, for example, that my DataFrame shows 10 on October 6.
Here's my code of the raw dataframe:
import pandas as pd
from datetime import datetime, timedelta
url = 'https://raw.githubusercontent.com/mariorz/covid19-mx-time-series/master/data/covid19_confirmed_mx.csv'
df = pd.read_csv(url, index_col=0)
df = df.loc['Colima','18-03-2020':'06-10-2020']
df = pd.DataFrame(df)
df.index = pd.to_datetime(df.index, format='%d-%m-%Y')
df
This is the raw outcome:
Thank you guys!
There's an inbuilt diff function just for these kind of operations:
df['Diff'] = df.Colima.diff()
Yes, you can use the shift method to access the preceding row's value to calculate the difference.
df['difference'] = df.Colima - df.Colima.shift(1)
I am trying to avoid loops as this is just a subset of the bigger dataframe I have which has over 30k rows. All I want to do is create a new column with difference between the date in that row and today's date.
What's the best way to do it?
import pandas as pd
df = pd.DataFrame({'Date' : ['2014-03-27', '2014-03-28', '2014-03-31', '2014-04-01', '2014-04-02', '2014-04-03', '2014-04-04', '2014-04-07','2014-04-08', '2014-04-09'],
})
df['diff'] = (datetime.datetime.now()-pd.to_datetime(df['Date'])).dt.days
df['Date'] = pd.to_date_time(df['Date'])
df['num_days_diff'] = (np.datetime64('today', 'D') - df['Date'])/np.timedelta64(1, 'D')
I have data like this that I want to plot by month and year using matplotlib.
df = pd.DataFrame({'date':['2018-10-01', '2018-10-05', '2018-10-20','2018-10-21','2018-12-06',
'2018-12-16', '2018-12-27', '2019-01-08','2019-01-10','2019-01-11',
'2019-01-12', '2019-01-13', '2019-01-25', '2019-02-01','2019-02-25',
'2019-04-05','2019-05-05','2018-05-07','2019-05-09','2019-05-10'],
'counts':[10,5,6,1,2,
5,7,20,30,8,
9,1,10,12,50,
8,3,10,40,4]})
First, I converted the datetime format, and get the year and month from each date.
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
Then, I tried to do groupby like this.
aggmonth = df.groupby(['year', 'month']).sum()
And I want to visualize it in a barchart or something like that. But as you notice above, there are missing months in between the data. I want those missing months to be filled with 0s. I don't know how to do that in a dataframe like this. Previously, I asked this question about filling missing dates in a period of data. where I converted the dates to period range in month-year format.
by_month = pd.to_datetime(df['date']).dt.to_period('M').value_counts().sort_index()
by_month.index = pd.PeriodIndex(by_month.index)
df_month = by_month.rename_axis('month').reset_index(name='counts')
df_month
idx = pd.period_range(df_month['month'].min(), df_month['month'].max(), freq='M')
s = df_month.set_index('month').reindex(idx, fill_value=0)
s
But when I tried to plot s using matplotlib, it returned an error. It turned out you cannot plot a period data using matplotlib.
So basically I got these two ideas in my head, but both are stuck, and I don't know which one I should keep pursuing to get the result I want.
What is the best way to do this? Thanks.
Convert the date column to pandas datetime series, then use groupby on monthly period and aggregate the data using sum, next use DataFrame.resample on the aggregated dataframe to resample using monthly frequency:
df['date'] = pd.to_datetime(df['date'])
df1 = df.groupby(df['date'].dt.to_period('M')).sum()
df1 = df1.resample('M').asfreq().fillna(0)
Plotting the data:
df1.plot(kind='bar')
Say I'm looking at the Rdataset acme.csv found here. How do I import this with appropriately coarse date? Using parse_dates, it assigns the day to the present day (today being the 18th of July), since no day was specified. Can I make it deal with just month/year like the table is but keep using the date functionality of PANDAS?
import pandas as pd
url = 'http://vincentarelbundock.github.io/Rdatasets/csv/boot/acme.csv'
df = pd.read_csv(url, parse_dates=[1])
df.drop('Unnamed: 0', axis=1, inplace=True)
Don't parse dates in read_csv() but use to_datetime with format
df['month'] = pd.to_datetime(df['month'], format='%m/%y')
or you can use that function in read_csv() using lambda
df = pd.read_csv(url, parse_dates=['month'], date_parser=lambda x:pd.to_datetime(x, format='%m/%y'))
But you always get some day number in datetime.
BTW: In datetime you have always time too, but sometimes pandas doesn't show it.
print df['month'].head()
print df['month'].apply(lambda x:x.time()).head()