I have a DataFrame with a few time series:
divida movav12 var varmovav12
Date
2004-01 0 NaN NaN NaN
2004-02 0 NaN NaN NaN
2004-03 0 NaN NaN NaN
2004-04 34 NaN inf NaN
2004-05 30 NaN -0.117647 NaN
2004-06 44 NaN 0.466667 NaN
2004-07 35 NaN -0.204545 NaN
2004-08 31 NaN -0.114286 NaN
2004-09 30 NaN -0.032258 NaN
2004-10 24 NaN -0.200000 NaN
2004-11 41 NaN 0.708333 NaN
2004-12 29 24.833333 -0.292683 NaN
2005-01 31 27.416667 0.068966 0.104027
2005-02 28 29.750000 -0.096774 0.085106
2005-03 27 32.000000 -0.035714 0.075630
2005-04 30 31.666667 0.111111 -0.010417
2005-05 31 31.750000 0.033333 0.002632
2005-06 39 31.333333 0.258065 -0.013123
2005-07 36 31.416667 -0.076923 0.002660
I want to decompose the first time series divida in a way that I can separate its trend from its seasonal and residual components.
I found an answer here, and am trying to use the following code:
import statsmodels.api as sm
s=sm.tsa.seasonal_decompose(divida.divida)
However I keep getting this error:
Traceback (most recent call last):
File "/Users/Pred_UnBR_Mod2.py", line 78, in <module> s=sm.tsa.seasonal_decompose(divida.divida)
File "/Library/Python/2.7/site-packages/statsmodels/tsa/seasonal.py", line 58, in seasonal_decompose _pandas_wrapper, pfreq = _maybe_get_pandas_wrapper_freq(x)
File "/Library/Python/2.7/site-packages/statsmodels/tsa/filters/_utils.py", line 46, in _maybe_get_pandas_wrapper_freq
freq = index.inferred_freq
AttributeError: 'Index' object has no attribute 'inferred_freq'
How can I proceed?
Works fine when you convert your index to DateTimeIndex:
df.reset_index(inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
s=sm.tsa.seasonal_decompose(df.divida)
<statsmodels.tsa.seasonal.DecomposeResult object at 0x110ec3710>
Access the components via:
s.resid
s.seasonal
s.trend
Statsmodel will decompose the series only if you provide frequency. Usually all time series index will contain frequency eg: Daywise, Business days, weekly So it shows error. You can remove this error by two ways:
What Stefan did is he gave the index column to pandas DateTime function. It uses internal function infer_freq to find the frequency and return the index with frequency.
Else you can set the frequency to your index column as df.index.asfreq(freq='m'). Here m represents month. You can set the frequency if you have domain knowledge or by d.
It depends on the index format. You can have DateTimeIndex or you can have PeriodIndex. Stefan presented the example for DateTimeIndex. Here is my example for PeriodIndex.
My original DataFrame has a MultiIndex index with year in first level and month in second level. Here is how I convert it to PeriodIndex:
df["date"] = pd.PeriodIndex (df.index.map(lambda x: "{0}{1:02d}".format(*x)),freq="M")
df = df.set_index("date")
Now it is ready to be used by seasonal_decompose.
Make it simple:
Follow three steps:
1. if not done, make the column in yyyy-mm-dd or dd-mm-yyyy( using excel).
2. Then using pandas convert it into date format as:
df['Date'] = pd.to_datetime(df['Date'])
3. decompose it using:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition=seasonal_decompose(ts_log)
And finally:
Try parsing the date column using parse_dates , and later mention the index column .
from statsmodels.tsa.seasonal import seasonal_decompose
data=pd.read_csv(airline,header=0,squeeze=True,index_col=[0],parse_dates=[0])
res=seasonal_decompose(data)
Related
I am trying to drop intervals of rows in my Dataframes from the maximal value (exclusive) to the rest (end) of the column. Here is an example of one of the column of my df (dflist['time']):
0 0.000000
1 0.021528
2 0.042135
3 0.062925
4 0.083498
...
88 1.796302
89 1.816918
90 1.837118
91 1.857405
92 1.878976
Name: time, Length: 93, dtype: float64
I have tried to use the .iloc and the .drop function in conjunction to the .index to achieve this result but without any success so far:
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
v_max = dflist['velocity'].max()
v_max_idx = dflist['velocity'].index[dflist['velocity'] == v_max]
dflist['time'] = dflist['time'].iloc[0:[v_max_idx]]
I have also tried several variations, like converting 'v_max_idx' to a list with .list or a .int to change the type inside the .iloc function as it seems to be the problem:
TypeError: cannot do positional indexing on RangeIndex with these indexers [[Int64Index([15], dtype='int64')]] of type list
I don't know why I am not able to do this and it is quiet frustrating, as it seems to be a pretty basic operation..
Any help would therefore be greatly appreciated !
##EDIT REGARDING THE dropna() PROBLEM
I tried with .notna() :
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
v_max = dflist['velocity'].max()
v_max_idx = dflist['velocity'].index[dflist['velocity'] == v_max]
dflist['velocity'] = dflist['velocity'].iloc[0:list(v_max_idx)[0]]
dflist['velocity'] = dflist['velocity'][dflist['velocity'].notna()]
dflist['time'] = dflist['time'].iloc[0:list(v_max_idx)[0]]
dflist['time'] = dflist['time'][dflist['time'].notna()]
and try with dropna():
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
v_max = dflist['velocity'].max()
v_max_idx = dflist['velocity'].index[dflist['velocity'] == v_max]
dflist['velocity'] = dflist['velocity'].iloc[0:list(v_max_idx)[0]].dropna()
dflist['time'] = dflist['time'].iloc[0:list(v_max_idx)[0]].dropna()
No error messages, it just doesn't do anything:
19 0.385243 1.272031
20 0.405416 1.329072
21 0.425477 1.352059
22 0.445642 1.349657
23 0.465755 1.378407
24 NaN NaN
25 NaN NaN
26 NaN NaN
27 NaN NaN
28 NaN NaN
29 NaN NaN
30 NaN NaN
31 NaN NaN
32 NaN NaN
33 NaN NaN
34 NaN NaN
35 NaN NaN
36 NaN NaN
Return value of pandas.Index() in your example is pandas.Int64Index().
pandas.DataFrame.iloc() allows inputs like a slice object with ints, e.g. 1:7.
In your code, no matter v_max_idx which a pandas.Index() object or [pandas.Index()] which is a list object doesn't meet the requirements of iloc() argument type.
You can use list(v_max_idx) to convert pandas.Index() object to list then use [0] etc. to access the data, like
dflist['time'] = dflist['time'].iloc[0:list(v_max_idx)[0]]
I have a dataframe of price data that looks like the following: (with more than 10,000 columns)
Unamed: 0
01973JAC3 corp
Unamed: 2
019754AA8 corp
Unamed: 4
01265RTJ7 corp
Unamed: 6
01988PAD0 corp
Unamed: 8
019736AB3 corp
1
2004-04-13
101.1
2008-06-16
99.1
2010-06-14
110.0
2008-06-18
102.1
NaT
NaN
2
2004-04-14
101.2
2008-06-17
100.4
2010-07-05
110.3
2008-06-19
102.6
NaT
NaN
3
2004-04-15
101.6
2008-06-18
100.4
2010-07-12
109.6
2008-06-20
102.5
NaT
NaN
4
2004-04-16
102.8
2008-06-19
100.9
2010-07-19
110.1
2008-06-21
102.6
NaT
NaN
5
2004-04-19
103.0
2008-06-20
101.3
2010-08-16
110.3
2008-06-22
102.8
NaT
NaN
...
...
...
...
...
...
...
...
...
NaT
NaN
3431
NaT
NaN
2021-12-30
119.2
NaT
NaN
NaT
NaN
NaT
NaN
3432
NaT
NaN
2021-12-31
119.4
NaT
NaN
NaT
NaN
NaT
NaN
(Those are 9-digit CUSIPs in the header. So every two columns represent date and closed price for a security.)
I would like to
find and get rid of empty pairs of date and price like "Unamed: 8" and"019736AB3 corp"
then rearrange the dateframe to a panel of monthly close price as following:
Date
01973JAC3
019754AA8
01265RTJ7
01988PAD0
2004-04-30
102.1
NaN
NaN
NaN
2004-05-31
101.2
NaN
NaN
NaN
...
...
...
...
...
2021-12-30
NaN
119.2
NaN
NaN
2021-12-31
NaN
119.4
NaN
NaN
Edit:
I wanna clarify my question.
So my dataframe has more than 10,000 columns, which makes it impossible to just drop by column names or change their names one by one. The pairs of date and price start and end at different time and are of different length (, and of different frequency). I m looking for an efficient way to arrange therm into a less messy form. Thanks.
Here is a sample of 30 columns. https://github.com/txd2x/datastore file name: sample-question2022-01.xlsx
I figured out: stacking and then reshaping.Thx for the help.
for i in np.arange(len(price.columns)/2):
temp =DataFrame(columns = ['Date', 'ClosedPrice','CUSIP'])
temp['Date'] = price.iloc[ 0:np.shape(price)[0]-1, int(2*i)]
temp['ClosedPrice'] = price.iloc[0:np.shape(price)[0]-1, int(2*i+1)]
temp['CUSIP'] =price.columns[int(i*2+1)][:9] #
df = df.append(temp)
#use for loop to stack all the column pairs
df = df.dropna(axis=0, how = 'any') # drop nan rows
df = df.pivot(index='Date', columns = 'CUSIP', values = 'ClosedPrice') #reshape dataframe to have Date as index and CUSIP and column headers
df_monthly=df.resample('M').last() #finding last price of the month
if you want to get rid of unusful columns then perform the following code:
df.drop("name_of_column", axis=1, inplace=True)
if you want to drop empty rows use:
df.drop(df.index[row_number], inplace=True)
if you want to rearrange the data using 'timestamp and date' you need to convert it to a datetime object and then make it as index:
import datetime
df.Date=pd.to_datetime(df.Date)
df = df.set_index('Date')
and you probably want to change column name before doing any of that above, df.rename(columns={'first_column': 'first', 'second_column': 'second'}, inplace = True)
Updated01:
if you want to keep just some columns of those 10000, lets say for example 10 or 7 columns, then use df = df[["first_column","second_column", ....]]
if you want to get rid of all empty columns use: df.dropna(axis=1, how = 'all') "how" keyword have two values: "all" to drop the whole row or column if it is full of Nan, "any" to drop the whole row or column if it have one Nan at least.
Update02:
Now if you have got a lot of date columns and you just want to keep one of them, supposing that you have choosed a date column that have no "Nan" values use the following code:
columns=df.columns.tolist()
for column in columns:
try:
if(df[column].dtypes=='object'):
df[column]=pd.to_datetime(df[column]).
if(df[column].dtypes=='datetime64[ns]')&(column!='Date'):
df.drop(column,axis=1,inplace=True)
except ValueError:
pass
rearrange the dataframe using months:
import datetime
df.Date=pd.to_datetime(df.Date)
df['Month']=df.Date.dt.month
df['Year']=df.Date.dt.year
df = df.set_index('Month')
df.groupby(["Year","Month"]).mean()
update03:
To combine all date columns while preserving data use the following code:
import pandas as pd
import numpy as np
df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
if (df[column].isnull().sum()>2300):
df.drop(column,axis=1,inplace=True)
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:
if(df[column].dtypes=='datetime64[ns]'):
df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
else:
df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
merged=pd.merge(merged,df[[columns[i],columns[i+1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
k+=1
the only problem left that it will throw a memory Error.
MemoryError: Unable to allocate 27.4 GiB for an array with shape (3677415706,) and data type int64
So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN
I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:
My code is below.
import urllib2
import pandas as pd
import os
import xlrd
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)
xd = pd.ExcelFile(socket)
#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)
bigdata = df1.append(df2,ignore_index = True)
Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:
0 1 2 3 4 5 6 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 Maturity NaN NaN NaN NaN NaN NaN
3 years: NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 2005-01-03 00:00:00 NaN NaN NaN NaN NaN NaN
6 2005-01-04 00:00:00 NaN NaN NaN NaN NaN NaN
... ... ... .. .. ... ... ...
5410 2015-04-20 00:00:00 NaN NaN NaN NaN 0.367987 0.357069
5411 2015-04-21 00:00:00 NaN NaN NaN NaN 0.362478 0.352581
It has 5440 rows and 61 columns
However, I want the dataframe to be of the format:
I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.
Date(which is the 2nd Column in the current Dataframe) Update time(which would just be a column with datetime.datetime.now()) Currency(which would just be a column with 'GBP') Maturity Date Yield Data from SpreadSheet
I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.
Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)
# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape
spot_curve.shape
Out[184]: (2715, 40)
# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape
short_end_spot_curve.shape
Out[185]: (2715, 60)
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
combined_data.head()
combined_data.tail()
combined_data.shape
# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================
# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than 50
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape
combined_data.shape
Out[186]: (2628, 100)
from scipy.interpolate import interp1d
# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)
# write out apply function
def interpolate_maturities(group):
# transpose row vector to column vector and drops all nans
a = group.T.dropna().reset_index()
f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
return pd.Series(maturity.apply(f).values, index=maturity.values)
# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)
# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')
I have a data frame that consists of a time series data with 15-second intervals:
date_time value
2012-12-28 11:11:00 103.2
2012-12-28 11:11:15 103.1
2012-12-28 11:11:30 103.4
2012-12-28 11:11:45 103.5
2012-12-28 11:12:00 103.3
The data spans many years. I would like to group by both year and time to look at the distribution of time-of-day effect over many years. For example, I may want to compute the mean and standard deviation of every 15-second interval across days, and look at how the means and standard deviations change from 2010, 2011, 2012, etc. I naively tried data.groupby(lambda x: [x.year, x.time]) but it didn't work. How can I do such grouping?
In case date_time is not your index, a date_time-indexed DataFrame could be created with:
dfts = df.set_index('date_time')
From there you can group by intervals using
dfts.groupby(lambda x : x.month).mean()
to see mean values for each month. Similarly, you can do
dfts.groupby(lambda x : x.year).std()
for standard deviations across the years.
If I understood the example task you would like to achieve, you could simply split the data into years using xs, group them and concatenate the results and store this in a new DataFrame.
years = range(2012, 2015)
yearly_month_stats = [dfts.xs(str(year)).groupby(lambda x : x.month).mean() for year in years]
df2 = pd.concat(yearly_month_stats, axis=1, keys = years)
From which you get something like
2012 2013 2014
value value value
1 NaN 5.324165 15.747767
2 NaN -23.193429 9.193217
3 NaN -14.144287 23.896030
4 NaN -21.877975 16.310195
5 NaN -3.079910 -6.093905
6 NaN -2.106847 -23.253183
7 NaN 10.644636 6.542562
8 NaN -9.763087 14.335956
9 NaN -3.529646 2.607973
10 NaN -18.633832 0.083575
11 NaN 10.297902 14.059286
12 33.95442 13.692435 22.293245
You were close:
data.groupby([lambda x: x.year, lambda x: x.time])
Also be sure to set date_time as the index, as in kermit666's answer