Python Pandas - Rolling regressions for multiple columns in a dataframe

Python Pandas - Rolling regressions for multiple columns in a dataframe - python

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')

here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

Related

Sum Data variables of Dataset

I've merged a list of DataArray in one DataSet using the code below:
surface_dataarray = []
for (key, value) in surfaces_item_service.items():
print(f'{key}: {value}')
single_surface_class = __yearly_surface_type(folder_path, provider, data_source, bins, key)
single_surface_class.name = key
if single_surface_class.count() > 1:
single_surface_class.rio.to_raster(output_file_path + f'/{key}.tif', driver="GTiff")
surface_dataarray.append(single_surface_class)
surface_data = xr.merge(surface_dataarray)
And I have obtained a Dataset like the below:
<xarray.Dataset>
Dimensions: (x: 1868, y: 1373)
Coordinates:
band int64 1
* x (x) float64 4.269e+05 4.269e+05 ... 4.455e+05 4.455e+05
* y (y) float64 4.53e+06 4.53e+06 ... 4.516e+06 4.516e+06
spatial_ref int64 0
year int64 2020
variable <U15 'CLASSIFIED DATA'
Data variables:
water_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
non_green_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
green_surface (y, x) float32 nan nan nan nan nan ... nan nan nan nan
Is it possible to sum the Data variables?
I need to save as single band.
NB: there are nan values because my area doesn't have values at borders.

Yes, one way to do it is to use Dataset.to_array, which will combine all the data variables into a single array along a new "variable" dimension:
sum_of_data_variables = ds.to_array().sum("variable")

Writing a function to iterate a data frame [duplicate]

This question already has answers here:
How to iterate over rows in a DataFrame in Pandas
(32 answers)
Closed 2 years ago.
I have the following dataframe and I need to do the following:
-subtract the free package limit from the total number of calls
-multiply the result by the calling plan value
-add the monthly charge depending on the calling plan
There are 2 plan types:
Surf: Monthly charge: $20, 500 monthly minutes
-After exceeding the limits:1 minute: 3 cents
Ultimate: Monthly charge: $70, 3000 monthly minutes
-After exceeding the limits: 1 minute: 1 cent
Here is the function I tried to make and tried to apply it to a new column: https://pastebin.com/iB3GwrQ9
def extra_cost(row):
calls = row['min_1', 'min_2', 'min_3', 'min_4', 'min_5', 'min_6', 'min_7', 'min_8', 'min_9', 'min_10', 'min_11', 'min_12']
if plan_name == 'surf':
if calls > 500:
return (calls - 500)
else:
return "NaN"
if plane_name == 'ultimate':
if calls > 3000:
return (calls - 3000)
else:
return "NaN"
else:
return "NaN"
user_calls1['sum'] = user_calls1.apply(extra_cost(row))
Here is the dataframe head: https://pastebin.com/8HJWbgUr
user_id first_name last_name plan_name min_1 min_2 min_3 min_4 min_5 \
1000 Anamaria Bauer ultimate NaN NaN NaN NaN NaN
1001 Mickey Wilkerson surf NaN NaN NaN NaN NaN
1002 Carlee Hoffman surf NaN NaN NaN NaN NaN
1003 Reynaldo Jenkins surf NaN NaN NaN NaN NaN
1004 Leonila Thompson surf NaN NaN NaN NaN 181.58
user_id first_name last_name plan_name min_6 min_7 min_8 min_9 \
1000 Anamaria Bauer ultimate NaN NaN NaN NaN
1001 Mickey Wilkerson surf NaN NaN 171.14 297.69
1002 Carlee Hoffman surf NaN NaN NaN NaN
1003 Reynaldo Jenkins surf NaN NaN NaN NaN
1004 Leonila Thompson surf 261.32 358.45 334.86 284.60
user_id first_name last_name plan_name min_10 min_11 min_12
1000 Anamaria Bauer ultimate NaN NaN 116.83
1001 Mickey Wilkerson surf 374.11 404.59 392.93
1002 Carlee Hoffman surf 54.13 359.76 363.24
1003 Reynaldo Jenkins surf NaN NaN 1041.00
1004 Leonila Thompson surf 341.63 452.98 403.53
I have not been able to figure out why it is not working and returning " name 'row' is not defined ". If there is a better solution, please let me know!

Could you give this a shot?
import numpy as np
def extra_cost(row):
calls = row[['min_1', 'min_2', 'min_3', 'min_4', 'min_5', 'min_6', 'min_7', 'min_8', 'min_9', 'min_10', 'min_11', 'min_12']]
if row.plan_name == 'surf':
total = calls[calls > 500].sum()
if total != 0:
return total
else:
return np.nan
if row.plan_name == 'ultimate':
total = calls[calls > 3000].sum()
if total != 0:
return total
else:
return np.nan
else:
return np.nan
user_calls1['sum'] = user_calls1.apply(extra_cost, axis=1)

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!

Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.

Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0

The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.

Basically, what I'm working with is a dataframe with all of the parking tickets given out in one year. Every ticket takes up its own row in the unaltered dataframe. What I want to do is group all the tickets by date so that I have 2 columns (date, and the amount of tickets issued on that day). Right now I can achieve that, however, the date is not considered a column by pandas.
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop (unnecessary_cols, 1)
df1 =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
df1['frequency'] =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
print (df1)
df1 = (df1.iloc[121:274])
The output is:
date_of_infraction date_of_infraction frequency
20120101 1059 NaN
20120102 2711 NaN
20120103 6889 NaN
20120104 8030 NaN
20120105 7991 NaN
20120106 8693 NaN
20120107 7237 NaN
20120108 5061 NaN
20120109 7974 NaN
20120110 8872 NaN
20120111 9110 NaN
20120112 8667 NaN
20120113 7247 NaN
20120114 7211 NaN
20120115 6116 NaN
20120116 9168 NaN
20120117 8973 NaN
20120118 9016 NaN
20120119 7998 NaN
20120120 8214 NaN
20120121 6400 NaN
20120122 6355 NaN
20120123 7777 NaN
20120124 8628 NaN
20120125 8527 NaN
20120126 8239 NaN
20120127 8667 NaN
20120128 7174 NaN
20120129 5378 NaN
20120130 7901 NaN
... ... ...
20121202 5342 NaN
20121203 7336 NaN
20121204 7258 NaN
20121205 8629 NaN
20121206 8893 NaN
20121207 8479 NaN
20121208 7680 NaN
20121209 5357 NaN
20121210 7589 NaN
20121211 8918 NaN
20121212 9149 NaN
20121213 7583 NaN
20121214 8329 NaN
20121215 7072 NaN
20121216 5614 NaN
20121217 8038 NaN
20121218 8194 NaN
20121219 6799 NaN
20121220 7102 NaN
20121221 7616 NaN
20121222 5575 NaN
20121223 4403 NaN
20121224 5492 NaN
20121225 673 NaN
20121226 1488 NaN
20121227 4428 NaN
20121228 5882 NaN
20121229 3858 NaN
20121230 3817 NaN
20121231 4530 NaN
Essentially, I want to move all the columns over by one to the right. Right now pandas only considers the last two columns as actual columns. I hope this made sense.

The count of infractions per date should be achievable with just one call to groupby. Try this:
import numpy as np
import pandas as pd
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop(unnecessary_cols, 1)
# reset_index() to move the dates into their own column
counts = df1.groupby('date_of_infraction').count().reset_index()
print(counts)
Note that any dates with zero tickets will not show up as 0; instead, they will simply be absent from counts.
If this doesn't work, it would be helpful for us to see the first few rows of df1 after you drop the unnecessary rows.

Try using as_index=False.
For example:
import numpy as np
import pandas as pd
data = {"date_of_infraction":["20120101", "20120101", "20120202", "20120202"],
"foo":np.random.random(4)}
df = pd.DataFrame(data)
df
date_of_infraction foo
0 20120101 0.681286
1 20120101 0.826723
2 20120202 0.669367
3 20120202 0.766019
(df.groupby("date_of_infraction", as_index=False) # <-- acts like reset_index()
.foo.count()
.rename(columns={"foo":"frequency"})
)
date_of_infraction frequency
0 20120101 2
1 20120202 2

Restructuring Dataframe in Python

I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:
My code is below.
import urllib2
import pandas as pd
import os
import xlrd
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)
xd = pd.ExcelFile(socket)
#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)
bigdata = df1.append(df2,ignore_index = True)
Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:
0 1 2 3 4 5 6 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 Maturity NaN NaN NaN NaN NaN NaN
3 years: NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 2005-01-03 00:00:00 NaN NaN NaN NaN NaN NaN
6 2005-01-04 00:00:00 NaN NaN NaN NaN NaN NaN
... ... ... .. .. ... ... ...
5410 2015-04-20 00:00:00 NaN NaN NaN NaN 0.367987 0.357069
5411 2015-04-21 00:00:00 NaN NaN NaN NaN 0.362478 0.352581
It has 5440 rows and 61 columns
However, I want the dataframe to be of the format:
I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.
Date(which is the 2nd Column in the current Dataframe) Update time(which would just be a column with datetime.datetime.now()) Currency(which would just be a column with 'GBP') Maturity Date Yield Data from SpreadSheet

I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.
Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)
# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape
spot_curve.shape
Out[184]: (2715, 40)
# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape
short_end_spot_curve.shape
Out[185]: (2715, 60)
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
combined_data.head()
combined_data.tail()
combined_data.shape
# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================
# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than 50
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape
combined_data.shape
Out[186]: (2628, 100)
from scipy.interpolate import interp1d
# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)
# write out apply function
def interpolate_maturities(group):
# transpose row vector to column vector and drops all nans
a = group.T.dropna().reset_index()
f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
return pd.Series(maturity.apply(f).values, index=maturity.values)
# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)
# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas - Rolling regressions for multiple columns in a dataframe - python

Related

Sum Data variables of Dataset

Writing a function to iterate a data frame [duplicate]

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.

Restructuring Dataframe in Python

Categories

Resources