How can I fill pandas column with a function (sin,line)? - python

I have a dataframe with some columns that i have been adding myself. There is one specific column that gathers the max and min tide levels.
Pandas Column mostly empty but with some reference values
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4],'b':[np.nan,np.nan,3,4]},columns=['a','b'])
df
The problem is that the column is mostly empty because it only shows those peak values and not the intermediate ones. I would like to fill the missing values with a function similiar to the image shown below.
I want to fill it with a function of this kind
Thank you in advance.

Since you didn't specify, which datetime format your pandas dataframe uses, here is an example with index data. You can use them, if they are evenly spaced and they don't have gaps.
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit
tide = np.asarray([-1.2,np.nan,np.nan,3.4,np.nan,np.nan,-1.6,np.nan,np.nan,3.7,np.nan,np.nan,-1.4,])
tide_time = np.arange(len(tide))
df = pd.DataFrame({'a':tide_time,'b':tide})
#define your fit function with amplitude, frequence, phase and offset
def fit_func(x, ampl, freq, phase, offset):
return ampl * np.sin(freq * x + phase) + offset
#extract rows that contain your values
df_nona = df.dropna()
#perform the least square fit, get the coefficients for your fitted data
coeff, _mat = curve_fit(fit_func, df_nona["a"], df_nona["b"])
print(coeff)
#append a column with fit data
df["fitted_b"] = fit_func(df["a"], *coeff)
Output for my sample data
#amplitude frequency phase offset
[ 2.63098177 1.12805625 -2.17037976 1.0127173 ]
a b fitted_b
0 0 -1.2 -1.159344
1 1 NaN -1.259341
2 2 NaN 1.238002
3 3 3.4 3.477807
4 4 NaN 2.899605
5 5 NaN 0.164376
6 6 -1.6 -1.601058
7 7 NaN -0.378513
8 8 NaN 2.434439
9 9 3.7 3.622127
10 10 NaN 1.826826
11 11 NaN -0.899136
12 12 -1.4 -1.439532

Related

Python/Pandas For Loop Time Series

I am working with panel time-series data and am struggling with creating a fast for loop, to sum up, the past 50 numbers at the current i. The data is like 600k rows, and it starts to churn around 30k. Is there a way to use pandas or Numpy to do the same at a fraction of the time?
The change column is of type float, with 4 decimals.
Index Change
0 0.0410
1 0.0000
2 0.1201
... ...
74327 0.0000
74328 0.0231
74329 0.0109
74330 0.0462
SEQ_LEN = 50
for i in range(SEQ_LEN, len(df)):
df.at[i, 'Change_Sum'] = sum(df['Change'][i-SEQ_LEN:i])
Any help would be highly appreciated! Thank you!
I tried this with 600k rows and the average time was
20.9 ms ± 1.35 ms
This will return a series with the rolling sum for the last 50 Change in the df:
df['Change'].rolling(50).sum()
you can add it to a new column like so:
df['change50'] = df['Change'].rolling(50).sum()
Disclaimer: This solution cannot compete with .rolling(). Plus, if a .groupby() case, just do a df.groupby("group")["Change"].rolling(50).sum() and then reset index. Therefore please accept the other answer.
Explicit for loop can be avoided by translating your recursive partial sum into the difference of cumulative sum (cumsum). The formula:
Sum[x-50:x] = Sum[:x] - Sum[:x-50] = Cumsum[x] - Cumsum[x-50]
Code
For showcase purpose, I have shorten len(df["Change"]) to 10 and SEQ_LEN to 5. A million records completed almost immediately in this way.
import pandas as pd
import numpy as np
# data
SEQ_LEN = 5
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"Change": np.random.normal(0, 1, 10) # a million rows
}
)
# step 1. Do cumsum
df["Change_Cumsum"] = df["Change"].cumsum()
# Step 2. calculate diff of cumsum: Sum[x-50:x] = Sum[:x] - Sum[:x-50]
df["Change_Sum"] = np.nan # or zero as you wish
df.loc[SEQ_LEN:, "Change_Sum"] = df["Change_Cumsum"].values[SEQ_LEN:] - df["Change_Cumsum"].values[:(-SEQ_LEN)]
# add idx=SEQ_LEN-1
df.at[SEQ_LEN-1, "Change_Sum"] = df.at[SEQ_LEN-1, "Change_Cumsum"]
Output
df
Out[30]:
Change Change_Cumsum Change_Sum
0 -1.133838 -1.133838 NaN
1 0.384319 -0.749519 NaN
2 1.496554 0.747035 NaN
3 -0.355382 0.391652 NaN
4 -0.787534 -0.395881 -0.395881
5 -0.459439 -0.855320 0.278518
6 -0.059169 -0.914489 -0.164970
7 -0.354174 -1.268662 -2.015697
8 -0.735523 -2.004185 -2.395838
9 -1.183940 -3.188125 -2.792244

how to construct an index from percentage change time series?

consider the values below
array1 = np.array([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
I convert these into a pandas Series object
import numpy as np
import pandas as pd
df = pd.Series(array1)
And compute the percentage change as
df = (1+df.pct_change(periods=1))
from here, how do i construct an index (base=100)? My desired output should be:
0 100.00
1 100.43
2 101.82
3 101.82
4 101.82
5 101.43
6 102.19
7 101.68
8 101.07
9 101.02
10 101.01
11 101.01
12 100.88
13 100.54
14 99.95
15 99.45
I can achieve the objective through an iterative (loop) solution, but that may not be a practical solution, if the data depth and breadth is large. Secondly, is there a way in which i can get this done in a single step on multiple columns? thank you all for any guidance.
An index (base=100) is the relative change of a series in retation to its first element. So there's no need to take a detour to relative changes and recalculate the index from them when you can get it directly by
df = pd.Series(array1)/array1[0]*100
As far as I know, there is still no off-the-shelf expanding_window version for pct_change(). You can avoid the for-loop by using apply:
# generate data
import pandas as pd
series = pd.Series([526.59, 528.88, 536.19, 536.18, 536.18, 534.14, 538.14, 535.44,532.21, 531.94, 531.89, 531.89, 531.23, 529.41, 526.31, 523.67])
# copmute percentage change with respect to first value
series.apply(lambda x: ((x / series.iloc[0]) - 1) * 100) + 100
Output:
0 100.000000
1 100.434873
2 101.823050
3 101.821151
4 101.821151
5 101.433753
6 102.193357
7 101.680624
8 101.067244
9 101.015971
10 101.006476
11 101.006476
12 100.881141
13 100.535521
14 99.946828
15 99.445489
dtype: float64

Is there a pandas way of getting the averages between consecutive rows?

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(30,3))
df.head()
which gives:
0 1 2
0 0.741955 0.913681 0.110109
1 0.079039 0.662438 0.510414
2 0.469055 0.201658 0.259958
3 0.371357 0.018394 0.485339
4 0.850254 0.808264 0.469885
Say I want to add another column that will build the averages in column 2: between index (0,1) (1,2)... (28,29).
I imagine this is a common task as column 2 are the x axis positions and I want the categorical labels on a plot to appear in the middle between the 2 points on the x axis.
So I was wondering if there is a pandas way for this:
averages = []
for index, item in enumerate(df[2]):
if index < df[2].shape[0] -1:
averages.append((item + df[2].iloc[index + 1]) / 2)
df["averages"] = pd.Series(averages)
df.head()
which gives:
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333138
3 0.996487 0.272300 0.334554 0.586686
as you can see 0.31 is an average of 0.21 and 0.42.
Thanks!
I think that you can do this with pandas.DataFrame.rolling. Using your dataframe head as an example:
df['averages'] = df[2].rolling(2).mean().shift(-1)
returns:
>>> df
0 1 2 averages
0 0.997044 0.965708 0.211980 0.318781
1 0.716349 0.724811 0.425583 0.378653
2 0.729991 0.985072 0.331723 0.333139
3 0.996487 0.272300 0.334554 NaN
The NaN at the end is there because there is no row indexed 4; but in your full dataframe, it would go on until the second to last row (the average of value at indices 28 and 29, i.e. your 29th and 30th values). I just wanted to show that this gives the same values as your desired output, so I used the exact data you provided. (for future reference, if you want to provide a reproducible dataframe for us from random numbers, use and show us a random seed such as np.random.seed(42) before creating the df, that way, we'll all have the same one.)
breaking it down:
df[2] is there because you're interested in column 2; .rolling(2) is there because you want to get the mean of 2 values (if you wanted the mean of 3 values, use .rolling(3), etc...), .mean() is whatever function you want (in your case, the mean); finally .shift(-1) makes sure that the new column is in the proper place (i.e., makes sure you show the mean of each value in column 2 and the value below, as the default would be the value above)
This is one way, though slightly loopy. But #sacul's solution is better. I leave this here for reference only.
import pandas as pd
import numpy as np
from itertools import zip_longest
df = pd.DataFrame(np.random.rand(30, 3))
v = df.values[:, -1]
df = df.join(pd.DataFrame(np.array([np.mean([i, j], axis=0) for i, j in \
zip_longest(v, v[1:], fillvalue=v[-1])]), columns=['2_pair_avg']))
# 0 1 2 2_pair_avg
# 0 0.382656 0.228837 0.053199 0.373678
# 1 0.812690 0.255277 0.694156 0.697738
# 2 0.040521 0.211511 0.701320 0.491044
# 3 0.558739 0.697916 0.280768 0.615398
# 4 0.262771 0.912669 0.950029 0.489550
# 5 0.217489 0.405125 0.029071 0.101794
# 6 0.577929 0.933565 0.174517 0.214530
# 7 0.067030 0.452027 0.254544 0.613225
# 8 0.580869 0.556112 0.971907 0.582547
# 9 0.483528 0.951537 0.193188 0.175215
# 10 0.481141 0.589833 0.157242 0.159363
# 11 0.087057 0.823691 0.161485 0.108634
# 12 0.319516 0.161386 0.055784 0.285276
# 13 0.901529 0.365992 0.514768 0.386599
# 14 0.270118 0.454583 0.258430 0.245463
# 15 0.379739 0.299569 0.232497 0.214943
# 16 0.017621 0.182647 0.197389 0.538386
# 17 0.720688 0.147093 0.879383 0.732239
# 18 0.859594 0.538390 0.585096 0.503846
# 19 0.360718 0.571567 0.422596 0.287384
# 20 0.874800 0.391535 0.152171 0.239078
# 21 0.935150 0.379871 0.325984 0.294485
# 22 0.269607 0.891331 0.262986 0.212050
# 23 0.140976 0.414547 0.161115 0.542682
# 24 0.851434 0.059209 0.924250 0.801210
# 25 0.389025 0.774885 0.678170 0.388856
# 26 0.679247 0.982517 0.099542 0.372649
# 27 0.670354 0.279138 0.645756 0.336031
# 28 0.393414 0.970737 0.026307 0.343947
# 29 0.479611 0.349401 0.661587 0.661587

Irregular binning p2 python pandas

I have one small follow up question regarding binning in python pandas.
I have a data-frame like the following:
df =
variable test_score
-1 52.0
1 53.0
4 54.0
6 64.0
6 64.0
-6 64.0
5 71.0
10 73.0
-15 75.0
4 77.0
....... etc, etc....
I would like to bin with respect to the column/variable "variable", so that the same number of rows "X" (say 100) appear in each "variable" bin.
I would then like to scatter plot the central value of each variable bin ((variable_bin_min + variable_bin_max)/2) against the mean of the test scores for that variable bin.
I cannot see a simple way to do this and would be grateful for any guidance!
This should get it done. I manufactured the data, so it won't look like yours.
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(dict(variable=np.random.choice(range(20), (1000,)),
test_score=np.random.rand(1000,).round(2) * 100))
df_ = df.groupby(pd.qcut(df.variable, len(df) / 100)).agg([np.min, np.max, np.mean])
pd.concat([df_.variable.apply(lambda x: x.loc[['amin', 'amax']].mean(), axis=1),
df_.test_score['mean']],
axis=1,
keys=['bin_center', 'mean_score']).plot.scatter('bin_center', 'mean_score')
For your bins containing 5 items then pd.cut() to further slice the data :
LL = df['test_score'].tolist()
bins = LL[::5]

Restructuring Dataframe in Python

I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:
My code is below.
import urllib2
import pandas as pd
import os
import xlrd
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)
xd = pd.ExcelFile(socket)
#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)
bigdata = df1.append(df2,ignore_index = True)
Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:
0 1 2 3 4 5 6 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 Maturity NaN NaN NaN NaN NaN NaN
3 years: NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 2005-01-03 00:00:00 NaN NaN NaN NaN NaN NaN
6 2005-01-04 00:00:00 NaN NaN NaN NaN NaN NaN
... ... ... .. .. ... ... ...
5410 2015-04-20 00:00:00 NaN NaN NaN NaN 0.367987 0.357069
5411 2015-04-21 00:00:00 NaN NaN NaN NaN 0.362478 0.352581
It has 5440 rows and 61 columns
However, I want the dataframe to be of the format:
I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.
Date(which is the 2nd Column in the current Dataframe) Update time(which would just be a column with datetime.datetime.now()) Currency(which would just be a column with 'GBP') Maturity Date Yield Data from SpreadSheet
I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.
Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)
# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape
spot_curve.shape
Out[184]: (2715, 40)
# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape
short_end_spot_curve.shape
Out[185]: (2715, 60)
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
combined_data.head()
combined_data.tail()
combined_data.shape
# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================
# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than 50
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape
combined_data.shape
Out[186]: (2628, 100)
from scipy.interpolate import interp1d
# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)
# write out apply function
def interpolate_maturities(group):
# transpose row vector to column vector and drops all nans
a = group.T.dropna().reset_index()
f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
return pd.Series(maturity.apply(f).values, index=maturity.values)
# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)
# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')

Categories