Pandas indexed Series Subset (of a DataFrame) not changing values - python

I have the following table:
df = pd.DataFrame(({'code':['A121','A121','A121','H812','H812','H812','Z198','Z198','Z198','S222','S222','S222'],
'mode':['stk','sup','cons','stk','sup','cons','stk','sup','cons','stk','sup','cons'],
datetime.date(year=2021,month=5,day=1):[4,2,np.nan,2,2,np.nan,6,np.nan,np.nan,np.nan,2,np.nan],
datetime.date(year=2021,month=5,day=2):[1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan,np.nan],
datetime.date(year=2021,month=5,day=3):[12,5,np.nan,13,5,np.nan,12,np.nan,np.nan,np.nan,5,np.nan],
datetime.date(year=2021,month=5,day=4):[np.nan,1,np.nan,np.nan,4,np.nan,np.nan,np.nan,np.nan,np.nan,7,np.nan]}))
df = df.set_index('mode')
I want to achieve the following, I want the the rows wherever cons to be set according to some arithemetic calculations:
cons for the corresponding date and code needs to be set to the following calculation prev_date stk - current_date stk + sup
I have tried the code below:
dates = list(df.columns)
dates.remove('code')
for date in dates:
prev_date = date - datetime.timedelta(days=1)
if(df.loc["stk"].get(prev_date,None) is not None):
opn_stk = df.loc["stk",prev_date].reset_index(drop=True)
cls_stk = df.loc["stk",date].reset_index(drop=True)
sup = df.loc["sup",date].fillna(0).reset_index(drop=True)
cons = opn_stk - cls_stk + sup
df.loc["cons",date] = cons
I do not receive any error, however the cons values does not change at all.
I suspect this is probably because df.loc["cons",date] is an indexed Series and the calculation opn_stk - cls_stk + sup is an unindexed Series.
Any idea how to fix this?
P.S Also I am using loops to calculate this, is there any other vectorized way that would be more efficient
Expected Output

Let's try a groupby apply instead:
def calc_cons(g):
# Transpose
t = g[g.columns[g.columns != 'code']].T
# Update Cons
g.loc[g.index == 'cons', g.columns != 'code'] = (-t['stk'].diff() +
t['sup'].fillna(0)).to_numpy()
return g
df = df.groupby('code', as_index=False, sort=False).apply(calc_cons)
# print(df[df.index == 'cons'])
print(df)
code 2021-05-01 2021-05-02 2021-05-03 2021-05-04
mode
stk A121 4.0 1.0 12.0 NaN
sup A121 2.0 NaN 5.0 1.0
cons A121 NaN 3.0 -6.0 NaN
stk H812 2.0 3.0 13.0 NaN
sup H812 2.0 NaN 5.0 4.0
cons H812 NaN -1.0 -5.0 NaN
stk Z198 6.0 2.0 12.0 NaN
sup Z198 NaN NaN NaN NaN
cons Z198 NaN 4.0 -10.0 NaN
stk S222 NaN NaN NaN NaN
sup S222 2.0 NaN 5.0 7.0
cons S222 NaN NaN NaN NaN
*Assumes columns are in sorted order by date in 1 day intervals.

Although #Henry Ecker's answer is very elegant, it is very slow compared to what I have done (over 10x slower), so I would like to go ahead with my implementation fixed
My implementation fixed as per Henry Ecker's suggestion df.loc["cons",date] = cons.to_numpy()
dates = list(df.columns)
dates.remove('code')
for date in dates:
prev_date = date - datetime.timedelta(days=1)
if(df.loc["stk"].get(prev_date,None) is not None):
opn_stk = df.loc["stk",prev_date].reset_index(drop=True) # gets the stock of prev date
cls_stk = df.loc["stk",date].reset_index(drop=True) # gets the stock of current date
sup = df.loc["sup",date].fillna(0).reset_index(drop=True) # gets suplly of current date
cons = opn_stk - cls_stk + sup
df.loc["cons",date] = cons.to_numpy()
Just as a sidenote:
My implementation runs on the full data (not this, I created this as a toy example) in 0:00:00.053309 seconds and Henry Ecker's implementation run in 0:00:00.568888 seconds so more than 10x slower.
This is probably because he is iterating over the codes whereas I am iterating over dates. At any given point of time I will have at most 30 dates, but there can be more that 500 codes

Related

Python - Find percent change for previous 7-day period's average

I have time-series data in a dataframe. Is there any way to calculate for each day the percent change of that day's value from the average of the previous 7 days?
I have tried
df['Change'] = df['Column'].pct_change(periods=7)
However, this simply finds the difference between t and t-7 days. I need something like:
For each value of Ti, find the average of the previous 7 days, and subtract from Ti
Sure, you can for example use:
s = df['Column']
n = 7
mean = s.rolling(n, closed='left').mean()
df['Change'] = (s - mean) / mean
Note on closed='left'
There was a bug prior to pandas=1.2.0 that caused incorrect handling of closed for fixed windows. Make sure you have pandas>=1.2.0; for example, pandas=1.1.3 will not give the result below.
As described in the docs:
closed: Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
A simple way to understand is to try with some very simple data and a small window:
a = pd.DataFrame(range(5), index=pd.date_range('2020', periods=5))
b = a.assign(
sum_left=a.rolling(2, closed='left').sum(),
sum_right=a.rolling(2, closed='right').sum(),
sum_both=a.rolling(2, closed='both').sum(),
sum_neither=a.rolling(2, closed='neither').sum(),
)
>>> b
0 sum_left sum_right sum_both sum_neither
2020-01-01 0 NaN NaN NaN NaN
2020-01-02 1 NaN 1.0 1.0 NaN
2020-01-03 2 1.0 3.0 3.0 NaN
2020-01-04 3 3.0 5.0 6.0 NaN
2020-01-05 4 5.0 7.0 9.0 NaN

Bind one row cell with multiple rows cell for excle sheet in panda jupyter notebook

I have an excel sheet like this.
If I search using the below method I got only 1 row.
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
But I want to get all rows connecting with this name (same for birthdate and place).
expected output:
How can I achieve this? how can I bind these things
You need to forward fill the data with ffill():
df = df.replace('', np.nan) # in case you don't have null values, but you have empty strings
df['NAME '] = df['NAME '].ffill()
df4 = df.loc[(df['NAME '] == 'HIR')]
df4
That will then bring up all of the rows when you use loc. You can do this on other columns as well.
First you need to remove those blank rows in your excel. then fill values by the previous value
import pandas as pd
df = pd.read_excel('so.xlsx')
df = df[~df['HOBBY'].isna()]
df[['SNO','NAME']] = df[['SNO','NAME']].ffill()
df
SNO NAME HOBBY COURSE BIRTHDATE PLACE
0 1.0 HIR DANCING BTECH 1990.0 USA
1 1.0 HIR MUSIC MTECH NaN NaN
2 1.0 HIR TRAVELLING AI NaN NaN
4 2.0 BH GAMES BTECH 1992.0 INDIA
5 2.0 BH BOOKS AI NaN NaN
6 2.0 BH SWIMMING NaN NaN NaN
7 2.0 BH MUSIC NaN NaN NaN
8 2.0 BH DANCING NaN NaN NaN

Remove group of empty or nan in pandas groupby

In a dataframe, with some empty(NaN) values in some rows - Example below
s = pd.DataFrame([[39877380,158232151,20], [39877380,332086469,], [39877380,39877381,14], [39877380,39877383,8], [73516838,6439138,1], [73516838,6500551,], [735571896,203559638,], [735571896,282186552,], [736453090,6126187,], [673117474,12196071,], [673117474,12209800,], [673117474,618058747,6]], columns=['start','end','total'])
When I groupby start and end columns
s.groupby(['start', 'end']).total.sum()
the output I get is
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
735571896 203559638 nan
282186552 nan
736453090 6126187 nan
I want to exclude all the groups of start where all values with end is 'nan' - Expected output -
start end
39877380 39877381 14.00
39877383 8.00
158232151 20.00
332086469 nan
73516838 6439138 1.00
6500551 nan
673117474 12196071 nan
12209800 nan
618058747 6.00
I tried with dropna(), but it is removing all the nan values and not nan groups.
I am newbie in python and pandas. Can someone help me in this? thank you
In newer pandas versions is necessary use min_count=1 for missing values if use sum:
s1 = s.groupby(['start', 'end']).total.sum(min_count=1)
#oldier pandas version solution
#s1 = s.groupby(['start', 'end']).total.sum()
Then is possible filter if at least one non missing value per first level by Series.notna with GroupBy.transform and GroupBy.any, filtering is by boolean indexing:
s2 = s1[s1.notna().groupby(level=0).transform('any')]
#oldier pandas version solution
#s2 = s1[s1.notnull().groupby(level=0).transform('any')]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64
Or is possible get unique values of first level index values by MultiIndex.get_level_values and filtering by DataFrame.loc:
idx = s1.index.get_level_values(0)
s2 = s1.loc[idx[s1.notna()].unique()]
#oldier pandas version solution
#s2 = s1.loc[idx[s1.notnull()].unique()]
print (s2)
start end
39877380 39877381 14.0
39877383 8.0
158232151 20.0
332086469 NaN
73516838 6439138 1.0
6500551 NaN
673117474 12196071 NaN
12209800 NaN
618058747 6.0
Name: total, dtype: float64

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!
Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.
Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0
The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

Restructuring Dataframe in Python

I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:
My code is below.
import urllib2
import pandas as pd
import os
import xlrd
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)
xd = pd.ExcelFile(socket)
#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)
bigdata = df1.append(df2,ignore_index = True)
Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:
0 1 2 3 4 5 6 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 Maturity NaN NaN NaN NaN NaN NaN
3 years: NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 2005-01-03 00:00:00 NaN NaN NaN NaN NaN NaN
6 2005-01-04 00:00:00 NaN NaN NaN NaN NaN NaN
... ... ... .. .. ... ... ...
5410 2015-04-20 00:00:00 NaN NaN NaN NaN 0.367987 0.357069
5411 2015-04-21 00:00:00 NaN NaN NaN NaN 0.362478 0.352581
It has 5440 rows and 61 columns
However, I want the dataframe to be of the format:
I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.
Date(which is the 2nd Column in the current Dataframe) Update time(which would just be a column with datetime.datetime.now()) Currency(which would just be a column with 'GBP') Maturity Date Yield Data from SpreadSheet
I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.
Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)
# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape
spot_curve.shape
Out[184]: (2715, 40)
# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape
short_end_spot_curve.shape
Out[185]: (2715, 60)
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
combined_data.head()
combined_data.tail()
combined_data.shape
# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================
# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than 50
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape
combined_data.shape
Out[186]: (2628, 100)
from scipy.interpolate import interp1d
# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)
# write out apply function
def interpolate_maturities(group):
# transpose row vector to column vector and drops all nans
a = group.T.dropna().reset_index()
f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
return pd.Series(maturity.apply(f).values, index=maturity.values)
# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)
# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')

Categories