Winsorize dataframe columns per month while ignoring NaN's - python

I have a dataframe with monthly data and the following colums: date, bm and cash
date bm cash
1981-09-30 0.210308 2.487146
1981-10-31 0.241291 2.897529
1981-11-30 0.221529 2.892758
1981-12-31 0.239002 2.726372
1981-09-30 0.834520 4.387087
1981-10-31 0.800472 4.297658
1981-11-30 0.815778 4.459382
1981-12-31 0.836681 4.895269
Now I want to winsorize my data per month while keeping NaN values in the data. I.e. I want to group the data per month and overwrite observations above the 0.99 and below the 0.01 percentile with the 99 percentile and 0.01 percentile respectively. From Winsorizing data by column in pandas with NaN I found that I should do this with the "clip" function. My code looks as follows:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index(['date'])
df_grouped = df.groupby(pd.Grouper(freq='M'))
cols = df.columns
for c in cols:
df[c] = df_grouped[c].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
I get the following output: ValueError: cannot reindex from a duplicate axis
P.S. I realize that I have not included my required output, but I hope that the required output is clear. Otherwise I can try to put something together.
Edit: These solution from #Allolz is already of great help, but it does not work exactly as it is supposed to. Before I run the code from #Allolz I I ran :
df_in.groupby(pd.Grouper(freq='M', key='date'))['secured'].quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which returned:
date
1980-01-31 0.00 1.580564e+00
0.01 1.599805e+00
0.25 2.388106e+00
0.50 6.427071e+00
0.75 1.200685e+01
0.99 5.133111e+01
1.00 5.530329e+01
After winsorizing I get:
date
1980-01-31 0.00 1.599805
0.01 1.617123
0.25 2.388106
0.50 6.427071
0.75 12.006854
0.99 47.756152
1.00 51.331114
It is clear that the new 0.0 and 1.0 quantiles are equal to the original 0.01 and 0.09 quantiles, which is what we would expect. However, the new 0.01 and 0.99 quantiles are not equal to the original 0.01 and 0.99 quantiles where I would expect that these should remain the same. What can cause this and wat could solve it? My hunch is that it might have to do with NaN's in the data, but I'm not sure if that is really the cause.

One method which will be faster requires you to create helper columns. We will use groupby + transform to broadcast columns for the 0.01 and 0.99 quantile (for that Month group) back to the DataFrame and then you can use those Series to clip the original at once. (clip will leave NaN alone so it satisfies that requirement too). Then if you want, remove the helper columns (I'll leave them in for clariity).
Sample Data
import numpy as np
import panda as pd
np.random.seed(123)
N = 10000
df = pd.DataFrame({'date': np.random.choice(pd.date_range('2010-01-01', freq='MS', periods=12), N),
'val': np.random.normal(1, 0.95, N)})
Code
gp = df.groupby(pd.Grouper(freq='M', key='date'))['val']
# Assign the lower-bound ('lb') and upper-bound ('ub') for Winsorizing
df['lb'] = gp.transform('quantile', 0.01)
df['ub'] = gp.transform('quantile', 0.99)
# Winsorize
df['val_wins'] = df['val'].clip(upper=df['ub'], lower=df['lb'])
Output
The majority of rows will not be changed (only those outside of the 1-99th percentile) so we can check the small susbet rows that did change to see it works. You can see rows for the same months have the same bounds and the winsorized value ('val_wins') is properly clipped to the bound it exceeds.
df[df['val'] != df['val_wins']]
# date val lb ub val_wins
#42 2010-09-01 -1.686566 -1.125862 3.206333 -1.125862
#96 2010-04-01 -1.255322 -1.243975 2.995711 -1.243975
#165 2010-08-01 3.367880 -1.020273 3.332030 3.332030
#172 2010-09-01 -1.813011 -1.125862 3.206333 -1.125862
#398 2010-09-01 3.281198 -1.125862 3.206333 3.206333
#... ... ... ... ... ...
#9626 2010-12-01 3.626950 -1.198967 3.249161 3.249161
#9746 2010-11-01 3.472490 -1.259557 3.261329 3.261329
#9762 2010-09-01 3.460467 -1.125862 3.206333 3.206333
#9768 2010-06-01 -1.625013 -1.482529 3.295520 -1.482529
#9854 2010-12-01 -1.475515 -1.198967 3.249161 -1.198967
#
#[214 rows x 5 columns]

Related

Issue in executing a specific type of nested 'for' loop on columns of a panda dataframe

I have a panda dataframe that has values like below. Though in real I am working with lot more columns and historical data
AUD USD JPY EUR
0 0.67 1 140 1.05
I want to iterate over columns to create dataframe with columns AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR and JPYEUR
where for eg AUDUSD is calculated as product of AUD column and USD colum
I tried below
for col in df:
for cols in df:
cf[col+cols]=df[col]*df[cols]
But it generates table with unneccessary values like AUDAUD, USDUSD or duplicate value like AUDUSD and USDAUD. I think if i can somehow set "cols =col+1 till end of df" in second for loop I should be able to resolve the issue. But i don't know how to do that ??
Result i am looking for is a table with below columns and their values
AUDUSD, AUDJPY, AUDEUR, USDJPY, USDEUR, JPYEUR
You can use itertools.combinations with pandas.Series.mul and pandas.concat.
Try this :
from itertools import combinations
​
combos = list(combinations(df.columns, 2))
​
out = pd.concat([df[col[1]].mul(df[col[0]]) for col in combos], axis=1, keys=combos)
​
out.columns = out.columns.map("".join)
# Output :
print(out)
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
# Used input :
df = pd.DataFrame({'AUD': [0.67], 'USD': [1], 'JPY': [140], 'EUR': [1.05]})
I thought it intuitive that your first approach was to use an inner / outer loop and think this solution works in the same spirit:
# Added a Second Row for testing
df = pd.DataFrame(
{'AUD': [0.67, 0.91], 'USD': [1, 1], 'JPY': [140, 130], 'EUR': [1.05, 1]},
)
# Instantiated the Second DataFrame
cf = pd.DataFrame()
# Call the index of the columns as an integer
for i in range(len(df.columns)):
# Increment the index + 1, so you aren't looking at the same column twice
# Also, limit the range to the length of your columns
for j in range(i+1, len(df.columns)):
print(f'{df.columns[i]}' + f'{df.columns[j]}') # VERIFY
# Create a variable of the column names mashed together
combine = f'{df.columns[i]}' + f'{df.columns[j]}
# Assign the rows to be a product of the mashed column series
cf[combine] = df[df.columns[i]] * df[df.columns[j]]
print(cf) # VERIFY
The console Log looks like this:
AUDUSD
AUDJPY
AUDEUR
USDJPY
USDEUR
JPYEUR
AUDUSD AUDJPY AUDEUR USDJPY USDEUR JPYEUR
0 0.67 93.8 0.7035 140 1.05 147.0
1 0.91 118.3 0.9100 130 1.00 130.0

Lowpass filter to get second derivative of data from pandas dataframe

I have a dataframe (an excerpt is shown below):
Time head hip_center left_ankle
0 0.00 1916.654646 1487.842416 1152.102052
1 0.01 1916.800455 1487.870595 1152.110548
2 0.02 1916.913416 1487.934406 1152.113837
3 0.03 1916.992517 1488.334658 1152.083790
4 0.04 1917.109599 1488.298676 1152.239034
And what I want to do is calculate the acceleration of each column for each row, i.e. the second derivative. But I also need to apply a lowpass filter to filter out the noise.
I've defined the filter like so:
#Lowpass filter
from scipy.signal import butter, filtfilt
def butter_lowpass_filter(data, cutoff, fs, order):
nyq = 0.5 * fs
normal_cutoff = cutoff / nyq
# Get the filter coefficients
b, a = butter(order, normal_cutoff, btype='low', analog=False)
y = filtfilt(b, a, data.iloc[:, 1:])
return y
#call function
butter_lowpass_filter(acc_filt, 8, 100, 2)
So what I want is for each column and each row, to get the second derivative with a filter applied to smooth it. I'm not so sure however at what stage do I calculate the derivative, and how? Because I did it originally with .diff() but this is what gives all the crazy noise to the results.
I want the output to also be in a dataframe. Any help on this?
EDIT: I know about the savgol filter, I will also be applying this, but I need to do a lowpass one as well independently.
EDIT2: This is the formula for the second derivative:
A(t) = (x(t+1) - 2 * x(t) + x(t-1)) / (SamplingPeriod * SamplingPeriod)
where Sampling Period is 0.01 secs.
From this answer here. They used shift to compute the second derivative:
df = df.set_index("Time")
for col in df:
df[col+"_second_der"] = df[col] - 2*df[col].shift(1) + df[col].shift(2)
Output:
head hip_center left_ankle head_second_der hip_center_second_der left_ankle_second_der
Time
0.00 1916.65 1487.84 1152.10 NaN NaN NaN
0.01 1916.80 1487.87 1152.11 NaN NaN NaN
0.02 1916.91 1487.93 1152.11 -0.03 0.04 -0.01
0.03 1916.99 1488.33 1152.08 -0.03 0.34 -0.03
0.04 1917.11 1488.30 1152.24 0.04 -0.44 0.19
Let me know whether this works for you!

Conditional If Statement applied to multiple columns of dataframe

I have a dataframe of minute stock returns and I would like to create a new column that is conditional on whether a return was exceeded (pos or negative), and if so that row is equal to the limit (pos or negative), otherwise equal to the last column that was checked. The example below illustrates this:
import pandas as pd
dict = [
{'ticker':'jpm','date': '2016-11-28','returns1': 0.02,'returns2': 0.03,'limit': 0.1},
{ 'ticker':'ge','date': '2016-11-28','returns1': 0.2,'returns2': -0.3,'limit': 0.1},
{'ticker':'fb', 'date': '2016-11-28','returns1': -0.2,'returns2': 0.5,'limit': 0.1},
]
df = pd.DataFrame(dict)
df['date'] = pd.to_datetime(df['date'])
df=df.set_index(['date','ticker'], drop=True)
The target would be this:
fin_return limit returns1 returns2
date ticker
2016-11-28 jpm 0.03 0.1 0.02 0.03
ge 0.10 0.1 0.20 -0.30
fb -0.10 0.1 -0.20 0.50
So in the first row, the returns never exceeded the limit, so the value becomes equal to the value in returns2 (0.03). In row 2, the returns were exceeded on the upside, so the value should be the positive limit. In row 3 the returns where exceeded on the downside first, so the value should be the negative limit.
My actual dataframe has a couple thousand columns, so I am not quite sure how to do this (maybe a loop?). I appreciate any suggestions.
The idea is to test a stop loss or limit trading algorithm. Whenever, the lower limit is triggered, it should replace the final column with the lower limit, same for the upper limit, whichever comes first for that row. So once either one is triggered, the next row should be tested.
I am adding a different example with one more column here to make this a bit clearer (the limit is +/- 0.1)
fin_return limit returns1 returns2 returns3
date ticker
2016-11-28 jpm 0.02 0.1 0.01 0.04 0.02
ge 0.10 0.1 0.20 -0.30 0.6
fb -0.10 0.1 -0.02 -0.20 0.7
In the first row, the limit was never triggered to the final return is from returns3 (0.02). In row 2 the limit was triggered on the upside in returns 1 so the fin_return is equal to the upper limit (anything that happens in returns2 and returns 3 is irrelevant for this row). In row 3, the limited was exceeded on the downside in returns 2, so the fin_return becomes -0.1, and anything in returns3 is irrelevant.
Use:
dict = [
{'ticker':'jpm','date': '2016-11-28','returns1': 0.02,'returns2': 0.03,'limit': 0.1,'returns3':0.02},
{ 'ticker':'ge','date': '2016-11-28','returns1': 0.2,'returns2': -0.3,'limit': 0.1,'returns3':0.6},
{'ticker':'fb', 'date': '2016-11-28','returns1': -0.02,'returns2': -0.2,'limit': 0.1,'returns3':0.7},
]
df = pd.DataFrame(dict)
df['date'] = pd.to_datetime(df['date'])
df=df.set_index(['date','ticker'], drop=True)
#select all columns without first (here limit column)
df1 = df.iloc[:, 1:]
#comapre if all columns under +-limit
mask = df1.lt(df['limit'], axis=0) & df1.gt(-df['limit'], axis=0)
m1 = mask.all(axis=1)
print (m1)
date ticker
2016-11-28 jpm True
ge False
fb False
dtype: bool
#replace first columns in limit with NaNs and back filling missing values, seelct first col
m2 = df1.mask(mask).bfill(axis=1).iloc[:, 0].gt(df['limit'])
print (m2)
date ticker
2016-11-28 jpm False
ge True
fb False
dtype: bool
arr = np.select([m1,m2, ~m2], [df1.iloc[:, -1], df['limit'], -df['limit']])
#set first column in DataFrame by insert
df.insert(0, 'fin_return', arr)
print (df)
fin_return limit returns1 returns2 returns3
date ticker
2016-11-28 jpm 0.02 0.1 0.02 0.03 0.02
ge 0.10 0.1 0.20 -0.30 0.60
fb -0.10 0.1 -0.02 -0.20 0.70

Winsorizing data by column in pandas with NaN

I'd like to winsorize several columns of data in a pandas Data Frame. Each column has some NaN, which affects the winsorization, so they need to be removed. The only way I know how to do this is to remove them for all of the data, rather than remove them only column-by-column.
MWE:
import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize
# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
, index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan
Here is the baseline distribution:
df.quantile([0, 0.01, 0.5, 0.99, 1])
output:
one two three four
0.00 2.336618e-10 2.294259e-07 0.002437 2.305353
0.01 9.862626e-07 9.742568e-04 0.975807 1003.814520
0.50 4.975859e-05 4.981049e-02 50.290946 50374.548980
0.99 9.897463e-05 9.898590e-02 98.978263 98991.438985
1.00 9.999983e-05 9.999966e-02 99.996793 99999.437779
This is how I'm winsorizing:
def using_mstats(s):
return winsorize(s, limits=[0.01, 0.01])
wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Which gives this:
Out[356]:
one two three four
0.00 0.000001 0.001060 1.536882 1003.820149
0.01 0.000001 0.001060 1.536882 1003.820149
0.25 0.000025 0.024975 25.200378 25099.994780
0.50 0.000050 0.049810 50.290946 50374.548980
0.75 0.000075 0.074842 74.794537 75217.343920
0.99 0.000099 0.098986 98.978263 98991.436957
1.00 0.000100 0.100000 99.996793 98991.436957
Column four is correct because it has no NaN but the others are incorrect. The 99th percentile and Max should be the same. The observations counts are identical for both:
In [357]: df.count()
Out[357]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
In [358]: wins.count()
Out[358]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
This is how I can 'solve' it, but at the cost of losing a lot of my data:
wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Output:
Out[360]:
one two three four
0.00 9.686203e-07 0.000928 0.965702 1005.209503
0.01 9.686203e-07 0.000928 0.965702 1005.209503
0.25 2.486052e-05 0.024829 25.204032 25210.837443
0.50 4.980946e-05 0.049894 50.299004 50622.227179
0.75 7.492750e-05 0.075059 74.837900 75299.906415
0.99 9.895563e-05 0.099014 98.972310 99014.311761
1.00 9.895563e-05 0.099014 98.972310 99014.311761
In [361]: wins2.count()
Out[361]:
one 51700
two 51700
three 51700
four 51700
dtype: int64
How can I winsorize the data, by column, that is not NaN, while maintaining the data shape (i.e. not removing rows)?
As often happens, simply creating the MWE helped clarify. I need to use clip() in combination with quantile() as below:
df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
Output:
one two three four
0.00 9.862626e-07 0.000974 0.975807 1003.814520
0.01 9.862666e-07 0.000974 0.975816 1003.820092
0.25 2.485043e-05 0.024975 25.200378 25099.994780
0.50 4.975859e-05 0.049810 50.290946 50374.548980
0.75 7.486737e-05 0.074842 74.794537 75217.343920
0.99 9.897462e-05 0.098986 98.978245 98991.436977
1.00 9.897463e-05 0.098986 98.978263 98991.438985
In [384]: df2.count()
Out[384]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
The numbers are different from above because I have maintained all of the data in each column that is not missing (NaN).

Group rows in fixed duration windows that satisfy multiple conditions

I have a df as below. Consider df is indexed by timestamps as dtype='datetime64[ns]' i.e. 1970-01-01 00:00:27.603046999. I am putting dummy timestamps here.
Timestamp Address Type Arrival_Time Time_Delta
0.1 2 A 0.25 0.15
0.4 3 B 0.43 0.03
0.9 1 B 1.20 0.20
1.3 1 A 1.39 0.09
1.5 3 A 1.64 0.14
1.7 3 B 1.87 0.17
2.0 3 A 2.09 0.09
2.1 1 B 2.44 0.34
I have three unique "addresses" (1, 2,3).
I have two unique "types" (A, B)
Now what I am trying to do two things in simple way (possibly using pd.Grouper and pd.Groupby functions in Panda).
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each 1sec bin, for each "address" find the mean and sum of "Time_delta" only if "Type" = A.
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each bin, for each "address", find the mean and sum of Inter-Arrival Time*.
IAT = Arrival Time (i) - Arrival Time (i-1)
Note: If the timestamps duration/length is of 100 seconds, we should have exactly 100 rows in the output dataframe and six columns i.e. two (mean, sum) for each address.
For Problem 1:
I tried the following code:
df = pd.DataFrame({'Timestamp': Timestamp, 'Address': Address,
'Type': Type, 'Arrival_Time': Arrival_time, 'Time_Delta': Time_delta})
# Set index to Datetime
index = pd.DatetimeIndex(df[df.columns[3]]*10**9) # Convert timestamp into format
df = df.set_index(index) # Set timestamp as index
df_1 = df[df.columns[2]].groupby([pd.TimeGrouper('1S'), df['Address']]).mean().unstack(fill_value=0)
which gives results:
Timestamp 1 2 3
1970-01-01 00:00:00 0.20 0.15 0.030
1970-01-01 00:00:01 0.09 0.00 0.155
1970-01-01 00:00:02 0.34 0.00 0.090
As you can see, it gives the mean Time_delta for each address in the 1S bin, But I want to add the second condition i.e. find mean for each address only if Type=A. I hope problem 1 is now clear.
For Problem 2:
Its a bit complicated. I want to do get Mean IAT for each address in the same format (See below):
One possible way is to add an extra column to original df as df['IAT'], where
for in range (1, len(df))
i = 0
df['IAT'] = df['Arrival_Time'][i] - df['Arrival_Time'][i-1] i =
i=i+1
Then apply the same above code to find mean of IAT for each address if Type=A.
Actual Data
Timestamp Address Type Time Delta Arrival Time
1970-01-01 00:00:00.000000000 28:5a:ec:16:00:22 Control frame 0.000000 Nov 10, 2017 22:39:20.538561000
1970-01-01 00:00:00.000287000 28:5a:ec:16:00:23 Data frame 0.000287 Nov 10, 2017 22:39:20.548121000
1970-01-01 00:00:00.000896000 28:5a:ec:16:00:22 Control frame 0.000609 Nov 10, 2017 22:39:20.611256000
1970-01-01 00:00:00.001388000 28:5a:ec:16:00:21 Data frame 0.000492 Nov 10, 2017 22:39:20.321745000
... ...

Categories