Winsorize a dataframe with percentile values - python

I'd like to replicate this method of winsorizing a dataframe with specified percentile regions in python. I tried using the scipy winsorize function but that didn't get the results I was looking for.
Example expected output for a dataframe winsorized by 0.01% low percentage value and 0.99% high percentage value across each date:
Original df:
A B C D E
2020-06-30 0.033 -0.182 -0.016 0.665 0.025
2020-07-31 0.142 -0.175 -0.016 0.556 0.024
2020-08-31 0.115 -0.187 -0.017 0.627 0.027
2020-09-30 0.032 -0.096 -0.022 0.572 0.024
Winsorized data:
A B C D E
2020-06-30 0.033 -0.175 -0.016 0.64 0.025
2020-07-31 0.142 -0.169 -0.016 0.54 0.024
2020-08-31 0.115 -0.18 -0.017 0.606 0.027
2020-09-30 0.032 -0.093 -0.022 0.55 0.024

Related

Conditional mean of a dataframe based on datetime column names

I'm new to python . I'm looking for a way to generate mean for row values based on column names(Column names are date series formats from January to December). I want to generate mean for every 10 days for over a period of an year. My dataframe is in the below format(2000 rows)
import pandas as pd
df= pd.DataFrame({'A':[81,80.09,83,85,88],
'B':[21.8,22.04,21.8,21.7,22.06],
'20210113':[0,0.05,0,0,0.433],
'20210122':[0,0.13,0,0,0.128],
'20210125':[0.056,0,0.043,0.062,0.16],
'20210213':[0.9,0.56,0.32,0.8,0],
'20210217':[0.7,0.99,0.008,0.23,0.56],
'20210219':[0.9,0.43,0.76,0.98,0.5]})
Expected Output:
In [2]: df
Out[2]:
A B c(Mean 20210111,..20210119 ) D(Mean of 20210120..20210129)..
0 81 21.8
1 80.09 22.04
2 83 21.8
3 85 21.7
4 88 22.06
One way would be to isolate the date columns from the rest of the DF. Transpose it to be able to use normal grouping operations. Then transpose back and merge to the unaffected portion of the DataFrame.
import pandas as pd
df = pd.DataFrame({'A': [81, 80.09, 83, 85, 88],
'B': [21.8, 22.04, 21.8, 21.7, 22.06],
'20210113A.2': [0, 0.05, 0, 0, 0.433],
'20210122B.1': [0, 0.13, 0, 0, 0.128],
'20210125C.3': [0.056, 0, 0.043, 0.062, 0.16],
'20210213': [0.9, 0.56, 0.32, 0.8, 0],
'20210217': [0.7, 0.99, 0.008, 0.23, 0.56],
'20210219': [0.9, 0.43, 0.76, 0.98, 0.5]})
# Unaffected Columns Go Here
keep_columns = ['A', 'B']
# Get All Affected Columns
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
# Strip Extra Information From Column Names
new_df.columns = new_df.columns.map(lambda c: c[0:8])
# Transpose
new_df = new_df.T
# Convert index to DateTime for easy use
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
# Resample every 10 Days on new DT index (Drop any rows with no values)
new_df = new_df.resample("10D").mean().dropna(how='all')
# Transpose and Merge Back on DF
df = df[keep_columns].merge(new_df.T, left_index=True, right_index=True)
# For Display
print(df.to_string())
Output:
A B 2021-01-13 00:00:00 2021-01-23 00:00:00 2021-02-12 00:00:00
0 81.00 21.80 0.0000 0.056 0.833333
1 80.09 22.04 0.0900 0.000 0.660000
2 83.00 21.80 0.0000 0.043 0.362667
3 85.00 21.70 0.0000 0.062 0.670000
4 88.00 22.06 0.2805 0.160 0.353333
new_df = df.loc[:, ~df.columns.isin(keep_columns)]
new_df
0 1 2 3 4
20210113 0.000 0.05 0.000 0.000 0.433
20210122 0.000 0.13 0.000 0.000 0.128
20210125 0.056 0.00 0.043 0.062 0.160
20210213 0.900 0.56 0.320 0.800 0.000
20210217 0.700 0.99 0.008 0.230 0.560
20210219 0.900 0.43 0.760 0.980 0.500
new_df.index = pd.to_datetime(new_df.index, format='%Y%m%d')
new_df
0 1 2 3 4
2021-01-13 0.000 0.05 0.000 0.000 0.433
2021-01-22 0.000 0.13 0.000 0.000 0.128
2021-01-25 0.056 0.00 0.043 0.062 0.160
2021-02-13 0.900 0.56 0.320 0.800 0.000
2021-02-17 0.700 0.99 0.008 0.230 0.560
2021-02-19 0.900 0.43 0.760 0.980 0.500
new_df = new_df.resample("10D").mean().dropna(how='all')
new_df
0 1 2 3 4
2021-01-13 0.000000 0.09 0.000000 0.000 0.280500
2021-01-23 0.056000 0.00 0.043000 0.062 0.160000
2021-02-12 0.833333 0.66 0.362667 0.670 0.353333
new_df.T
2021-01-13 2021-01-23 2021-02-12
0 0.0000 0.056 0.833333
1 0.0900 0.000 0.660000
2 0.0000 0.043 0.362667
3 0.0000 0.062 0.670000
4 0.2805 0.160 0.353333

all non-significant or NAN p-values in Logit

I'm running a logit with statsmodels that has around 25 regressors, ranging from categorical, ordinal and continuous variables.
My code is the following, with its output:
a = np.asarray(data_nobands[[*all 25 columns*]], dtype=float)
mod_logit = sm.Logit(np.asarray(data_nobands['cured'], dtype=float),a)
logit_res = mod_logit.fit(method="nm", cov_type="cluster", cov_kwds={"groups":data_nobands['AGREEMENT_NUMBER']})
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 17316
Model: Logit Df Residuals: 17292
Method: MLE Df Model: 23
Date: Wed, 05 Aug 2020 Pseudo R-squ.: -0.02503
Time: 19:49:27 Log-Likelihood: -10274.
converged: False LL-Null: -10023.
Covariance Type: cluster LLR p-value: 1.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 3.504e-05 0.009 0.004 0.997 -0.017 0.017
x2 1.944e-05 nan nan nan nan nan
x3 3.504e-05 2.173 1.61e-05 1.000 -4.259 4.259
x4 3.504e-05 2.912 1.2e-05 1.000 -5.707 5.707
x5 3.504e-05 0.002 0.016 0.988 -0.004 0.004
x6 3.504e-05 0.079 0.000 1.000 -0.154 0.154
x7 3.504e-05 0.003 0.014 0.989 -0.005 0.005
x8 3.504e-05 0.012 0.003 0.998 -0.023 0.023
x9 3.504e-05 0.020 0.002 0.999 -0.039 0.039
x10 3.504e-05 0.021 0.002 0.999 -0.041 0.041
x11 3.504e-05 0.011 0.003 0.997 -0.021 0.022
x12 8.831e-06 5.74e-06 1.538 0.124 -2.42e-06 2.01e-05
x13 4.82e-06 9.23e-06 0.522 0.602 -1.33e-05 2.29e-05
x14 3.504e-05 0.000 0.248 0.804 -0.000 0.000
x15 3.504e-05 4.02e-05 0.871 0.384 -4.38e-05 0.000
x16 1.815e-05 1.58e-05 1.152 0.249 -1.27e-05 4.9e-05
x17 3.504e-05 0.029 0.001 0.999 -0.057 0.057
x18 3.504e-05 0.000 0.190 0.849 -0.000 0.000
x19 9.494e-06 nan nan nan nan nan
x20 1.848e-05 nan nan nan nan nan
x21 3.504e-05 0.026 0.001 0.999 -0.051 0.051
x22 3.504e-05 0.037 0.001 0.999 -0.072 0.072
x23 -0.0005 0.000 -2.596 0.009 -0.001 -0.000
x24 3.504e-05 0.006 0.006 0.995 -0.011 0.011
x25 3.504e-05 0.011 0.003 0.998 -0.022 0.022
==============================================================================
"""
With any other method such as bfgs, lbfgs, minimize, the output is the following:
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 17316
Model: Logit Df Residuals: 17292
Method: MLE Df Model: 23
Date: Wed, 05 Aug 2020 Pseudo R-squ.: -0.1975
Time: 19:41:22 Log-Likelihood: -12003.
converged: False LL-Null: -10023.
Covariance Type: cluster LLR p-value: 1.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 0 0.152 0 1.000 -0.299 0.299
x2 0 724.618 0 1.000 -1420.225 1420.225
x3 0 20.160 0 1.000 -39.514 39.514
x4 0 23.008 0 1.000 -45.094 45.094
x5 0 0.010 0 1.000 -0.020 0.020
x6 0 1.335 0 1.000 -2.617 2.617
x7 0 0.020 0 1.000 -0.039 0.039
x8 0 0.109 0 1.000 -0.214 0.214
x9 0 0.070 0 1.000 -0.137 0.137
x10 0 0.175 0 1.000 -0.343 0.343
x11 0 0.045 0 1.000 -0.088 0.088
x12 0 1.24e-05 0 1.000 -2.42e-05 2.42e-05
x13 0 2.06e-05 0 1.000 -4.04e-05 4.04e-05
x14 0 0.001 0 1.000 -0.002 0.002
x15 0 5.16e-05 0 1.000 -0.000 0.000
x16 0 1.9e-05 0 1.000 -3.73e-05 3.73e-05
x17 0 0.079 0 1.000 -0.155 0.155
x18 0 0.000 0 1.000 -0.001 0.001
x19 0 1145.721 0 1.000 -2245.573 2245.573
x20 0 nan nan nan nan nan
x21 0 0.028 0 1.000 -0.055 0.055
x22 0 0.037 0 1.000 -0.072 0.072
x23 0 0.000 0 1.000 -0.000 0.000
x24 0 0.005 0 1.000 -0.010 0.010
x25 0 0.015 0 1.000 -0.029 0.029
==============================================================================
"""
As you can see, I get either "nan" p-values or highly not significant.
What could the problem be?

Values disappear in dataframe multiindex after set_index()

I have a dataframe that looks like that:
scn cl_name lqd_mp lqd_wp gas_mp gas_wp res_mp res_wp
12 C6 Hexanes 3.398 1.723 2.200 5.835 2.614 2.775
13 NaN Me-Cyclo-pentane 1.193 0.591 0.439 1.146 0.707 0.733
14 NaN Benzene 0.037 0.017 0.013 0.030 0.021 0.020
15 NaN Cyclo-hexane 1.393 0.690 0.697 1.820 0.944 0.979
16 C7 Heptanes 6.359 3.748 1.122 3.477 2.980 3.679
17 NaN Me-Cyclo-hexane 4.355 2.515 0.678 2.068 1.985 2.401
18 NaN Toluene 0.407 0.220 0.061 0.174 0.183 0.208
19 C8 Octanes 10.277 6.901 0.692 2.438 4.092 5.759
20 NaN Ethyl-benzene 0.146 0.091 0.010 0.032 0.058 0.076
21 NaN Meta/Para-xylene 0.885 0.553 0.029 0.095 0.333 0.436
22 NaN Ortho-xylene 0.253 0.158 0.002 0.007 0.091 0.119
23 C9 Nonanes 8.683 6.552 0.280 1.113 3.266 5.160
24 NaN Tri-Me-benzene 0.496 0.351 0.000 0.000 0.176 0.261
25 C10 Decanes 8.216 6.877 0.108 0.451 2.985 5.233
I'd like to replace all the NaN values with the values from the previous row in 'scn' column and then to reindex the dataframe using multiindex on two columns 'scn' and 'cl_name'.
I do it with those two lines of code:
df['scn'] = df['scn'].ffill()
df.set_index(['scn', 'cl_name'], inplace=True)
The first line with ffil() does what I want replacing NaNs with above values. But after doing set_index() these values are disappearing leaving blank cells.
lqd_mp lqd_wp gas_mp gas_wp res_mp res_wp
scn cl_name
C6 Hexanes 3.398 1.723 2.200 5.835 2.614 2.775
Me-Cyclo-pentane 1.193 0.591 0.439 1.146 0.707 0.733
Benzene 0.037 0.017 0.013 0.030 0.021 0.020
Cyclo-hexane 1.393 0.690 0.697 1.820 0.944 0.979
C7 Heptanes 6.359 3.748 1.122 3.477 2.980 3.679
Me-Cyclo-hexane 4.355 2.515 0.678 2.068 1.985 2.401
Toluene 0.407 0.220 0.061 0.174 0.183 0.208
C8 Octanes 10.277 6.901 0.692 2.438 4.092 5.759
Ethyl-benzene 0.146 0.091 0.010 0.032 0.058 0.076
Meta/Para-xylene 0.885 0.553 0.029 0.095 0.333 0.436
Ortho-xylene 0.253 0.158 0.002 0.007 0.091 0.119
C9 Nonanes 8.683 6.552 0.280 1.113 3.266 5.160
Tri-Me-benzene 0.496 0.351 0.000 0.000 0.176 0.261
C10 Decanes 8.216 6.877 0.108 0.451 2.985 5.233
I'd like no blanks in 'scn' part of the index. What am I doing wrong?
Thanks

Pandas : interpolate a dataframe and replace values

For each column of a dataframe, I did an interpolation using the pandas function "interpolate" and i'm trying to replace values of the dataframe by values of the interpolated curve (trend curve on excel).
I have the following dataframe, named data
0 1
0 0.000 0.002
1 0.001 0.002
2 0.001 0.003
3 0.003 0.004
4 0.003 0.005
5 0.003 0.005
6 0.004 0.006
7 0.005 0.006
8 0.006 0.007
9 0.006 0.007
10 0.007 0.008
11 0.007 0.009
12 0.008 0.010
13 0.008 0.010
14 0.010 0.012
I then did the following code:
for i in range(len(data.columns)):
data[i].interpolate(method="polynomial",order=2,inplace=True)
I thought that inplace would replace values but it don't seems to work. Does someone knowns how to do that?
Thanks and have a good day :)
Try this,
import pandas as pd
import numpy as np
I created a mini text file with some crazy values so you can see how interpolate is working.
File looks like this,
0,1
0.0,.002
0.001,.3
NaN,NaN
4.003,NaN
.004,19
.005,234
NaN,444
1,777
Here is how to import and process your data,
df=pd.read_csv('datafile.txt, header=0)
for column in df:
df[column].interpolate(method="polynomial",order=2,inplace=True)
print(df.head())
the dataframe now looks like this,
0 1
0 0.000000 0.002000
1 0.001000 0.300000
2 2.943616 -30.768123
3 4.003000 -70.313176
4 0.004000 19.000000
5 0.005000 234.000000
6 0.616931 444.000000
7 1.000000 777.000000
Also,
if you mean you want to interpolate between the points in your dataframe, that is something different.
Something like that would be,
df1 = df.reindex(df.index.union(np.linspace(.11,.25,8)))
df1.interpolate('index')
the results of that look like,
0 1
0.00 0.00000 0.00200
0.11 0.00011 0.03478
0.13 0.00013 0.04074
0.15 0.00015 0.04670
0.17 0.00017 0.05266
0.19 0.00019 0.05862
0.21 0.00021 0.06458
0.23 0.00023 0.07054
0.25 0.00025 0.07650
1.00 0.00100 0.30000
It's in fact working with scipy.interpolate.UnivariateSpline

feed empty pandas.dataframe with several files

I would like to feed a empty dataframe appending several files of the same type and structure. However, I can't see what's wrong here:
def files2df(colnames, ext):
df = DataFrame(columns = colnames)
for inf in sorted(glob.glob(ext)):
dfin = read_csv(inf, sep='\t', skiprows=1)
print(dfin.head(), '\n')
df.append(dfin, ignore_index=True)
return df
The resulting dataframe is empty. Could someone give me a hand?
1.0 16.59 0.597 0.87 1.0.1 3282 100.08
0 0.953 14.52 0.561 0.80 0.99 4355 -
1 1.000 31.59 1.000 0.94 1.00 6322 -
2 1.000 6.09 0.237 0.71 1.00 10568 -
3 1.000 31.29 1.000 0.94 1.00 14363 -
4 1.000 31.59 1.000 0.94 1.00 19797 -
1.0 6.69 0.199 0.74 1.0.1 186 13.16
0 1 0.88 0.020 0.13 0.99 394 -
1 1 0.75 0.017 0.11 0.99 1052 -
2 1 3.34 0.097 0.57 1.00 1178 -
3 1 1.50 0.035 0.26 1.00 1211 -
4 1 20.59 0.940 0.88 1.00 1583 -
1.0 0.12 0.0030 0.04 0.97 2285 2.62
0 1 1.25 0.135 0.18 0.99 2480 -
1 1 0.03 0.001 0.04 0.97 7440 -
2 1 0.12 0.003 0.04 0.97 8199 -
3 1 1.10 0.092 0.16 0.99 11174 -
4 1 0.27 0.007 0.06 0.98 11310 -
0.244 0.07 0.0030 0.02 0.76 41314 1.32
0 0.181 0.64 0.028 0.03 0.36 41755 -
1 0.161 0.18 0.008 0.01 0.45 42420 -
2 0.161 0.18 0.008 0.01 0.45 42461 -
3 0.237 0.25 0.011 0.02 0.56 43060 -
4 0.267 1.03 0.047 0.07 0.46 43321 -
0.163 0.12 0.0060 0.01 0.5 103384 1.27
0 0.243 0.27 0.014 0.02 0.56 104693 -
1 0.215 0.66 0.029 0.04 0.41 105192 -
2 0.190 0.10 0.005 0.01 0.59 105758 -
3 0.161 0.12 0.006 0.01 0.50 109783 -
4 0.144 0.16 0.007 0.01 0.42 110067 -
Empty DataFrame
Columns: array([D, LOD, r2, CIlow, CIhi, Dist, T-int], dtype=object)
Index: array([], dtype=object)
df.append(dfin, ignore_index=True) returns a new DataFrame, it does not change df in place.
Use df = df.append(dfin, ignore_index=True). But even with this change i think this will not give what you need. Append extends a frame on axis=1 (columns), but i believe you want to combine the data on axis=0 (rows)
In this scenario (reading multiple files and use all data to create a single DataFrame), i would use pandas.concat(). The code below will give you a frame with columns named by colnames, and the rows are formed by the data in the csv files.
def files2df(colnames, ext):
files = sorted(glob.glob(ext))
frames = [read_csv(inf, sep='\t', skiprows=1, names=colnames) for inf in files]
return concat(frames, ignore_index=True)
I did not try this code, just wrote it here, maybe you need tweak it to get it running, but the idea is clear (i hope).
Also, I found another solution, but don't know which one is faster.
def files2df(colnames, ext):
dflist = [ ]
for inf in sorted(glob.glob(ext)):
dflist.append(read_csv(inf, names = colnames, sep='\t', skiprows=1))
#print(dflist)
df = concat(dflist, axis = 0, ignore_index=True)
#print(df.to_string())
return df

Categories