I have a dataframe that looks like that:
scn cl_name lqd_mp lqd_wp gas_mp gas_wp res_mp res_wp
12 C6 Hexanes 3.398 1.723 2.200 5.835 2.614 2.775
13 NaN Me-Cyclo-pentane 1.193 0.591 0.439 1.146 0.707 0.733
14 NaN Benzene 0.037 0.017 0.013 0.030 0.021 0.020
15 NaN Cyclo-hexane 1.393 0.690 0.697 1.820 0.944 0.979
16 C7 Heptanes 6.359 3.748 1.122 3.477 2.980 3.679
17 NaN Me-Cyclo-hexane 4.355 2.515 0.678 2.068 1.985 2.401
18 NaN Toluene 0.407 0.220 0.061 0.174 0.183 0.208
19 C8 Octanes 10.277 6.901 0.692 2.438 4.092 5.759
20 NaN Ethyl-benzene 0.146 0.091 0.010 0.032 0.058 0.076
21 NaN Meta/Para-xylene 0.885 0.553 0.029 0.095 0.333 0.436
22 NaN Ortho-xylene 0.253 0.158 0.002 0.007 0.091 0.119
23 C9 Nonanes 8.683 6.552 0.280 1.113 3.266 5.160
24 NaN Tri-Me-benzene 0.496 0.351 0.000 0.000 0.176 0.261
25 C10 Decanes 8.216 6.877 0.108 0.451 2.985 5.233
I'd like to replace all the NaN values with the values from the previous row in 'scn' column and then to reindex the dataframe using multiindex on two columns 'scn' and 'cl_name'.
I do it with those two lines of code:
df['scn'] = df['scn'].ffill()
df.set_index(['scn', 'cl_name'], inplace=True)
The first line with ffil() does what I want replacing NaNs with above values. But after doing set_index() these values are disappearing leaving blank cells.
lqd_mp lqd_wp gas_mp gas_wp res_mp res_wp
scn cl_name
C6 Hexanes 3.398 1.723 2.200 5.835 2.614 2.775
Me-Cyclo-pentane 1.193 0.591 0.439 1.146 0.707 0.733
Benzene 0.037 0.017 0.013 0.030 0.021 0.020
Cyclo-hexane 1.393 0.690 0.697 1.820 0.944 0.979
C7 Heptanes 6.359 3.748 1.122 3.477 2.980 3.679
Me-Cyclo-hexane 4.355 2.515 0.678 2.068 1.985 2.401
Toluene 0.407 0.220 0.061 0.174 0.183 0.208
C8 Octanes 10.277 6.901 0.692 2.438 4.092 5.759
Ethyl-benzene 0.146 0.091 0.010 0.032 0.058 0.076
Meta/Para-xylene 0.885 0.553 0.029 0.095 0.333 0.436
Ortho-xylene 0.253 0.158 0.002 0.007 0.091 0.119
C9 Nonanes 8.683 6.552 0.280 1.113 3.266 5.160
Tri-Me-benzene 0.496 0.351 0.000 0.000 0.176 0.261
C10 Decanes 8.216 6.877 0.108 0.451 2.985 5.233
I'd like no blanks in 'scn' part of the index. What am I doing wrong?
Thanks
Related
I have a database like this:
participant time1 time2 ... time27
1 0.003 0.001 0.003
1 0.003 0.002 0.001
1 0.006 0.003 0.003
1 0.003 0.001 0.003
2 0.003 0.003 0.001
2 0.003 0.003 0.001
3 0.006 0.003 0.003
3 0.007 0.044 0.006
3 0.000 0.005 0.007
I need to perform a transformation using np.log1p() per participant and divide every value by the maximum of each participant.
(log [X + 1]) / Xmax
How can I do this?
You can use:
df.join(df.groupby('participant')
.transform(lambda s: np.log1p(s)/s.max())
.add_suffix('_trans')
)
Output (as new columns):
participant time1 time2 time27 time1_trans time2_trans time27_trans
0 1 0.003 0.001 0.003 0.499251 0.333167 0.998503
1 1 0.003 0.002 0.001 0.499251 0.666001 0.333167
2 1 0.006 0.003 0.003 0.997012 0.998503 0.998503
3 1 0.003 0.001 0.003 0.499251 0.333167 0.998503
4 2 0.003 0.003 0.001 0.998503 0.998503 0.999500
5 2 0.003 0.003 0.001 0.998503 0.998503 0.999500
6 3 0.006 0.003 0.003 0.854582 0.068080 0.427930
7 3 0.007 0.044 0.006 0.996516 0.978625 0.854582
8 3 0.000 0.005 0.007 0.000000 0.113353 0.996516
I'd like to replicate this method of winsorizing a dataframe with specified percentile regions in python. I tried using the scipy winsorize function but that didn't get the results I was looking for.
Example expected output for a dataframe winsorized by 0.01% low percentage value and 0.99% high percentage value across each date:
Original df:
A B C D E
2020-06-30 0.033 -0.182 -0.016 0.665 0.025
2020-07-31 0.142 -0.175 -0.016 0.556 0.024
2020-08-31 0.115 -0.187 -0.017 0.627 0.027
2020-09-30 0.032 -0.096 -0.022 0.572 0.024
Winsorized data:
A B C D E
2020-06-30 0.033 -0.175 -0.016 0.64 0.025
2020-07-31 0.142 -0.169 -0.016 0.54 0.024
2020-08-31 0.115 -0.18 -0.017 0.606 0.027
2020-09-30 0.032 -0.093 -0.022 0.55 0.024
I have the following functions developed using the TA-Lib package in Python:
def adx(df):
adx = ta.ADX(df['high'], df['low'], df['close'], timeperiod=2)
return pd.DataFrame(adx)
def calc_adx(df):
ret_val = ta.ADX(row['high'], row['low'], row['close'], timeperiod=2)
return ret_val
ticker_group = df.groupby('ticker')
subsets = []
for ticker, ticker_df in ticker_group:
ticker_df['adx'] = ta.ADX(ticker_df['high'], ticker_df['low'], ticker_df['close'], timeperiod=2)
subsets.append(ticker_df)
df = pd.concat(subsets)
However, I am now wish to run an analysis by using multiple time periods 2,4,6,10.
I have tried to add in additional functions:
def adx_periods (df, weeks):
for weeks in weeks:
df['{}week_adx'.format(weeks)] = adx(df, weeks)
return df
periods = adx_periods(df, [2,4,6,10])
But this fails with internal errors. Please help.
An extract of the dataframe is below:
ticker date open high low close volume
0 A2M 2015-04-03 0.555 0.595 0.530 0.555 11.972594
1 A2M 2015-04-10 0.545 0.550 0.530 0.535 1.942575
2 A2M 2015-04-17 0.535 0.550 0.520 0.540 3.003353
3 A2M 2015-04-24 0.535 0.535 0.490 0.505 3.909057
4 A2M 2015-05-01 0.505 0.510 0.475 0.500 2.252260
5 A2M 2015-05-08 0.505 0.510 0.490 0.495 4.999979
6 A2M 2015-05-15 0.500 0.510 0.465 0.465 1.925071
7 A2M 2015-05-22 0.480 0.490 0.470 0.470 1.327491
8 A2M 2015-05-29 0.480 0.495 0.455 0.465 10.907722
9 A2M 2015-06-05 0.470 0.535 0.460 0.520 10.903146
10 A2M 2015-06-12 0.520 0.535 0.515 0.525 3.473838
11 A2M 2015-06-19 0.530 0.540 0.500 0.510 3.066124
12 A2M 2015-06-26 0.615 0.720 0.555 0.650 18.185325
13 A2M 2015-07-03 0.635 0.690 0.625 0.660 5.487445
14 A2M 2015-07-10 0.670 0.680 0.640 0.680 10.724293
15 A2M 2015-07-17 0.665 0.685 0.655 0.665 3.383546
16 A2M 2015-07-24 0.650 0.750 0.635 0.730 9.850991
17 A2M 2015-07-31 0.735 0.785 0.730 0.735 4.988930
18 A2M 2015-08-07 0.732 0.750 0.710 0.735 1.448889
19 A2M 2015-08-14 0.735 0.740 0.705 0.710 2.624986
Change your adx
def adx(df,p):
adx = ta.ADX(df['high'], df['low'], df['close'], timeperiod=p)
return pd.DataFrame(adx)
And
def adx_periods (df, weeks):
for week in weeks:
df['{}week_adx'.format(week)] = adx(df, week)
return df
I'm running a logit with statsmodels that has around 25 regressors, ranging from categorical, ordinal and continuous variables.
My code is the following, with its output:
a = np.asarray(data_nobands[[*all 25 columns*]], dtype=float)
mod_logit = sm.Logit(np.asarray(data_nobands['cured'], dtype=float),a)
logit_res = mod_logit.fit(method="nm", cov_type="cluster", cov_kwds={"groups":data_nobands['AGREEMENT_NUMBER']})
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 17316
Model: Logit Df Residuals: 17292
Method: MLE Df Model: 23
Date: Wed, 05 Aug 2020 Pseudo R-squ.: -0.02503
Time: 19:49:27 Log-Likelihood: -10274.
converged: False LL-Null: -10023.
Covariance Type: cluster LLR p-value: 1.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 3.504e-05 0.009 0.004 0.997 -0.017 0.017
x2 1.944e-05 nan nan nan nan nan
x3 3.504e-05 2.173 1.61e-05 1.000 -4.259 4.259
x4 3.504e-05 2.912 1.2e-05 1.000 -5.707 5.707
x5 3.504e-05 0.002 0.016 0.988 -0.004 0.004
x6 3.504e-05 0.079 0.000 1.000 -0.154 0.154
x7 3.504e-05 0.003 0.014 0.989 -0.005 0.005
x8 3.504e-05 0.012 0.003 0.998 -0.023 0.023
x9 3.504e-05 0.020 0.002 0.999 -0.039 0.039
x10 3.504e-05 0.021 0.002 0.999 -0.041 0.041
x11 3.504e-05 0.011 0.003 0.997 -0.021 0.022
x12 8.831e-06 5.74e-06 1.538 0.124 -2.42e-06 2.01e-05
x13 4.82e-06 9.23e-06 0.522 0.602 -1.33e-05 2.29e-05
x14 3.504e-05 0.000 0.248 0.804 -0.000 0.000
x15 3.504e-05 4.02e-05 0.871 0.384 -4.38e-05 0.000
x16 1.815e-05 1.58e-05 1.152 0.249 -1.27e-05 4.9e-05
x17 3.504e-05 0.029 0.001 0.999 -0.057 0.057
x18 3.504e-05 0.000 0.190 0.849 -0.000 0.000
x19 9.494e-06 nan nan nan nan nan
x20 1.848e-05 nan nan nan nan nan
x21 3.504e-05 0.026 0.001 0.999 -0.051 0.051
x22 3.504e-05 0.037 0.001 0.999 -0.072 0.072
x23 -0.0005 0.000 -2.596 0.009 -0.001 -0.000
x24 3.504e-05 0.006 0.006 0.995 -0.011 0.011
x25 3.504e-05 0.011 0.003 0.998 -0.022 0.022
==============================================================================
"""
With any other method such as bfgs, lbfgs, minimize, the output is the following:
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 17316
Model: Logit Df Residuals: 17292
Method: MLE Df Model: 23
Date: Wed, 05 Aug 2020 Pseudo R-squ.: -0.1975
Time: 19:41:22 Log-Likelihood: -12003.
converged: False LL-Null: -10023.
Covariance Type: cluster LLR p-value: 1.000
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 0 0.152 0 1.000 -0.299 0.299
x2 0 724.618 0 1.000 -1420.225 1420.225
x3 0 20.160 0 1.000 -39.514 39.514
x4 0 23.008 0 1.000 -45.094 45.094
x5 0 0.010 0 1.000 -0.020 0.020
x6 0 1.335 0 1.000 -2.617 2.617
x7 0 0.020 0 1.000 -0.039 0.039
x8 0 0.109 0 1.000 -0.214 0.214
x9 0 0.070 0 1.000 -0.137 0.137
x10 0 0.175 0 1.000 -0.343 0.343
x11 0 0.045 0 1.000 -0.088 0.088
x12 0 1.24e-05 0 1.000 -2.42e-05 2.42e-05
x13 0 2.06e-05 0 1.000 -4.04e-05 4.04e-05
x14 0 0.001 0 1.000 -0.002 0.002
x15 0 5.16e-05 0 1.000 -0.000 0.000
x16 0 1.9e-05 0 1.000 -3.73e-05 3.73e-05
x17 0 0.079 0 1.000 -0.155 0.155
x18 0 0.000 0 1.000 -0.001 0.001
x19 0 1145.721 0 1.000 -2245.573 2245.573
x20 0 nan nan nan nan nan
x21 0 0.028 0 1.000 -0.055 0.055
x22 0 0.037 0 1.000 -0.072 0.072
x23 0 0.000 0 1.000 -0.000 0.000
x24 0 0.005 0 1.000 -0.010 0.010
x25 0 0.015 0 1.000 -0.029 0.029
==============================================================================
"""
As you can see, I get either "nan" p-values or highly not significant.
What could the problem be?
I've noticed that Pandas groupby().filter() is slow for large datasets. Much slower than the equivalent merge. Here's my example:
size = 50000000
df = pd.DataFrame( { 'M' : np.random.randint(10,size=size), 'A' : np.random.randn(size), 'B' :np.random.randn(size)})
%%time
gb = df.groupby('M').filter(lambda x : x['A'].count()%2==0)
Wall time: 14 s
%%time
gb_int = df.groupby('M').count()%2==0
gb_int = gb_int[gb_int['A'] == True]
gb = df.merge(gb_int, left_on='M', right_index=True)
Wall time: 8.39 s
Can anyone help me understand why groupby filter is so slow?
Using %%prun, you see that the faster merge relies on inner_join, pandas.hashtable.Int64Factorizer whereas the slower filter uses groupby_indices and sort (showing only calls consuming more than 0.02s):
`merge`: 3361 function calls (3285 primitive calls) in 5.420 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.092 1.092 1.092 1.092 {pandas.algos.inner_join}
4 0.768 0.192 0.768 0.192 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
1 0.578 0.578 0.578 0.578 {pandas.algos.take_2d_axis1_float64_float64}
4 0.512 0.128 0.512 0.128 {method 'take' of 'numpy.ndarray' objects}
1 0.425 0.425 0.425 0.425 {method 'get_labels' of 'pandas.hashtable.Int64HashTable' objects}
1 0.381 0.381 0.381 0.381 {pandas.algos.take_2d_axis0_float64_float64}
1 0.296 0.296 0.296 0.296 {pandas.algos.take_2d_axis1_int64_int64}
1 0.203 0.203 1.563 1.563 groupby.py:3730(count)
1 0.194 0.194 0.194 0.194 merge.py:746(_get_join_keys)
1 0.130 0.130 5.420 5.420 <string>:2(<module>)
2 0.109 0.054 0.109 0.054 common.py:250(_isnull_ndarraylike)
3 0.099 0.033 0.107 0.036 internals.py:4768(needs_filling)
2 0.099 0.050 0.875 0.438 merge.py:687(_factorize_keys)
2 0.094 0.047 0.200 0.100 groupby.py:3740(<genexpr>)
2 0.083 0.041 0.083 0.041 {pandas.algos.take_2d_axis1_bool_bool}
1 0.081 0.081 0.772 0.772 algorithms.py:156(factorize)
7 0.058 0.008 1.406 0.201 common.py:733(take_nd)
1 0.049 0.049 2.521 2.521 merge.py:322(_get_join_info)
1 0.035 0.035 2.196 2.196 merge.py:516(_get_join_indexers)
1 0.030 0.030 0.030 0.030 {built-in method numpy.core.multiarray.putmask}
1 0.030 0.030 0.033 0.033 merge.py:271(_maybe_add_join_keys)
1 0.028 0.028 3.725 3.725 merge.py:26(merge)
28 0.021 0.001 0.021 0.001 {method 'reduce' of 'numpy.ufunc' objects}
And the slower filter:
3751 function calls (3694 primitive calls) in 9.110 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.158 2.158 2.158 2.158 {pandas.algos.groupby_indices}
2 1.214 0.607 1.214 0.607 {pandas.algos.take_2d_axis1_float64_float64}
1 1.017 1.017 1.017 1.017 {method 'sort' of 'numpy.ndarray' objects}
4 0.859 0.215 0.859 0.215 {method 'take' of 'numpy.ndarray' objects}
2 0.586 0.293 0.586 0.293 {pandas.algos.take_2d_axis1_int64_int64}
1 0.534 0.534 0.534 0.534 {pandas.algos.take_1d_int64_int64}
1 0.420 0.420 0.420 0.420 {built-in method pandas.algos.ensure_object}
1 0.395 0.395 0.395 0.395 {method 'get_labels' of 'pandas.hashtable.Int64HashTable' objects}
1 0.349 0.349 0.349 0.349 {pandas.algos.groupsort_indexer}
2 0.324 0.162 0.340 0.170 indexing.py:1794(maybe_convert_indices)
2 0.223 0.112 3.109 1.555 internals.py:3625(take)
1 0.129 0.129 0.129 0.129 {built-in method numpy.core.multiarray.concatenate}
1 0.124 0.124 9.109 9.109 <string>:2(<module>)
1 0.124 0.124 0.124 0.124 {method 'copy' of 'numpy.ndarray' objects}
1 0.086 0.086 0.086 0.086 {pandas.lib.generate_slices}
31 0.083 0.003 0.083 0.003 {method 'reduce' of 'numpy.ufunc' objects}
1 0.076 0.076 0.710 0.710 algorithms.py:156(factorize)
5 0.074 0.015 2.415 0.483 common.py:733(take_nd)
1 0.067 0.067 0.068 0.068 numeric.py:2476(array_equal)
1 0.063 0.063 8.985 8.985 groupby.py:3523(filter)
1 0.062 0.062 2.640 2.640 groupby.py:4300(_groupby_indices)
10 0.059 0.006 0.059 0.006 common.py:250(_isnull_ndarraylike)
1 0.030 0.030 0.030 0.030 {built-in method numpy.core.multiarray.putmask}