Python - decision tree in lightgbm with odd values - python

I am trying to fit a single decision tree using the Python module lightgbm. However, I find the output a little strange. I have 15 explanatory variables and the numerical response variable has the following characteristic:
count 653.000000
mean 31.503813
std 11.838267
min 13.750000
25% 22.580000
50% 28.420000
75% 38.250000
max 76.750000
Name: X2, dtype: float64
I do the following to fit the tree: I first construct the Dataset object
df_train = lightgbm.Dataset(
df, # The data
label = df[response], # The response series
feature_name = features, # A list with names of all explanatory variables
categorical_feature = categorical_vars # A list with names of the categorical ones
)
Next, I define the parameters and fit the model:
param = {
# make it a single tree:
'objective': 'regression',
'bagging_freq':0, # Disable bagging
'feature_fraction':1, # don't randomly select features. consider all.
'num_trees': 1,
# tuning parameters
'max_leaves': 20,
'max_depth': -1,
'min_data_in_leaf': 20
}
model = lightgbm.train(param, df_train)
From the model I extract the leaves of the tree as:
tree = model.trees_to_dataframe()[[
'right_child',
'node_depth',
'value',
'count']]
leaves = tree[tree.right_child.isnull()]
print(leaves)
right_child node_depth value count
5 None 6 29.957982 20
6 None 6 30.138253 28
8 None 6 30.269373 34
9 None 6 30.404353 38
12 None 6 30.528705 33
13 None 6 30.651690 62
14 None 5 30.842856 59
17 None 5 31.080432 51
19 None 6 31.232860 21
20 None 6 31.358547 26
22 None 5 31.567571 43
23 None 5 31.795345 46
28 None 6 32.034321 27
29 None 6 32.247890 24
31 None 6 32.420886 22
32 None 6 32.594289 21
34 None 5 32.920932 20
35 None 5 33.210205 22
37 None 4 33.809376 36
38 None 4 34.887632 20
Now, if you look at the values, they range from (approximately) 30 to 35. This is far from capturing the distribution (shown above with min = 13.75 and max = 76.75) of the response variable.
Can anyone explain to me what is going on here?
Follow Up Based On Accepted Answer:
I tried to add 'learning_rate':1 and 'min_data_in_bin':1 to the parameter dict which resulted in the following tree:
right_child node_depth value count
5 None 6 16.045500 20
6 None 6 17.824074 27
8 None 6 19.157500 36
9 None 6 20.529730 37
12 None 6 21.805834 36
13 None 6 23.048387 62
14 None 5 24.975263 57
17 None 5 27.335385 52
19 None 6 29.006800 25
20 None 6 30.234286 21
22 None 5 32.221591 44
23 None 5 34.472272 44
28 None 6 36.808889 27
29 None 6 38.944583 24
31 None 6 40.674546 22
32 None 6 42.408572 21
34 None 5 45.675000 20
35 None 5 48.567728 22
37 None 4 54.559445 36
38 None 4 65.341999 20
This is much more desirable. This means, that we can now use lightgbm to mimic the behavior of a single decision tree with categorical features. As opposed to sklearn, lightgbm honors "true" categorical variables whereas in sklearn one needs to one-hot encode all categorical variables which can turn out really bad; see this kaggle post.

As you may know LightGBM does a couple of tricks to speed things up. One of them is feature binning, where the values of the features are assigned to bins to reduce the possible number of splits. By default this number is 3, so for example if you have 100 samples you'd have about 34 bins.
Another important thing here when using a single tree is that LightGBM does boosting by default, which means that it will start from an initial score and try to gradually improve on it. That gradual change is controlled by the learning_rate which by default is 0.1, so the predictions from each tree are multiplied by this number and added to the current score.
The last thing to consider is that the tree size is controlled by num_leaves which is 31 by default. If you want to fully grow the tree you have to set this number to your number of samples.
So if you want to replicate a full-grown decision tree in LightGBM you have to adjust these parameters. Here's an example:
import lightgbm as lgb
import numpy as np
import pandas as pd
X = np.linspace(1, 2, 100)[:, None]
y = X[:, 0]**2
ds = lgb.Dataset(X, y)
params = {'num_leaves': 100, 'min_child_samples': 1, 'min_data_in_bin': 1, 'learning_rate': 1}
bst = lgb.train(params, ds, num_boost_round=1)
print(pd.concat([
bst.trees_to_dataframe().loc[lambda x: x['left_child'].isnull(), 'value'].describe().rename('leaves'),
pd.Series(y).describe().rename('y'),
], axis=1))
leaves
y
count
100
100
mean
2.33502
2.33502
std
0.882451
0.882451
min
1
1
25%
1.56252
1.56252
50%
2.25003
2.25003
75%
3.06252
3.06252
max
4
4
Having said that, if you're looking for a decision tree it's easier to use scikit-learn's:
from sklearn.tree import DecisionTreeRegressor
tree = DecisionTreeRegressor().fit(X, y)
np.allclose(bst.predict(X), tree.predict(X))
# True

Related

Calculate a prediction interval for a dataset Python

I have the following table:
perc
0 59.98797
1 61.89383
2 61.08403
3 61.00661
4 62.64753
5 62.18118
6 60.74520
7 57.83964
8 62.09705
9 57.07985
10 58.62777
11 60.02589
12 58.74948
13 59.14136
14 58.37719
15 58.27401
16 59.67806
17 58.62855
18 58.45272
19 57.62186
20 58.64749
21 58.88152
22 54.80138
23 59.57697
24 60.26713
25 60.96022
26 55.59813
27 60.32104
28 57.95403
29 58.90658
30 53.72838
31 57.03986
32 58.14056
33 53.62257
34 57.08174
35 57.26881
36 48.80800
37 56.90632
38 59.08444
39 57.36432
consisting of various percentages.
I'm interested in creating a probability distribution based on these percentages for the sake of coming up with a prediction interval (say 95%) of what we would expect a new observation of this percentage to be within.
I initially was doing the following, but upon testing with my sample data I remembered that CIs capture the mean, not a new observation.
import scipy.stats as st
import numpy as np
# Get data in a list
lst = list(percDone['perc'])
# create 95% confidence interval
st.t.interval(alpha=0.95, df=len(lst)-1,
loc=np.mean(lst),
scale=st.sem(lst))
Thanks!

Pre-processing single feature containing different scales

How do I preprocess this data containing a single feature with different scales? This will then be used for supervised machine learning classification.
Data
import pandas as pd
import numpy as np
np.random.seed = 4
df_eur_jpy = pd.DataFrame({"value": np.random.default_rng().uniform(0.07, 3.85, 50)})
df_usd_cad = pd.DataFrame({"value": np.random.default_rng().uniform(0.0004, 0.02401, 50)})
df_usd_cad["ticker"] = "usd_cad"
df_eur_jpy["ticker"] = "eur_jpy"
df = pd.concat([df_eur_jpy,df_usd_cad],axis=0)
df.head(1)
value ticker
0 0.161666 eur_jpy
We can see the different tickers contain data with a different scale when looking at the max/min of this groupby:
df.groupby("ticker")["value"].agg(['min', 'max'])
min max
ticker
eur_jpy 0.079184 3.837519
usd_cad 0.000405 0.022673
I have many tickers in my real data and would like to combine all of these in the one feature (pandas column) and use with an estimator in sci-kit learn for supervised machine learning classification.
If I Understand Carefully (IIUC), you can use the min-max scaling formula:
You can apply this formula to your dataframe with implemented sklearn.preprocessing.MinMaxScaler like below:
from sklearn.preprocessing import MinMaxScaler
df2 = df.pivot(columns='ticker', values='value')
# ticker eur_jpy usd_cad
# 0 3.204568 0.021455
# 1 1.144708 0.013810
# ...
# 48 1.906116 0.002058
# 49 1.136424 0.022451
df2[['min_max_scl_eur_jpy', 'min_max_scl_usd_cad']] = MinMaxScaler().fit_transform(df2[['eur_jpy', 'usd_cad']])
print(df2)
Output:
ticker eur_jpy usd_cad min_max_scl_eur_jpy min_max_scl_usd_cad
0 3.204568 0.021455 0.827982 0.896585
1 1.144708 0.013810 0.264398 0.567681
2 2.998154 0.004580 0.771507 0.170540
3 1.916517 0.003275 0.475567 0.114361
4 0.955089 0.009206 0.212517 0.369558
5 3.036463 0.019500 0.781988 0.812471
6 1.240505 0.006575 0.290608 0.256373
7 1.224260 0.020711 0.286163 0.864584
8 3.343022 0.020564 0.865864 0.858280
9 2.710383 0.023359 0.692771 0.978531
10 1.218328 0.008440 0.284540 0.336588
11 2.005472 0.022898 0.499906 0.958704
12 2.056680 0.016429 0.513916 0.680351
13 1.010388 0.005553 0.227647 0.212368
14 3.272408 0.000620 0.846543 0.000149
15 2.354457 0.018608 0.595389 0.774092
16 3.297936 0.017484 0.853528 0.725720
17 2.415297 0.009618 0.612035 0.387285
18 0.439263 0.000617 0.071386 0.000000
19 3.335262 0.005988 0.863740 0.231088
20 2.767412 0.013357 0.708375 0.548171
21 0.830678 0.013824 0.178478 0.568255
22 1.056041 0.007806 0.240138 0.309336
23 1.497400 0.023858 0.360896 1.000000
24 0.629698 0.014088 0.123489 0.579604
25 3.758559 0.020663 0.979556 0.862509
26 0.964214 0.010302 0.215014 0.416719
27 3.680324 0.023647 0.958150 0.990918
28 3.169445 0.017329 0.818372 0.719059
29 1.898905 0.017892 0.470749 0.743299
30 3.322663 0.020508 0.860293 0.855869
31 2.735855 0.010578 0.699741 0.428591
32 2.264645 0.017853 0.570816 0.741636
33 2.613166 0.021359 0.666173 0.892456
34 1.976168 0.001568 0.491888 0.040928
35 3.076169 0.013663 0.792852 0.561335
36 3.330470 0.013048 0.862429 0.534891
37 3.600527 0.012340 0.936318 0.504426
38 0.653994 0.008665 0.130137 0.346288
39 0.587896 0.013134 0.112052 0.538567
40 0.178353 0.011326 0.000000 0.460781
41 3.727127 0.016738 0.970956 0.693658
42 1.719622 0.010939 0.421696 0.444123
43 0.460177 0.021131 0.077108 0.882665
44 3.124722 0.010328 0.806136 0.417826
45 1.011988 0.007631 0.228085 0.301799
46 3.833281 0.003896 1.000000 0.141076
47 3.289872 0.017223 0.851322 0.714495
48 1.906116 0.002058 0.472721 0.062020
49 1.136424 0.022451 0.262131 0.939465

Select/Group rows from a data frame with the nearest values for a specific column(s)

I have the two columns in a data frame (you can see a sample down below)
Usually in columns A & B I get 10 to 12 rows with similar values.
So for example: from index 1 to 10 and then from index 11 to 21.
I would like to group these values and get the mean and standard deviation of each group.
I found this following line code where I can get the index of the nearest value. but I don't know how to do this repetitively:
Index = df['A'].sub(df['A'][0]).abs().idxmin()
Anyone has any ideas on how to approach this problem?
A B
1 3652.194531 -1859.805238
2 3739.026566 -1881.965576
3 3742.095325 -1878.707674
4 3747.016899 -1878.728626
5 3746.214554 -1881.270329
6 3750.325368 -1882.915532
7 3748.086576 -1882.406672
8 3751.786422 -1886.489485
9 3755.448968 -1885.695822
10 3753.714126 -1883.504098
11 -337.969554 24.070990
12 -343.019575 23.438956
13 -344.788697 22.250254
14 -346.433460 21.912217
15 -343.228579 22.178519
16 -345.722368 23.037441
17 -345.923108 23.317620
18 -345.526633 21.416528
19 -347.555162 21.315934
20 -347.229210 21.565183
21 -344.575181 22.963298
22 23.611677 -8.499528
23 26.320500 -8.744512
24 24.374874 -10.717384
25 25.885272 -8.982414
26 24.448127 -9.002646
27 23.808744 -9.568390
28 24.717935 -8.491659
29 25.811393 -8.773649
30 25.084683 -8.245354
31 25.345618 -7.508419
32 23.286342 -10.695104
33 -3184.426285 -2533.374402
34 -3209.584366 -2553.310934
35 -3210.898611 -2555.938332
36 -3214.234899 -2558.244347
37 -3216.453616 -2561.863807
38 -3219.326197 -2558.739058
39 -3214.893325 -2560.505207
40 -3194.421934 -2550.186647
41 -3219.728445 -2562.472566
42 -3217.630380 -2562.132186
43 234.800448 -75.157523
44 236.661235 -72.617806
45 238.300501 -71.963103
46 239.127539 -72.797922
47 232.305335 -70.634125
48 238.452197 -73.914015
49 239.091210 -71.035163
50 239.855953 -73.961841
51 238.936811 -73.887023
52 238.621490 -73.171441
53 240.771812 -73.847028
54 -16.798565 4.421919
55 -15.952454 3.911043
56 -14.337879 4.236691
57 -17.465204 3.610884
58 -17.270147 4.407737
59 -15.347879 3.256489
60 -18.197750 3.906086
A simpler approach consist in grouping the values where the percentage change is not greater than a given threshold (let's say 0.5):
df['Group'] = (df.A.pct_change().abs()>0.5).cumsum()
df.groupby('Group').agg(['mean', 'std'])
Output:
A B
mean std mean std
Group
0 3738.590934 30.769420 -1880.148905 7.582856
1 -344.724684 2.666137 22.496995 0.921008
2 24.790470 0.994361 -9.020824 0.977809
3 -3210.159806 11.646589 -2555.676749 8.810481
4 237.902230 2.439297 -72.998817 1.366350
5 -16.481411 1.341379 3.964407 0.430576
Note: I have only used the "A" column, since the "B" column appears to follow the same pattern of consecutive nearest values. You can check if the identified groups are the same between columns with:
grps = (df[['A','B']].pct_change().abs()>1).cumsum()
grps.A.eq(grps.B).all()
I would say that if you know the length of each group/index set you want then you can first subset the column and row with :
df['A'].iloc[0:11].mean()
Then figure out a way to find standard deviation.

Given a discrete distribution, how do I round a number to the closest value in that distribution?

What I ultimately want to do is round the expected value of a discrete random variable distribution to a valid number in the distribution. For example if I am drawing evenly from the numbers [1, 5, 6], the expected value is 4 but I want to return the closest number to that (ie, 5).
from scipy.stats import *
xk = (1, 5, 6)
pk = np.ones(len(xk))/len(xk)
custom = rv_discrete(name='custom', values=(xk, pk))
print(custom.expect())
# 4.0
def round_discrete(discrete_rv_dist, val):
# do something here
return answer
print(round_discrete(custom, custom.expect()))
# 5.0
I don't know apriori what distribution will be used (ie might not be integers, might be an unbounded distribution), so I'm really struggling to think of an algorithm that is sufficiently generic. Edit: I just learned that rv_discrete doesn't work on non-integer xk values.
As to why I want to do this, I'm putting together a monte-carlo simulation, and want a "nominal" value for each distribution. I think that the EV is the most physically appropriate rather than the mode or median. I might have values in the downstream simulation that have to be one of several discrete choices, so passing a value that is not within that set is not acceptable.
If there's already a nice way to do this in Python that would be great, otherwise I can interpret math into code.
Here is R code that I think will do what you want, using Poisson data to illustrate:
set.seed(322)
x = rpois(100, 7) # 100 obs from POIS(7)
a = mean(x); a
[1] 7.16 # so 7 is the value we want
d = min(abs(x-a)); d # min distance btw a and actual Pois val
[1] 0.16
u = unique(x); u # unique Pois values observed
[1] 7 5 4 10 2 9 8 6 11 3 13 14 12 15
v = u[abs(u-a)==d]; v # unique val closest to a
[1] 7
Hope you can translate it to Python.
Another run:
set.seed(323)
x = rpois(100, 20)
a = mean(x); a
[1] 20.32
d = min(abs(x-a)); d
[1] 0.32
u = unique(x)
v = u[abs(u-a)==d]; v
[1] 20
x
[1] 17 16 20 23 23 20 19 23 21 19 21 20 22 25 13 15 19 19 14 27 19 30 17 19 23
[26] 16 23 26 33 16 11 23 14 21 24 12 18 20 20 19 26 12 22 24 20 22 17 23 11 19
[51] 19 26 17 17 11 17 23 21 26 13 18 28 22 14 17 25 28 24 16 15 25 26 22 15 23
[76] 27 19 21 17 23 21 24 23 22 23 18 25 14 24 25 19 19 21 22 16 28 18 11 25 23
u
[1] 17 16 20 23 19 21 22 25 13 15 14 27 30 26 33 11 24 12 18 28
Figured it out, and tested it working. If I plug my value X into the cdf, then I can plug that probability P = cdf(X) into the ppf. The values at ppf(P +- epsilon) will give me the closest values in the set to X.
Or more geometrically, for a discrete pmf, the point (X,P) will lie on a horizontal portion of the corresponding cdf. When you invert the cdf, (P,X) is now on a vertical section of the ppf. Taking P +- eps will give you the 2 nearest flat portions of the ppf connected to that vertical jump, which correspond to the valid values X1, X2. You can then do a simple difference to figure out which is closer to your target value.
import numpy as np
eps = np.finfo(float).eps
ev = custom.expect()
p = custom.cdf(ev)
ev_candidates = custom.ppf([p - eps, p, p + eps])
ev_candidates_distance = abs(ev_candidates - ev)
ev_closest = ev_candidates[np.argmin(ev_candidates_distance)]
print(ev_closest)
# 5.0
Terms:
pmf - probability mass function
cdf - cumulative distribution function (cumulative sum of the pdf)
ppf - percentage point function (inverse of the cdf)
eps - epsilon (smallest possible increment)
Would the function ceil from the math library help? For example:
from math import ceil
print(float(ceil(3.333333333333333)))

Looped regression model in Python/Sklearn

I'm trying to systematically regress a couple of different dependent variables (countries) on the same set of inputs/independent variables, and want to do this in a looped fashion in Python using Sklearn. The dependant variables look like this:
Europe UK Japan USA Canada
Jan-10 10 13 39 42 16
Feb-10 13 16 48 51 19
Mar-10 15 18 54 57 21
Apr-10 12 15 45 48 18
May-10 11 14 42 45 17
while the independent variables look like this:
Input 1 Input 2 Input 3 Input 4
Jan-10 90 50 3 41
Feb-10 95 54 5 43
Mar-10 92 52 1 45
Apr-10 91 60 1 49
May-10 90 67 11 49
I find it easy to manually regress them + store predictions one at a time (ie Europe on all four inputs, then Japan etc) but haven't figured out how to program a single looped function that could do them all in one go. I suspect I may need to use a list/dictionary to store the dependent variables and call them one-by-one but don't quite know how to write this in a Pythonic way.
The existing code for a single loop looks like this:
x = pd.DataFrame('countryinputs.csv')
countries = pd.DataFrame('countryoutputs.csv')
y = countries['Europe']
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
Simply iterate through the column names. Then pass name into a defined function. In fact, you can wrap the process in a dictionary comprehension and pass into DataFrame constructor to return a dataframe of predicted values (same shape as original dataframe):
X = pd.DataFrame(...)
countries = pd.DataFrame(...)
def reg_proc(label):
y = countries[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in countries.columns},
columns = countries.columns)
To demonstrate with random, seeded data where tools below would be your countries:
Data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
np.random.seed(7172018)
tools = pd.DataFrame({'pandas': np.random.uniform(0,1000,50),
'r': np.random.uniform(0,1000,50),
'julia': np.random.uniform(0,1000,50),
'sas': np.random.uniform(0,1000,50),
'spss': np.random.uniform(0,1000,50),
'stata': np.random.uniform(0,1000,50)
},
columns=['pandas', 'r', 'julia', 'sas', 'spss', 'stata'])
X = pd.DataFrame({'Input1': np.random.randn(50)*10,
'Input2': np.random.randn(50)*10,
'Input3': np.random.randn(50)*10,
'Input4': np.random.randn(50)*10})
Model
def reg_proc(label):
y = tools[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in tools.columns},
columns = tools.columns)
print(pred_df.head(10))
# pandas r julia sas spss stata
# 0 547.631679 576.025733 682.390046 507.767567 246.020799 557.648181
# 1 577.334819 575.992992 280.579234 506.014191 443.044139 396.044620
# 2 430.494827 576.211105 541.096721 441.997575 386.309627 558.472179
# 3 440.662962 524.582054 406.849303 420.017656 508.701222 393.678200
# 4 588.993442 472.414081 453.815978 479.208183 389.744062 424.507541
# 5 520.215513 489.447248 670.708618 459.375294 314.008988 516.235188
# 6 515.266625 459.292370 477.485995 436.398180 446.777292 398.826234
# 7 423.930650 414.069751 629.444118 378.059735 448.760240 449.062734
# 8 549.769034 406.531405 653.557937 441.425445 348.725447 456.089921
# 9 396.826924 399.327683 717.285415 361.235709 444.830491 429.967976

Categories