Why is RNG different for TensorFlow 2 and 1? - python

import numpy as np
np.random.seed(1)
import random
random.seed(2)
import tensorflow as tf
tf.compat.v1.set_random_seed(3) # graph-level seed
if tf.__version__[0] == '2':
tf.random.set_seed(4) # global seed
else:
tf.set_random_seed(4) # global seed
from tensorflow.keras.initializers import glorot_uniform as GlorotUniform
from tensorflow.keras import backend as K
init = GlorotUniform(seed=5)(shape=(4, 4))
print(K.eval(init))
[[-0.75889236 0.5744677 0.82025963 -0.26889956]
[ 0.0180248 -0.24747121 -0.0666492 0.23440498]
[ 0.61886185 0.05548459 0.39713246 0.126324 ]
[ 0.6639387 -0.58397514 0.39671892 0.67872125]] # TF 2
[[ 0.2515846 -0.41902617 -0.7859829 0.41573995]
[ 0.8099498 -0.6861247 -0.46198446 -0.7579694 ]
[ 0.29976922 0.0310365 0.5031274 0.314076 ]
[-0.62062943 -0.01889879 0.7725797 -0.65635633]] # TF 1
Why the difference? This is creating severe reproducibility problems between the two versions - and this or something else, within the same version's (TF2) Graph vs. Eager. More importantly, can TF1's RNG sequence be used in TF2?

With enough digging - yes. TL;DR:
TF2 behavior in TF1: from tensorflow.python.keras.initializers import GlorotUniformV2 as GlorotUniform
TF1 behavior in TF2: from tensorflow.python.keras.initializers import GlorotUniform
TF2 essentially executes the first bullet under the hood; GlorotUniform is actually GlorotUniformV2.
Some details:
Found docs - but code itself terminates at some pywrapped compiled code (TF1 -- TF2 -- for some reason Github refuses to show gen_stateless_random_ops for TF2 and gen_random_ops for TF1, but you can find both in the local install):
tensorflow.python.ops.gen_random_ops.truncated_normal Outputs random values from a truncated normal distribution.
The generated values follow a normal distribution with mean 0 and
standard deviation 1, except that values whose magnitude is more
than 2 standard deviations from the mean are dropped and re-picked.
tensorflow.python.ops.gen_stateless_random_ops.truncated_normal Outputs deterministic pseudorandom values from a truncated normal distribution.
The generated values follow a normal distribution with mean 0 and
standard deviation 1, except that values whose magnitude is more
than 2 standard deviations from the mean are dropped and re-picked.
The outputs are a deterministic function of shape and seed.
The first and second are ultimately where GlorotUniform and GlorotUniformV2 route to, respectively. TF2's from tensorflow.keras.initializers imports from init_ops_v2 (i.e. V2), whereas TF1's from init_ops.

Related

How to fix: frozen and None value

The problem was to create a normal distribution with mean 32 and standard deviation 4.5, setting the random seed to 1, and create a random sample of 100 elements from the above defined distribution.Finally, compute the absolute difference between the sample mean and the distribution mean.
This is some of the beginner stats problems in the course. I have had experience in Python but not in stats.
x = stats.norm(loc=32,state=4.5)
y = np.random.seed(1)
mean1 = np.mean(x)
mean2 = np.mean(y)
diff = abs(mean1 - mean2)
The error I've been encountering is x has a frozen value and y has a value of None.
random.seed(1) sets the state of the pseudorandom numbers generator so that every run of this script will give the same output - and give identical results for all students...
You need to execute this before generating your random numbers. The seed function doesn't have anything to return, so it return None. This is the default return value in Python for functions that don't return anything specific.
Then you create your sample of size 100, and calculate its mean. As it is a sample, its mean will differ from the mean of the distribution (32): we calculate the absolute difference between these means.
You can experiment with different sample sizes, and see how the difference tends towards 0 when the size of the sample grows - you'll learn more about it in your course!
from scipy.stats import norm
import numpy as np
np.random.seed(1)
distribution_mean = 32
sample = norm.rvs(loc=distribution_mean, scale=4.5, size=100)
sample_mean = np.mean(sample)
print('sample:', sample)
print('sample mean:', sample_mean)
abs_diff = abs(sample_mean - distribution_mean)
print('absolute difference:', abs_diff)
Output:
sample: [39.30955414 29.24709614 29.62322711 27.1716412 35.89433433 21.64307586
39.85165294 28.57456895 33.43567593 30.87783331 38.57948572 22.72936681
30.54912258 30.2717554 37.10196249 27.0504893 31.22407307 28.04963712
32.18996186 34.62266846 27.0472137 37.15125669 36.05715824 34.26122453
36.05385177 28.92322463 31.44699399 27.78903755 30.79450364 34.3865996
28.88752662 30.21460913 28.90772285 28.19657461 28.97939241 31.9430093
26.97210343 33.05487064 39.4691098 35.33919872 31.13674001 28.00566966
28.63778768 39.6160457 32.2286349 29.13351959 32.85911968 41.45114811
32.54071529 34.77741399 33.35076644 30.41487569 26.85866811 30.42795775
31.05997595 34.63980436 35.77542536 36.18995937 33.28514296 35.98313524
28.60520927 37.6379067 34.30818419 30.65858224 34.19833166 31.65992729
37.09233224 38.83917567 41.83508933 25.71576649 25.50148788 29.72990362
32.72016681 35.94276015 33.42035726 22.90009453 30.62208194 35.72588589
33.03542631 35.42905031 30.99952336 31.09658869 32.83952626 33.84523241
32.89234874 32.53553891 28.98201971 33.69903704 32.54819572 37.08267759
37.39513046 32.83320388 30.31121772 29.12571317 33.90572459 32.34803031
30.45265846 32.19618586 29.2099962 35.14114415]
sample mean: 32.27262283434065
absolute difference: 0.2726228343406518

Isolation Forest gives different results when predicting one point instead of all

I am trying to detect anomalies in some data. I have normal data and data which are considered anomalous.
I use Isolation Forest from scikit-learn library in python. I have create a model from the normal data like that:
model = IsolationForest(n_estimators=100, contamination=0.002)
model.fit(new_features)
When I am trying to do prediction:
predicted = model.predict(transformed_anomaly)
It works correctly. 35 out of 36 are detected as anomalies.
If I do this:
for anomaly in transformed_anomaly:
predicted = model.predict(anomaly.reshape(1,-1))
Suddenly all points are classified as inliers.
I checked the shape of 'anomaly.reshape(1,-1)', it is (1, 2).
The shape of 'transformed_anomaly' is (36,2)
Could someone point out the problem with it?
Pass random_state= 0 in isolation forest to get same results on every run.
model = IsolationForest(n_estimators=100, contamination=0.002,,random_state= 0)
I have one more solution - Why not fix the seed value like this .
# Set a seed value
seed_value= 123
# 1. Set `PYTHONHASHSEED` environment variable at a fixed value
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
# 2. Set `python` built-in pseudo-random generator at a fixed value
import random
random.seed(seed_value)
# 3. Set `numpy` pseudo-random generator at a fixed value
import numpy as np
np.random.seed(seed_value)
This will help you in getting same result evrytime on same data as it removes randomness from the model.

Python scikit learn pca.explained_variance_ratio_ cutoff

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.
However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to "99% of variance is retained"? Could anyone enlighten? Thanks.
The Python Scikit learn PCA manual is here
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA
Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.
You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(0)
my_matrix = np.random.randn(20, 5)
my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)
print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()
[ 1.50756565 1.29374452 0.97042041 0.61712667 0.31529082]
[ 0.32047581 0.27502207 0.20629036 0.13118776 0.067024 ]
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.
Although this question is older than 2 years i want to provide an update on this.
I wanted to do the same and it looks like sklearn now provides this feature out of the box.
As stated in the docs
if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components
So the code required is now
my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)
This worked for me with even less typing in the PCA section.
The rest is added for convenience. Only 'data' needs to be defined in an earlier stage.
import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA
st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
pc
#pca.explained_variance_ratio_
print ( "Components = ", pca.n_components_ , ";\nTotal explained variance = ",
round(pca.explained_variance_ratio_.sum(),5) )

OLS using statsmodel.formula.api versus statsmodel.api

Can anyone explain to me the difference between ols in statsmodel.formula.api versus ols in statsmodel.api?
Using the Advertising data from the ISLR text, I ran an ols using both, and got different results. I then compared with scikit-learn's LinearRegression.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
df = pd.read_csv("C:\...\Advertising.csv")
x1 = df.loc[:,['TV']]
y1 = df.loc[:,['Sales']]
print "Statsmodel.Formula.Api Method"
model1 = smf.ols(formula='Sales ~ TV', data=df).fit()
print model1.params
print "\nStatsmodel.Api Method"
model2 = sm.OLS(y1, x1)
results = model2.fit()
print results.params
print "\nSci-Kit Learn Method"
model3 = LinearRegression()
model3.fit(x1, y1)
print model3.coef_
print model3.intercept_
The output is as follows:
Statsmodel.Formula.Api Method
Intercept 7.032594
TV 0.047537
dtype: float64
Statsmodel.Api Method
TV 0.08325
dtype: float64
Sci-Kit Learn Method
[[ 0.04753664]]
[ 7.03259355]
The statsmodel.api method returns a different parameter for TV from the statsmodel.formula.api and the scikit-learn methods.
What kind of ols algorithm is statsmodel.api running that would produce a different result? Does anyone have a link to documentation that could help answer this question?
Came across this issue today and wanted to elaborate on #stellasia's answer because the statsmodels documentation is perhaps a bit ambiguous.
Unless you are using actual R-style string-formulas when instantiating OLS, you need to add a constant (literally a column of 1s) under both statsmodels.formulas.api and plain statsmodels.api. #Chetan is using R-style formatting here (formula='Sales ~ TV'), so he will not run into this subtlety, but for people with some Python knowledge but no R background this could be very confusing.
Furthermore it doesn't matter whether you specify the hasconst parameter when building the model. (Which is kind of silly.) In other words, unless you are using R-style string formulas, hasconst is ignored even though it is supposed to
[Indicate] whether the RHS includes a user-supplied constant
because, in the footnotes
No constant is added by the model unless you are using formulas.
The example below shows that both .formulas.api and .api will require a user-added column vector of 1s if not using R-style string formulas.
# Generate some relational data
np.random.seed(123)
nobs = 25
x = np.random.random((nobs, 2))
x_with_ones = sm.add_constant(x, prepend=False)
beta = [.1, .5, 1]
e = np.random.random(nobs)
y = np.dot(x_with_ones, beta) + e
Now throw x and y into Excel and run Data>Data Analysis>Regression, making sure "Constant is zero" is unchecked. You'll get the following coefficients:
Intercept 1.497761024
X Variable 1 0.012073045
X Variable 2 0.623936056
Now, try running this regression on x, not x_with_ones, in either statsmodels.formula.api or statsmodels.api with hasconst set to None, True, or False. You'll see that in each of those 6 scenarios, there is no intercept returned. (There are only 2 parameters.)
import statsmodels.formula.api as smf
import statsmodels.api as sm
print('smf models')
print('-' * 10)
for hc in [None, True, False]:
model = smf.OLS(endog=y, exog=x, hasconst=hc).fit()
print(model.params)
# smf models
# ----------
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
# [ 1.46852293 1.8558273 ]
Now running things correctly with a column vector of 1.0s added to x. You can use smf here but it's really not necessary if you're not using formulas.
print('sm models')
print('-' * 10)
for hc in [None, True, False]:
model = sm.OLS(endog=y, exog=x_with_ones, hasconst=hc).fit()
print(model.params)
# sm models
# ----------
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
# [ 0.01207304 0.62393606 1.49776102]
The difference is due to the presence of intercept or not:
in statsmodels.formula.api, similarly to the R approach, a constant is automatically added to your data and an intercept in fitted
in statsmodels.api, you have to add a constant yourself (see the documentation here). Try using add_constant from statsmodels.api
x1 = sm.add_constant(x1)
I had a similar issue with the Logit function.
(I used patsy to create my matrices, so the intercept was there.)
My sm.logit was not converging.
My sm.formula.logit was converging however.
Data going in was exactly the same.
I changed the solver method to 'newton' and the sm.logit converged also.
Is it possible the two versions have different default solver methods.

What's the correct usage of matplotlib.mlab.normpdf()?

I intend for part of a program I'm writing to automatically generate Gaussian distributions of various statistics over multiple raw text sources, however I'm having some issues generating the graphs as per the guide at:
python pylab plot normal distribution
The general gist of the plot code is as follows.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as pyplot
meanAverage = 222.89219487179491 # typical value calculated beforehand
standardDeviation = 3.8857889432054091 # typical value calculated beforehand
x = np.linspace(-3,3,100)
pyplot.plot(x,mlab.normpdf(x,meanAverage,standardDeviation))
pyplot.show()
All it does is produce a rather flat looking and useless y = 0 line!
Can anyone see what the problem is here?
Cheers.
If you read documentation of matplotlib.mlab.normpdf, this function is deprycated and you should use scipy.stats.norm.pdf instead.
Deprecated since version 2.2: scipy.stats.norm.pdf
And because your distribution mean is about 222, you should use np.linspace(200, 220, 100).
So your code will look like:
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as pyplot
meanAverage = 222.89219487179491 # typical value calculated beforehand
standardDeviation = 3.8857889432054091 # typical value calculated beforehand
x = np.linspace(200, 220, 100)
pyplot.plot(x, norm.pdf(x, meanAverage, standardDeviation))
pyplot.show()
It looks like you made a few small but significant errors. You either are choosing your x vector wrong or you swapped your stddev and mean. Since your mean is at 222, you probably want your x vector in this area, maybe something like 150 to 300. This way you get all the good stuff, right now you are looking at -3 to 3 which is at the tail of the distribution. Hope that helps.
I see that, for the *args which are sending meanAverage, standardDeviation, the correct thing to be sent is:
mu : a numdims array of means of a
sigma : a numdims array of atandard deviation of a
Does this help?

Categories