Library errors with pmdarima and statsmodels - python

I have a problem with some libraries for time series.
In particular first error rise when i import this library
from pmdarima.arima import auto_arima
As suggested in another post I use the command !pip install pmdarima to solve this problem. But then I have to restart the runtime otherwise I can't compile and I also have to re-use the command every time I open my colab/jupyter notebook.
So my first question is related to this issue. Is there any solution to avoid this process every time?
The second problem is connected to the first one, because I import other libraries that are:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import datetime as datetime
from pmdarima.arima import auto_arima
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse
from statsmodels.tsa.stattools import adfuller
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from pmdarima import auto_arima
from statsmodels.tsa.statespace.sarimax import SARIMAX
According to the fact that I know how to solve the first problem, then I have several lines of code related to the time series prediction and when I have to use a function where I'm using the ARIMA model:
def Predict(train,test,Order1,Order2,Order3,parForecastLenght=31):
# Build Model
model = ARIMA(train.astype("float32"), order=(Order1, Order2, Order3))
fitted = model.fit(disp=-1)
# Forecast
fc, se, conf = fitted.forecast(parForecastLenght, alpha=0.05)
# Make as pandas series
fc_series = pd.Series(fc, index=test.iloc[0:parForecastLenght].index)
lower_series = pd.Series(conf[:, 0], index=test.iloc[0:parForecastLenght].index)
upper_series = pd.Series(conf[:, 1], index=test.iloc[0:parForecastLenght].index)
# Plot
plt.figure(figsize=(12,5), dpi=100)
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast')
plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.15)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=8)
plt.show()
return fc_series
when I use try to execute this code:
model1 = Predict(train_Att_Assunzioni,test_Att_Assunzioni,0,0,0,30)
appears this kind of error:
NotImplementedError:
statsmodels.tsa.arima_model.ARMA and statsmodels.tsa.arima_model.ARIMA have
been removed in favor of statsmodels.tsa.arima.model.ARIMA (note the .
between arima and model) and statsmodels.tsa.SARIMAX.
statsmodels.tsa.arima.model.ARIMA makes use of the statespace framework and
is both well tested and maintained. It also offers alternative specialized
parameter estimators.
So again I check posts on stackoverflow, I tried to implement the suggested operations, but nothing seems to work except for the substitution of the library from from statsmodels.tsa.arima_model import ARIMA to from statsmodels.tsa.arima.model import ARIMA
but then the first problem rise again.
N.B. I tried to install statsmodels, pmadarima, I tried to change my work enviroment from colab to jupyter lab, but nothing

Related

cannot find regression in sklearn.metrics

I'm trying to use the following:
from fireTS.models import NARX, DirectAutoRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import numpy as np
import scipy
import sklearn
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
However, upon running the first line, an error saying:
ModuleNotFoundError: No module named 'sklearn.metrics.regression'
Interestingly, I cannot find anything on the web about this problem (even in the recently asked question in stackoverflow about this 26+ days ago).
Anyone who have encountered the same and was bale to fix this?
EDIT:
SO I FOUND THE FIX.
I went to the library where my firets is located and clicked models.py.
I changed the following:
from sklearn.metrics.regression import r2_score, mean_squared_error
to
from sklearn.metrics import r2_score, mean_squared_error
and hola, NO MORE ERRORS :)

No module named 'statsmodels.tsa.arima' in Colab but not in Pycharm

# ARIMA example
from statsmodels.tsa.arima.model import ARIMA
data = [200,30,30,35,30,20,26,35,30,33,40,29,29,30,30,30,30,20,26,35,30,33,40,29,29,30,30,30]
# fit model
model = ARIMA(data, order=(10, 1, 10))
model_fit = model.fit()
# make prediction
yhat = model_fit.predict(len(data), len(data), typ='levels')
print(yhat)
The
from statsmodels.tsa.arima.model import ARIMA is wokring perfectly in pycharm but while running the same code in colab it throws
There are very few supports there on internet for this library, so I would appreciate any sort of help or any workaround please.
Try,
from statsmodels.tsa.arima_model import ARIMA
if you don't have statsmodel installed then also do,
pip install statsmodels
You import statsmodels like this:
import statsmodels.api as sm
And then you can use SARIMA like this:
model=sm.tsa.arima.ARIMA(data,order=(10, 1, 10))
The arima_model import is deprecated. You can read more about using ARIMA here.
You need a newer version. Try to run the following in your Colab:
!pip install statsmodels==0.12.1
It will allow the import that you want.

MultiGPU Kmeans clustering with RAPIDs freezes

I am new into Python and Rapids.AI and I am trying to recreate SKLearn KMeans in a multinode GPU (I have 2 GPUs) using Dask and RAPIDs (I am using rapids with its docker, which mounts a Jupyter Notebook too).
The code I show below (also I show an example of the Iris dataset) freezes and the jupyter notebook cell is never ended. I tried to use %debug magic key and also Dask dashboard but I did not draw any clear conclusions (the only conclusion I think that could due to device_m_csv.iloc, but I am not sure about it). Another thing that could be is I am forgetting some wait() or compute() or persistent() (really, I am not sure on what occasions they should be used correctly).
I will explain the code, for a better reading:
First of all, do needed imports
Next, starts with KMeans algorithm (delimiter: #######################...)
Create a CUDA cluster with 2 workers, one per GPU (I have 2 GPUs) and 1 thread for worker (I have read this is the recommended value) and start a client
Read dataset from CSV making 2 partitions (chunksize = '2kb')
Split previous dataset into data (more known as X) and labels ( (more known as y)
Instantiate cu_KMeans using Dask
Fit the model
Predict values
Check the obtained score
Sorry for not being able to offer more data, but I couldn't get it. Whatever is necessary to solve the doubt I will be happy to offer it.
Where or what can you think the problem is?.
Thank you very much in advance.
%%time
# Import libraries and show its versions
import numpy as np; print('NumPy Version:', np.__version__)
import pandas as pd; print('Pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import nvstrings, nvcategory
import cupy; print('cuPY Version:', cupy.__version__)
import cudf; print('cuDF Version:', cudf.__version__)
import cuml; print('cuML Version:', cuml.__version__)
import dask; print('Dask Version:', dask.__version__)
import dask_cuda; print('DaskCuda Version:', dask_cuda.__version__)
import dask_cudf; print('DaskCuDF Version:', dask_cudf.__version__)
import matplotlib; print('MatPlotLib Version:', matplotlib.__version__)
import seaborn as sns; print('SeaBorn Version:', sns.__version__)
#import timeimport warnings
from dask import delayed
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
from dask_ml.cluster import KMeans as skmKMeans
from dask_cuda import LocalCUDACluster
from sklearn import metrics
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score as sk_adjusted_rand_score, silhouette_score as sk_silhouette_score
from cuml.cluster import KMeans as cuKMeans
from cuml.dask.cluster.kmeans import KMeans as cumKMeans
from cuml.metrics import adjusted_rand_score as cu_adjusted_rand_score
# Configure matplotlib library
import matplotlib.pyplot as plt
%matplotlib inline
# Configure seaborn library
sns.set()
#sns.set(style="white", color_codes=True)
%config InlineBackend.figure_format = 'svg'
# Configure warnings
#warnings.filterwarnings("ignore")
####################################### KMEANS #############################################################
# Create local cluster
cluster = LocalCUDACluster(n_workers=2, threads_per_worker=1)
client = Client(cluster)
# Identify number of workers
n_workers = len(client.has_what().keys())
# Read data in host memory
device_m_csv = dask_cudf.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',', chunksize='2kB') # Get complete CSV. Chunksize is 2kb for getting 2 partitions
#x = host_data.iloc[:, [0,1,2,3]].values
device_m_data = device_m_csv.iloc[:, [0, 1, 2, 3]] # Get data columns
device_m_labels = device_m_csv.iloc[:, 4] # Get labels column
# Plot data
#sns.pairplot(device_csv.to_pandas(), hue='variety');
# Define variables
label_type = { 'Setosa': 1, 'Versicolor': 2, 'Virginica': 3 } # Dictionary of variables type
# Create KMeans
cu_m_kmeans = cumKMeans(init = 'k-means||',
n_clusters = len(device_m_labels.unique()),
oversampling_factor = 40,
random_state = 0)
# Fit data in KMeans
cu_m_kmeans.fit(device_m_data)
# Predict data
cu_m_kmeans_labels_predicted = cu_m_kmeans.predict(device_m_data).compute()
# Check score
#print('Cluster centers:\n',cu_m_kmeans.cluster_centers_)
#print('adjusted_rand_score: ', sk_adjusted_rand_score(device_m_labels, cu_m_kmeans.labels_))
#print('silhouette_score: ', sk_silhouette_score(device_m_data.to_pandas(), cu_m_kmeans_labels_predicted))
# Close local cluster
client.close()
cluster.close()
Iris dataset example:
EDIT 1
#Corey, it is my ouput using your code:
NumPy Version: 1.17.5
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.1
cuPY Version: 6.7.0
cuDF Version: 0.12.0
cuML Version: 0.12.0
Dask Version: 2.10.1
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.12.0
MatPlotLib Version: 3.1.3
SeaBorn Version: 0.10.0
Cluster centers:
0 1 2 3
0 5.006000 3.428000 1.462000 0.246000
1 5.901613 2.748387 4.393548 1.433871
2 6.850000 3.073684 5.742105 2.071053
adjusted_rand_score: 0.7302382722834697
silhouette_score: 0.5528190123564102
I modified your reproducible example slightly and was able to produce an output on the most recent nightly of RAPIDS.
This is the output of the script.
(cuml_dev_2) cjnolet#deeplearn ~ $ python ~/kmeans_mnmg_reproduce.py
NumPy Version: 1.18.1
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.2.post1
cuPY Version: 7.2.0
cuDF Version: 0.13.0a+3237.g61e4d9c
cuML Version: 0.13.0a+891.g4f44f7f
Dask Version: 2.11.0+28.g10db6ba
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.13.0a+3237.g61e4d9c
MatPlotLib Version: 3.2.0
SeaBorn Version: 0.10.0
/share/software/miniconda3/envs/cuml_dev_2/lib/python3.7/site-packages/dask/array/random.py:27: FutureWarning: dask.array.random.doc_wraps is deprecated and will be removed in a future version
FutureWarning,
/share/software/miniconda3/envs/cuml_dev_2/lib/python3.7/site-packages/distributed/dashboard/core.py:79: UserWarning:
Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn("\n" + msg)
bokeh.server.util - WARNING - Host wildcard '*' will allow connections originating from multiple (or possibly all) hostnames or IPs. Use non-wildcard values to restrict access explicitly
Cluster centers:
0 1 2 3
0 5.883607 2.740984 4.388525 1.434426
1 5.006000 3.428000 1.462000 0.246000
2 6.853846 3.076923 5.715385 2.053846
adjusted_rand_score: 0.7163421126838475
silhouette_score: 0.5511916046195927
And here is the modified script that produced this output:
# Import libraries and show its versions
import numpy as np; print('NumPy Version:', np.__version__)
import pandas as pd; print('Pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import nvstrings, nvcategory
import cupy; print('cuPY Version:', cupy.__version__)
import cudf; print('cuDF Version:', cudf.__version__)
import cuml; print('cuML Version:', cuml.__version__)
import dask; print('Dask Version:', dask.__version__)
import dask_cuda; print('DaskCuda Version:', dask_cuda.__version__)
import dask_cudf; print('DaskCuDF Version:', dask_cudf.__version__)
import matplotlib; print('MatPlotLib Version:', matplotlib.__version__)
import seaborn as sns; print('SeaBorn Version:', sns.__version__)
#import timeimport warnings
from dask import delayed
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
from dask_ml.cluster import KMeans as skmKMeans
from dask_cuda import LocalCUDACluster
from sklearn import metrics
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score as sk_adjusted_rand_score, silhouette_score as sk_silhouette_score
from cuml.cluster import KMeans as cuKMeans
from cuml.dask.cluster.kmeans import KMeans as cumKMeans
from cuml.metrics import adjusted_rand_score as cu_adjusted_rand_score
# Configure matplotlib library
import matplotlib.pyplot as plt
# Configure seaborn library
sns.set()
#sns.set(style="white", color_codes=True)
# Configure warnings
#warnings.filterwarnings("ignore")
####################################### KMEANS #############################################################
# Create local cluster
cluster = LocalCUDACluster(n_workers=2, threads_per_worker=1)
client = Client(cluster)
# Identify number of workers
n_workers = len(client.has_what().keys())
# Read data in host memory
from sklearn.datasets import load_iris
loader = load_iris()
#x = host_data.iloc[:, [0,1,2,3]].values
device_m_data = dask_cudf.from_cudf(cudf.from_pandas(pd.DataFrame(loader.data)), npartitions=2) # Get data columns
device_m_labels = dask_cudf.from_cudf(cudf.from_pandas(pd.DataFrame(loader.target)), npartitions=2)
# Plot data
#sns.pairplot(device_csv.to_pandas(), hue='variety');
# Define variables
label_type = { 'Setosa': 1, 'Versicolor': 2, 'Virginica': 3 } # Dictionary of variables type
# Create KMeans
cu_m_kmeans = cumKMeans(init = 'k-means||',
n_clusters = len(np.unique(loader.target)),
oversampling_factor = 40,
random_state = 0)
# Fit data in KMeans
cu_m_kmeans.fit(device_m_data)
# Predict data
cu_m_kmeans_labels_predicted = cu_m_kmeans.predict(device_m_data).compute()
# Check score
print('Cluster centers:\n',cu_m_kmeans.cluster_centers_)
print('adjusted_rand_score: ', sk_adjusted_rand_score(loader.target, cu_m_kmeans_labels_predicted.values.get()))
print('silhouette_score: ', sk_silhouette_score(device_m_data.compute().to_pandas(), cu_m_kmeans_labels_predicted))
# Close local cluster
client.close()
cluster.close()
Can you please provide your output for the versions of these libraries? I would recommend also running the modified script and see if this runs successfully for you. If not, we can dive in further to find out if it's Docker-related, RAPIDS version related, or something else.
If you have access to the command-prompt that's running your Jupyter notebook, it might be helpful to enable logging by passing in verbose=True when constructing the KMeans object. This can help us isolate where things are getting stuck.
The Dask documentation is really good and extensive, though I admit sometimes the flexibility and amount of features if provides can be a little overwhelming. I think it helps to see Dask as an API for distributed computing that gives the user control over a few different layers of execution, each layer providing more fine-grained control.
compute(), wait() and persist() are concepts that come from the manner in which the tasks that underly a series of distributed computations are scheduled on a set of workers. What's common to all of these computations is an execution graph that represents remote tasks and their inter-dependencies. At some point, this execution graph gets scheduled on a set of workers. Dask provides two APIs, depending on whether the tasks underlying the graph are scheduled immediately (eagerly) or whether the computation needs to be triggered manually (lazily).
Both of these APIs build the execution graph as tasks are created that depend on the results of other tasks. The former uses the dask.futures API for immediate asynchronous execution, the results of which you may sometimes want to wait() on before doing other operations. The dask.delayed API is used for lazy executions and requires the invocation of methods like compute() or persist() in order to begin computation.
Most often, users of libraries like RAPIDS are more concerned with manipulating their data and aren't as concerned with how those manipulations are scheduled on the set of workers. The dask.dataframe and dask.array objects are built on top of the delayed and futures APIs. Most users interact with these data structures rather than interacting with delayed and futures objects, but it's not a bad idea to be aware of them if you should ever need to do some data transformations outside of what the distributed dataframe and array objects provide.
dask.dataframe and dask.array both build lazy execution graphs where at all possible and provide a compute() method to materialize the graph and return the result to the client. They both also provide a persist() method to start computation asynchronously in the background. wait() is useful if you want to begin computation in the background but do not want to return the results to the client.
I hope this is helpful.

Scikit learn does not appear to respect global / local random_states in unittests

I'm trying to write an integration test that uses the descriptive statistics (.describe().to_list()) of the results of a model prediction (model.predict(X)). However, even though I've set np.random.seed(###) the descriptive statistics are different after running the tests in the console vs. in the environment created by Pycharm:
Here's a MRE for local:
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
np.random.seed(42)
X, y = make_regression(n_features=2, random_state=42)
regr = ElasticNet(random_state=42)
regr.fit(X, y)
pred = regr.predict(X)
# Theory: This result should be the same from the result in a class
pd.Series(pred).describe().to_list()
And an example test-file:
from unittest import TestCase
from sklearn.linear_model import ElasticNet
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
np.random.seed(42)
class TestPD(TestCase):
def testExpectedPrediction(self):
np.random.seed(42)
X, y = make_regression(n_features=2, random_state=42)
regr = ElasticNet(random_state=42)
regr.fit(X, y)
pred = pd.Series(regr.predict(X))
for i in pred.describe().to_list():
print(i)
# here we would have a self.assertTrue/Equals f.e. element
What appears to happen is that when I run this test in the Python Console, I get one result. But then when I run it using PyCharm's unittests for the folder, I get another result. Now, importantly, in PyCharm, the project interpreter is used to create an environment for the console that ought to be the same as the test environment. This leaves me to believe that I'm missing something about the way random_state is passed along. My expectation is, given that I have set a seed, that the results would be reproducible. But that doesn't appear to be the case and I would like to understand:
Why they aren't equal?
What I can do to make them equal?
I haven't been able to find a lot of best practices with respect to testing against expected model results. So commentary in that regard would also be helpful.

catboost shows very bad result on a toy dataset

Today I've tried to test an amazing Catboost library published recently by Yandex but it shows very poor results even on a toy dataset. I've tried to find a root of my problem but due to the lack of proper documentation and topics about the library I can't figure out what's going on. Please help me =)
I'm using Anaconda 3 x64 with Python 3.6.
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, f1_score, make_scorer
from catboost import CatBoostClassifier
X,y = make_classification( n_classes=2
,n_clusters_per_class=2
,n_features=10
,n_informative=4
,n_repeated=2
,shuffle=True
,random_state=564
,n_samples=10000
)
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size = 0.8)
cb = CatBoostClassifier(depth=3,custom_loss=
['Accuracy','AUC'],
logging_level='Silent',
iterations=500,
od_type='Iter',
od_wait=20)
cb.fit(X_train,y_train,eval_set=(X_test,y_test),plot=True,use_best_model=True)
pred = cb.predict_proba(X_test)[:,1]
tpr,fpr,_=roc_curve(y_score=pred,y_true=y_test)
#just to show the difference
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier().fit(X_train,y_train)
pred_gbc = gbc.predict_proba(X_test)[:,1]
tpr_xgb,fpr_xgb,_=roc_curve(y_score=pred_gbc,y_true=y_test)
plt.plot(tpr,fpr,color='orange')
plt.plot(tpr_xgb,fpr_xgb,color='red')
plt.show()
It was a bug. Be careful and ensure you are using the latest version. The bug was fixed in 0.6.1 version.

Categories