MultiGPU Kmeans clustering with RAPIDs freezes - python

I am new into Python and Rapids.AI and I am trying to recreate SKLearn KMeans in a multinode GPU (I have 2 GPUs) using Dask and RAPIDs (I am using rapids with its docker, which mounts a Jupyter Notebook too).
The code I show below (also I show an example of the Iris dataset) freezes and the jupyter notebook cell is never ended. I tried to use %debug magic key and also Dask dashboard but I did not draw any clear conclusions (the only conclusion I think that could due to device_m_csv.iloc, but I am not sure about it). Another thing that could be is I am forgetting some wait() or compute() or persistent() (really, I am not sure on what occasions they should be used correctly).
I will explain the code, for a better reading:
First of all, do needed imports
Next, starts with KMeans algorithm (delimiter: #######################...)
Create a CUDA cluster with 2 workers, one per GPU (I have 2 GPUs) and 1 thread for worker (I have read this is the recommended value) and start a client
Read dataset from CSV making 2 partitions (chunksize = '2kb')
Split previous dataset into data (more known as X) and labels ( (more known as y)
Instantiate cu_KMeans using Dask
Fit the model
Predict values
Check the obtained score
Sorry for not being able to offer more data, but I couldn't get it. Whatever is necessary to solve the doubt I will be happy to offer it.
Where or what can you think the problem is?.
Thank you very much in advance.
%%time
# Import libraries and show its versions
import numpy as np; print('NumPy Version:', np.__version__)
import pandas as pd; print('Pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import nvstrings, nvcategory
import cupy; print('cuPY Version:', cupy.__version__)
import cudf; print('cuDF Version:', cudf.__version__)
import cuml; print('cuML Version:', cuml.__version__)
import dask; print('Dask Version:', dask.__version__)
import dask_cuda; print('DaskCuda Version:', dask_cuda.__version__)
import dask_cudf; print('DaskCuDF Version:', dask_cudf.__version__)
import matplotlib; print('MatPlotLib Version:', matplotlib.__version__)
import seaborn as sns; print('SeaBorn Version:', sns.__version__)
#import timeimport warnings
from dask import delayed
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
from dask_ml.cluster import KMeans as skmKMeans
from dask_cuda import LocalCUDACluster
from sklearn import metrics
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score as sk_adjusted_rand_score, silhouette_score as sk_silhouette_score
from cuml.cluster import KMeans as cuKMeans
from cuml.dask.cluster.kmeans import KMeans as cumKMeans
from cuml.metrics import adjusted_rand_score as cu_adjusted_rand_score
# Configure matplotlib library
import matplotlib.pyplot as plt
%matplotlib inline
# Configure seaborn library
sns.set()
#sns.set(style="white", color_codes=True)
%config InlineBackend.figure_format = 'svg'
# Configure warnings
#warnings.filterwarnings("ignore")
####################################### KMEANS #############################################################
# Create local cluster
cluster = LocalCUDACluster(n_workers=2, threads_per_worker=1)
client = Client(cluster)
# Identify number of workers
n_workers = len(client.has_what().keys())
# Read data in host memory
device_m_csv = dask_cudf.read_csv('./DataSet/iris.csv', header = 0, delimiter = ',', chunksize='2kB') # Get complete CSV. Chunksize is 2kb for getting 2 partitions
#x = host_data.iloc[:, [0,1,2,3]].values
device_m_data = device_m_csv.iloc[:, [0, 1, 2, 3]] # Get data columns
device_m_labels = device_m_csv.iloc[:, 4] # Get labels column
# Plot data
#sns.pairplot(device_csv.to_pandas(), hue='variety');
# Define variables
label_type = { 'Setosa': 1, 'Versicolor': 2, 'Virginica': 3 } # Dictionary of variables type
# Create KMeans
cu_m_kmeans = cumKMeans(init = 'k-means||',
n_clusters = len(device_m_labels.unique()),
oversampling_factor = 40,
random_state = 0)
# Fit data in KMeans
cu_m_kmeans.fit(device_m_data)
# Predict data
cu_m_kmeans_labels_predicted = cu_m_kmeans.predict(device_m_data).compute()
# Check score
#print('Cluster centers:\n',cu_m_kmeans.cluster_centers_)
#print('adjusted_rand_score: ', sk_adjusted_rand_score(device_m_labels, cu_m_kmeans.labels_))
#print('silhouette_score: ', sk_silhouette_score(device_m_data.to_pandas(), cu_m_kmeans_labels_predicted))
# Close local cluster
client.close()
cluster.close()
Iris dataset example:
EDIT 1
#Corey, it is my ouput using your code:
NumPy Version: 1.17.5
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.1
cuPY Version: 6.7.0
cuDF Version: 0.12.0
cuML Version: 0.12.0
Dask Version: 2.10.1
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.12.0
MatPlotLib Version: 3.1.3
SeaBorn Version: 0.10.0
Cluster centers:
0 1 2 3
0 5.006000 3.428000 1.462000 0.246000
1 5.901613 2.748387 4.393548 1.433871
2 6.850000 3.073684 5.742105 2.071053
adjusted_rand_score: 0.7302382722834697
silhouette_score: 0.5528190123564102

I modified your reproducible example slightly and was able to produce an output on the most recent nightly of RAPIDS.
This is the output of the script.
(cuml_dev_2) cjnolet#deeplearn ~ $ python ~/kmeans_mnmg_reproduce.py
NumPy Version: 1.18.1
Pandas Version: 0.25.3
Scikit-Learn Version: 0.22.2.post1
cuPY Version: 7.2.0
cuDF Version: 0.13.0a+3237.g61e4d9c
cuML Version: 0.13.0a+891.g4f44f7f
Dask Version: 2.11.0+28.g10db6ba
DaskCuda Version: 0+unknown
DaskCuDF Version: 0.13.0a+3237.g61e4d9c
MatPlotLib Version: 3.2.0
SeaBorn Version: 0.10.0
/share/software/miniconda3/envs/cuml_dev_2/lib/python3.7/site-packages/dask/array/random.py:27: FutureWarning: dask.array.random.doc_wraps is deprecated and will be removed in a future version
FutureWarning,
/share/software/miniconda3/envs/cuml_dev_2/lib/python3.7/site-packages/distributed/dashboard/core.py:79: UserWarning:
Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.
warnings.warn("\n" + msg)
bokeh.server.util - WARNING - Host wildcard '*' will allow connections originating from multiple (or possibly all) hostnames or IPs. Use non-wildcard values to restrict access explicitly
Cluster centers:
0 1 2 3
0 5.883607 2.740984 4.388525 1.434426
1 5.006000 3.428000 1.462000 0.246000
2 6.853846 3.076923 5.715385 2.053846
adjusted_rand_score: 0.7163421126838475
silhouette_score: 0.5511916046195927
And here is the modified script that produced this output:
# Import libraries and show its versions
import numpy as np; print('NumPy Version:', np.__version__)
import pandas as pd; print('Pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import nvstrings, nvcategory
import cupy; print('cuPY Version:', cupy.__version__)
import cudf; print('cuDF Version:', cudf.__version__)
import cuml; print('cuML Version:', cuml.__version__)
import dask; print('Dask Version:', dask.__version__)
import dask_cuda; print('DaskCuda Version:', dask_cuda.__version__)
import dask_cudf; print('DaskCuDF Version:', dask_cudf.__version__)
import matplotlib; print('MatPlotLib Version:', matplotlib.__version__)
import seaborn as sns; print('SeaBorn Version:', sns.__version__)
#import timeimport warnings
from dask import delayed
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
from dask_ml.cluster import KMeans as skmKMeans
from dask_cuda import LocalCUDACluster
from sklearn import metrics
from sklearn.cluster import KMeans as skKMeans
from sklearn.metrics import adjusted_rand_score as sk_adjusted_rand_score, silhouette_score as sk_silhouette_score
from cuml.cluster import KMeans as cuKMeans
from cuml.dask.cluster.kmeans import KMeans as cumKMeans
from cuml.metrics import adjusted_rand_score as cu_adjusted_rand_score
# Configure matplotlib library
import matplotlib.pyplot as plt
# Configure seaborn library
sns.set()
#sns.set(style="white", color_codes=True)
# Configure warnings
#warnings.filterwarnings("ignore")
####################################### KMEANS #############################################################
# Create local cluster
cluster = LocalCUDACluster(n_workers=2, threads_per_worker=1)
client = Client(cluster)
# Identify number of workers
n_workers = len(client.has_what().keys())
# Read data in host memory
from sklearn.datasets import load_iris
loader = load_iris()
#x = host_data.iloc[:, [0,1,2,3]].values
device_m_data = dask_cudf.from_cudf(cudf.from_pandas(pd.DataFrame(loader.data)), npartitions=2) # Get data columns
device_m_labels = dask_cudf.from_cudf(cudf.from_pandas(pd.DataFrame(loader.target)), npartitions=2)
# Plot data
#sns.pairplot(device_csv.to_pandas(), hue='variety');
# Define variables
label_type = { 'Setosa': 1, 'Versicolor': 2, 'Virginica': 3 } # Dictionary of variables type
# Create KMeans
cu_m_kmeans = cumKMeans(init = 'k-means||',
n_clusters = len(np.unique(loader.target)),
oversampling_factor = 40,
random_state = 0)
# Fit data in KMeans
cu_m_kmeans.fit(device_m_data)
# Predict data
cu_m_kmeans_labels_predicted = cu_m_kmeans.predict(device_m_data).compute()
# Check score
print('Cluster centers:\n',cu_m_kmeans.cluster_centers_)
print('adjusted_rand_score: ', sk_adjusted_rand_score(loader.target, cu_m_kmeans_labels_predicted.values.get()))
print('silhouette_score: ', sk_silhouette_score(device_m_data.compute().to_pandas(), cu_m_kmeans_labels_predicted))
# Close local cluster
client.close()
cluster.close()
Can you please provide your output for the versions of these libraries? I would recommend also running the modified script and see if this runs successfully for you. If not, we can dive in further to find out if it's Docker-related, RAPIDS version related, or something else.
If you have access to the command-prompt that's running your Jupyter notebook, it might be helpful to enable logging by passing in verbose=True when constructing the KMeans object. This can help us isolate where things are getting stuck.

The Dask documentation is really good and extensive, though I admit sometimes the flexibility and amount of features if provides can be a little overwhelming. I think it helps to see Dask as an API for distributed computing that gives the user control over a few different layers of execution, each layer providing more fine-grained control.
compute(), wait() and persist() are concepts that come from the manner in which the tasks that underly a series of distributed computations are scheduled on a set of workers. What's common to all of these computations is an execution graph that represents remote tasks and their inter-dependencies. At some point, this execution graph gets scheduled on a set of workers. Dask provides two APIs, depending on whether the tasks underlying the graph are scheduled immediately (eagerly) or whether the computation needs to be triggered manually (lazily).
Both of these APIs build the execution graph as tasks are created that depend on the results of other tasks. The former uses the dask.futures API for immediate asynchronous execution, the results of which you may sometimes want to wait() on before doing other operations. The dask.delayed API is used for lazy executions and requires the invocation of methods like compute() or persist() in order to begin computation.
Most often, users of libraries like RAPIDS are more concerned with manipulating their data and aren't as concerned with how those manipulations are scheduled on the set of workers. The dask.dataframe and dask.array objects are built on top of the delayed and futures APIs. Most users interact with these data structures rather than interacting with delayed and futures objects, but it's not a bad idea to be aware of them if you should ever need to do some data transformations outside of what the distributed dataframe and array objects provide.
dask.dataframe and dask.array both build lazy execution graphs where at all possible and provide a compute() method to materialize the graph and return the result to the client. They both also provide a persist() method to start computation asynchronously in the background. wait() is useful if you want to begin computation in the background but do not want to return the results to the client.
I hope this is helpful.

Related

Library errors with pmdarima and statsmodels

I have a problem with some libraries for time series.
In particular first error rise when i import this library
from pmdarima.arima import auto_arima
As suggested in another post I use the command !pip install pmdarima to solve this problem. But then I have to restart the runtime otherwise I can't compile and I also have to re-use the command every time I open my colab/jupyter notebook.
So my first question is related to this issue. Is there any solution to avoid this process every time?
The second problem is connected to the first one, because I import other libraries that are:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib as mpl
import datetime as datetime
from pmdarima.arima import auto_arima
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.seasonal import seasonal_decompose
from dateutil.parser import parse
from statsmodels.tsa.stattools import adfuller
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from pmdarima import auto_arima
from statsmodels.tsa.statespace.sarimax import SARIMAX
According to the fact that I know how to solve the first problem, then I have several lines of code related to the time series prediction and when I have to use a function where I'm using the ARIMA model:
def Predict(train,test,Order1,Order2,Order3,parForecastLenght=31):
# Build Model
model = ARIMA(train.astype("float32"), order=(Order1, Order2, Order3))
fitted = model.fit(disp=-1)
# Forecast
fc, se, conf = fitted.forecast(parForecastLenght, alpha=0.05)
# Make as pandas series
fc_series = pd.Series(fc, index=test.iloc[0:parForecastLenght].index)
lower_series = pd.Series(conf[:, 0], index=test.iloc[0:parForecastLenght].index)
upper_series = pd.Series(conf[:, 1], index=test.iloc[0:parForecastLenght].index)
# Plot
plt.figure(figsize=(12,5), dpi=100)
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast')
plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.15)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=8)
plt.show()
return fc_series
when I use try to execute this code:
model1 = Predict(train_Att_Assunzioni,test_Att_Assunzioni,0,0,0,30)
appears this kind of error:
NotImplementedError:
statsmodels.tsa.arima_model.ARMA and statsmodels.tsa.arima_model.ARIMA have
been removed in favor of statsmodels.tsa.arima.model.ARIMA (note the .
between arima and model) and statsmodels.tsa.SARIMAX.
statsmodels.tsa.arima.model.ARIMA makes use of the statespace framework and
is both well tested and maintained. It also offers alternative specialized
parameter estimators.
So again I check posts on stackoverflow, I tried to implement the suggested operations, but nothing seems to work except for the substitution of the library from from statsmodels.tsa.arima_model import ARIMA to from statsmodels.tsa.arima.model import ARIMA
but then the first problem rise again.
N.B. I tried to install statsmodels, pmadarima, I tried to change my work enviroment from colab to jupyter lab, but nothing

How to get Reproducible Results with AutoKeras

I need to reproduce results with AutoKeras for the same input and configurations: I tried the following at the beginning of my notebook but still didn't got the same results.
I am using Tensorflow 2.0.4 and AutoKeras 1.0.12
seed_value= 0
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
os.environ['TF_CUDNN_DETERMINISTIC'] = str(seed_value)
import tensorflow as tf
tf.random.set_seed(seed_value)
from keras import backend as K
import autokeras as ak
import random
random.seed(seed_value)
import numpy as np
np.random.seed(seed_value)
Note:
I want to reproduce results at different times; i.e. to get the same result after closing the notebook, and run the code again .. not during the same session.
I guess, you need to seed the generators before each call you want to be reproducable. The best option is to make such a decorator (or a context manager):
import contextlib
#contextlib.contextmanager
def reproducable(seed_value=0):
import os
os.environ['PYTHONHASHSEED']=str(seed_value)
os.environ['TF_CUDNN_DETERMINISTIC'] = str(seed_value)
import tensorflow as tf
tf.random.set_seed(seed_value)
from keras import backend as K
import autokeras as ak
import random
random.seed(seed_value)
import numpy as np
np.random.seed(seed_value)
yield
#reproducable()
def main():
# ...put your code here...
UPD
Note: I want to reproduce results at different times; i.e. to get the same result after closing the notebook, and run the code again .. not during the same session.
and?

How to add Mlib library to Spark?

I was given assignment to run some code and show results using the Apache Spark using Python Language, I installed the Apache Spark server using the following steps: https://phoenixnap.com/kb/install-spark-on-windows-10. I tried my code and everything was fine. Now I am assigned another assignment, it needs MLlib linear regression and they provide us with some code that should be running then we will add additional code for it. When I try to run the code I have some errors and warnings, part of them appeared in the previous assignment but it still working. I believe the issue is that there are additiona things related to Mlib Library should be added so the code will run correctly. Anybody has any idea what files should be added to the spark so it runs the code related to MLib?
I am using Windows 10, and spark-3.0.1-bin-hadoop2.7
This is my code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StandardScaler
conf = SparkConf().setMaster("local").setAppName("LinearRegression")
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
# Load training data
df = sqlContext.read.format("libsvm").option("numFeatures", 13).load("boston_housing.txt")
# Data needs to be scaled for better results and interpretation
# Initialize the `standardScaler`
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")
# Fit the DataFrame to the scaler
scaler = standardScaler.fit(df)
# Transform the data in `df` with the scaler
scaled_df = scaler.transform(df)
# Initialize the linear regression model
lr = LinearRegression(labelCol="label", maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the data to the model
linearModel = lr.fit(scaled_df)
# Print the coefficients for the model
print("Coefficients: %s" % str(linearModel.coefficients))
print("Intercept: %s" % str(linearModel.intercept))
here is the screenshot for what I have when I run the code:
Try to do pip install numpy (or pip3 install numpy if that fails). The traceback says numpy module is not found.

Dill installed - throwing error that part of the module is missing

I'm writing code in a Jupyter Notebook that involves cleaning and analyzing a large amount of consumer data. I'm trying to use dill to save the dataframes with thousands of rows so I don't have to run the code every time I want to make an adjustment, so dill seems like the perfect package to do so... Except I'm getting this error when attempting to pickle the notebook:
AttributeError: module 'dill' has no attribute 'dump_session'
Let me know if the program code is necessary - I don't think it should make a difference. The imports are:
import numpy as np
import pandas as pd
import dill
import scipy
from matplotlib import pyplot as plt
from __future__ import division
from collections import OrderedDict
from sklearn.cluster import KMeans
pd.options.display.max_columns = None
and when I run this code I get the error from above:
dill.dump_session('recengine.db')
Is there another package that's interfering with dill's use of pickle vs. cpickle?

scikit-learn GridSearchCV doesn't work as samples increase

The following script runs fine on my machine with n_samples=1000, but dies (no error, just stops working) with n_samples=10000. This only happens using the Anaconda python distribution (numpy 1.8.1) but is fine with Enthought's (numpy 1.9.2). Any ideas what would be causing this?
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.metrics.scorer import log_loss_scorer
from sklearn.cross_validation import KFold
from sklearn import datasets
import numpy as np
X, y = datasets.make_classification(n_samples=10000, n_features=50,
n_informative=35, n_redundant=10,
random_state=1984)
lr = LogisticRegression(random_state=1984)
param_grid = {'C': np.logspace(-1, 2, 4, base=2)}
kf = KFold(n=y.size, n_folds=5, shuffle=True, random_state=1984)
gs = GridSearchCV(estimator=lr, param_grid=param_grid, scoring=log_loss_scorer, cv=kf, verbose=100,
n_jobs=-1)
gs.fit(X, y)
Note: I'm using sklearn 0.16.1 in both distributions and am using OS X.
I've noticed that upgrading to numpy version 1.9.2 with Enthought distribution (by updating manually) breaks the grid search. I haven't had any luck downgrading Anaconda numpy version to 1.8.1 though.
Are you on windows? If so, you need to protect the code with
if __name__ == "__main__":
do_stuff()
Otherwise multiprocessing will not work.
Per Andreas's comment, the problem seems to be with multi threading in the linear algebra library. I solved it with the following command in the terminal:
export VECLIB_MAXIMUM_THREADS=1
My (weak) understanding is that this limits the linear algebra's library use of multiple threads and lets multiprocessing handle multithreading as it wants.

Categories