I am in the process of going to dummy encode a dask dataframe train_final[categorical_var]. However, when I run the code I get a memory error. Could this happen since dask is supposed to do it by loading data chunk by chunk.
The code is below:
from dask_ml.preprocessing import DummyEncoder
de = DummyEncoder()
train_final_cat = de.fit_transform(train_final[categorical_var])
The error:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-84-e21592c13279> in <module>
1 from dask_ml.preprocessing import DummyEncoder
2 de = DummyEncoder()
----> 3 train_final_cat = de.fit_transform(train_final[categorical_var])
~/env/lib/python3.5/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
460 if y is None:
461 # fit method of arity 1 (unsupervised transformation)
--> 462 return self.fit(X, **fit_params).transform(X)
463 else:
464 # fit method of arity 2 (supervised transformation)
~/env/lib/python3.5/site-packages/dask_ml/preprocessing/data.py in fit(self, X, y)
602
603 self.transformed_columns_ = pd.get_dummies(
--> 604 sample, drop_first=self.drop_first
605 ).columns
606 return self
~/env/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
890 dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,
891 dummy_na=dummy_na, sparse=sparse,
--> 892 drop_first=drop_first, dtype=dtype)
893 with_dummies.append(dummy)
894 result = concat(with_dummies, axis=1)
~/env/lib/python3.5/site-packages/pandas/core/reshape/reshape.py in _get_dummies_1d(data, prefix, prefix_sep, dummy_na, sparse, drop_first, dtype)
978
979 else:
--> 980 dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
981
982 if not dummy_na:
~/env/lib/python3.5/site-packages/numpy/lib/twodim_base.py in eye(N, M, k, dtype, order)
184 if M is None:
185 M = N
--> 186 m = zeros((N, M), dtype=dtype, order=order)
187 if k >= M:
188 return m
MemoryError:
Would anyone be able to give me some direction in this regard
Thanks
Michael
Encoding dummy variables is a very memory intensive task, as you're creating a new column for each unique value of your categorical_column. If categorical_column is high cardinality then even a single chunk can explode in size. As well, creating dummies is not "embarrassingly parallel"; so workers can't just process each chunk independently. The workers need to communicate and replicate some data during the computation.
Related
I'm trying to replicate the results described in How to Determine the Best Fitting Data Distribution Using Python. I used then the following code:
import numpy as np
from distfit import distfit
# Generate 10000 normal distribution samples with mean 0, std dev of 3
X = np.random.normal(0, 3, 10000)
# Initialize distfit
dist = distfit()
# Determine best-fitting probability distribution for data
dist.fit_transform(X)
Anyway, I obtained the following error:
[distfit] >fit..
[distfit] >transform..
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-8-02f73e7f157d> in <module>
9
10 # Determine best-fitting probability distribution for data
---> 11 dist.fit_transform(X)
~\Anaconda3\lib\site-packages\distfit\distfit.py in fit_transform(self, X, verbose)
275 self.fit(verbose=verbose)
276 # Transform X based on functions
--> 277 self.transform(X, verbose=verbose)
278 # Store
279 results = _store(self.alpha,
~\Anaconda3\lib\site-packages\distfit\distfit.py in transform(self, X, verbose)
214 if self.method=='parametric':
215 # Compute best distribution fit on the empirical X
--> 216 out_summary, model = _compute_score_distribution(X, X_bins, y_obs, self.distributions, self.stats, verbose=verbose)
217 # Determine confidence intervals on the best fitting distribution
218 model = _compute_cii(self, model, verbose=verbose)
~\Anaconda3\lib\site-packages\distfit\distfit.py in _compute_score_distribution(data, X, y_obs, DISTRIBUTIONS, stats, verbose)
906 model['params'] = (0.0, 1.0)
907 best_score = np.inf
--> 908 df = pd.DataFrame(index=range(0, len(DISTRIBUTIONS)), columns=['distr', 'score', 'LLE', 'loc', 'scale', 'arg'])
909 max_name_len = np.max(list(map(lambda x: len(x.name), DISTRIBUTIONS)))
910
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
346 dtype=dtype, copy=copy)
347 elif isinstance(data, dict):
--> 348 mgr = self._init_dict(data, index, columns, dtype=dtype)
349 elif isinstance(data, ma.MaskedArray):
350 import numpy.ma.mrecords as mrecords
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype)
449 nan_dtype = dtype
450 v = construct_1d_arraylike_from_scalar(np.nan, len(index),
--> 451 nan_dtype)
452 arrays.loc[missing] = [v] * missing.sum()
453
~\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in construct_1d_arraylike_from_scalar(value, length, dtype)
1194 else:
1195 if not isinstance(dtype, (np.dtype, type(np.dtype))):
-> 1196 dtype = dtype.dtype
1197
1198 # coerce if we have nan for an integer dtype
AttributeError: type object 'object' has no attribute 'dtype'
(I'm using Jupyter.)
How can I fix this problem?
The solution to the above error, as can be seen in the comments of the question, was to upgrade pandas. This issue appears in versions 1.0.4 and lower.
I am trying to work on the Kaggle ASL Dataset, and during preprocessing, I tried to scale the values against each pixel.
I did the following steps in Google Colab:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
train = pd.read_csv("sign-language-mnist/sign_mnist_train.csv")
scaler = MinMaxScaler()
new_df = train.apply(lambda x: scaler.fit_transform(x.values.reshape(1,-1)),axis=0)
While trying to run this piece of code, I am getting the following error:
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7550 kwds=kwds,
7551 )
-> 7552 return op.get_result()
7553
7554 def applymap(self, func) -> "DataFrame":
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
272
273 # wrap results
--> 274 return self.wrap_results(results, res_index)
275
276 def apply_series_generator(self) -> Tuple[ResType, "Index"]:
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in wrap_results(self, results, res_index)
313 # see if we can infer the results
314 if len(results) > 0 and 0 in results and is_sequence(results[0]):
--> 315 return self.wrap_results_for_axis(results, res_index)
316
317 # dict of scalars
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in wrap_results_for_axis(self, results, res_index)
369
370 try:
--> 371 result = self.obj._constructor(data=results)
372 except ValueError as err:
373 if "arrays must all be same length" in str(err):
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
466
467 elif isinstance(data, dict):
--> 468 mgr = init_dict(data, index, columns, dtype=dtype)
469 elif isinstance(data, ma.MaskedArray):
470 import numpy.ma.mrecords as mrecords
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in init_dict(data, index, columns, dtype)
281 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
282 ]
--> 283 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
284
285
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
76 # figure out the index, if necessary
77 if index is None:
---> 78 index = extract_index(arrays)
79 else:
80 index = ensure_index(index)
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in extract_index(data)
385
386 if not indexes and not raw_lengths:
--> 387 raise ValueError("If using all scalar values, you must pass an index")
388
389 if have_series:
ValueError: If using all scalar values, you must pass an index
However, the following piece of code works fine:
new_df = pd.DataFrame(scaler.fit_transform(train), columns=train.columns)
So, the question is what is going wrong? Can anyone please answer that?
or, Can anyone explain, what I need to know to find out why the first one was giving that weird error?
Thanks in advance.
You could try
train.iloc[:,1:] = scaler.fit_transform(train.iloc[:,1:])
Anyway you wouldn't want to scale the label value too.
My input is a pandas dataframe ("vector") with one column and 178885 rows holding strings with up to 600 words each.
0 this is an example text...
1 more examples...
...
178885 last example
Name: vectortext, Length: 178886, dtype: object
I'm doing feature extraction (unigrams) using the TfidfVectorizer:
vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
X = vectorizer_uni.fit_transform(vector).toarray()
X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
k = len(X.columns) #number of features
Unfortunately I'm receiving a Memory Error as below. I'm using the 64bit version of python 3.6 with 16GB RAM on my windows 10 machine. I've red alot about python generators etc. but I can't figure out how to solve this problem without limiting the number of features (which is not really an option). Any ideas how to solve this? Could I somehow split my dataframe before?
Traceback:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-88-15b6091ceec7> in <module>()
1 vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
----> 2 X = vectorizer_uni.fit_transform(vector).toarray()
3 X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
4 k = len(X.columns) # number of features
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
962 def toarray(self, order=None, out=None):
963 """See the docstring for `spmatrix.toarray`."""
--> 964 return self.tocoo(copy=False).toarray(order=order, out=out)
965
966 ##############################################################
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\coo.py in toarray(self, order, out)
250 def toarray(self, order=None, out=None):
251 """See the docstring for `spmatrix.toarray`."""
--> 252 B = self._process_toarray_args(order, out)
253 fortran = int(B.flags.f_contiguous)
254 if not fortran and not B.flags.c_contiguous:
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
1037 return out
1038 else:
-> 1039 return np.zeros(self.shape, dtype=self.dtype, order=order)
1040
1041 def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):
MemoryError:
I'm new to Agglomerative Clustering and doc2vec, so I hope somebody can help me with the following issue.
This is my code:
model = AgglomerativeClustering(linkage='average',
connectivity=None, n_clusters=2)
X = model_dm.docvecs.doctag_syn0
model.fit(X, y=None)
model.fit_predict(X, y=None)
What I want is to predict the average of the distances of each observation. I got the following error:
MemoryErrorTraceback (most recent call last)
<ipython-input-22-d8b93bc6abe1> in <module>()
2 model = AgglomerativeClustering(linkage='average',connectivity=None,n_clusters=2)
3 X = model_dm.docvecs.doctag_syn0
----> 4 model.fit(X, y=None)
5
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in fit(self, X, y)
763 n_components=self.n_components,
764 n_clusters=n_clusters,
--> 765 **kwargs)
766 # Cut the tree
767 if compute_full_tree:
/usr/local/lib64/python2.7/site-packages/sklearn/externals/joblib/memory.pyc in __call__(self, *args, **kwargs)
281
282 def __call__(self, *args, **kwargs):
--> 283 return self.func(*args, **kwargs)
284
285 def call_and_shelve(self, *args, **kwargs):
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in _average_linkage(*args, **kwargs)
547 def _average_linkage(*args, **kwargs):
548 kwargs['linkage'] = 'average'
--> 549 return linkage_tree(*args, **kwargs)
550
551
/usr/local/lib64/python2.7/site-packages/sklearn/cluster/hierarchical.pyc in linkage_tree(X, connectivity, n_components, n_clusters, linkage, affinity, return_distance)
428 i, j = np.triu_indices(X.shape[0], k=1)
429 X = X[i, j]
--> 430 out = hierarchy.linkage(X, method=linkage, metric=affinity)
431 children_ = out[:, :2].astype(np.int)
432
/usr/local/lib64/python2.7/site-packages/scipy/cluster/hierarchy.pyc in linkage(y, method, metric)
669 'matrix looks suspiciously like an uncondensed '
670 'distance matrix')
--> 671 y = distance.pdist(y, metric)
672 else:
673 raise ValueError("`y` must be 1 or 2 dimensional.")
/usr/local/lib64/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1375
1376 m, n = s
-> 1377 dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
1378
1379 # validate input for multi-args metrics
MemoryError:
You are getting a MemoryError. This is a reliable indicator that you are running out of memory, on the line indicated.
The line indicates an attempt to allocate an np.zeros() array of (m * (m - 1)) // 2 values of type double (8 bytes). Looking at the scipy source, m, here, is the number of vectors in X, aka model_dm.docvecs.doctag_syn0.shape[0].
So, how many docvecs are you working with? If it's 200,000, you will need...
((200000 * 199999) // 2) * 8 bytes
...or about 320GB of RAM for that np.zeros() allocation to succeed. (If you have more docvecs, even more RAM.)
(Agglomerative clustering needs to know all the pairwise distances, which the scipy implementation tries to calculate and store at the beginning, which is very space-consuming.)
You may need to have more RAM, or use fewer docvecs, or use a different clustering algorithm, or use an implementation which is lazier about calculating distances (but is then much much slower because it will often be recalculating, rather than reusing, distances it needs repeatedly.
I'm trying to convert a Pandas dataframe to a NumPy array to create a model with Sklearn. I'll simplify the problem here.
>>> mydf.head(10)
IdVisita
445 latam
446 NaN
447 grados
448 grados
449 eventos
450 eventos
451 Reescribe-medios-clases-online
454 postgrados
455 postgrados
456 postgrados
Name: cat1, dtype: object
>>> from sklearn import preprocessing
>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit(mydf)
Traceback:
ValueError Traceback (most recent call last)
<ipython-input-74-f581ab15cbed> in <module>()
2 mydf.head(10)
3 enc = preprocessing.OneHotEncoder()
----> 4 enc.fit(mydf)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y)
996 self
997 """
--> 998 self.fit_transform(X)
999 return self
1000
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
1052 """
1053 return _transform_selected(X, self._fit_transform,
-> 1054 self.categorical_features, copy=True)
1055
1056 def _transform(self, X):
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
870 """
871 if selected == "all":
--> 872 return transform(X)
873
874 X = atleast2d_or_csc(X, copy=copy)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X)
1001 def _fit_transform(self, X):
1002 """Assumes X contains only categorical features."""
-> 1003 X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]
1004 if np.any(X < 0):
1005 raise ValueError("X needs to contain only non-negative integers.")
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
279 array = np.ascontiguousarray(array, dtype=dtype)
280 else:
--> 281 array = np.asarray(array, dtype=dtype)
282 if not allow_nans:
283 _assert_all_finite(array)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: invalid literal for long() with base 10: 'postgrados'
Notice IdVisita is the index here and numbers might not be all consecutive.
Any clues?
Your error here is that you are calling OneHotEncoder which from the docs
The input to this transformer should be a matrix of integers
but your df has a single column 'cat1' which is of dtype object which is in fact a String.
You should use LabelEcnoder:
In [13]:
le = preprocessing.LabelEncoder()
le.fit(df.dropna().values)
le.classes_
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\sklearn\preprocessing\label.py:108: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Out[13]:
array(['Reescribe-medios-clases-online', 'eventos', 'grados', 'latam',
'postgrados'], dtype=object)
Note I had to drop the NaN row as this will introduce a mixed dtype which cannot be used for ordering e.g. float > str will not work
A simpler approach is to use DictVectorizer, which does the conversion to integer as well as the OneHotEncoding at the same step.
Using it with the argument DictVectorizer(sparse=False) allows getting a DataFrame after the fit_transform to keep working with Pandas.