using dask Parallel post fit on sklearn predictors (ParallelPostFit wrapper) - python

I am trying to evaluate an sklearn predictor which I have made over a larger than memory dask array of inputs. I have read over the parallel post fit documentation https://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.wrappers.ParallelPostFit.html and am still having some problems. The following code illustrates the kind issue that I am running into:
from dask.base import tokenize
import numpy as np
import dask.array as da
from dask.array import Array
from sklearn.linear_model import LinearRegression
from dask_ml.wrappers import ParallelPostFit
"""
for stack overflow question
"""
x = np.linspace(0,100,100,dtype=np.int32)
y = np.linspace(0,100,100,dtype=np.int32)
z = np.linspace(0,100,100,dtype=np.int32)
Y = np.random.normal(size=(100,))
X = np.stack([x,y,z],axis=1)
reg = LinearRegression().fit(X,Y)
#now try to compute on dask arrays over the whole space
x= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
y= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
z= da.linspace(0,100,100,chunks=(10,)).astype(np.int32)
x,y,z = da.meshgrid(x,y,z,sparse=False,indexing='ij')
stacked = da.stack([x.flatten(),y.flatten(),z.flatten()],axis=1)
clf = ParallelPostFit(estimator=reg)
clf.predict(stacked)
Excecuting clf.predict throws a value error Can't drop an axis with more than 1 block. Please use atop instead.
which I dont understand how to correct.
Thank You for any help.

Related

lasso regression for larger datasets using dask

I'm attempting to perform a lasso regression for a larger than main memory dataset by using Dask, but there doesn't seem to be a cleanly documented way to do so.
I did previously find a somewhat related question but no actual answer.
Looking into how scikit sets up the Lasso regression, I thought I could set it up the same way. For example, here is one approach I tried
from dask_ml.datasets import make_regression
import dask_glm.families
import dask_glm.regularizers
import dask_glm.algorithms
import pandas as pd
# dask dataframes
X, y = make_regression(n_samples=1000, chunks=100)
# pandas dataframes
df_X = X.compute()
df_y = y.compute()
family = dask_glm.families.Normal()
regularizer = dask_glm.regularizers.ElasticNet(weight=1)
b = dask_glm.algorithms.gradient_descent(X=X, y=y, max_iter=100000, family=family, regularizer=regularizer, alpha=0.01, normalize=False, fit_intercept=False)
print(b)
reg = linear_model.Lasso(alpha=0.01, fit_intercept=False)
reg.fit(df_X, df_y)
print(reg.coef_)
However, the coefficients don't match up at all, and the dask version's coefficients seem more unstable than scikit's.
Here's another approach I tried, this time based on a comment from this GH issue
from dask_ml.datasets import make_regression
from dask_glm.regularizers import L1
from dask_glm.estimators import LinearRegression
X, y = make_regression(n_samples=1000, chunks=100)
lr = LinearRegression(regularizer=L1())
lr.fit(X, y)
print(lr.coef_)
Again, the coefficients seem very unstable.
Ideally there would already be an implementation of Lasso using Dask for this, but I can't seem to find much on the internet except for running LassoCV with dask as the backend to joblib, which is a little different than I want.

How to run regression without predictor?

I am trying to run a regression without predictor, just constant and error term. The model is y = a + error.
I have tried as follows:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
y = np.random.normal(size=50)
sm.OLS(y, sm.add_constant(), missing='drop').fit()
However, this does not work.
As denoted here, using regression without predictors is not a major data analysis tool. Logistic regression is not a classifier, while this was already discussed as "Linear vs. Logistic Regression on Classification Problems" and "Regression for Binary Classification". However, it can still be a requirement for any reason like the one you pointed in this question. Thus, we will try to provide a proper solution as an answer to your case.
Similar to this question, you can use DummyRegressor from Sklearn as follows:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.dummy import DummyRegressor
X = np.random.normal(size=50)
y = np.random.normal(size=50)
dummy_regr = DummyRegressor(strategy="mean")
dummy_regr.fit(X, y)
...

Output of a statsmodels regression

I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.

import the dataset in Python with sci-kit learn for machine learning problems_dataset Winscosin breast cancer

Hello I try to import a dataset to spyder
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('breast-cancer-wisconsin.data1.csv')
X = dataset.iloc[:,0:9].values
y= dataset.iloc[:,9].values
but when i display the X matrix in the variable explorer it says that object arrays are currently not supported
Try this:
X = dataset.drop('column_9', 1).values
y = dataset['column_9'].values
Just replace column_9 with whatever the target column's name is.
Actually in Spyder we can't see the object array. We can only see the data-frame data, but the Spyder team promised that they will provide the object array feature in Spyder 4 (to be released later in 2019).
You can even load data from sklearn module this way:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

LinearDiscriminantAnalysis - Single column output from .transform(X)

I have been successfully playing around with replicating one of the sklearn tutorials using the iris dataset in PyCharm using Python 2.7. However, when trying to repeat this with my own data I have been encountering an issue. I have been importing data from a .csv file using 'np.genfromtxt', but for some reason I keep getting a single column output for X_r2 (see below), when I should get a 2 column output. I have therefore replaced my data with some randomly generated variables to post onto SO, and I am still getting the same issue.
I have included the 'problem' code below, and I would be interested to know what I have done wrong. I have extensively used the debugging features in PyCharm to check that the type and shape of my variables are similar to the original sklearn example, but it did not help me with the problem. Any help or suggestions would be appreciated.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(2, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)
The array y in the example you posted has values of 0, 1 and 2 while yours only has values of 0 and 1. This change achieves what you want:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
y = np.random.randint(3, size=500)
X = np.random.randint(1, high=1000, size=(500, 6))
target_names = np.array([['XX'], ['YY']])
lda = LinearDiscriminantAnalysis(n_components=2)
X_r2 = lda.fit(X, y).transform(X)

Categories