How to use sklearn pipeline fit with ray - python

I would like to use sklearn pipeline with Ray cluster to make computation paralel.
I found example https://docs.ray.io/en/master/ray-more-libs/joblib.html
I try code below but it doesn't work paralelly:
import joblib
from ray.util.joblib import register_ray
register_ray()
with joblib.parallel_backend('ray'):
df = pd.read_csv(filepath, sep=sep, encoding=encoding, on_bad_lines='skip', low_memory=False)
y = df.pop('target')
X = df.copy()
out= pipe.fit_transform(X, y)
If I use import modin.pandas as pd the fit method shows problem that X,y are not pandas dataframe types

Related

How to speeding up the scaling process in python?

I have a large text dataset and I'm using the MinMaxScaler to transform one feature. The code is working fine but takes more than 3 mins and I want to reduce the consumed time for this process. Is there any suggestions to speed up this process OR alternative method to so this transformation faster?
df = cleanData('data.csv')
scaler = MinMaxScaler(feature_range=(0, 5))
scaler.fit(pd.DataFrame(df.loc[:,'year']))
df.loc[:,'year'] = scaler.transform(pd.DataFrame(df.loc[:,'year']))
You can try doing it with dask-ml:
import dask.dataframe as dd
from dask_ml.preprocessing import MinMaxScaler
# or read directly from csv with ddf = dd.read_csv('data.csv')
ddf = dd.from_pandas(df, npartitions=10)
scaler = MinMaxScaler(feature_range=(0, 5))
scaler.fit(ddf['year'])
ddf['year'] = scaler.transform(ddf['year'])
There are also other preprocessing tools available in dask_ml, see https://ml.dask.org/modules/generated/dask_ml.preprocessing.MinMaxScaler.html?highlight=minmaxscaler

How can I transform a 2d array to a pandas dataframe in python

Currently, I'm doing the titanic dataset on kaggle. The Age column has some missing values, and I tried to impute them using sklearn.impute SimpleImputer.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split as tts
from sklearn.impute import SimpleImputer
titanic_data = pd.read_csv("../input/titanic/train.csv")
imputer = SimpleImputer(missing_values=np.nan)
features = ['Age', 'Pclass']
X = titanic_data[features]
y = titanic_data.Survived
age_arr = X.Age.values.reshape(1, -1)
imputed_age = pd.DataFrame(imputer.fit_transform(age_arr))
X.Age = imputed_age
print(imputed_age)
As shown above, I have some trouble arranging and converting those arrays and data columns. I need a proper way to make those a single column in the age column. When I print imputed_age, it gives me a dataframe where each age is a column. I want to make all of these in the same column, and how could I easily do the imputing and successfully put the imputed values into the dataframe again?
How could I put those imputed values into the dataframe?
I asked this on a forum elsewhere and someone gave me a solution. I'll put it here, and I've modified it a bit.
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
df = sns.load_dataset("titanic")
features = ["pclass","age"]
X = df.loc[:,features]
y = df.survived
imputer = SimpleImputer()
age_transform = pd.DataFrame(imputer.fit_transform(pd.DataFrame(X.age)),columns=["Age"])
I check your code and I found that if we input dataframe in imputer.fit_transform, we don't need to reshape to (1,-1).
So I just make age columns as dataframe and input it in imputer and fit_transform. And I think it works well.

ValueError: could not convert string to float: 'Q'

I am new to programming and I was working with the titanic dataset from Kaggle. I have been trying to build the Logistic Regression model after performing one-hot encoding. But I keep getting the error. I think the error is caused due to the dummy variable. Below is my code.
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
#Loading data
df=pd.read_csv(r"C:\Users\Downloads\train.csv")
#Deleting unwanted columns
df.drop(["PassengerId","Name","Cabin","Ticket"],axis=1,inplace=True)
#COunt of Missing values in each column
print(df.isnull().sum())
#Deleting rows with missing values based on column name
df.dropna(subset=['Embarked','Age'],inplace=True)
print(df.isnull().sum())
#One hot encoding for categorical variables
#Creating dummy variables for Sex column
dummies = pd.get_dummies(df.Sex)
dummies2=pd.get_dummies(df.Embarked)
#Appending the dummies dataframe with original dataframe
new_df= pd.concat([df,dummies,dummies2],axis='columns')
print(type(new_df))
#print(new_df.head(10))
#Drop the original sex,Embarked column and one of the dummy column for bth variables
new_df.drop(['Sex','Embarked'],axis='columns',inplace=True)
print(new_df.head(10))
new_df.info()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score
x = df.drop('Survived', axis=1)
y = df['Survived']
logmodel = LogisticRegression()
logmodel.fit(x, y)
As we discussed in the comments, here is the solution:
First, you need to modify your x and y variables to use new_df instead of df just like so:
x = new_df.drop('Survived', axis=1)
y = new_df['Survived']
Then, you need to increase the iteration of your Logistic Regression Model like so:
logmodel = LogisticRegression(max_iter=1000)

Scale data from dataframe obtained with pyspark

I'm trying to scale some data from a csv file. I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part. Here is the code:
from sklearn import preprocessing
import numpy as np
import pyspark
from pysparl.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)
If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:
ValueError: setting an array element with a sequence.
So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?
sklearn works with pandas dataframe. So you have to convert spark dataframe to pandas dataframe.
X_scaled = preprocessing.scale(df.toPandas())
You can use the "StandardScaler" method from "pyspark.ml.feature". Attaching a sample script to perform the exact pre-processing as sklearn,
Step 1:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features",
outputCol="scaled_features",
withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)
Remember before you perform step 1, you need to assemble all the features with VectorAssembler. Hence this will be your step 0.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?

I have used the following code to convert the sk learn breast cancer data set to data frame : I am not getting the output ? I am very new in python and not able to figure out what is wrong.
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = numpy.c_[cancer.data, cancer.target]
columns = numpy.append(cancer.feature_names, ["target"])
return pandas.DataFrame(data, columns=columns)
answer_one()
Use pandas
There was a great answer here: How to convert a Scikit-learn dataset to a Pandas dataset?
The keys in bunch object give you an idea about which data you want to make columns for.
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = pd.Series(cancer.target)
The following code works
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = np.c_[cancer.data, cancer.target]
columns = np.append(cancer.feature_names, ["target"])
return pd.DataFrame(data, columns=columns)
answer_one()
The reason why your code doesn't work before was you try to call numpy and pandas package again after defining it as np and pd respectively.
However, i suggest that the package loading and redefinition is done at the beginning of the script, outside a function definition.
As of scikit-learn 0.23 you can do the following to get a DataFrame and save some keystrokes:
df = load_breast_cancer(as_frame=True)
df.frame
dataframe = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
dataframe['target'] = cancer.target
return dataframe

Categories