I'm trying to scale some data from a csv file. I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part. Here is the code:
from sklearn import preprocessing
import numpy as np
import pyspark
from pysparl.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)
If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:
ValueError: setting an array element with a sequence.
So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?
sklearn works with pandas dataframe. So you have to convert spark dataframe to pandas dataframe.
X_scaled = preprocessing.scale(df.toPandas())
You can use the "StandardScaler" method from "pyspark.ml.feature". Attaching a sample script to perform the exact pre-processing as sklearn,
Step 1:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features",
outputCol="scaled_features",
withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)
Remember before you perform step 1, you need to assemble all the features with VectorAssembler. Hence this will be your step 0.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)
Related
I have scaled columns, however how do I put them back into my data frame?
Here is the code that I have:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
num_cols = ['fare_amount','trip_distance','jfk_drop_distance','lga_drop_distance','ewr_drop_distance','met_drop_distance','wtc_drop_distance']
features = train_df[num_cols]
ct = ColumnTransformer([('scaler',
StandardScaler(),
['fare_amount','trip_distance','jfk_drop_distance','lga_drop_distance','ewr_drop_distance','met_drop_distance','wtc_drop_distance'])]
,remainder='passthrough')
ct.fit_transform(features)
My main data frame that I want to substite this columns with old one is train_data
I think, U seem to be almost close.
just put the fit_transform data to your dataframe like below:
...
train_df[num_cols] = ct.fit_transform(features)
...
Currently, I'm doing the titanic dataset on kaggle. The Age column has some missing values, and I tried to impute them using sklearn.impute SimpleImputer.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split as tts
from sklearn.impute import SimpleImputer
titanic_data = pd.read_csv("../input/titanic/train.csv")
imputer = SimpleImputer(missing_values=np.nan)
features = ['Age', 'Pclass']
X = titanic_data[features]
y = titanic_data.Survived
age_arr = X.Age.values.reshape(1, -1)
imputed_age = pd.DataFrame(imputer.fit_transform(age_arr))
X.Age = imputed_age
print(imputed_age)
As shown above, I have some trouble arranging and converting those arrays and data columns. I need a proper way to make those a single column in the age column. When I print imputed_age, it gives me a dataframe where each age is a column. I want to make all of these in the same column, and how could I easily do the imputing and successfully put the imputed values into the dataframe again?
How could I put those imputed values into the dataframe?
I asked this on a forum elsewhere and someone gave me a solution. I'll put it here, and I've modified it a bit.
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
df = sns.load_dataset("titanic")
features = ["pclass","age"]
X = df.loc[:,features]
y = df.survived
imputer = SimpleImputer()
age_transform = pd.DataFrame(imputer.fit_transform(pd.DataFrame(X.age)),columns=["Age"])
I check your code and I found that if we input dataframe in imputer.fit_transform, we don't need to reshape to (1,-1).
So I just make age columns as dataframe and input it in imputer and fit_transform. And I think it works well.
I am trying to run Kmeans clustering algo in Spark 2.2. I am not able to find the correct input format. It gives TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector error. I checked further that my inputrdd is an Row Rdd. CAn we convert it to an array RDD? This MLlib Doc says shows that we can pass a paralleized array rdd data into the KMeans model.
Error occurs at KMeans.train step.
import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
df = pd.DataFrame({"c1" : [1,2,3,4,5,6], "c2": [2,6,1,2,4,6], "c3" : [21,32,12,65,43,52]})
sdf = sqlContext.createDataFrame(df)
inputrdd = sdf.rdd
model = KMeans.train( inputrdd, 2, maxIterations=10, initializationMode="random",
seed=50, initializationSteps=5, epsilon=1e-4)
inputrdd when .collect is called.
[Row(c1=1, c2=2, c3=21),
Row(c1=2, c2=6, c3=32),
Row(c1=3, c2=1, c3=12),
Row(c1=4, c2=2, c3=65),
Row(c1=5, c2=4, c3=43),
Row(c1=6, c2=6, c3=52)]
Following changes helped. I changed my Row rdd to Vector directly using Vectors.dense.
from pyspark.mllib.linalg import Vectors
inputrdd = sdf.rdd.map(lambda s : Vectors.dense(s))
I have used the following code to convert the sk learn breast cancer data set to data frame : I am not getting the output ? I am very new in python and not able to figure out what is wrong.
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = numpy.c_[cancer.data, cancer.target]
columns = numpy.append(cancer.feature_names, ["target"])
return pandas.DataFrame(data, columns=columns)
answer_one()
Use pandas
There was a great answer here: How to convert a Scikit-learn dataset to a Pandas dataset?
The keys in bunch object give you an idea about which data you want to make columns for.
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = pd.Series(cancer.target)
The following code works
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = np.c_[cancer.data, cancer.target]
columns = np.append(cancer.feature_names, ["target"])
return pd.DataFrame(data, columns=columns)
answer_one()
The reason why your code doesn't work before was you try to call numpy and pandas package again after defining it as np and pd respectively.
However, i suggest that the package loading and redefinition is done at the beginning of the script, outside a function definition.
As of scikit-learn 0.23 you can do the following to get a DataFrame and save some keystrokes:
df = load_breast_cancer(as_frame=True)
df.frame
dataframe = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
dataframe['target'] = cancer.target
return dataframe
My question is I have so many columns in my pandas data frame and I am trying to apply the sklearn preprocessing using dataframe mapper from sklearn-pandas library such as
mapper= DataFrameMapper([
('gender',sklearn.preprocessing.LabelBinarizer()),
('gradelevel',sklearn.preprocessing.LabelEncoder()),
('subject',sklearn.preprocessing.LabelEncoder()),
('districtid',sklearn.preprocessing.LabelEncoder()),
('sbmRate',sklearn.preprocessing.StandardScaler()),
('pRate',sklearn.preprocessing.StandardScaler()),
('assn1',sklearn.preprocessing.StandardScaler()),
('assn2',sklearn.preprocessing.StandardScaler()),
('assn3',sklearn.preprocessing.StandardScaler()),
('assn4',sklearn.preprocessing.StandardScaler()),
('assn5',sklearn.preprocessing.StandardScaler()),
('attd1',sklearn.preprocessing.StandardScaler()),
('attd2',sklearn.preprocessing.StandardScaler()),
('attd3',sklearn.preprocessing.StandardScaler()),
('attd4',sklearn.preprocessing.StandardScaler()),
('attd5',sklearn.preprocessing.StandardScaler()),
('sbm1',sklearn.preprocessing.StandardScaler()),
('sbm2',sklearn.preprocessing.StandardScaler()),
('sbm3',sklearn.preprocessing.StandardScaler()),
('sbm4',sklearn.preprocessing.StandardScaler()),
('sbm5',sklearn.preprocessing.StandardScaler())
])
I am just wondering whether there is another more succinct way for me to preprocess many variables at one time without writing them out explicitly.
Another thing that I found a little bit annoying is when I transformed all the pandas data frame into arrays which sklearn can work with, they will lose the column name features, which makes the selection very difficult. Does anyone knows how to preserve the column names as the key when change the pandas data frames to np arrays?
Thank you so much!
from sklearn.preprocessing import LabelBinarizer, LabelEncoder, StandardScaler
from sklearn_pandas import DataFrameMapper
encoders = ['gradelevel', 'subject', 'districtid']
scalars = ['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5']
mapper = DataFrameMapper(
[('gender', LabelBinarizer())] +
[(encoder, LabelEncoder()) for encoder in encoders] +
[(scalar, StandardScaler()) for scalar in scalars]
)
If you're doing this a lot, you could even write your own function:
mapper = data_frame_mapper(binarizers=['gender'],
encoders=['gradelevel', 'subject', 'districtid'],
scalars=['sbmRate', 'pRate', 'assn1', 'assn2', 'assn3', 'assn4', 'assn5', 'attd1', 'attd2', 'attd3', 'attd4', 'attd5', 'sbm1', 'sbm2', 'sbm3', 'sbm4', 'sbm5'])