I am trying to run Kmeans clustering algo in Spark 2.2. I am not able to find the correct input format. It gives TypeError: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector error. I checked further that my inputrdd is an Row Rdd. CAn we convert it to an array RDD? This MLlib Doc says shows that we can pass a paralleized array rdd data into the KMeans model.
Error occurs at KMeans.train step.
import pandas as pd
from pyspark.mllib.clustering import KMeans, KMeansModel
df = pd.DataFrame({"c1" : [1,2,3,4,5,6], "c2": [2,6,1,2,4,6], "c3" : [21,32,12,65,43,52]})
sdf = sqlContext.createDataFrame(df)
inputrdd = sdf.rdd
model = KMeans.train( inputrdd, 2, maxIterations=10, initializationMode="random",
seed=50, initializationSteps=5, epsilon=1e-4)
inputrdd when .collect is called.
[Row(c1=1, c2=2, c3=21),
Row(c1=2, c2=6, c3=32),
Row(c1=3, c2=1, c3=12),
Row(c1=4, c2=2, c3=65),
Row(c1=5, c2=4, c3=43),
Row(c1=6, c2=6, c3=52)]
Following changes helped. I changed my Row rdd to Vector directly using Vectors.dense.
from pyspark.mllib.linalg import Vectors
inputrdd = sdf.rdd.map(lambda s : Vectors.dense(s))
Related
Im trying to make my way to a sligthly more flexible knn input script than the tutorials based of the iris dataset but Im having some trouble (I think) to add the matching 2nd dimension to the numpy array in #6 and when I come to #11. the fitting.
File "G:\PROGRAMMERING\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 212, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [150, 1]
x is (150,5) and y is (150,1). 150 is the number of samples in both, but they differ in number of fields, is this the problem and if so how do I fix it?
#1. Loading the Pandas libraries as pd
import pandas as pd
import numpy as np
#2. Read data from the file 'custom.csv' placed in your code directory
data = pd.read_csv("custom.csv")
#3. Preview the first 5 lines of the loaded data
print(data.head())
print(type(data))
#4.Test the shape of the data
print(data.shape)
df = pd.DataFrame(data)
print(df)
#5. Convert non-numericals to numericals
print(df.dtypes)
# Any object should be converted to numerical
df['species'] = pd.Categorical(df['species'])
df['species'] = df.species.cat.codes
print("outcome:")
print(df.dtypes)
#6.Convert df to numpy.ndarray
np = df.to_numpy()
print(type(np)) #this should state <class 'numpy.ndarray'>
print(data.shape)
print(np)
x = np.data
y = [df['species']]
print(y)
#K-nearest neighbor (find closest) - searach for the K nearest observations in the dataset
#The model calculates the distance to all, and selects the K nearest ones.
#8. Import the class you plan to use
from sklearn.neighbors import (KNeighborsClassifier)
#9. Pick a value for K
k = 2
#10. Instantiate the "estimator" (make an instance of the model)
knn = KNeighborsClassifier(n_neighbors=k)
print(knn)
#11. fit the model with data/model training
knn.fit(x, y)
#12. Predict the response for a new observation
print(knn.predict([[3, 5, 4, 2]]))```
This is how I used the scikit-learn KNeighborsClassifier to fit the knn model:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
df = datasets.load_iris()
X = pd.DataFrame(df.data)
y = df.target
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X,y)
print(knn.predict([[6, 3, 5, 2]]))
#prints output class [2]
print(knn.predict([[3, 5, 4, 2]]))
#prints output class [1]
From DataFrame you don't need to convert to numpy array, you can directly fit the model on DataFrame, also while converting the DataFrame to numpy array you have named that as np which is also used to import numpy at the top import numpy as np
The input prediction input is 4 columns, leaving the fifth 'species' without prediction. Also, if 'species' was the target it cannot be given as input to the knn at the same time. The pop removes this particular column from the dataFrame df.
#npdf = df.to_numpy()
df = df.apply(lambda x:pd.Series(x))
y = np.asarray(df['species'])
#removes the target from the sample
df.pop('species')
x = df.to_numpy()
I am running this code
import pandas as np
import numpy as np
from sklearn import cluster
from sklearn.cluster import KMeans
model = cluster.KMeans(n_clusters=4, random_state=10)
Then I put that through a dataframe I am working on and that includes the columns age and income, which is the clusters I am working on,
model.fit(df[['income', 'age']]
And so far it works well until I run the following bit, which aims at creating a column with the label of the cluster each data point belongs to.
df['cluster'] = model.labels_df.head()
And this is the error code I get:
AttributeError: 'KMeans' object has no attribute 'labels_df'
Any suggestions?
The attribute to access the labels of the model is: model.labels_
Use:
df['cluster'] = model.labels_
By typing model.labels_df.head() you request the head of model.labels_df that does not exist.
I believe you have mistyped it and you need:
df['cluster'] = model.labels_
df.head()
I'm trying to scale some data from a csv file. I'm doing this with pyspark to obtain the dataframe and sklearn for the scale part. Here is the code:
from sklearn import preprocessing
import numpy as np
import pyspark
from pysparl.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.option('header','true').csv('flights,csv')
X_scaled = preprocessing.scale(df)
If I make the dataframe with pandas the scale part doesn't have any problems, but with spark I get this error:
ValueError: setting an array element with a sequence.
So I'm guessing that the element types are different between pandas and pyspark, but how can I work with pyspark to do the scale?
sklearn works with pandas dataframe. So you have to convert spark dataframe to pandas dataframe.
X_scaled = preprocessing.scale(df.toPandas())
You can use the "StandardScaler" method from "pyspark.ml.feature". Attaching a sample script to perform the exact pre-processing as sklearn,
Step 1:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features",
outputCol="scaled_features",
withStd=True,withMean=True)
scaler_model = scaler.fit(transformed_data)
scaled_data = scaler_model.transform(transformed_data)
Remember before you perform step 1, you need to assemble all the features with VectorAssembler. Hence this will be your step 0.
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=required_features, outputCol='features')
transformed_data = assembler.transform(df)
I have a correlation matrix calculated as follow on pyspark 2.2:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
datos = sql("""select * from proceso_riesgos.jdgc_bd_train_mn_ingresos""")
Variables_corr= ['ingreso_final_mix','ingreso_final_promedio',
'ingreso_final_mediana','ingreso_final_trimedia','ingresos_serv_q1',
'ingresos_serv_q2','ingresos_serv_q3','prom_ingresos_serv','y_correc']
assembler = VectorAssembler(
inputCols=Variables_corr,
outputCol="features")
datos1=datos.select(Variables_corr).filter("y_correc is not null")
output = assembler.transform(datos)
r1 = Correlation.corr(output, "features")
the result is a data frame with a variable called "pearson(features): matrix":
Row(pearson(features)=DenseMatrix(20, 20, [1.0, 0.9428, 0.8908, 0.913,
0.567, 0.5832, 0.6148, 0.6488, ..., -0.589, -0.6145, -0.5906, -0.5534,
-0.5346, -0.0797, -0.617, 1.0], False))]
I need to take those values and export it to an excel, or to be able to manipulate the result.
A list could be desiderable.
Thanks for help!!
You are almost there ! There is no need to use old rdd mllib api .
This is my method to generate pandas dataframe, you can export to excel or csv or others format.
def correlation_matrix(df, corr_columns, method='pearson'):
vector_col = "corr_features"
assembler = VectorAssembler(inputCols=corr_columns, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
return pd.DataFrame(result.reshape(-1, len(corr_columns)), columns=corr_columns, index=corr_columns)
New to python and sklearn so apologies in advance. I have two transformers and I would like to gather the results in a `FeatureUnion (for a final modelling step at the end). This should be quite simple but FeatureUnion is stacking the outputs rather than providing an nx2 array or DataFrame. In the example below I will generate some data that is 10 rows by 2 columns. This will then generate two features that are 10 rows by 1 column. I would like the final feature union to have 10 rows and 1 column but what I get are 20 rows by 1 column.
I will try to demonstrate with my example below:
some imports
import numpy as np
import pandas as pd
from sklearn import pipeline
from sklearn.base import TransformerMixin
some random data
df = pd.DataFrame(np.random.rand(10, 2), columns=['a', 'b'])
a custom transformer that selects a column
class Trans(TransformerMixin):
def __init__(self, col_name):
self.col_name = col_name
def fit(self, X):
return self
def transform(self, X):
return X[self.col_name]
a pipeline that uses the transformer twice (in my real case I have two different transformers but this reproduces the problem)
pipe = pipeline.FeatureUnion([
('select_a', Trans('a')),
('select_b', Trans('b'))
])
now i use the pipeline but it returns an array of twice the length
pipe.fit_transform(df).shape
(20,)
however I would like an array with dimensions (10, 2).
Quick fix?
The transformers in the FeatureUnion need to return 2-dimensional matrices, however in your code by selecting a column, you are returning a 1-dimensional vector. You could fix this by selecting the column with X[[self.col_name]].