I am using the iris data set from sklearn to do some basic predictive modelling. I am splitting the data into training and test sets and for a list of given proportions, I want to sample without replacement different proportions of the training data. I need to sample using np.random.choice cannot use df.sample
But what I am doing to sample seems to be incorrect. I will greatly appreciate any insights.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
props=[0.2,0.5,0.7,0.9]
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
y=df[list(df.loc[:,df.columns.values =='target'])]
X=df[list(df.loc[:,df.columns.values !='target'])]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3
,train_size=0.7)
for i in proportions:
sampleX=np.random.choice(X_train, size=i, replace = False) #----> code to sample
You can sample the index:
props=[0.2,0.5,0.7,0.9]
for i in props:
ix = np.random.choice(X_train.index, size=int(i*len(X_train)), replace = False)
sampleX = X_train.loc[ix]
Or simply use a binomial:
for i in props:
sampleX = X_train.iloc[np.random.binomial(1,i,len(X))]
Related
I have loaded sklearn model successfully but unable to make predictions on pyspark dataframe. While running the below given code, getting an error mentioned below. Please help me to get the code to make predictions with sklearn model on pyspark. I also have searched relevant questions but could not find the solution.
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
braodcast_model.value
#update prediction method
def predictor(cols):
#call predict method for model
return model.value.predict(*cols)
udf_predictor = udf(predictor, FloatType())
#apply the udf to dataframe
df_prediction = df.withColumn("prediction", udf_predictor(df.select(list_of_columns)))
I get the following error message
TypeError: Invalid argument, not a string or column. For column literals, use 'lit', 'array',
'struct' or 'create_map' function.
I think you were on the right track for reaching your expected output.
I managed to find two possible solutions for such problem: one uses Spark UDF, the other uses Pandas UDF.
Spark UDF
from pyspark.sql.functions import udf
#udf('integer')
def predict_udf(*cols):
return int(braodcast_model.value.predict((cols,)))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_udf(*list_of_columns))
Pandas UDF
import pandas as pd
from pyspark.sql.functions import pandas_udf
#pandas_udf('integer')
def predict_pandas_udf(*cols):
X = pd.concat(cols, axis=1)
return pd.Series(braodcast_model.value.predict(X))
list_of_columns = df.columns
df_prediction = df.withColumn('prediction', predict_pandas_udf(*list_of_columns))
Reproducible example
Here I used a Databricks Community cluster with Spark 3.1.2, pandas==1.2.4 and pyarrow==4.0.0.
broadcasted_model is a simple logistic regression from scikit-learn, trained on the breast cancer dataset.
import pandas as pd
import joblib
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from pyspark.sql.functions import udf, pandas_udf
# load dataset
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
# split in training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)
# create a small pipeline with standardization and model
pipe = make_pipeline(StandardScaler(), LogisticRegression())
# save and reload the model
path = '/databricks/driver/test_model.joblib'
joblib.dump(model, path)
loaded_model = joblib.load(path)
# sample of unseen data
df = spark.createDataFrame(X_test.sample(50, random_state=42))
# create broadcasted model
sc = spark.sparkContext
braodcast_model = sc.broadcast(loaded_model)
Then I used the two methods illustrated above and you will see that the outputs df_prediction will be the same in both cases.
I am currently using the scikit learn module in order to help with a crime prediction problem. I am having an issue batch coding the entire Dataframe that I have with the knn.predict method.
How can I batch code the entire two columns of my Dataframe with the knn.predict() method in order to store in another Dataframe the output?
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
knn_df = pd.read_csv("/Users/helenapunset/Desktop/knn_dataframe.csv")
# x is the set of features
x = knn_df[['latitude', 'longitude']]
# y is the target variable
y = knn_df['Class']
# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
# training the data
knn.fit(x_train,y_train)
# test score was approximately 69%
knn.score(x_test,y_test)
# this is predicted to be a safe zone
crime_prediction = knn.predict([[25.787882, -80.358427]])
print(crime_prediction)
In the last line of the code I was able to add the two features I am using which are latitude and longitude from my Dataframe labeled knn_df. But, this is a single point I have been searching through the documentation on a process for streamlining this knn prediction for the entire Dataframe and cannot seem to find a way to do this. Is there somehow a possibility of using a for loop for this?
Let the new set to be predicted is 'knn_df_predict'. Assuming same column names,try the following lines of code :
x_new = knn_df_predict[['latitude', 'longitude']] #formating features
crime_prediction = knn.predict(x_new) #predicting for the new set
knn_df_predict['prediction'] = crime_prediction #Adding the prediction to dataframe
I want to create some random data and try to improve my model with PolynominalFeatures, however I'm facing little troubles with doing so.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
import random
import pandas as pd
import numpy as np
import statsmodels.api as sm
#create some artificial data
x=np.linspace(-1,1,1000)
x=pd.array(random.choices(x,k=1000))
y=x**2+np.random.randn(1000)
#divide sample
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
#define data frame to futher use for PolynomialFeatures
df=pd.DataFrame([x_train,x_test])
df=df.transpose()
data = df
# perform a polynomial features transform of the dataset
trans = PolynomialFeatures(degree=2)
data = trans.fit_transform(data)
model = sm.OLS(y_train,data).fit()
And then I get error : ValueError: unrecognized data structures: <class 'pandas.core.arrays.numpy_.PandasArray'> / <class 'numpy.ndarray'>
Do you have any ideas what should be done to make my regression work properly ?
use to_numpy() function to convert pandas array to numpy array
model = sm.OLS(y_train.to_numpy(),data).fit()
I am having this warning.
Instructions for updating:
To construct input pipelines, use the tf.data module.
I have had some search but I couldn't figure out the logic behind the tf.data.Dataset, so I couldn't manage converting pd.DataFrame into tf.data.Dataset.
I also need help for predictions at the end of the code, I couldn't figure out right way to compare predictions(high probability output) with label.
Note: data has no column names, so I have added a1 to a784 names to columns so I can assign them to feature_columns.
Thanks is advance.
Here is the code:
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from sklearn import metrics
from tensorflow.python.data import Dataset
mnist_df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/mnist_train_small.csv",header=None)
mnist_df.describe()
mnist_df.columns
hand_df = mnist_df[0]
matrix_df = mnist_df.drop([0],axis=1)
matrix_df.head()
hand_df.head()
#creating cols array and append a1 to a784 in order to name columns
cols=[]
for i in range(785):
if i!=0:
a = '{}{}'.format('a',i)
cols.append(a)
matrix_df.columns = cols
mnist_df = mnist_df.head(10000)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(matrix_df, hand_df, test_size=0.3, random_state=101)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
matrix_df = pd.DataFrame(data=scaler.fit_transform(matrix_df),
columns=matrix_df.columns,
index=matrix_df.index)
#naming columns so I will not get error while assigning feature_columns
for i in range(len(cols)):
a=i+1
b='{}{}'.format('a',a)
cols[i] = tf.feature_column.numeric_column(str(b))
matrix_df.head()
input_func = tf.estimator.inputs.pandas_input_fn(x=X_train,y=y_train,
batch_size=10,num_epochs=1000,
shuffle=True)
my_optimizer = tf.train.AdagradOptimizer(learning_rate=0.03)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
model = tf.estimator.DNNClassifier(feature_columns=cols,
hidden_units=[32,64],
n_classes=10,
optimizer=my_optimizer,
config=tf.estimator.RunConfig(keep_checkpoint_max=1))
model.train(input_fn=input_func,steps=1000)
predict_input_func = tf.estimator.inputs.pandas_input_fn(x=X_test,
batch_size=50,
num_epochs=1,
shuffle=False)
pred_gen = model.predict(predict_input_func)
predictions = list(pred_gen)
predictions[0]
I have a comma-separated CSV file with two numerical columns - inputs and outputs. They are correlated in a (more or less linear function), see below. The sample I have is very small.
Below, is the Python code I wrote using sklearn in order to predict values. Somehow it's not giving me the correct values (reasonable predictions). I am quite new to this, so please bear with me.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Data.
89,155
86,161
82.5,168
79.25,174
76.25,182
73,189
70,198
66.66,207
63.5,218
60.25,229
57,241
54,257
51,259
Actually you are having problem understanding your own code.
import pandas as pd
data = pd.read_csv("data.csv", header=None, names=['kg', 'cm'])
labels = data['kg']
train1 = data.drop(['kg'], axis=1) # In all honesty, I don't understand this.
Until here what you have done is that you have loaded the dataframe. After that you seprated X and y from the dataset.
labels represent the y values.
train1 represent the x values.
Since you wrote you can't understand :- train1 = data.drop(['kg'], axis=1)
Let me explain this. What this does is that from the dataframe which consist both column 'kg' and 'cm'. It removes 'kg' column (axis = 1 means column, axis = 0 means row). Hence only 'cm' is remaining which is your x.
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
reg.predict(80) # Gives an incorrect value of about 108.
Now you train the model on x values which represents 'cm' and y values which represent 'kg'.
When you predict(80) what happens is that you input the 'cm' value to be 80. Let me just plot the 'cm' vs 'kg' for training data.
When you input height as 80 this means that you are going more left, even more left than your plot. Hence as you can see x decrease y increase. It means that as 'cm' decrease means 'kg' increase. Hence ouput is 110 which is more.
from io import StringIO
input_data=StringIO("""89,155\n
86,161\n
82.5,168\n
79.25,174\n
76.25,182\n
73,189\n
70,198\n
66.66,207\n
63.5,218\n
60.25,229\n
57,241\n
54,257\n
51,259""")
import pandas as pd
data = pd.read_csv(input_data, header=None, names=['kg', 'cm'])
labels = data['cm']
train1 = data.drop(['cm'], axis=1) #This is similar to selecting the kg column
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train1, labels, test_size=0.10, random_state=2)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
reg.score(x_test, y_test)
import numpy as np
reg.predict(np.array([80]).reshape(-1, 1)) # 172.65013306.
I think you are having problems with small data size. The code flow looks normal to me, I would suggest you try to find the p-value for the input-output. This will tell you if the correlation found from your linear regression is significant or not (p-value <0.05).
You can find p-value using:
from scipy.stats import linregress
print(linregress(input, output))
To find p-value using scikit learn you probably need to use the formula to find p-value. Good luck.