To run linear regression in pyspark, I have to convert the feature columns of my data to dense vectors and then use them to fit the regression model as shown below:
assembler = VectorAssembler(inputCols=feature_cols, outputCol="Features")
vtrain = assembler.transform(train).select('Features', y)
lin_reg = LinearRegression(solver='normal',
featuresCol = 'Features',
labelCol = y)
model = lin_reg.fit(vtrain)
This has been working for a while but just recently started giving me the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1059.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1059.0 (TID 1877) (10.139.64.10 executor 0): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'MMM dd, yyyy hh:mm:ss aa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
This is confusing me because all of the columns in "train" are either integer or double. vtrain is just that same data in vectorized form. There is no datetime parsing anywhere. I tried setting spark.sql.legacy.timeParserPolicy to LEGACY, but the same error occurred.
Does anyone know why this might be?
Related
I need to run a Data Envelopment Analysis model based on a data frame that I built, in csv format. After a bit of research into what software and package would be easier, I found R's Benchmarking package https://cran.r-project.org/web/packages/Benchmarking/index.html.
However, when I try to run the code:
#loading the dataframe
IES<-read.csv("_INPUTS & OUTPUTS.csv", sep=";")
#specifying input and output matrix
(x <- with(IES, cbind(STUDENTS, FACULTY, AGE, PHD, MA, PRES, FTE, FTNE, PART, HOUR, TEC, TEC_SUP, TEC_SPEC, TEC_MA, TEC_PHD, PORTAL_CAPES, VIRT_JOURNAL, VIRT_BOOK, VL_DESPESA_PESSOAL_DOCENTE, VL_DESPESA_PESSOAL_TECNICO, VL_DESPESA_PESSOAL_ENCARGO, VL_DESPESA_CUSTEIO, VL_DESPESA_INVESTIMENTO, VL_DESPESA_PESQUISA, VL_DESPESA_OUTRA)))
(y <- matrix(IES$ST_COMPLETING))
#running the model
dea(x,y, RTS="vrs", ORIENTATION="out")
I get the following error:
Error in -XREF[, h] : invalid argument to unary operator
I have checked the dataframe and there are no problems there. Also tried the input "in" orientation, but get the same error.
Can anyone help?
Solved it. Problem was that the .csv file used had ',' as the decimal separator. So all I had to do was specify it when reading it.
I working with python pyspark dataframe. I have approx 70 gb of json files which are basically the emails. The aim is to perform TF-IDF on body of emails. Primarily, I have converted record-oriented json to HDFS. for the tf-idf implementation, first I have done some data cleaning using spark NLP, followed by basic computing tf, idf and tf-idf which includes several groupBy() and join() operations. the implementation works just fine with a small sample dataset but when I run if with entire dataset I have been getting the following error:
Py4JJavaError: An error occurred while calling o506.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 107 in stage 7.0 failed 4 times, most recent failure: Lost task 107.3 in stage 7.0 : ExecutorLostFailure (executor 72 exited caused by one of the running tasks) Reason: Container from a bad node: container_1616585444828_0075_01_000082 on host: Exit status: 143. Diagnostics: [2021-04-08 01:02:20.700]Container killed on request. Exit code is 143
[2021-04-08 01:02:20.701]Container exited with a non-zero exit code 143.
[2021-04-08 01:02:20.702]Killed by external signal
sample code:
df_1 = df.select(df.id,explode(df.final).alias("words"))
df_1= df_1.filter(length(df_1.words) > '3')
df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
I get error on third step, i.e, df_2 = df_1.groupBy("id","words").agg(count("id").alias("count_id"))
Things I have tried:
set("spark.sql.broadcastTimeout", "36000")
df_1.coalesce(20) before groupBy()
checked for null values -> no null values in any df
nothing worked so far. Since I am very new to pyspark I would really appreciate some help on how to make the implementation more efficient and fast.
for a project I want to create a Core ML 3 model, which is receiving some text (i.e from mails) and classify it. In addition, the model should be updatable and trained on the devices. Therefore, I found that KNearestNeighborsClassifier can be updatable and wanted to used them for my approach.
However, first of all I got an error
" RuntimeWarning: You will not be able to run predict() on this Core ML model. Underlying exception message was: Error compiling model: "Error reading protobuf spec. validator error: KNearestNeighborsClassifier requires k to be a positive integer."
while creating such a model with a script (see below). In addition, I am not sure how to use the KNearestNeighborsClassifier for my problem correctly. Especially, which number of dimension is correct one if I want classify some texts? And how will I have to use the model correctly in the app? Maybe you know some useful guide, which I have not found yet=
My script for creating the KNearestNeighborsClassifier is based on this guide: https://github.com/apple/coremltools/blob/master/examples/updatable_models/updatable_nearest_neighbor_classifier.ipynb
I have installed and I am using coremltools==3.0b6.
Here my actual script for creating the model:
number_of_dimensions = 128
from coremltools.models.nearest_neighbors import KNearestNeighborsClassifierBuilder
builder = KNearestNeighborsClassifierBuilder(input_name='input',
output_name='output',
number_of_dimensions=number_of_dimensions,
default_class_label='defaultLabel',
number_of_neighbors=3,
weighting_scheme='inverse_distance',
index_type='linear')
builder.author = 'Christian'
builder.license = 'MIT'
builder.description = 'Classifies {} dimension vector based on 3 nearest neighbors'.format(number_of_dimensions)
builder.spec.description.input[0].shortDescription = 'Input vector to classify'
builder.spec.description.output[0].shortDescription = 'Predicted label. Defaults to \'defaultLabel\''
builder.spec.description.output[1].shortDescription = 'Probabilities / score for each possible label.'
builder.spec.description.trainingInput[0].shortDescription = 'Example input vector'
builder.spec.description.trainingInput[1].shortDescription = 'Associated true label of each example vector'
#This lets the developer of the app change the number of neighbors at runtime from anywhere between 1 and 10, with a default of 3.
builder.set_number_of_neighbors_with_bounds(3, allowed_range=(1, 10))
# Let's set the index to kd_tree with leaf size of 30
builder.set_index_type('kd_tree', 30)
# By default an empty knn model is updatable
print(builder.is_updatable)
print(builder.number_of_dimensions)
print(builder.number_of_neighbors)
print(builder.number_of_neighbors_allowed_range())
print(builder.index_type)
mlmodel_updatable_path = './UpdatableKNN.mlmodel'
# Save the updated spec
from coremltools.models import MLModel
mlmodel_updatable = MLModel(builder.spec)
mlmodel_updatable.save(mlmodel_updatable_path)
I hope that you can help me by telling me if overall my approach using the KNearestNeighborsClassifier for text classification is senseful and hopefully you can help me to create successfully the CoreML model.
Many thanks in advance.
Not sure why you're getting that error, although make sure you're using the latest (beta) version of coremltools (3.0b6 currently).
As for the number of dimensions, you'll need to convert your text into a vector of a fixed length somehow. Exactly how you do that is totally up to the problem you're trying to solve.
For example, you could use the bag-of-words technique to turn a phrase into such a vector. You can use word embeddings, or a neural network, or any of the other common techniques for this.
But you need some way to turn the text into feature vectors.
I'm trying to run a simple Dataflow Python pipeline that gets certain user events from BigQuery and produces a per-user event count.
p = df.Pipeline(argv=pipeline_args)
result_query = "..."
data = p | df.io.Read(df.io.BigQuerySource(query=result_query))
user_events = data|df.Map(lambda x: (x['users_user_id'], 1))
user_event_counts = user_events|df.CombinePerKey(sum)
Running this gives me an error:
TypeError: Expected tuple, got int [while running 'Map(<lambda at user_stats.py:...>)']
Data before the CombinePerKey transform is in this form:
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'55107178236374', 1)
(u'2296845644499670', 1)
(u'2296845644499670', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
(u'1489727796186326', 1)
If instead calculate user_event_counts with this:
user_event_counts = (user_events|df.GroupByKey()|
df.Map('count', lambda (user, ones): (user, sum(ones))))
then there are no errors and I get the result I expect.
Based the docs I would have expected similar behaviour from both approaches. I obviously missing something with respect to CombinePerKey but I can't see what it is. Any tips appreciated!
I am guessing you run a version of the SDK lower than 0.2.4.
This is a bug in how we handle combining operations in some scenarios. The issue is fixed with the latest release of the SDK (v0.2.4): https://github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4
Sorry about that. Let us know if you still experience the issue with the latest release.
I keep seeing these warnings when using trainImplicit:
WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB).
The maximum recommended task size is 100 KB.
And then the task size starts to increase. I tried to call repartition on the input RDD but the warnings are the same.
All these warnings come from ALS iterations, from flatMap and also from aggregate, for instance the origin of the stage where the flatMap is showing these warnings (w/ Spark 1.3.0, but they are also shown in Spark 1.3.1):
org.apache.spark.rdd.RDD.flatMap(RDD.scala:296)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1065)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:530)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
and from aggregate:
org.apache.spark.rdd.RDD.aggregate(RDD.scala:968)
org.apache.spark.ml.recommendation.ALS$.computeYtY(ALS.scala:1112)
org.apache.spark.ml.recommendation.ALS$.org$apache$spark$ml$recommendation$ALS$$computeFactors(ALS.scala:1064)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:538)
org.apache.spark.ml.recommendation.ALS$$anonfun$train$3.apply(ALS.scala:527)
scala.collection.immutable.Range.foreach(Range.scala:141)
org.apache.spark.ml.recommendation.ALS$.train(ALS.scala:527)
org.apache.spark.mllib.recommendation.ALS.run(ALS.scala:203)
Similar problem was described in Apache Spark mail lists - http://apache-spark-user-list.1001560.n3.nabble.com/Large-Task-Size-td9539.html
I think you can try to play with number of partitions (using repartition() method), depending of how many hosts, RAM, CPUs do you have.
Try also to investigate all steps via Web UI, where you can see number of stages, memory usage on each stage, and data locality.
Or just never mind about this warnings unless everything works correctly and fast.
This notification is hard-coded in Spark (scheduler/TaskSetManager.scala)
if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
!emittedTaskSizeWarning) {
emittedTaskSizeWarning = true
logWarning(s"Stage ${task.stageId} contains a task of very large size " +
s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
}
.
private[spark] object TaskSetManager {
// The user will be warned if any stages contain a task that has a serialized size greater than
// this.
val TASK_SIZE_TO_WARN_KB = 100
}