i have GBs of data in s3 and trying to bring parallelism while reading in my code by refering the following Link .
I am using the below code as a sample but when i run,it runs down to the following error:
Anyhelp on this is deeply appreciated as i am very new to spark.
EDIT : I have to read my s3 files using parallelism which is not explained in any post. People marking duplicate please read the problem first.
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
class telco_cn:
def __init__(self, sc):
self.sc = sc
def decode_module(msg):
df=spark.read.json(msg)
return df
def consumer_input(self, sc, k_topic):
a = sc.parallelize(['s3://bucket1/1575158401-51e09537-0ce5-c775-6beb-fd1b0a568e15.json'])
d = a.map(lambda x: telco_cn.decode_module(x)).collect()
print (d)
if __name__ == "__main__":
cn = telco_cn(sc)
cn.consumer_input(sc, '')
You are attempting to call spark.read.json from within a map operation on an RDD. As this map operation will be executed on Spark's executor/worker nodes, you cannot reference a SparkContext/SparkSession variable (which is defined on the Spark driver) within the map. This is what the error message is trying to tell you.
Why not just call df=spark.read.json('s3://bucket1/1575158401-51e09537-0ce5-c775-6beb-fd1b0a568e15.json') directly?
Related
Let's say I have the two following python projects -
PROJECT A
class FeatureBuilder:
def __init__(self):
self.artifact = read_artifacts_from_s3()
def create_features(self):
# do something with artifact
PROJECT B
from pyspark.sql import DataFrame
from builder import FeatureBuilder
def pandas_udf(df: DataFrame):
feature_builder = FeatureBuilder()
def create_features(pdf):
feature_vector = fbuilder.create_features(pdf)
return feature_vector
return df.groupby("id").applyInPandas(create_features, df)
In this example, in project B, I'm calling to create_features function, which uses the FeatureBuilder object I imported from project A (which I can't change), and FeatureBuilder reads the file it needs from S3 (or any other location).
Project A is not a "PySpark" project - by this I mean it has no code related to the PySpark package, Spark session or Spark context at all.
What will happen in this case? Will every machine in the cluster read the file from S3?
If yes and let's say I can change project A, is there any way to optimize it? Maybe load the file from project B, broadcast it, and pass it to the object in project A?
Or maybe can I broadcast the FeatureBuilder object itself?
I'm not sure what is the right way to do that under the constraint that I can't add any Spark code to project A anyway.
When using PySpark, the code you write will be executed on a cluster of machines in a distributed manner. When you call the create_features function within the pandas_udf in Project B, PySpark will attempt to distribute the data and the code execution across multiple nodes in the cluster.
In the example you provided, when you call the FeatureBuilder object from Project A, the read_artifacts_from_s3 method will be executed on each worker node in the cluster, causing each node to read the file from S3 independently. This can lead to a significant performance overhead and is not optimal.
If you can't change Project A, one way to optimize it would be to cache the contents of the file in memory on the driver node using the broadcast variable and then use it in the create_features method within Project B. This way, the contents of the file will be broadcast to all worker nodes in the cluster, and the data will only need to be read once.
Here's an example:
from pyspark.sql import DataFrame
from builder import FeatureBuilder
from pyspark.broadcast import Broadcast
def pandas_udf(df: DataFrame):
artifact = read_artifacts_from_s3()
broadcast_artifact = Broadcast(artifact)
feature_builder = FeatureBuilder()
def create_features(pdf):
feature_vector = fbuilder.create_features(pdf, broadcast_artifact.value)
return feature_vector
return df.groupby("id").applyInPandas(create_features, df)
I have been tasked lately, to ingest JSON responses onto Databricks Delta-lake. I have to hit the REST API endpoint URL 6500 times with different parameters and pull the responses.
I have tried two modules, ThreadPool and Pool from the multiprocessing library, to make each execution a little quicker.
ThreadPool:
How to choose the number of threads for ThreadPool, when the Azure Databricks cluster is set to autoscale from 2 to 13 worker nodes?
Right now, I've set n_pool = multiprocessing.cpu_count(), will it make any difference, if the cluster auto-scales?
Pool
When I use Pool to use processors instead of threads. I see the following errors randomly on each execution. Well, I understand from the error that Spark Session/Conf is missing and I need to set it from each process. But I am on Databricks with default spark session enabled, then why do I see these errors.
Py4JError: SparkConf does not exist in the JVM
**OR**
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Lastly, planning to replace multiprocessing with 'concurrent.futures.ProcessPoolExecutor'. Does it make any difference?
if you're using thread pools, they will run only on the driver node, executors will be idle. Instead you need to use Spark itself to parallelize the requests. This is usually done by creating a dataframe with list of URLs (or parameters for URL if base URL is the same), and then use Spark user defined function to do actual requests. Something like this:
import urllib
df = spark.createDataFrame([("url1", "params1"), ("url2", "params2")],
("url", "params"))
#udf("body string, status int")
def do_request(url: str, params: str):
full_url = url + "?" + params # adjust this as required
with urllib.request.urlopen(full_url) as f:
status = f.status
body = f.read().decode("utf-8")
return {'status': status, 'body': body}
res = df.withColumn("result", do_requests(col("url"), col("params")))
This will return dataframe with a new column called result that will have two fields - status and body (JSON answer as string).
You can try following way to resolve
Py4JError: SparkConf does not exist in the JVM
**OR**
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Error
Install findspark
$pip install findspark
Code:
import findsparkfindspark.init()
References: Py4JError: SparkConf does not exist in the JVM and py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Currently, I am processing data in hive using custom mappers and reducers like this:
select TRANSFORM(hostname,impressionId) using 'python process_data.py' as a,b from impressions
But when I try to apply the same logic in Spark sql, I get SparkSqlParser error.
I want to resue the logic in process_data.py off the box. Is there any way to do it?
You need to put in some sort errors stacktrace so that the community can answer your questions quickly.
For the Python Script to run in your Scala code(that is what i am assuming), you can achieve it in the following way:
Example :
Python File : Code for making Input data to Uppercase
#!/usr/bin/python
import sys
for line in sys.stdin:
print line.upper()
Spark Code : For Piping the data
import org.apaches.spark.{SparkConf, SparkContext}
val conf = new SparkConf().setAppName("Pipe")
val sc = new SparkContext(conf)
val distScript = "/path/on/driver/PipeScript.py"
val distScriptName = "PipeScript.py"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
you can create your own custom UDF and can use it within Spark application code. use custom UDF only in case something you can't do with available Spark native functions.
I am not sure what's there in process_data.py and what kind of input it takes and what are you expecting out of it.
If it's something that you want to make it available for different application code. you can do as follows:
you can create a class in python and have a function inside it to do processing and add it to your spark application code.
class MyClass:
def __init__(self, args):
…
def MyFunction(self):
spark.sparkContext.addPyFile('/py file location/somecode.py')
importing your class in pyspark application code
from somecode import MyClass
create an object to access the class and it's function
myobject = MyClass()
now you can access your class function to send and receive arguments.
After following the Apache Spark documentation, I tried to experiment with the mapPartition module. In the following code, I expected to see initial RDD as in the function myfunc I am just returning back the iterator after printing the values. But when I do collect on the RDD it is empty.
from pyspark import SparkConf
from pyspark import SparkContext
def myfunc(it):
print(it.next())
return it
def fun1(sc):
n = 5
rdd = sc.parallelize([x for x in range(n+1)], n)
print(rdd.mapPartitions(myfunc).collect())
if __name__ == "__main__":
conf = SparkConf().setMaster("local[*]")
conf = conf.setAppName("TEST2")
sc = SparkContext(conf = conf)
fun1(sc)
mapPartitions is not relevant here. Iterators (here itertools.chain) are stateful and can be traversed only once. When you call it.next() you read and discard the first element and what you return is a tail of the sequence.
When partition has only one item (it should be the case for all but one) you effectively discard the whole partition.
A few notes:
Putting anything to stdout inside a task is typically useless.
The way you use next is not portable and cannot be used in Python 3.
I have the following situation with my Pyspark:
In my driver program (driver.py), I call a function from another file (prod.py)
latest_prods = prod.featurize_prods().
Driver code:
from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod
sc = SparkContext()
if __name__ == '__main__':
print 'Into main'
featurize_latest = Featurize('param1', 'param2', sc)
latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)
old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods = sc.union([latest_prods, old_prods])
Then I do some some reduceByKey code here... that generates total_prods_processed.
Finally I call:
total_prods_processed.saveAsTextFile(...)
I would like to generate latest_prods and old_prods in parallel. Both are created in the same SparkContext. Is it possible to do that? If not, how can I achieve that functionality?
Is this something that does Spark automatically? I am not seeing this behavior when I run the code so please let me know if it is a configuration option.
After searching on the internet, I think your problem can be addressed by threads. It is as simple as create two threads for your old_prod and latest_prod work.
Check this post for a simplified example. Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything.
The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. Then read both in another job to perform the union. (this is not recommended by the documentation).
If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147
However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other.