How to use mapPartitions in pyspark - python

After following the Apache Spark documentation, I tried to experiment with the mapPartition module. In the following code, I expected to see initial RDD as in the function myfunc I am just returning back the iterator after printing the values. But when I do collect on the RDD it is empty.
from pyspark import SparkConf
from pyspark import SparkContext
def myfunc(it):
print(it.next())
return it
def fun1(sc):
n = 5
rdd = sc.parallelize([x for x in range(n+1)], n)
print(rdd.mapPartitions(myfunc).collect())
if __name__ == "__main__":
conf = SparkConf().setMaster("local[*]")
conf = conf.setAppName("TEST2")
sc = SparkContext(conf = conf)
fun1(sc)

mapPartitions is not relevant here. Iterators (here itertools.chain) are stateful and can be traversed only once. When you call it.next() you read and discard the first element and what you return is a tail of the sequence.
When partition has only one item (it should be the case for all but one) you effectively discard the whole partition.
A few notes:
Putting anything to stdout inside a task is typically useless.
The way you use next is not portable and cannot be used in Python 3.

Related

How does PySpark work behind the scene when using a Python module which should load files?

Let's say I have the two following python projects -
PROJECT A
class FeatureBuilder:
def __init__(self):
self.artifact = read_artifacts_from_s3()
def create_features(self):
# do something with artifact
PROJECT B
from pyspark.sql import DataFrame
from builder import FeatureBuilder
def pandas_udf(df: DataFrame):
feature_builder = FeatureBuilder()
def create_features(pdf):
feature_vector = fbuilder.create_features(pdf)
return feature_vector
return df.groupby("id").applyInPandas(create_features, df)
In this example, in project B, I'm calling to create_features function, which uses the FeatureBuilder object I imported from project A (which I can't change), and FeatureBuilder reads the file it needs from S3 (or any other location).
Project A is not a "PySpark" project - by this I mean it has no code related to the PySpark package, Spark session or Spark context at all.
What will happen in this case? Will every machine in the cluster read the file from S3?
If yes and let's say I can change project A, is there any way to optimize it? Maybe load the file from project B, broadcast it, and pass it to the object in project A?
Or maybe can I broadcast the FeatureBuilder object itself?
I'm not sure what is the right way to do that under the constraint that I can't add any Spark code to project A anyway.
When using PySpark, the code you write will be executed on a cluster of machines in a distributed manner. When you call the create_features function within the pandas_udf in Project B, PySpark will attempt to distribute the data and the code execution across multiple nodes in the cluster.
In the example you provided, when you call the FeatureBuilder object from Project A, the read_artifacts_from_s3 method will be executed on each worker node in the cluster, causing each node to read the file from S3 independently. This can lead to a significant performance overhead and is not optimal.
If you can't change Project A, one way to optimize it would be to cache the contents of the file in memory on the driver node using the broadcast variable and then use it in the create_features method within Project B. This way, the contents of the file will be broadcast to all worker nodes in the cluster, and the data will only need to be read once.
Here's an example:
from pyspark.sql import DataFrame
from builder import FeatureBuilder
from pyspark.broadcast import Broadcast
def pandas_udf(df: DataFrame):
artifact = read_artifacts_from_s3()
broadcast_artifact = Broadcast(artifact)
feature_builder = FeatureBuilder()
def create_features(pdf):
feature_vector = fbuilder.create_features(pdf, broadcast_artifact.value)
return feature_vector
return df.groupby("id").applyInPandas(create_features, df)

what is the python ray syntax to use an imported function as a ray remote function with decorator arg such as num_cpus

I am trying to use an imported function as a ray remote function, however the usual syntax to declare a remote function doesn't seem to work. There are several other related questions on stackoverflow that don't fully answer my question, see: here and here. Link 1 'seems' to solve the issue but seems very clunky and does't look like the way I would expect the ray developers intend it to be used. Ref 2 solves a similar issue 'except' if you need to give arguments to the decorated function (as i do in my case) then it returns the following error:
AssertionError: The #ray.remote decorator must be applied either with no arguments and no parentheses, for example '#ray.remote', or it must be applied using some of the arguments 'num_returns', 'num_cpus', 'num_gpus', 'memory', 'object_store_memory', 'resources', 'max_calls', or 'max_restarts', like '#ray.remote(num_returns=2, resources={"CustomResource": 1})'.
Here is an example of full working code:
import pandas as pd
import numpy as np
import ray
from ray.cluster_utils import Cluster
config = {
"cluster_cpus":3,
"n_cores":1
}
df1 = pd.DataFrame()
df1['a'] = [1,2,3]
df1['b'] = [4,5,6]
try:
# init ray
# Start a head node for the cluster
if not ray.is_initialized():
master_cluster = Cluster(
initialize_head=True,
head_node_args={"num_cpus":config["cluster_cpus"]}
)
# start ray (after initializing cluster)
try:
ray.init(address=master_cluster.address, include_dashboard=False, log_to_driver=True)
except TypeError:
ray.init(address=master_cluster.address, include_webui=False)
df_id = df1
#ray.remote(num_cpus=config["n_cores"])
def sp_workflow(df, sp):
sp["output"] = df.sum(axis=1).values * sp["input"]
return sp
model_pool = [{"input":1},{"input":2},{"input":3}]
outputs = []
result_ids = [sp_workflow.remote(df=df_id, sp=sp) for sp in model_pool]
# Loop over the pending results and process completed jobs
while len(result_ids):
done_id, result_ids = ray.wait(result_ids)
sp = ray.get(done_id[0])
outputs.append(sp)
print(outputs)
except Exception as e:
raise e
finally:
if ray.is_initialized():
ray.shutdown()
However, in my case 'sp_workflow' is a function stored in another script and so cannot have the #ray.remote(num_cpus=config["n_cores") decorator applied to it. If I need to specify the number of cores to give an imported function that I wish to use remotely then its not clear in the ray docs how to do this? - Unless I missed something?
I tried replacing the sp_workflow definition with the imported version and the remote call with the following line but it gives the error mentioned earlier:
from other_library import sp_workflow
result_ids = [ray.remote(sp_workflow(df=df_id, sp=sp), num_cpus=config["n_cores"] ) for sp in model_pool]
Following Ray Nishihara answer mentioned in your post, I propose you not to miss the first step of converting your function to remote version remote_foo = ray.remote(local_foo), where you may pass num_cpus argument.
So in your case it should be like this:
from other_library import sp_workflow
sp_workflow_rm = ray.remote(sp_workflow, config["n_cores"])
result_ids = [ray.remote(sp_workflow_rm(df=df_id, sp=sp)) for sp in model_pool]
I have not tested it, hopefully it will work for you.

Passing function to spark that reads S3 file using pyspark

i have GBs of data in s3 and trying to bring parallelism while reading in my code by refering the following Link .
I am using the below code as a sample but when i run,it runs down to the following error:
Anyhelp on this is deeply appreciated as i am very new to spark.
EDIT : I have to read my s3 files using parallelism which is not explained in any post. People marking duplicate please read the problem first.
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
class telco_cn:
def __init__(self, sc):
self.sc = sc
def decode_module(msg):
df=spark.read.json(msg)
return df
def consumer_input(self, sc, k_topic):
a = sc.parallelize(['s3://bucket1/1575158401-51e09537-0ce5-c775-6beb-fd1b0a568e15.json'])
d = a.map(lambda x: telco_cn.decode_module(x)).collect()
print (d)
if __name__ == "__main__":
cn = telco_cn(sc)
cn.consumer_input(sc, '')
You are attempting to call spark.read.json from within a map operation on an RDD. As this map operation will be executed on Spark's executor/worker nodes, you cannot reference a SparkContext/SparkSession variable (which is defined on the Spark driver) within the map. This is what the error message is trying to tell you.
Why not just call df=spark.read.json('s3://bucket1/1575158401-51e09537-0ce5-c775-6beb-fd1b0a568e15.json') directly?

Loading library on each map() execution

Library called spaCy has some problems when being shared across executors (problem with pickling). One of the workarounds is to import it independently on each map execution but the load takes a while.
I'm new to Spark and so I don't understand what's the exact mechanism behind map. What will happen in example case below?
I'm afraid of the worst case scenario, where individual lines of text are processed independently and for each one it will import spacy. Fresh import can take good 10+ s and we have 1,000,000+ lines of text.
class SpacyMagic(object):
_spacys = {}
#classmethod
def get(cls, lang):
if lang not in cls._spacys:
import spacy
cls._spacys[lang] = spacy.load(lang)
return cls._spacys[lang]
def run_spacy(sent):
nlp = SpacyMagic.get('en')
return [wrd.text for wrd in nlp(sent)]
sc = SparkContext(appName="LineTokenizer")
data = sc.textFile(s3in)
res = data.map(process_line)
print res.take(100)
sc.stop()

Pyspark multiple jobs in parallel

I have the following situation with my Pyspark:
In my driver program (driver.py), I call a function from another file (prod.py)
latest_prods = prod.featurize_prods().
Driver code:
from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod
sc = SparkContext()
if __name__ == '__main__':
print 'Into main'
featurize_latest = Featurize('param1', 'param2', sc)
latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)
old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods = sc.union([latest_prods, old_prods])
Then I do some some reduceByKey code here... that generates total_prods_processed.
Finally I call:
total_prods_processed.saveAsTextFile(...)
I would like to generate latest_prods and old_prods in parallel. Both are created in the same SparkContext. Is it possible to do that? If not, how can I achieve that functionality?
Is this something that does Spark automatically? I am not seeing this behavior when I run the code so please let me know if it is a configuration option.
After searching on the internet, I think your problem can be addressed by threads. It is as simple as create two threads for your old_prod and latest_prod work.
Check this post for a simplified example. Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything.
The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. Then read both in another job to perform the union. (this is not recommended by the documentation).
If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147
However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other.

Categories