Multi-processing in Azure Databricks - python

I have been tasked lately, to ingest JSON responses onto Databricks Delta-lake. I have to hit the REST API endpoint URL 6500 times with different parameters and pull the responses.
I have tried two modules, ThreadPool and Pool from the multiprocessing library, to make each execution a little quicker.
ThreadPool:
How to choose the number of threads for ThreadPool, when the Azure Databricks cluster is set to autoscale from 2 to 13 worker nodes?
Right now, I've set n_pool = multiprocessing.cpu_count(), will it make any difference, if the cluster auto-scales?
Pool
When I use Pool to use processors instead of threads. I see the following errors randomly on each execution. Well, I understand from the error that Spark Session/Conf is missing and I need to set it from each process. But I am on Databricks with default spark session enabled, then why do I see these errors.
Py4JError: SparkConf does not exist in the JVM
**OR**
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Lastly, planning to replace multiprocessing with 'concurrent.futures.ProcessPoolExecutor'. Does it make any difference?

if you're using thread pools, they will run only on the driver node, executors will be idle. Instead you need to use Spark itself to parallelize the requests. This is usually done by creating a dataframe with list of URLs (or parameters for URL if base URL is the same), and then use Spark user defined function to do actual requests. Something like this:
import urllib
df = spark.createDataFrame([("url1", "params1"), ("url2", "params2")],
("url", "params"))
#udf("body string, status int")
def do_request(url: str, params: str):
full_url = url + "?" + params # adjust this as required
with urllib.request.urlopen(full_url) as f:
status = f.status
body = f.read().decode("utf-8")
return {'status': status, 'body': body}
res = df.withColumn("result", do_requests(col("url"), col("params")))
This will return dataframe with a new column called result that will have two fields - status and body (JSON answer as string).

You can try following way to resolve
Py4JError: SparkConf does not exist in the JVM
**OR**
py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM
Error
Install findspark
$pip install findspark
Code:
import findsparkfindspark.init()
References: Py4JError: SparkConf does not exist in the JVM and py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

Related

How to connect kafka IO from apache beam to a cluster in confluent cloud

I´ve made a simple pipeline in Python to read from kafka, the thing is that the kafka cluster is on confluent cloud and I am having some trouble conecting to it.
Im getting the following log on the dataflow job:
Caused by: org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:820)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:631)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:612)
at org.apache.beam.sdk.io.kafka.KafkaIO$Read$GenerateKafkaSourceDescriptor.processElement(KafkaIO.java:1495)
Caused by: java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
So I think Im missing something while passing the config since it mentions something related to it, Im really new to all of this and I know nothing about java so I dont know how to proceed even reading the JAAS documentation.
The code of the pipeline is the following:
import apache_beam as beam
from apache_beam.io.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
import os
import json
import logging
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='credentialsOld.json'
with open('cluster.configuration.json') as cluster:
data=json.load(cluster)
cluster.close()
def logger(element):
logging.INFO('Something was found')
def main():
config={
"bootstrap.servers":data["bootstrap.servers"],
"security.protocol":data["security.protocol"],
"sasl.mechanisms":data["sasl.mechanisms"],
"sasl.username":data["sasl.username"],
"sasl.password":data["sasl.password"],
"session.timeout.ms":data["session.timeout.ms"],
"auto.offset.reset":"earliest"
}
print('======================================================')
beam_options = PipelineOptions(runner='DataflowRunner',project='project',experiments=['use_runner_v2'],streaming=True,save_main_session=True,job_name='kafka-stream-test')
with beam.Pipeline(options=beam_options) as p:
msgs = p | 'ReadKafka' >> ReadFromKafka(consumer_config=config,topics=['users'],expansion_service="localhost:8088")
msgs | beam.FlatMap(logger)
if __name__ == '__main__':
main()
I read something about passing a property java.security.auth.login.config in the config dictionary but since that example is with java and I´am using python Im really lost at what I have to pass or even if that´s the property I have to pass etc.
btw Im getting the api key and secret from here and this is what I am passing to sasl.username and sasl.password
I faced the same error the first time I tried the beam's expansion service. The key sasl.mechanisms that you are supplying is incorrect, try with sasl.mechanism also you do not need to supply the username and password since you are connection is authenticated by jasl basically the consumer_config like below worked for me:
config={
"bootstrap.servers":data["bootstrap.servers"],
"security.protocol":data["security.protocol"],
"sasl.mechanism":data["sasl.mechanisms"],
"session.timeout.ms":data["session.timeout.ms"],
"group.id":"tto",
"sasl.jaas.config":f'org.apache.kafka.common.security.plain.PlainLoginModule required serviceName="Kafka" username=\"{data["sasl.username"]}\" password=\"{data["sasl.password"]}\";',
"auto.offset.reset":"earliest"
}
I got a partial answer to this question since I fixed this problem but got into another one:
config={
"bootstrap.servers":data["bootstrap.servers"],
"security.protocol":data["security.protocol"],
"sasl.mechanisms":data["sasl.mechanisms"],
"sasl.username":data["sasl.username"],
"sasl.password":data["sasl.password"],
"session.timeout.ms":data["session.timeout.ms"],
"group.id":"tto",
"sasl.jaas.config":f'org.apache.kafka.common.security.plain.PlainLoginModule required serviceName="Kafka" username=\"{data["sasl.username"]}\" password=\"{data["sasl.password"]}\";',
"auto.offset.reset":"earliest"
}
I needed to provide the sasl.jaas.config porpertie with the api key and secret of my cluster and also the service name, however, now Im facing a different error whe running the pipeline on dataflow:
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
This error shows after 4-5 mins of trying to run the job on dataflow, actually I have no idea how to fix this but I think is related to my broker on confluent rejecting the connection, I think this could be related to the zone execution since the cluster is in a different zone than job region.
UPDATE:
I tested the code on linux/ubuntu and I dont know why but the expansión service gets downloaded automatically so you wont get unsoported signal error, still having some issues trying to autenticate to confluent kafka tho.

Find the yarn ApplicationID of of the current Spark job from the DRIVER node?

Is there a straightforward way to get the yarn ApplicationId of the current job from the DRIVER node running under Amazon's Elastic Map Reduce (EMR)? This is running Spark in the cluster mode.
Right now I'm using code that runs a map() operation on a worker to read the CONTAINER_ID environment variable. This seems inefficient. Here's the code:
def applicationIdFromEnvironment():
return "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])
def applicationId():
"""Return the Yarn (or local) applicationID.
The environment variables are only set if we are running in a Yarn container.
"""
# First check to see if we are running on the worker...
try:
return applicationIdFromEnvironment()
except KeyError:
pass
# Perhaps we are running on the driver? If so, run a Spark job that finds it.
try:
from pyspark import SparkConf, SparkContext
sc = SparkContext.getOrCreate()
if "local" in sc.getConf().get("spark.master"):
return f"local{os.getpid()}"
# Note: make sure that the following map does not require access to any existing module.
appid = sc.parallelize([1]).map(lambda x: "_".join(['application'] + os.environ['CONTAINER_ID'].split("_")[1:3])).collect()
return appid[0]
except ImportError:
pass
# Application ID cannot be determined.
return f"unknown{os.getpid()}"
You can get the applicationID directly from the SparkContext using the property applicationId:
A unique identifier for the Spark application. Its format depends on
the scheduler implementation.
in case of local spark app something like ‘local-1433865536131’
case of YARN something like ‘application_1433865536131_34483’
appid = sc.applicationId

Conflict - Python insert-update into Azure table storage

Working with Python, I have many processes that need to update/insert data into an Azure table storage at the same time using :
table_service.update_entity(table_name, task) <br/>
table_service.insert_entity(table_name, task)
However, the following error occurs:
<br/>AzureConflictHttpError: Conflict
{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:57d9b721-6002-012d-3d0c-b88bef000000\nTime:2019-01-29T19:55:53.5984026Z"}}}
Maybe I need to use a global Lock to avoid operating the same Table Entity concurrently but I don't know how to use it
Azure Tables has a new SDK available for Python that is in a preview release available on pip, here's an update for the newest library.
On a create method you can use a try/except block to catch the expected error:
from azure.data.tables import TableClient
from azure.core.exceptions import ResourceExistsError
table_client = TableClient.from_connection_string(conn_str, table_name="myTableName")
try:
table_client.create_entity(entity=my_entity)
except ResourceExistsError:
print("Entity already exists")
You can use ETag to update entities conditionally after creation.
from azure.data.tables import UpdateMode
from azure.core import MatchConditions
received_entity = table_client.get_entity(
partition_key="my_partition_key",
row_key="my_row_key",
)
etag = received_entity._metadata["etag"]
resp = table_client.update_entity(
entity=my_new_entity,
etag=etag,
mode=UpdateMode.REPLACE,
match_condition=MatchConditions.IfNotModified
)
On update, you can elect for replace or merge, more information here.
(FYI, I am a Microsoft employee on the Azure SDK for Python team)
There isn't a global "Lock" in Azure Table Storage, since it's using optimistic concurrency via ETag (i.e. If-Match header in raw HTTP requests).
If your thread A is performing insert_entity, it should catch the 409 Conflict error.
If your thread B & C are performing update_entity, they should 412 Precondition Failed error, and use a loop to retrieve the latest entity then try to update the entity again.
For more details, please check Managing Concurrency in the Table Service section in https://azure.microsoft.com/en-us/blog/managing-concurrency-in-microsoft-azure-storage-2/

Submit a python job to spark

I'm trying to submit a job written in python to spark.
First I'm going to explain my setup, I installed a single node of spark 2.3.0 on a performing windows server (40 cores and 1TB or RAM), of course the goal will be later to create a cluster of less powerfull nodes, but for now I'm testing everything there :)
My first test consists on taking a set of tabular CSV files (40-100GB each), split them and then save the splitted result somewhere else.
I've been making my prototype on a jupyter notebook using pyspark (which creates automatically a sparkContext.
Now, I want to create a spark_test.py script, containing the body of my prototype in the main, which I'm planning to send to the spark-submit.
The thing is that seems like the processing part of my script is not working at all. Here you have the body of my script:
from pyspark import SparkContext, SparkConf
def main():
# Create spark context
spark_conf = SparkConf().setAppName('GMK_SPLIT_TEST')
print('\nspark configuration: \n%s\n' % spark_conf.toDebugString())
sc = SparkContext(conf=spark_conf)
# Variables definition
partitionings_number = 40
file_1 = r'D:\path\to\my\csv\file.csv'
output_path = r'D:\output\path'
# Processing 1
rdd = sc.parallelize(range(1000))
print(rdd.mean())
# Processing 2
sdf = spark.read.option('header','true').csv(file_1, sep=';', encoding='utf-8')
sdf_2 = sdf.repartition(partitionings_number, 'Zone[3-2]')
sdf_2.write.saveAsTable('CSVBuckets', format='csv', sep=';', mode='overwrite', path=output_path, header='True')
if __name__ == '__main__':
main()
Now here is where I have more doubts. Does spark-submit will try to connect to an already running instance of spark or it will initialize one by itself?
I tried:
spark-submit --master local[20] --driver-memory 30g
The above command seems to work if for the Processing 1, but no for the Processing 2
spark-submit --master spark:\\127.0.0.1:7077 --driver-memory 30g
The above command raises an exception of spark contex initialization. Is it because I don't have any spark instance running?
How so I have to pass my file_1 with my python job in order to have do the processing 2? I tried --files without success.
Thank for your time guys!

Pyspark multiple jobs in parallel

I have the following situation with my Pyspark:
In my driver program (driver.py), I call a function from another file (prod.py)
latest_prods = prod.featurize_prods().
Driver code:
from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod
sc = SparkContext()
if __name__ == '__main__':
print 'Into main'
featurize_latest = Featurize('param1', 'param2', sc)
latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)
old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods = sc.union([latest_prods, old_prods])
Then I do some some reduceByKey code here... that generates total_prods_processed.
Finally I call:
total_prods_processed.saveAsTextFile(...)
I would like to generate latest_prods and old_prods in parallel. Both are created in the same SparkContext. Is it possible to do that? If not, how can I achieve that functionality?
Is this something that does Spark automatically? I am not seeing this behavior when I run the code so please let me know if it is a configuration option.
After searching on the internet, I think your problem can be addressed by threads. It is as simple as create two threads for your old_prod and latest_prod work.
Check this post for a simplified example. Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything.
The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. Then read both in another job to perform the union. (this is not recommended by the documentation).
If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147
However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other.

Categories