Currently, I am processing data in hive using custom mappers and reducers like this:
select TRANSFORM(hostname,impressionId) using 'python process_data.py' as a,b from impressions
But when I try to apply the same logic in Spark sql, I get SparkSqlParser error.
I want to resue the logic in process_data.py off the box. Is there any way to do it?
You need to put in some sort errors stacktrace so that the community can answer your questions quickly.
For the Python Script to run in your Scala code(that is what i am assuming), you can achieve it in the following way:
Example :
Python File : Code for making Input data to Uppercase
#!/usr/bin/python
import sys
for line in sys.stdin:
print line.upper()
Spark Code : For Piping the data
import org.apaches.spark.{SparkConf, SparkContext}
val conf = new SparkConf().setAppName("Pipe")
val sc = new SparkContext(conf)
val distScript = "/path/on/driver/PipeScript.py"
val distScriptName = "PipeScript.py"
sc.addFile(distScript)
val ipData = sc.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val opData = ipData.pipe(SparkFiles.get(distScriptName))
opData.foreach(println)
you can create your own custom UDF and can use it within Spark application code. use custom UDF only in case something you can't do with available Spark native functions.
I am not sure what's there in process_data.py and what kind of input it takes and what are you expecting out of it.
If it's something that you want to make it available for different application code. you can do as follows:
you can create a class in python and have a function inside it to do processing and add it to your spark application code.
class MyClass:
def __init__(self, args):
…
def MyFunction(self):
spark.sparkContext.addPyFile('/py file location/somecode.py')
importing your class in pyspark application code
from somecode import MyClass
create an object to access the class and it's function
myobject = MyClass()
now you can access your class function to send and receive arguments.
Related
Let's say I have the two following python projects -
PROJECT A
class FeatureBuilder:
def __init__(self):
self.artifact = read_artifacts_from_s3()
def create_features(self):
# do something with artifact
PROJECT B
from pyspark.sql import DataFrame
from builder import FeatureBuilder
def pandas_udf(df: DataFrame):
feature_builder = FeatureBuilder()
def create_features(pdf):
feature_vector = fbuilder.create_features(pdf)
return feature_vector
return df.groupby("id").applyInPandas(create_features, df)
In this example, in project B, I'm calling to create_features function, which uses the FeatureBuilder object I imported from project A (which I can't change), and FeatureBuilder reads the file it needs from S3 (or any other location).
Project A is not a "PySpark" project - by this I mean it has no code related to the PySpark package, Spark session or Spark context at all.
What will happen in this case? Will every machine in the cluster read the file from S3?
If yes and let's say I can change project A, is there any way to optimize it? Maybe load the file from project B, broadcast it, and pass it to the object in project A?
Or maybe can I broadcast the FeatureBuilder object itself?
I'm not sure what is the right way to do that under the constraint that I can't add any Spark code to project A anyway.
When using PySpark, the code you write will be executed on a cluster of machines in a distributed manner. When you call the create_features function within the pandas_udf in Project B, PySpark will attempt to distribute the data and the code execution across multiple nodes in the cluster.
In the example you provided, when you call the FeatureBuilder object from Project A, the read_artifacts_from_s3 method will be executed on each worker node in the cluster, causing each node to read the file from S3 independently. This can lead to a significant performance overhead and is not optimal.
If you can't change Project A, one way to optimize it would be to cache the contents of the file in memory on the driver node using the broadcast variable and then use it in the create_features method within Project B. This way, the contents of the file will be broadcast to all worker nodes in the cluster, and the data will only need to be read once.
Here's an example:
from pyspark.sql import DataFrame
from builder import FeatureBuilder
from pyspark.broadcast import Broadcast
def pandas_udf(df: DataFrame):
artifact = read_artifacts_from_s3()
broadcast_artifact = Broadcast(artifact)
feature_builder = FeatureBuilder()
def create_features(pdf):
feature_vector = fbuilder.create_features(pdf, broadcast_artifact.value)
return feature_vector
return df.groupby("id").applyInPandas(create_features, df)
I'm writing an aggregation in pysaprk
To this project, I'm also adding test, where I create a session, put some data, and then run my aggregation, and check the results
The code looks like as following:
def mapper_convert_row(row):
#... specific of business logic code, eventually return one string value
return my_str
def run_spark_query(spark: SparkSession, from_dt, to_dt):
query = get_hive_query_str(from_dt, to_dt)
df = spark.sql(query).rdd.map(lambda row: Row(mapper_convert_row(row)))
out_schema = StructType([StructField("data", StringType())])
df_conv = spark.createDataFrame(df, out_schema)
df_conv.write.mode('overwrite').format("csv").save(folder)
And here is my test class
class SparkFetchTest(unittest.TestCase):
#staticmethod
def getOrCreateSC():
conf = SparkConf()
conf.setMaster("local")
spark = (SparkSession.builder.config(conf=conf).appName("MyPySparkApp")
.enableHiveSupport().getOrCreate())
return spark
def test_fetch(self):
dt_from = datetime.strptime("2019-01-01-10-00", '%Y-%m-%d-%H-%M')
dt_to = datetime.strptime("2019-01-01-10-05", '%Y-%m-%d-%H-%M')
spark = self.getOrCreateSC()
self.init_and_populate_table_with_test_data(spark, input_tbl, dt_from, dt_to)
run_spark_query(spark, dt_from, dt_to)
# assert on results
I've added PySpark dependencies via the Conda environment
and running this code via PyCharm. Just to make it clear - there is no spark installation on my local machine except PySpark Conda package
When I set the breakpoint inside the code, it works for me in the driver code, but it does not stop inside mapper_convert_row function.
How can I debug this business logic function in a local test environment?
The same approach in scala works perfectly, but this code should be in python.
pyspark is a conduit to the spark runtime that runs on the jvm / is written in scala. The connection is through py4j that provides a tcp-based socket from the python executable to the jvm. Unfortunately that means
No local debugging
I'm no more happy about it than you. I might just write/maintain a parallel code branch in scala to figure some things out that are tiring to do without the debugger.
Update Pycharm is able to debug spark programs. I have been using it nearly daily Pycharm Debugging of Pyspark
i have GBs of data in s3 and trying to bring parallelism while reading in my code by refering the following Link .
I am using the below code as a sample but when i run,it runs down to the following error:
Anyhelp on this is deeply appreciated as i am very new to spark.
EDIT : I have to read my s3 files using parallelism which is not explained in any post. People marking duplicate please read the problem first.
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
class telco_cn:
def __init__(self, sc):
self.sc = sc
def decode_module(msg):
df=spark.read.json(msg)
return df
def consumer_input(self, sc, k_topic):
a = sc.parallelize(['s3://bucket1/1575158401-51e09537-0ce5-c775-6beb-fd1b0a568e15.json'])
d = a.map(lambda x: telco_cn.decode_module(x)).collect()
print (d)
if __name__ == "__main__":
cn = telco_cn(sc)
cn.consumer_input(sc, '')
You are attempting to call spark.read.json from within a map operation on an RDD. As this map operation will be executed on Spark's executor/worker nodes, you cannot reference a SparkContext/SparkSession variable (which is defined on the Spark driver) within the map. This is what the error message is trying to tell you.
Why not just call df=spark.read.json('s3://bucket1/1575158401-51e09537-0ce5-c775-6beb-fd1b0a568e15.json') directly?
I am trying to capture the logging in behave tests. The behave version I am using is 1.2.5. The documentation can be found here: https://media.readthedocs.org/pdf/python-behave/stable/python-behave.pdf
Unfortunately, I am not able to get the contents of the "buffer" attribute populated when instantiating an object from "class behave.log_capture.LoggingCapture(config, level=None)"
This the kind of layout I have for my current BDD:
environment.py
import behave
from behave import log_capture
from behave.log_capture import *
def before_all(context):
context.config.setup_logging()
context.config.log_capture = True
context.logging_level = "INFO"
# create the LoggingCapture object.
context.log_capture_data = log_capture.LoggingCapture(context.config)
#... rest of the code here.
# #capture
def before_scenario(context, scenario):
#... rest of the code here
def after_scenario(context, scenario):
#... rest of the code here
def after_all(context):
#... rest of the code here.
list_of_scenarios.feature
#... list of the scenarios.
# In one of the scenarios I need to access the "buffer" attribute which contains the captured data:
print("buffer contains {}".format(context.log_capture_data.buffer))
list_of_scenarios.py
#... implementation of steps for the scenarios
What am I missing? When I try to see the contents of the attribute "buffer" using print("buffer contains {}".format(context.log_capture_data.buffer))
I get an empty list: buffer contains []
I do see the "INFO:...." logging on the console. I believe this is captured on the "stdout". I am trying to get the same output on the attribute "buffer".
Please note: I have setup the behave in Linux. And I execute from the command line using command: behave test_setup. test_setup is a directory that houses environment.py, features and the implementation of steps files.
Thank you.
I´m quite new in programming and even more when it comes to Object Oriented programming. I’m trying to connect through Python to an external software (SAP2000, an structural engineering software). This program comes with an API to connect and there is an example in the help (http://docs.csiamerica.com/help-files/common-api(from-sap-and-csibridge)/Example_Code/Example_7_(Python).htm).
This works pretty well but I would like to divide the code so that I can create one function for opening the program, several functions to work with it and another one to close. This would provide me flexibility to make different calculations as desired and close it afterwards.
Here is the code I have so far where enableloadcases() is a function that operates once the instance is created.
import os
import sys
import comtypes.client
import pandas as pd
def openSAP2000(path,filename):
ProgramPath = "C:\Program Files (x86)\Computers and Structures\SAP2000 20\SAP2000.exe"
APIPath = path
ModelPath = APIPath + os.sep + filename
mySapObject = comtypes.client.GetActiveObject("CSI.SAP2000.API.SapObject")
#start SAP2000 application
mySapObject.ApplicationStart()
#create SapModel object
SapModel = mySapObject.SapModel
#initialize model
SapModel.InitializeNewModel()
ret = SapModel.File.OpenFile(ModelPath)
#run model (this will create the analysis model)
ret = SapModel.Analyze.RunAnalysis()
def closeSAP2000():
#ret = mySapObject.ApplicationExit(False)
SapModel = None
mySapObject = None
def enableloadcases(case_id):
'''
The function activates LoadCases for output
'''
ret = SapModel.Results.Setup.SetCaseSelectedForOutput(case_id)
From another module, I call the function openSAP2000() and it works fine but when I call the function enableloadcases() an error says AttributeError: type object ‘SapModel’ has no attribute ‘Results’.
I believe this must be done by creating a class and after calling the functions inside but I honestly don´t know how to do it.
Could you please help me?
Thank you very much.
Thank you very much for the help. I managed to solve the problem. It was as simple and stupid as marking SapModel variable as global.
Now it works fine.
Thank you anyway.