How to broadcast a complex class object in pyspark across the clusters

How to broadcast a complex class object in pyspark across the clusters - python

Following is the code where py_cpp_bind refers to a piece of code written in C++11 and then binded to python using boost-python (enabled pickling). In order to initialize the object it requires three arguments (filename, int, int). I wished to broadcast this object across the clusters, as this piece is required to perform a computation for each element.
However, on execution Apache Spark seems to complain with
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
... 15 more
Code:
from pyspark.serializers import BatchedSerializer, PickleSerializer
from pyspark import SparkContext, SparkConf
import py_cpp_bind
def populate_NL(n, tk2):
tk = [list(tk2[0]), tk2[1]]
res = mscore.score(tk[1], tk[0])
return res
def main(n, sc):
mscore = py_cpp_bind.score()
# following line constructs the object from the given arguments
print mscore.init("data/earthquake.csv", n, 4000)
broadcastVar = sc.broadcast(mdl)
C = [((0,), [1])]
C = sc.parallelize(C).flatMap(lambda X : populate(n, X))
print(C.collect())
if __name__ == "__main__":
conf = SparkConf().setMaster("local[*]")
conf = conf.setAppName("TEST")
sc = SparkContext(conf = conf, serializer=PickleSerializer())
n = 5
main(n, sc)

Related

How to use multiprocess in another code and funcion? [duplicate]

This question already has answers here:
What does if __name__ == "__main__": do?
(45 answers)
Closed 6 months ago.
I am trying to multiprocess cell detection via the deepcell package. Because the deepcell detection works fine on small images. But for bigger images it does not work or takes a really really long time.
So I'm trying to cut the images into small patches, and then use multiprocessing to feed them to the cell detection.
I need to be able to run the pool_cell_detection() function in another code, and get its return value (allPoints). Whereas if I use it in a if name=='main' wrapper, then I could not get the return value. Can you suggest how I can do this?
Here is my code1
import numpy as np
import os
import matplotlib.pyplot as plt
import cv2 as cv
from multiprocessing import Pool
from deepcell.mesmer import Mesmer
import time
blevel_image = cv.imread("./images/blevel_eq_p.png",0)
app = Mesmer()
def deepcell_detection(image0, mpp):
print(type(image0))
cv.imwrite("./images/image1.png", image0)
image = np.stack((image0,image0), axis=-1)
image = np.expand_dims(image,0)
labeled_image, coords = app.predict(image, image_mpp=mpp)
print(len(coords))
return coords
def pool_cell_detection(img_channel):
blobs_log = []
r,c = img_channel.shape[0:2]
mpp = 2
rstep=r//10
cstep=c//10
patches=[]
for i in range(10):
for j in range(10):
img_patch = img_channel[i*rstep:(i+1)*rstep,j*cstep:(j+1)*cstep]
patches.append([img_patch, mpp])
with Pool(4) as p:
print("pooling")
allPoints=p.map(deepcell_detection, patches)
return allPoints
def main():
allPoints = pool_cell_detection(blevel_image)
if __name__ == '__main__':
main()
I need in code 2 something like the following:
import code1
def func_something():
#Many operations
allPoints = pool_cell_detection(blevel_image)
But I'm not sure how to write code 2 to be able to get the allpoints

As the multiprocessing documentation says, the entry point of a multiprocessing program must be wrapped in if __name__ == '__main__': when using the spawn start method, which is the only available option on Windows, which you're on.
Change the final invocation from
blobs_log = pool_cell_detection(blevel_image)
to e.g.
def main():
blobs_log = pool_cell_detection(blevel_image)
if __name__ == '__main__':
main()
Following up on the comment threads, all in all, you might have, let's say, multiprocessing_cell_detect.py:
import multiprocessing
import cv2 as cv
import numpy as np
from deepcell.mesmer import Mesmer
def deepcell_detection(image0, mpp):
cv.imwrite("./images/image1.png", image0)
image = np.stack((image0, image0), axis=-1)
image = np.expand_dims(image, 0)
labeled_image, coords = Mesmer().predict(image, image_mpp=mpp)
return coords
def pool_cell_detection(img_channel):
r, c = img_channel.shape[0:2]
mpp = 2
rstep = r // 10
cstep = c // 10
patches = []
for i in range(10):
for j in range(10):
img_patch = img_channel[i * rstep:(i + 1) * rstep, j * cstep:(j + 1) * cstep]
patches.append([img_patch, mpp])
with multiprocessing.Pool(4) as p:
return p.map(deepcell_detection, patches)
and my_cell_program.py:
from multiprocessing_cell_detect import pool_cell_detection
def main():
blevel_image = cv.imread("./images/blevel_eq_p.png", 0)
allPoints = pool_cell_detection(blevel_image)
print(allPoints)
if __name__ == '__main__':
main()
and you run python my_cell_program.py.
So long as the main entry point of the program is guarded, things will work. The module you import a multiprocessing thing from will not need to be guarded (unless you also wish to use that module stand-alone).

Understanding Spark broadcasting

I find it pretty confusing using a broadcasted variable inside a UDF from an imported function. Say I make a broadcasted variable inside an imported function from the main file. It works if I have a UDF defined inside the function (second_func) but not outside (third_func).
Why is this happening?
Are UDFs advised being defined inside the function that calls it?
# test_utils.py
from pyspark.sql import types as T
from pyspark.sql import functions as F
#F.udf(T.StringType())
def do_smth_out():
return broadcasted.value["a"]
def second_func(spark, df):
#F.udf(T.StringType())
def do_smth_in():
return broadcasted.value["a"]
data = {"a": "c"}
sc = spark.sparkContext
broadcasted = sc.broadcast(data)
return df.withColumn("a", do_smth_in())
def third_func(spark, df):
data = {"a": "c"}
sc = spark.sparkContext
broadcasted = sc.broadcast(data)
return df.withColumn("a", do_smth_out())
# main.py
from pyspark.sql import types as T
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from test_utils import first_func, second_func
#F.udf(T.StringType())
def do_smth():
return broadcasted.value["a"]
if __name__ == "__main__":
spark = SparkSession \
.builder \
.getOrCreate()
sc = spark.sparkContext
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
df = sc.parallelize(data).toDF(columns)
broadcasted = sc.broadcast({"a": "c"})
print("First trial")
df.withColumn("a", do_smth()).show()
# Works
print("Second trial")
second_func(spark, df).show()
# Works
print("Third trial")
third_func(spark, df).show()
# Doesn't work

Running DESeq2 from rpy2

I am trying to run DESeq2 through rpy2 for the first time and am having some difficulties.
class py_DESeq2:
def __init__(self, count_matrix):
self.dds = None
self.normalized_count_matrix = None
self.vsd = None
self.count_matrix = robjects.conversion.py2rpy(count_matrix)
self.design_matrix = robjects.conversion.py2rpy(pd.DataFrame({'treatment':['ctrl' for i in range(count_matrix.shape[1])]}))
self.design_formula = Formula('~ 1')
def norm_counts(self, **kwargs):
self.dds = deseq.DESeqDataSetFromMatrix(countData=self.count_matrix, colData=self.design_matrix, design=self.design_formula)
self.vsd = deseq.varianceStabilizingTransformation(self.dds, blind=True)
self.normed_count_matrix = deseq.assay(self.vsd)
self.normed_count_matrix = to_dataframe(self.normed_count_matrix)
self.normed_count_matrix = robjects.conversion.rpy2py(self.normed_count_matrix)
I get the following error at self.normed_count_matrix = deseq.assay(self.vsd):
module 'DESeq2' has no attribute 'assay'
The below code in R runs fine:
library(DESeq2)
countData <- read.delim("0.333404867983521.R.data.in.txt")
colData <- read.delim("0.333404867983521.R.groups.in.txt")
dds <- DESeqDataSetFromMatrix(countData, colData,design=~Treatment,tidy=TRUE)
norm <- varianceStabilizingTransformation(dds,blind=TRUE)
norm_matrix <- assay(norm)
norm_df <- data.frame(Gene=rownames(norm_matrix), norm_matrix)
write.table(norm_df, "0.333404867983521.R.data.out.txt", row.names = FALSE,sep="\t")
The norm object is a <class 'rpy2.robjects.methods.RS4'>.
There must be something I am missing here and a point in the right direction would be appreciated!

If you open R and type:
library(DESeq2)
assay
you will see that the assay function is not actually coming from DESeq2, but from its dependency which is called SummarizedExperiment:
> assay
standardGeneric for "assay" defined from package "SummarizedExperiment"
function (x, i, withDimnames = TRUE, ...)
standardGeneric("assay")
<bytecode: 0x5586a354db90>
<environment: 0x5586a3535e20>
Methods may be defined for arguments: x, i
Use showMethods("assay") for currently available ones.
you can confirm that assay is not part of DESeq2 by using explicit namespaces in R:
> DESeq2::assay
Error: 'assay' is not an exported object from 'namespace:DESeq2'
and to confirm that it is indeed a part of SummarizedExperiment:
> SummarizedExperiment::assay
standardGeneric for "assay" defined from package "SummarizedExperiment"
Therefore in rpy2 you can use it like this:
from rpy2.robjects.packages import importr
summarized_experiment = importr('SummarizedExperiment')
summarized_experiment.assay(self.vsd)

Python: correct format on input to datagridview

I want to add rows to a datagridview "manually". I tried converting the following code to python: https://learn.microsoft.com/en-us/dotnet/framework/winforms/controls/how-to-manipulate-rows-in-the-windows-forms-datagridview-control
However, I struggle with adding rows. The following doesn't work:
for j in range(len(signals)):
self._dataGridView1.Rows.Add(signals[j])
The following code does work, but is not dynamically enough as I don't know how many elements there will be:
for j in range(len(signals)):
self._dataGridView1.Rows.Add(signals[j][0], signals[j][1], signals[j][2], signals[j][3])
How should I fix this? I tried tuple, but the result were a tuple with all the info shown in the first cell instead of spread over the columns.
I would not like to add packages, as this is to be run within revid dynamo among several users, and I cannot convince everyone to install packages.
full code for context:
import clr
clr.AddReference('System.Windows.Forms')
clr.AddReference('System.Drawing')
clr.AddReference('System.Data')
clr.AddReference('RevitAPIUI')
from Autodesk.Revit.UI import TaskDialog
from System.Windows.Forms import *
from System.Drawing import (
Point, Size,
Font, FontStyle,
GraphicsUnit
)
from System.Data import DataSet
from System.Data.Odbc import OdbcConnection, OdbcDataAdapter
msgBox = TaskDialog
headers = IN[0]
signals = IN[1]
class DataGridViewQueryForm(Form):
def __init__(self):
self.Text = 'Signals'
self.ClientSize = Size(942, 255)
self.MinimumSize = Size(500, 200)
self.setupDataGridView()
def setupDataGridView(self):
self._dataGridView1 = DataGridView()
self._dataGridView1.AllowUserToOrderColumns = True
self._dataGridView1.ColumnHeadersHeightSizeMode = DataGridViewColumnHeadersHeightSizeMode.AutoSize
self._dataGridView1.Dock = DockStyle.Fill
self._dataGridView1.Location = Point(0, 111)
self._dataGridView1.Size = Size(506, 273)
self._dataGridView1.TabIndex = 3
self._dataGridView1.ColumnCount = len(headers)
self._dataGridView1.ColumnHeadersVisible = True
for i in range(len(headers)):
self._dataGridView1.Columns[i].Name = headers[i]
for j in range(len(signals)):
self._dataGridView1.Rows.Add(signals[j][0], signals[j][1], signals[j][2], signals[j][3])
self.Controls.Add(self._dataGridView1)
Application.Run(DataGridViewQueryForm())

Figured it out. Had to use System.Array.
from System import Array
code changes:
array_str = Array.CreateInstance(str, len(headers))
for j in range(len(signals)):
for k in range(len(headers)):
array_str[k] = signals[j][k]
self._dataGridView1.Rows.Add(array_str)

How to properly get table-row indexes and named values from trapped var-binds in pysnmp

I'm trying to keep my code as clean as possible but I'm not completely satisfied with what I achieved so far.
I built a SNMP manager which receive traps from another device using a custom MIB, which I will refer to as MY-MIB.
I am not sure this is the cleanest way, but essentially I have:
from pysnmp.entity import engine, config
from pysnmp.carrier.asynsock.dgram import udp
from pysnmp.entity.rfc3413 import ntfrcv, context
from pysnmp.smi import builder, rfc1902
from pysnmp.smi.view import MibViewController
from pysnmp.entity.rfc3413 import mibvar
_snmp_engine = engine.SnmpEngine()
_snmpContext = context.SnmpContext(_snmpEngine)
_mibBuilder = _snmpContext.getMibInstrum().getMibBuilder()
#Add local path where MY-MIB is located
_mibSources = _mibBuilder.getMibSources() + (builder.DirMibSource('.'),)
_mibBuilder.setMibSources(*mibSources)
_mibBuilder.loadModules('MY-MIB')
_view_controller = MibViewController(_mibBuilder)
def my_callback_trap_processor(snmp_engine, state_reference,
context_id, context_name, var_binds, ctx):
#...CALLBACK CODE...
config.addV1System(snmp_engine, 'my-area', 'MYCOMMUNITY')
config.addTargetParams(snmp_engine, 'my-creds', 'my-area',
'noAuthNoPriv', 1)
config.addSocketTransport(snmp_engine,
udp.domainName + (1,),
udp.UdpTransport().openServerMode((IP_ADDRESS,
PORT)))
ntfrcv.NotificationReceiver(snmp_engine, my_callback_trap_processor)
snmp_engine.transportDispatcher.jobStarted(1)
try:
snmp_engine.transportDispatcher.runDispatcher()
except:
snmp_engine.transportDispatcher.closeDispatcher()
raise
In the callback function above I can get a pretty intelligible print by just using the following code:
varBinds = [rfc1902.ObjectType(rfc1902.ObjectIdentity(x[0]), x[1]).resolveWithMib(_view_controller) for x in var_binds]
for varBind in varBinds:
print(varBind.prettyPrint())
which, from a given trap that I receive, gives me:
SNMPv2-MIB::sysUpTime.0 = 0
SNMPv2-MIB::snmpTrapOID.0 = MY-MIB::myNotificationType
MY-MIB::myReplyKey.47746."ABC" = 0x00000000000000000000000000000000000
MY-MIB::myTime.0 = 20171115131544Z
MY-MIB::myOperationMode.0 = 'standalone'
Nice. But I want to manipulate/dissect each bit of information from the given var-binds, especially in a higher level way.
Looking at the innards of the library I was able to gather this code up:
for varBind in var_binds:
objct = rfc1902.ObjectIdentity(varBind[0]).resolveWithMib(self._view_controller)
(symName, modName), indices = mibvar.oidToMibName(
self._view_controller, objct.getOid()
)
print(symName, modName, indices, varBind[1])
that gives me:
sysUpTime SNMPv2-MIB (Integer(0),) 0
snmpTrapOID SNMPv2-MIB (Integer(0),) 1.3.6.1.X.Y.Z.A.B.C.D
myReplyKey MY-MIB (myTimeStamp(47746), myName(b'X00080')) 0x00000000000000000000000000000000000
myTime MY-MIB (Integer(0),) 20171115131544Z
myOperationMode MY-MIB (Integer(0),) 1
and in the case of myReplyKey indexes I can just do a:
for idx in indices:
try:
print(idx.getValue())
except AttributeError:
print(int(idx))
But in the case of the myOperationMode var-bind, how do I get the named-value 'standalone' instead of 1? And how to get the names of the indexes (myTimeStamp and myName)?
Update:
After Ilya's suggestions I researched the library a little bit more for getting the namedValues and, also, I used some Python hacking to get what I was looking for on the indices.
varBinds = [rfc1902.ObjectType(rfc1902.ObjectIdentity(x[0]), x[1]).resolveWithMib(_view_controller) for x in var_binds]
processed_var_binds = []
for var_bind in resolved_var_binds:
object_identity, object_value = var_bind
mod_name, var_name, indices = object_identity.getMibSymbol()
var_bind_dict = {'mib': mod_name, 'name': var_name, 'indices': {}}
for idx in indices:
try:
value = idx.getValue()
except AttributeError:
var_bind_dict['indices'] = int(idx.prettyPrint())
else:
var_bind_dict['indices'][type(value).__name__] = str(value)
try:
var_bind_dict['value'] = object_value.namedValues[object_value]
except (AttributeError, KeyError):
try:
var_bind_dict['value'] = int(object_value.prettyPrint())
except ValueError:
var_bind_dict['value'] = object_value.prettyPrint()
processed_var_binds.append(var_bind_dict)

To resolve SNMP PDU var-bindings against a MIB you can use this snippet what I think you have done already:
from pysnmp.smi.rfc1902 import *
var_binds = [ObjectType(ObjectIdentity(x[0]), x[1]).resolveWithMib(mibViewController)
for x in var_binds]
By this point you have a list of rfc1902.ObjectType objects. The ObjectType instance mimics a two-element tuple: ObjectIdentity and SNMP value object.
var_bind = var_binds[0]
object_identity, object_value = var_bind
Now, getMibSymbol() will give you MIB name, MIB object name and the tuple of indices made up from the trailing part of the OID. Index elements are SNMP value objects just as object_value:
>>> object_identity.getMibSymbol()
('SNMPv2-MIB', 'sysDescr', (0,))
The enumeration, should it present, is reported by .prettyPrint():
>>> from pysnmp.proto.rfc1902 import *
>>> Error = Integer.withNamedValues(**{'disk-full': 1, 'no-disk': -1})
>>> error = Error(1)
>>> error.prettyPrint()
'disk-full'
>>> int(error)
1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to broadcast a complex class object in pyspark across the clusters - python

Related

How to use multiprocess in another code and funcion? [duplicate]

Understanding Spark broadcasting

Running DESeq2 from rpy2

Python: correct format on input to datagridview

How to properly get table-row indexes and named values from trapped var-binds in pysnmp

Categories

Resources