I'm facing with a task: using zmq socket to send and receive base64 string (which generated from 800x600 images). Currently, I'm using pub/sub connection to perform this task. But look like the message is large so that the socket can't transfer it immediately, and the later messages stuck in network buffer. Although I don't want to lose so many messages, I must restrict the HWM value so that the socket works properly. So I have some questions:
Is there another effective library/way to perform my task? Or should I use other connection type that zmq provides (router/dealer and request/reply)?
To transfer image (that processed by OpenCV), is there an approach I can use to minimize the size of the sending image, except converting into base64 format?
If I must continue using zmq pub/sub connection, how can I limit the time for storing old messages, not the number of them, like that for 3 minutes?
Here my python code for the socket:
Publisher
import numpy as np
import zmq
import base64
context = zmq.Context()
footage_socket = context.socket(zmq.PUB)
footage_socket.setsockopt(zmq.SNDHWM, 1)
footage_socket.connect(<tcp address>)
def send_func(frame, camera_link):
height, width, _ = frame.shape
frame = np.ascontiguousarray(frame)
base64_img = base64.b64encode(frame)
data = {"camera_link":camera_link,'base64_img':base64_img, "img_width":width, "img_height":height}
footage_socket.send_json(data)
Subcriber
footage_socket = context.socket(zmq.SUB)
footage_socket.bind(<tcp address>)
footage_socket.setsockopt(zmq.RCVHWM, 1)
def rcv_func():
while True:
print("run socket")
try:
framex = footage_socket.recv_string()
data = json.loads(framex)
frame = data['base64_img']
img = np.frombuffer(base64.b64decode(frame), np.uint8)
img = img.reshape(int(frame_height), int(frame_width), 3)
except Exception as e:
print(e)
Before we start, let me take a few notes: - avoid re-packing data into JSON, if it were just for the ease of coding. JSON-re-serialised data "grow"-in size, without delivering you a single value-added for ultra-fast & resources-efficient stream-processing. Professional systems "resort" to JSON-format only if they have plenty of time and almost unlimited spare CPU-processing power, they waste into re-packing the valuable data into a just another box-of-data-inside-another-box-of-data. Where feasible, they can pay all the costs and inefficiencies - here, you will result in getting nothing in exchange to the spent CPU-clocks, more than doubled the RAM-needed to re-pack itself and also having to transport even larger data - review, if camera indeed provides image-data that "deserve" to become 8-Byte / 64-bit "deep", if not, you have the first remarkable image-data reduction free-of-chage
Using sys.getsizeof() may surprise you:
>>> aa = np.ones( 1000 )
>>> sys.getsizeof( aa )
8096 <---------------------------- 8096 [B] object here "contains"
>>> (lambda array2TEST: array2TEST.itemsize * array2TEST.size )( aa )
8000 <---------------------------- 8000 [B] of data
>>> bb = aa.view() # a similar effect happen in smart VECTORISED computing
>>> sys.getsizeof( bb )
96 <------------------------------ 96 [B] object here "contains"
>>> (lambda array2TEST: array2TEST.itemsize * array2TEST.size )( bb )
8000 <---------------------------- 8000 [B] of data
>>> bb.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False <-------------------------------||||||||||
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> bb.dtype
dtype('float64') <-------------- 8 [B] per image-pixel even for {1|0} B/W
Q : is there an approach I can use to minimize the size of the sending image...?
Yes, there have been already spent millions of [man * years] of R&D, dedicated to solving this problem, and still evolving the best of the class methods for doing it.
The best results, as anyone may have already expected on one's own, are needed for extremely corner-cases - for a satellite imagery transport from away, far in a deep space, back home - like when JAXA was on it's second asteroid rendezvous mission, this time visiting the Ryugu asteroid.
Your as-is code produces 800x600-image-frames at so far unspecified fps-rate and color-depth. A brief view shows, how much data that can easily generate, within those said -3-minutes-, if the process is not handled with more attention and a due care:
>>> (lambda a2T: a2T.itemsize * a2T.size )( np.ones( ( 800, 600, 3 ) ) ) / 1E6
11.52 <---- each 800x600-RGB-FRAME handled/processed this way takes ~ 11.5 [MB]
#~30 fps ~345.6 [MB/s]
~ 62.2 [GB/3min]
Solution? Take the inspiration from The Best in the Class know-how :
There you have limited power ( both the energy-wise and the processing-wise - do not forget, the CPU-s "inside" this satellite were already manufactured more than some 5 - 7 years ago, before the Project launch - no one serious will dare to send a mission with bright and hot new, but unproven, COTS chips ), limited RAM ( again, the power plus weight limits, as the amount of the fuel needed to liftoff and fly "there" grows with every single gram of "The Useful Payload" ) and the last but not least - the most limiting factor - you have very limited means of R/F-COMMs - a so "loooooooong"-wire ( it takes almost half a day, to get a first bit from "there" back "here" + the same, if you try to ACK/NACK from "here" answering any remote-request or requesting a re-send after an error was detected ). The current DSN effective-telemetry data transport-speeds are about 6.4 ~ 9.6 kbps ( yes, not more than about 7000 bits/sec )
Here, the brightest minds have put all the art of the human intellect, into making this happen:
ultimate means of image compression - never send a bit unless it is indeed vital & necessary
ultimate means of transcoded-image data error self-correction added - if anything is worth adding, the error-detection is not ( you will have to wait for almost a day, to get it "re-transmited" again, hopefully without another error there ). Here we need a means of ( limited - see the costs of sending a single bit above, so this has to be very economic add-on ) self-correction, which can indeed repair some limited-scope of signal/data-transport errors, that may appear and do appear during the R/F-COMMs signal traveling from deep space back home. On larger errors, you have to wait a few days to get a re-scheduled image-data error recovery solved by another try to send a larger pack, that was not recoverable from the "damaged"-data by the capabilities engineered into the built-in error self-correction.
Where to start from?
If your use-case does not have the remarkable amount of the "spare" CPU-power available ( it is indeed needed to have pretty enough "free" CPU+RAM-resources to perform any such advanced image-data trans-coding & error-recovery re-processing, both in scale ( volume of additional data for the trans-coding and re-processing - both of which come at large sizes - orders of magnitude larger than a size of a single image-frame ) and in time ( speed of the additional CPU-processing ) ) there is no magic trick to get the ultimate image-data compression and your story ends here.
If your use-case can spin up more CPU-power, your next enemy is the time. Both the time to design a clever-enough image-processing and the time to process each image-frame, using your engineered image-data trans-coding, within a reasonably short amount of time, before sending over to the recipient end. The former is manageable by your Project-resources ( by finance - to get the right skilled engineers on board, and by people who execute (do) the actual design & engineering phase ). The latter is not manageable, it depends on your Project's needs - how fast ( fps ) and bearing what latency ( how late, in accumulated [ms] of delays ) your Project can still survive to perform the intended function.
python is an easy prototyping eco-system, once you need to boost the throughput ( ref. above ), this most probably ( 30+ years of experience make me awfully well confident in saying this - even if you pull in add-on steroids, like moving into cython + C-extensions for doing the whole circus indeed a bit, but only a bit faster, at an immense add-on cost of having to ( acquire new skill if not already on board - having an expensive learning curve duration and grows in salaries for those well-skilled ) re-engineer and re-factor your so far well-prototyped code-base ) will be the first blocker of the show going on
OpenCV can and will provide you some elementary image-manipulation tools to start from
image-data trans-coding and ordinary or ultimate data-compression have to follow, to reduce the data-size
ZeroMQ is the least problematic part - both performance-wise scalable and having unique low-latency throughput capabilities. Without any details, one may forget about the PUB/SUB, unless you keep prevented and avoided any subscription-list processing at all ( the costs of doing this would cause immense side-effects on the { central-node | network-dataflows + all remote-nodes }-overloads, having no practical effect for the intended fast and right-sized image-data pipeline-processing.
Q : If I must continue using zmq pub/sub connection, how can I limit the time for storing old messages, not the number of them, like that for 3 minutes?
ZeroMQ is a smart tool, yet one has to understand it's powers - ZeroCopy will help you in keeping low-RAM-profile in production, yet if you plan to store -3-minutes of image-data streaming, you will need both immense loads of RAM and CPU-power and it all also heavily depends on the actual amount of .recv()-ing peers.
ZeroMQ is a broker-less system, so you do not actually "store" messages, but the .send()-method just tells the ZeroMQ infrastructure, that the provided data are free-to-get-sent, whenever ZeroMQ infrastructure is seeing a chance to dispatch 'em to the designated peer-recipient ( be it locally or over the Atlantic or over the satellite-connection ). This means, the proper ZeroMQ configuration is a must, if you plan to have the sending/receiving-side's ready to enqueue / transmit / receive / dequeue ~3-minutes of even the most compressed image-data stream(s), potentially providing multiples of that, in case 1:many-party communication appears in production.
Proper analysis and sound design decisions are the only chance for your Project to survive all these requirements, given the CPU, RAM and transport-means are a-priori known to be limited.
Related
Background
I am creating a window with pysdl2 and using SDL_Blit_Surface for embedding a skia-python surface inside this window with the following code:
import skia
import sdl2 as sdl
from ctypes import byref as pointer
class Window:
DEFAULT_FLAGS = sdl.SDL_WINDOW_SHOWN
BYTE_ORDER = {
# ---------- -> RED GREEN BLUE ALPHA
"BIG_ENDIAN": (0xff000000, 0x00ff0000, 0x0000ff00, 0x000000ff),
"LIL_ENDIAN": (0x000000ff, 0x0000ff00, 0x00ff0000, 0xff000000)
}
PIXEL_DEPTH = 32 # BITS PER PIXEL
PIXEL_PITCH_FACTOR = 4 # Multiplied by Width to get BYTES PER ROW
def __init__(self, title, width, height, x=None, y=None, flags=None, handlers=None):
self.title = bytes(title, "utf8")
self.width = width
self.height = height
# Center Window By default
self.x, self.y = x, y
if x is None:
self.x = sdl.SDL_WINDOWPOS_CENTERED
if y is None:
self.y = sdl.SDL_WINDOWPOS_CENTERED
# Override flags
self.flags = flags
if flags is None:
self.flags = self.DEFAULT_FLAGS
# Handlers
self.handlers = handlers
if self.handlers is None:
self.handlers = {}
# SET RGBA MASKS BASED ON BYTE_ORDER
is_big_endian = sdl.SDL_BYTEORDER == sdl.SDL_BIG_ENDIAN
self.RGBA_MASKS = self.BYTE_ORDER["BIG_ENDIAN" if is_big_endian else "LIL_ENDIAN"]
# CALCULATE PIXEL PITCH
self.PIXEL_PITCH = self.PIXEL_PITCH_FACTOR * self.width
# SKIA INIT
self.skia_surface = self.__create_skia_surface()
# SDL INIT
sdl.SDL_Init(sdl.SDL_INIT_EVENTS) # INITIALIZE SDL EVENTS
self.sdl_window = self.__create_SDL_Window()
def __create_SDL_Window(self):
window = sdl.SDL_CreateWindow(
self.title,
self.x, self.y,
self.width, self.height,
self.flags
)
return window
def __create_skia_surface(self):
"""
Initializes the main skia surface that will be drawn upon,
creates a raster surface.
"""
surface_blueprint = skia.ImageInfo.Make(
self.width, self.height,
ct=skia.kRGBA_8888_ColorType,
at=skia.kUnpremul_AlphaType
)
# noinspection PyArgumentList
surface = skia.Surface.MakeRaster(surface_blueprint)
return surface
def __pixels_from_skia_surface(self):
"""
Converts Skia Surface into a bytes object containing pixel data
"""
image = self.skia_surface.makeImageSnapshot()
pixels = image.tobytes()
return pixels
def __transform_skia_surface_to_SDL_surface(self):
"""
Converts Skia Surface to an SDL surface by first converting
Skia Surface to Pixel Data using .__pixels_from_skia_surface()
"""
pixels = self.__pixels_from_skia_surface()
sdl_surface = sdl.SDL_CreateRGBSurfaceFrom(
pixels,
self.width, self.height,
self.PIXEL_DEPTH, self.PIXEL_PITCH,
*self.RGBA_MASKS
)
return sdl_surface
def update(self):
window_surface = sdl.SDL_GetWindowSurface(self.sdl_window) # the SDL surface associated with the window
transformed_skia_surface = self.__transform_skia_surface_to_SDL_surface()
# Transfer skia surface to SDL window's surface
sdl.SDL_BlitSurface(
transformed_skia_surface, None,
window_surface, None
)
# Update window with new copied data
sdl.SDL_UpdateWindowSurface(self.sdl_window)
def event_loop(self):
handled_events = self.handlers.keys()
event = sdl.SDL_Event()
while True:
sdl.SDL_WaitEvent(pointer(event))
if event.type == sdl.SDL_QUIT:
break
elif event.type in handled_events:
self.handlers[event.type](event)
if __name__ == "__main__":
skiaSDLWindow = Window("Browser Test", 500, 500, flags=sdl.SDL_WINDOW_SHOWN | sdl.SDL_WINDOW_RESIZABLE)
skiaSDLWindow.event_loop()
I monitor my CPU usage for the above code and it stays well below 20% with hardly any change in RAM usage.
Problem
The problem is that as soon I make a window greater than 690 x 549 (or any other size where width and height's products are the same) I get a segfault (core dumped) with CPU usage going upto 100%, no change in RAM usage.
What I have already tried/know
I know the fault is with SDL_BlitSurface as reported by the faulthandler module in python, and the classic print("here") lines.
I am not familiar with languages like c so from my basic understanding of a segfault I tried to match the size of the byte string returned by Window.__pixels_from_skia_surface with sys.getsizeof against C datatypes to see if it was close to the size of any, because I suspected an overflow. (forgive me if this is the stupidest debugging method you have ever seen). But the size didn't come close to any of the c datatypes.
As SDL_CreateRGBSurfaceFrom documentation says, it doesn't allocate memory for pixels data but takes external memory buffer passed to it. While there's a benefit in having no copy operation at all, it have lifetime implications - note "you must free the surface before you free the pixel data".
Python tracks references for its objects and automatically destroys objects once their reference count reaches 0 (i.e. no references to that object possible - delete it immediately). But nither SDL nor skia are python libraries, and whatever references they keep in their native code is not exposed to python. So, python automatic memory management doesn't help you here.
What's happening is you get pixels data from skia as bytes array (python object, automatically freed when no longer referenced), then pass it to SDL_CreateRGBSurfaceFrom (native code, python don't know that it'd keep internal reference), and then your pixels goes out of scope and python deletes them. You have surface but SDL says the way you created it pixels must not be destroyed (there are other ways, like SDL_CreateRGBSurface, that actually allocate their own memory). Then you try to blit it and surface still points to location where pixels were, but that array is no longer there.
[Everything that follows is explaination of why exactly it didn't crash with smaller surface size, and that turned out to require much more words than i thought. Sorry. If you're not interested in that stuff, don't read any further]
What happens next purely depends on memory allocator used by python. First, segmentation fault is a critical signal sent by operating system to your program, and it happens when you access memory pages in a way that you're not supposed to - e.g. reading memory that have no mapped pages or writing to pages that are mapped as read-only. All that, and the way to map/unmap pages, is provided by your operating system kernel (e.g. in linux it is handled by mmap/munmap calls), but OS kernel only operates on page level; you can't request half-of-page, but you can have large block backed by N pages. For most current operating systems, minimal page size is 4kb; some OS supports 2Mb or even larger 'huge' pages.
So, you get segmentation fault when you have larger surface, but don't get it when surface is smaller. Meaning for larger surface your BlitSurface hits memory that is already unmapped and OS sends your program polite "sorry can't allow that, correct yourself immediately or you're going down". But when surface is smaller memory that pixels were kept in still mapped; it doesn't necessarily mean it still contains the same data (e.g. python could have placed some other object there), but as far as OS concerned this memory region is still 'yours' to read. And the difference in that behaviour is indeed caused by size of allocated buffer (but of course you can't rely for that behaviour to be kept on other OS, other python versions, or even other with different set of environment variables).
As i've said before, you only mmap entire pages, but python (that's just an example, as you'll see later) have a lot of smaller objects (integers, floats, smaller strings, short arrays, ...) that are much smaller than a page. Allocating entire page for each of that would be a massive waste of memory (also other problems like reduced performance because of bad caching). To handle that what we do ('we' being every single program that needs smaller allocations, i.e. 99% of programs you use everyday) is allocate a larger block of memory and track which parts of that block is allocated/freed in userspace (as oppoosed to pages that are being tracked by OS kernel - in kernelspace) entirely. That way you could have very tight packing of small allocations without too much of an overhead, but the downside is that this allocations are not distinguishable on OS level. When you 'free' some small allocation that is placed in that kind of pre-allocated block, you just internally mark this region as unused and next time some other part of your program request some memory you start searching for a place where you can put it. It also means you usually don't return (unmap) memory to OS as you can't give back the block if at least one byte of it is still in use.
Python internally manages small objects (<512b) itself, by allocating 256kb blocks and placing objects in that blocks. If larger allocation is required - it passes it to libc malloc (python itself is written in C and uses libc; the most popular libc for linux is glibc). And malloc documentation for glibc says following:
When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the
glibc malloc() implementation allocates the memory as a private
anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default,
but is adjustable using mallopt(3)
So, allocations for larger objects should go to mmap/munmap, and freeing that pages should make them unaccessible (causing segfault if you try to access it, instead of silently reading potentially garbage data; bonus point if you try to write into it - so-called memory stomping, overwriting something else, probably even internal libc markers that it uses to track which memory is used; anything could happen after that). While there is still a chance that next mmap will randomly place next page on the same address, i'm going to neglect that. Unfortunately this is very old documentation that, while explains basic intent, no longer reclects how glibc behaves nowadays. Take a look at comment in glibc source (emphasis is mine):
M_MMAP_THRESHOLD is the request size threshold for using mmap()
to service a request. Requests of at least this size that cannot be
allocated using already-existing space will be serviced via mmap.
(If enough normal freed space already exists it is used instead.)
...
The implementation works with a sliding threshold, which is by
default limited to go between 128Kb and 32Mb (64Mb for 64
bitmachines) and starts out at 128Kb as per the 2001 default.
...
The threshold goes up in value when the application frees
memory that was allocated with the mmap allocator. The idea is
that once the application starts freeing memory of a certain size,
it's highly probable that this is a size the application uses for
transient allocations.
So, it tries to adapt to your allocation behaviour to balance performance with releasing memory back to OS.
But, different OSes will behave differently, and even with just linux we have multiple libc implementations (e.g. musl) that will implement malloc differently, and a lot of different memory allocators (jemalloc, tcmalloc, dlmalloc, you name it) that could be injected via LD_PRELOAD and your program (e.g. python itself in that case) will use different allocator with different rules on mmap usage. There are even debug allocators that injects "guard" pages around every allocation, that don't have any access rights at all (can't read, write or execute), to catch common memory-related programming mistakes, at a cost of massively larger memory usage.
To sum it up - you had a lifetime management bug in your code, and unfortunately it didn't crash immediately due to internals of libc memory allocation scheme, but it did crash when surface size got larger and libc decided to allocate exclusive pages for that buffer. That is unfortunate turn of events that languages without automatic memory managerment are exposed to, and by virtue of using python C bindings your python program is, to some extent, exposed as well.
The following dispatch() function runs receives messages through a Queue.queue and sends them using a ZeroMQ PUSH socket to an endpoint.
I want this function to exit, once it receives None through the queue, but if the socket's underlying message buffer has any undelivered messages (the remote endpoint is down), then the application won't terminate. Thus, once the function receives a None, it closes the socket with a specified linger.
Using this approach, how can I detect whether the specified linger was reached or not? In particular, no exception is raised.
def dispatch(self):
context = zmq.Context()
socket = context.socket(zmq.PUSH)
poller = zmq.Poller()
socket.connect('tcp://127.0.0.1:5555')
poller.register(socket, zmq.POLLOUT)
while True:
try:
msg = self.dispatcher_queue.get(block=True, timeout=0.5)
except queue.Empty:
continue
if msg is None:
socket.close(linger=5000)
break
try:
socket.send_json(msg)
except Exception as exc:
raise common.exc.WatchdogException(
f'Failed to dispatch resource match to processor.\n{msg=}') from exc
Q : "How to detect whether linger was reached when closing socket using ZeroMQ?"
Well, not an easy thing.
ZeroMQ internally hides all these details from a user-level code, as the API was (since ever, till recent v4.3) crafted with all the beauties of the art of Zen-of-Zero, for the sake of maximum performance, almost linear scaling and minimum latency. Do Zero-steps that do not support ( the less if violate ) this.
There might be three principal directions of attack on solving this:
one may try to configure & use the event-observing overlay of zmq_socket_monitor() to analyse actual sequence of events on the lowest Level-of-Detail achievable
one may also try a rather brute way to set infinite LINGER attribute for the zmq.Socket()-instance & directly kill the blocking operation by sending a SIGNAL after set amount of (now) soft-"linger" expired, be it using the new in v4.3+ features of zmq_timers ( a [ms]-coarse framework of timers / callback utilities ) or one's own
one may prefer to keep things clean and still meeting the goal by "surrounding" a call to the zmq_ctx_term(), which as per v4.3 documented API will block ( be warned, that it is not warranted to be so in other API versions back & forth ). This way may help you indirectly detect a duration actually spent in blocking-state, like :
...
NOMINAL_LINGER_ON_CLOSE = 5000
MASK_a = "INF: .term()-ed ASAP, after {0:} [us] from {1:} [ms] granted"
MASK_b = "INF: .term()-ed ALAP, after {0:} [us] from {1:} [ms] granted"
...
socket.setsockopt( zmq.LINGER, NOMINAL_LINGER_ON_CLOSE ) # ____ BE EXPLICIT, ALWAYS
aClk = zmq.Stopwatch()
aClk.start() #_________________________________________________ BoBlockingSECTION
context.term() # /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\ BLOCKING.......
_ = aClk.stop() #______________________________________________ EoBlockingSECTION
...
print( ( MASK_a if _ < ( NOMINAL_LINGER_ON_CLOSE * 1000 )
else MASK_b
).format( _, NOMINAL_LINGER_ON_CLOSE )
)
...
My evangelisation of always, indeed ALWAYS being rather EXPLICIT is based on both having seen the creeping "defaults", that changed from version to version ( which is fair to expect to continue the same way forth ), so a responsible design shall, indeed ALWAYS, imperatively re-enter those very values, that we want to be in-place, as our code will survive both our current version & us, mortals, still having Zero-warranty of what part of our current assumptions will remain "defaults" in any future version / revision & having the same uncertainty what version-mix would be present in the domain of our deployed piece of code ( as of EoY-2020, there are still v2.1, v2.11, v3.x, v4.x running wild somewhere out there so one never knows, do we? )
I am new to use ZeroMQ, so I am struggling with some code.
If I do the following code, no error is shown :
import zmq.asyncio
ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket(zmq.PULL)
rcv_socket.connect("ipc:///tmp/test")
rcv_socket.bind("ipc:///tmp/test")
But, if I try to use the function zmq_getsockopt(), it fails :
import zmq.asyncio
ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket(zmq.PULL)
rcv_socket.connect("ipc:///tmp/test")
socket_path = rcv_socket.getsockopt(zmq.LAST_ENDPOINT)
rcv_socket.bind("ipc://%s" % socket_path)
Then I get :
zmq.error.ZMQError: No such file or directory for ipc path "b'ipc:///tmp/test'".
new to use ZeroMQ, so I am struggling with some code.
Well, you will be way, way better off, if you start with first understanding The Rules of the Game, than with learning from crashes ( yes, on the very contrary to what "wannabe-evangelisation-gurus" pump into the crowds that "just-coding" is enough - which it is not, for doing indeed a serious business ).
This is why:
If you read the published API, it still will confuse you most of the time, if you have no picture of the structure of the system & do not understand its internal and external behaviours ( the Framework's Rules of the Game ) :
The ZMQ_LAST_ENDPOINT option shall retrieve the last endpoint bound for TCP and IPC transports. The returned value will be a string in the form of a ZMQ DSN. Note that if the TCP host is INADDR_ANY, indicated by a *, then the returned address will be 0.0.0.0 (for IPv4).
This says the point, yet without knowing the concept, the point is still hidden from you to see it.
The Best Next Step
If you are indeed serious into low-latency and distributed computing, the best next step, after reading the link above, is to stop coding and first take some time to read and understand the fabulous Pieter Hintjens' book "Code Connected, Volume 1" before going any further - definitely worth your time.
Then, you will see why this will never fly:
import zmq.asyncio; ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket( zmq.PULL )
rcv_socket.connect( "ipc:///tmp/test" )
socket_path = rcv_socket.getsockopt( zmq.LAST_ENDPOINT )
rcv_socket.bind( "ipc://%s" % socket_path )
whereas this one may ( yet no handling of a NULL-terminated character string is still present here ... which is per se a sign of a bad software design ) :
import zmq.asyncio; ctx = zmq.asyncio.Context()
rcv_socket = ctx.socket( zmq.PULL )
rcv_socket.bind( "ipc:///tmp/test" )
socket_path = rcv_socket.getsockopt( zmq.LAST_ENDPOINT )
rcv_socket.connect( "ipc://%s" % socket_path )
To compare performance of Spark when using Python and Scala I created the same job in both languages and compared the runtime. I expected both jobs to take roughly the same amount of time, but Python job took only 27min, while Scala job took 37min (almost 40% longer!). I implemented the same job in Java as well and it took 37minutes too. How is this possible that Python is so much faster?
Minimal verifiable example:
Python job:
# Configuration
conf = pyspark.SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
conf.set("spark.executor.instances", "4")
conf.set("spark.executor.cores", "8")
sc = pyspark.SparkContext(conf=conf)
# 960 Files from a public dataset in 2 batches
input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
# Count occurances of a certain string
logData = sc.textFile(input_files)
logData2 = sc.textFile(input_files2)
a = logData.filter(lambda value: value.startswith('WARC-Type: response')).count()
b = logData2.filter(lambda value: value.startswith('WARC-Type: response')).count()
print(a, b)
Scala job:
// Configuration
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
// 960 Files from a public dataset in 2 batches
val input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
val input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"
// Count occurances of a certain string
val logData1 = sc.textFile(input_files)
val logData2 = sc.textFile(input_files2)
val num1 = logData1.filter(line => line.startsWith("WARC-Type: response")).count()
val num2 = logData2.filter(line => line.startsWith("WARC-Type: response")).count()
println(s"Lines with a: $num1, Lines with b: $num2")
Just by looking at the code, they seem to be identical. I looked a the DAGs and they didn't provide any insights (or at least I lack the know-how to come up with an explanation based on them).
I would really appreciate any pointers.
Your basic assumption, that Scala or Java should be faster for this specific task, is just incorrect. You can easily verify it with minimal local applications. Scala one:
import scala.io.Source
import java.time.{Duration, Instant}
object App {
def main(args: Array[String]) {
val Array(filename, string) = args
val start = Instant.now()
Source
.fromFile(filename)
.getLines
.filter(line => line.startsWith(string))
.length
val stop = Instant.now()
val duration = Duration.between(start, stop).toMillis
println(s"${start},${stop},${duration}")
}
}
Python one
import datetime
import sys
if __name__ == "__main__":
_, filename, string = sys.argv
start = datetime.datetime.now()
with open(filename) as fr:
# Not idiomatic or the most efficient but that's what
# PySpark will use
sum(1 for _ in filter(lambda line: line.startswith(string), fr))
end = datetime.datetime.now()
duration = round((end - start).total_seconds() * 1000)
print(f"{start},{end},{duration}")
Results (300 repetitions each, Python 3.7.6, Scala 2.11.12), on Posts.xml from hermeneutics.stackexchange.com data dump with mix of matching and non matching patterns:
Python 273.50 (258.84, 288.16)
Scala 634.13 (533.81, 734.45)
As you see Python is not only systematically faster, but also is more consistent (lower spread).
Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.
Edit:
To address some concerns raised in the comments:
In the OP code data is passed in mostly in one direction (JVM -> Python) and no real serialization is required (this specific path just passes bytestring as-is and decodes on UTF-8 on the other side). That's as cheap as it gets when it comes to "serialization".
What is passed back is just a single integer by partition, so in that direction impact is negligible.
Communication is done over local sockets (all communication on worker beyond initial connect and auth is performed using file descriptor returned from local_connect_and_auth, and its nothing else than socket associated file). Again, as cheap as it gets when it comes to communication between processes.
Considering difference in raw performance shown above (much higher than what you see in you program), there is a lot of margin for overheads listed above.
This case is completely different from cases where either simple or complex objects have to be passed to and from Python interpreter in a form that is accessible to both parties as pickle-compatible dumps (most notable examples include old-style UDF, some parts of old-style MLLib).
Edit 2:
Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.
Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.
Python 22809.57 (21466.26, 24152.87)
Scala 27315.28 (24367.24, 30263.31)
Please note non-overlapping confidence intervals.
Edit 3:
To address another comment from Jasper-M:
The bulk of all the processing is still happening inside a JVM in the Spark case.
That is simply incorrect in this particular case:
The job in question is map job with single global reduce using PySpark RDDs.
PySpark RDD (unlike let's say DataFrame) implement gross of functionality natively in Python, with exception input, output and inter-node communication.
Since it is single stage job, and final output is small enough to be ignored, the main responsibility of JVM (if one was to nitpick, this is implemented mostly in Java not Scala) is to invoke Hadoop input format, and push data through socket file to Python.
The read part is identical for JVM and Python API, so it can be considered as constant overhead. It also doesn't qualify as the bulk of the processing, even for such simple job like this one.
The Scala job takes longer because it has a misconfiguration and, therefore, the Python and Scala jobs had been provided with unequal resources.
There are two mistakes in the code:
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4") // LINE #4
sc.hadoopConfiguration.set("spark.executor.cores", "8") // LINE #5
LINE 1. Once the line has been executed, the resource configuration of the Spark job is already established and fixed. From this point on, no way to adjust anything. Neither the number of executors nor the number of cores per executor.
LINE 4-5. sc.hadoopConfiguration is a wrong place to set any Spark configuration. It should be set in the config instance you pass to new SparkContext(config).
[ADDED]
Bearing the above in mind, I would propose to change the code of the Scala job to
config.set("spark.executor.instances", "4")
config.set("spark.executor.cores", "8")
val sc = new SparkContext(config) // LINE #1
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
and re-test it again. I bet the Scala version is going to be X times faster now.
i am fairly new to the field of IOT. I am setting up a sensor with teensy for reading up its data and transmitting over using serial communication to system where using python I am reading the data and storing it into a database.
The problem I am facing is when i check my program using arduino serial monitor I am getting insane sample speed like 10k readings are done in 40 milli seconds but when i try to read the same program using python it is not even giving me more than 1000 readings per second and that too without the database code with it it only reads 200 samples per second. Is there any way i can increase this sample rate or do i have to set any extra parameters for communication through serial ?
Here is my code for teensy :
int i;
elapsedMillis sinceTest1;
void setup()
{
Serial.begin(2000000); // USB is always 12 Mbit/sec
i = 0;
delay(5000);
Serial.println("Setup Called");
Serial.flush();
}
void loop()
{
if (i == 0 || i == 500000)
{
Serial.println(sinceTest1);
}
Serial.println(i);
//Serial.println(Serial.baud());
i++;
}
For python :
import serial
import pymysql
from datetime import datetime
import time
import signal
import sys
class ReadLine:
def __init__(self, s):
self.buf = bytearray()
self.s = s
def readline(self):
i = self.buf.find(b"\n")
if i >= 0:
r = self.buf[:i+1]
self.buf = self.buf[i+1:]
return r
while True:
i = max(1, min(2048, self.s.in_waiting))
data = self.s.read(i)
i = data.find(b"\n")
if i >= 0:
r = self.buf + data[:i+1]
self.buf[0:] = data[i+1:]
return r
else:
self.buf.extend(data)
ser = serial.Serial(
port='COM5',\
baudrate=2000000,\
#baudrate=9600,\
#parity=serial.PARITY_NONE,\
#stopbits=serial.STOPBITS_ONE,\
#bytesize=serial.EIGHTBITS,\
#timeout=0
)
print("connected to: " + ser.portstr)
count=1
#this will store the line
line = []
#database connection
connection = pymysql.connect(host="localhost", user="root", passwd="", database="tempDatabase")
cursor = connection.cursor()
checker = 0
rl = ReadLine(ser)
while True:
time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print(time)
print(checker)
print(rl.readline())
insert1 = ("INSERT INTO tempinfo(value,test,counter) VALUES('{}','{}','{}');".format(33.5, time,checker)) #.format(data[0])
insert2 = ("INSERT INTO urlsync(textvalue,sync) VALUES('http://www.myname.com/value.php?&value={}&time={}',0);".format(33.5,time)) #.format(data[0])
cursor.execute(insert1)
cursor.execute(insert2)
connection.commit()
checker += 1
connection.close()
time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print(time )
ser.close()
P.S : 1000 samples per second is the rate I am getting when I am not using the commands for database, including them I am getting around 250 samples per second only.
Any help or suggestion is appreciated, thank you.
First off, great question. The issue you are facing is loaded with learning opportunities.
Let's go one by one:
-You are now in the position to understand the difference between a microcontroller and a computer. The microcontroller in its most basic form (if you are running bare-metal code, even if it's not very efficient code, like on an Arduino) will do just one thing, and particularly when it's hardware-related (like reading or writing to UARTs) it will do it very efficiently. On a desktop computer, on the other hand, you have layer upon layer of tasks running simultaneously (operating system background tasks, updating the screen and whatnot). With so many things happening at the same time and if you don't establish priorities, it will be very difficult to accurately predict what will exactly happen and when. So it's not only your Python code that is running, there will be many more things that will come up and interrupt the flow of your user task. If you are hoping to read data from the UART buffer at a stable (or at least predictable) speed, that will never happen with the architecture you are using at the moment.
-Even if you manage to strip down your OS to the bare minimum, kill all processes, go on a terminal with no graphics whatsoever... you still have to deal with the uncertainty of what you are doing on your own Python code (that's why you see better performance with the Arduino serial monitor, which is not doing anything other than removing data from the buffer). On your Python code, you are sequentially reading from the port, trying to find a particular character (line feed) and then attaching the data you read to a list. If you want to improve performance, you need to either just read data and store it for offline processing or look at multithreading (if you have a thread of your program dedicated to only reading from the buffer and you do further processing on a separate thread you could improve significantly the throughput, particularly if you set priorities right).
-Last, but actually, most importantly, you should ask yourself: Do I really need to read data from my sensor at 2 Mbps? If the answer is yes, and your sensor is not a video camera, I'm afraid you need to take a step back and look at the following concepts: sensor bandwidth and dynamic response. After you do that, the next question is: how fast is your sensor updating its output and why? is that update rate meaningful? I can give you a couple of references here. First, imagine you have a temperature sensor to read and record the temperature in an oven. Does it make sense to sample values from the sensor at 1 MHz (1 million readings per second) if the temperature in the oven is changing at a rate of 10 degrees C per minute or even 100 degrees per second? Is your sensor even able to react so fast (that where its dynamic response comes into play)? My guess: probably not. Many industrial devices integrate dozens of sensors to control critical processes and send all data through a 1.5 Mbps link (pretty standard for Profibus, for instance).