I'm adopting Kafka and trying to understand how to monitor it (e.g. is it running out of memory for log storage). I see that it uses Yammer Metrics and exposes them via JMX - this apparently makes sense to people in Java land.
Is there an HTTP API I can build on? Or really any sort of relatively structured output at all?
You can use yahoo's kafka manager to inspection of cluster state. Download kafka manager # https://github.com/yahoo/kafka-manager.
Hope it may help you
you can use Kafka monitoring tool for this
https://github.com/impetus-opensource/ankush
You can get the latest release from following link https://github.com/impetus-opensource/ankush/releases.
You can create as well as monitor your clusters using above tool
Related
The project that I am working on is a bit confidential, but I will try to explain my issues and be as clear as possible because I need your opinion.
Project:
They asked me to set up a local ELK environment , and to use Python scripts to communicate with this stack (ELK), to store data, retrieve it, analyse it and visualise it thanks to Kibana, and finally there is a decision making based on that data(AI). So as you can see, it is a Data Engineering project with some AI for the decision making process. The issues that I am facing are:
I don't know how to use Python to communicate with the stack, I didn't find resources about it
Since the data is confidential, how can I assure a high security?
How many instances to use?
I am lost because I am new to ELK and my team is not Dev oriented
I am new to ELK, so please any advice would be really helpful!
I don't know how to use Python to communicate with the stack, I didn't
find resources about it
For learning how to interact with your stack use the python library:
You can install using pip3 install elasticsearch and the following links contain a wealth of tutorials on almost anything you would need to be doing.
https://kb.objectrocket.com/category/elasticsearch?filter=python
Suggest you start with these two:
https://kb.objectrocket.com/elasticsearch/how-to-parse-lines-in-a-text-file-and-index-as-elasticsearch-documents-using-python-641
https://kb.objectrocket.com/elasticsearch/how-to-query-elasticsearch-documents-in-python-268
Since the data is confidential, how can I assure a high security?
You can mask the data or restrict index access.
https://www.elastic.co/guide/en/elasticsearch/reference/current/authorization.html
https://nl.devoteam.com/expert-view/field-level-security-and-data-masking-in-elasticsearch/
How many instances to use?
I am lost because I am new to ELK and my team is not Dev oriented
I suggest you start with 1 Elasticsearch node, if you're on AWS use a t3a.large or equivalent and run Elasticsearch, Kibana and Logstash all on the same machine.
For setting it up: https://www.elastic.co/guide/en/elastic-stack-get-started/current/get-started-stack-docker.html#run-docker-secure
If you want to use phyton as your integration tools to Elasticsearch you can use elasticsearch phyton client.
The other options you can use python to create the result and save it in log file or insert to database than Logstash will get your data.
For the security ELK have good security from API authorization user authentication to cluster security. you can see in here Secure the Elastic Stack
I just use 1 instance, but feel free if you think you will need to separate between Kibana and Elasticsearch and Logstash (if you use it) or you can use docker to separate it.
Based on my experience, if you are going to load a lot of data in a short time it will be wise If you separate it so the processes don't interfere with each other.
I have a kafka consumer in python to process log data (stacktrace analysis and automatic issue creation) and we are also using elk stack in parallel. Is there any possibility to stream logs to python via elk to get rid of kafka? I have no experience in elk and can't find anything about streaming from it. Seems that I can just query log data once per time but this doesn't seem a perfect decision.
No, you cannot stream data out of Elasticsearch on its own.
If your input is something else, you can use Logstash's various output plugins (or write your own) that can write into a supported Python library
For example: pipe, tcp, websocket/http, exec plugins are all generic enough to be used with any language
However, logstash does not persist events like Kafka does, so if you want something that can handle back pressure and doesn't drop events, you'd keep Kafka around
I have a web server using NodeJS - Express and I have a Scikit-Learn (machine learning) model pickled (dumped) in the same machine.
What I need is to demonstrate the model by sending/receiving data from it to the server. I want to load the model on startup of the web server and keep "listening" for data inputs. When receive data, executes a prediction and send it back.
I am relatively new to Python. From what I've seen I could use a "Child Process" to execute that. I also saw some modules that run Python script from Node.
The problem is I want to load the model once and let it be for as long as the server is on. I don't want to keep loading the model every time due to it's size. How is the best way to perform that?
The idea is running everything in a AWS machine.
Thank you in advance.
My recommendation: write a simple python web service (personally recommend flask) and deploy your ML model. Then you can easily send requests to your python web service from your node back-end. You wouldn't have a problem with the initial model loading. it is done once in the app startup, and then you're good to go
DO NOT GO FOR SCRIPT EXECUTIONS AND CHILD PROCESSES!!! I just wrote it in bold-italic all caps so to be sure you wouldn't do that. Believe me... it potentially go very very south, with all that zombie processes upon job termination and other stuff. let's just simply say it's not the standard way to do that.
You need to think about multi-request handling. I think flask now has it by default
I am just giving you general hints because your problem has been generally introduced.
This gist shows how to use Messenger() to receive messages from an Azure Event Hub. It works great.
https://gist.github.com/tomconte/e2a4667185a9bf674f59
However, using this technique yields all messages in the Event Hub. I would like to read messages since an offset or timestamp (don't care which). I can see in the Qpid docs how to set these requirements, but not when using Messenger().
Here's the relevant section in the Qpid docs:
https://qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/api/index.html
And a sample that shows how to use it:
qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/examples/selected_recv.py.html
Question: is it possible and if so how?
Based on my understanding, I think you want to read the event data which start with an offset or timestamp on EventHub. I reviewed all classes & methods of EventHub SDK for C#/Java, then there are not any way support this usage for consuming event data from partations on EventHub. Apache Qpid is a library support AMQP protocol for Java/C/Python, and EventHub support AMQP, but it not means EventHub support all methods of Qpid.
There are two solutions as workaround way for you.
Receiving all messages on EventHub, and filter these undesired for you.
Using Azure Stream Analytics to create an output pipeline for outputing messages to other storages, such as Table Storage, DocumentDB, then you can retrieve these data from the other storages with the offset/timestamp of your needs.
I am toying around with home automation, and I am planning to use Azure Service Bus as my "core" for message handling. With the .NET SDKs everything works perfectly and is fast enough (milliseconds for send + receive). However, I am now using the "azure.servicebus" module with Python (Debian on a Raspberry Pi), and the receive_subscription_message call is far from fast. It varies between near instant to lagging a minute behind.
My code is as follows:
from azure.servicebus import ServiceBusService, Message, Queue
bus_service = ServiceBusService(
service_namespace='mynamespace',
shared_access_key_name='Listener1',
shared_access_key_value='...')
msg = bus_service.receive_subscription_message('messages', 'ListenerTest.py', peek_lock=True)
msg.delete()
I have toyed around with peek_lock True and False, but the behaviour is the same.
Has anyone else been able to get this stable / near instant?
Please make sure there has indeed messages in the subscription, also please be aware that .NET SDK by default uses a Service Bus specific protocol instead of http, but the Python SDK uses http polling (basically check if there're messages in the subscription once in a while). We can find the brief info at https://github.com/Azure/azure-sdk-for-python/blob/master/doc/servicebus.rst:
ServiceBus Queues are an alternative to Storage Queues that might be useful in scenarios where more advanced messaging features are needed (larger message sizes, message ordering, single-operaiton destructive reads, scheduled delivery) using push-style delivery (using long polling).
Per my understanding this might explain why you see the message received either instantly or up to a minute. Based on behavior that you described, you might want to use AMQP, which is based on bi-directional TCP, and thus does not require polling. To use AMQP, you may want to leverage the standard Proton-Python library, I'd like to suggest you to check https://msdn.microsoft.com/en-us/library/azure/jj841070.aspx for a sample. But please note the tips from that article:
Note that at the time of this writing, the SSL support in Proton-C is
only available for Linux operating systems. Because Microsoft Azure
Service Bus requires the use of SSL, Proton-C (and the language
bindings) can only be used to access Microsoft Azure Service Bus from
Linux at this time. Work to enable Proton-C with SSL on Windows is
underway so check back frequently for updates.