connection between kafka and elasticsearch using python - python

I'm new about using Kafka and elasticsearch. I've been trying to use Elastic search but I've some problem. I've grow up a docker compose file with all the images needed for building the environment then using kafka I've product into a specific topic the data and then I need to take from Kafka 's consumer data into a pub/sub system for sending data for the ingestion into elasticsearch.
I implement all this using python. I've seen that into the port and localhost as ip elasticsearch appear instead for kibana in the page appear the following sentence:
kibana server is not ready yet
the consumer python is something similar to it from which I take data from a topic:
from kafka import KafkaConsumer
# Import sys module
import sys
# Import json module to serialize data
import json
# Initialize consumer variable and set property for JSON decode
consumer = KafkaConsumer ('JSONtopic',bootstrap_servers = ['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8')))
for message in consumer:
print("Consumer records:\n")
print(message)
print("\nReading from JSON data\n")
print("Name:",message[6]['name'])
print("Email:",message[6]['email'])
# Terminate the script
sys.exit()
The goal is to use elasticsearch for doing analysis so I need to use it as backend as for visualize data into kibana. It could be really appreciate also a tutorial to follow for understanding what I should do for link this informations.
(P.s. data follow without problem from a topic to another one but the problem is to take this information and insert into elastic and have the possibility to visualize these informations)

If you're pushing data from Kafka to Elasticsearch then doing it with the Consumer API is typically not a good idea, since there are tools that exist that do it much better and handle more functionality.
For example:
Kafka Connect (e.g. 🎥 https://rmoff.dev/kafka-elasticsearch-video)
Logstash

Related

How to stream logs from elk stack to python

I have a kafka consumer in python to process log data (stacktrace analysis and automatic issue creation) and we are also using elk stack in parallel. Is there any possibility to stream logs to python via elk to get rid of kafka? I have no experience in elk and can't find anything about streaming from it. Seems that I can just query log data once per time but this doesn't seem a perfect decision.
No, you cannot stream data out of Elasticsearch on its own.
If your input is something else, you can use Logstash's various output plugins (or write your own) that can write into a supported Python library
For example: pipe, tcp, websocket/http, exec plugins are all generic enough to be used with any language
However, logstash does not persist events like Kafka does, so if you want something that can handle back pressure and doesn't drop events, you'd keep Kafka around

What is the best approach to getting data into S3 for Elasticsearch and RabbitMQ?

In my company we developed a few games for which for some games the events are being sent to either Elasticsearch and others to RabbitMQ. We have a local CLI which grabs the data from both, compiles the messages into compressed (Gzip) JSON files after which another CLI converts them to SQL statements and throws them into a local SQL Server. We want now to scale up but the current setup is painful and nowhere near real-time for analysis.
I've recently built an application in Python which I was planning to publish to a docker container in AWS. The script grabs data from Elasticsearch, compiles into small compressed JSONS and publishes to an S3 bucket. From there the data is ingested into Snowflake for analysis. So far I was able to get the data in quite quickly and looks promising as an alternative.
I was planning to do something similar with RabbitMQ but I wanted to find an even better alternative which would allow this ingestion process to happen seamlessly and help me avoid having to implement within the python code all sorts of exception calls.
I've researched a bit and found there might be a way to link RabbitMQ to Amazon Kinesis Firehose. My question would be: How would I send the stream from RabbitMQ to Kinesis?
For Elasticsearch, what is the best way to achieve this? I've read about the logstash plugin for S3 (https://www.elastic.co/guide/en/logstash/current/plugins-outputs-s3.html) and about logstash plugin for kinesis (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html). Which approach would be ideal for real-time ingestion?
My answer will be very theotic and need to be adapted tested in real world and adapted to your use case.
For a near realtime behaviour, I would use logstash
with elasticsearch input and a short cron. this post can help https://serverfault.com/questions/946237/logstashs-elasticsearch-input-plugin-should-be-used-to-output-to-elasticsearch
S3 output (support gzip)
maybe jdbc output to your DB
RabbitMq output plugin
You can create more scallable archi by output to RabbitMQ and use other pipeline to listen to the queue and execute other tasks.
From logstash ES -> Rabbit MQ
From logstash RabbitMQ -> SQL
From logstash RabbitMQ -> Kinesis
From logstash RabbitMQ -> AWS
etc....

Sending Data using Python

I Have a Python application which analyses data from multiple sources in real time. Once the data is analyzed the result of the analysis is stored in a database along with a time-stamp of when it was analyzed.
I would like to access the most recent result of this program remotely from another computer.
I was thinking about using python sockets and having a server script running on the main computer which runs the application and then that way I can access the data using a client script on another computer.
Is there a better way of doing this? Or are there any other solutions out there that can address this need?
Your question is very broad.
Most DB servers will provide a method/API to access the data remotely. You can use Python as a client if there is a DBAPI module for your DB that supports remote access over the network. For example if you are using Postgres you could use the psycopg2 module.
If you are using a simple DB such as SQLite then you might be able to use an ODBC driver. Some alternatives are here.
Edit
mongodb provides an API, pymongo.
In the end Redis was the best solution. Considering the original question The goal was to be able to send data in real time from one computer to another. Solutions such as Redis or RabbitMQ successfully accomplish this.
With Redis a server can be setup and it can publish messages to the network, clients can then subscribe to data channels and receive the messages in a queue
This Python library was used as a python Redis client :
https://pypi.python.org/pypi/redis

Using Python Qpid/Proton/Messenger(), how do I filter messages from Azure Event Hubs?

This gist shows how to use Messenger() to receive messages from an Azure Event Hub. It works great.
https://gist.github.com/tomconte/e2a4667185a9bf674f59
However, using this technique yields all messages in the Event Hub. I would like to read messages since an offset or timestamp (don't care which). I can see in the Qpid docs how to set these requirements, but not when using Messenger().
Here's the relevant section in the Qpid docs:
https://qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/api/index.html
And a sample that shows how to use it:
qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/examples/selected_recv.py.html
Question: is it possible and if so how?
Based on my understanding, I think you want to read the event data which start with an offset or timestamp on EventHub. I reviewed all classes & methods of EventHub SDK for C#/Java, then there are not any way support this usage for consuming event data from partations on EventHub. Apache Qpid is a library support AMQP protocol for Java/C/Python, and EventHub support AMQP, but it not means EventHub support all methods of Qpid.
There are two solutions as workaround way for you.
Receiving all messages on EventHub, and filter these undesired for you.
Using Azure Stream Analytics to create an output pipeline for outputing messages to other storages, such as Table Storage, DocumentDB, then you can retrieve these data from the other storages with the offset/timestamp of your needs.

How to create new topic with pykafka with partitions and replication?

I want to be able to programmatically create a topic in Kafka using pykafka. I know that accessing the TopicDict will automatically create a topic if one does not exist but I don't know how to control the number of partitions/replicas with that. Also, it has a nasty bug where it ends up in an infinite loop if Kafka goes down. Basically I want to do something like the following:
create_topic('mytopic', partitions=2, replicas=3)
Pykafka is a Python implementation of Kafka producer and consumer API, what you want to achieve is an operation that in Kafka is performed using another API, the administration/operations API (actually a group of Java classes). I don't think Pykafka has an API/wrapper for that.
What you may be observing is a topic being automatically created by Kafka. What you can do is to configure using properties the default number of partitions and replicas for automatically created topics.
You can do it with subprocess
If you install Kafka binaries, you can do something like this
from pykafka import KafkaClient
import subprocess
client = KafkaClient(hosts="localhost:9092")
subprocess.Popen("PATH/TO/KAFKA/BINARY/kafka_2.11-1.0.0/bin/kafka-topics.sh --create --zookeeper localhost:2181 --topic testtopic --replication-factor 1 --partitions 10".split())

Categories