I want to be able to programmatically create a topic in Kafka using pykafka. I know that accessing the TopicDict will automatically create a topic if one does not exist but I don't know how to control the number of partitions/replicas with that. Also, it has a nasty bug where it ends up in an infinite loop if Kafka goes down. Basically I want to do something like the following:
create_topic('mytopic', partitions=2, replicas=3)
Pykafka is a Python implementation of Kafka producer and consumer API, what you want to achieve is an operation that in Kafka is performed using another API, the administration/operations API (actually a group of Java classes). I don't think Pykafka has an API/wrapper for that.
What you may be observing is a topic being automatically created by Kafka. What you can do is to configure using properties the default number of partitions and replicas for automatically created topics.
You can do it with subprocess
If you install Kafka binaries, you can do something like this
from pykafka import KafkaClient
import subprocess
client = KafkaClient(hosts="localhost:9092")
subprocess.Popen("PATH/TO/KAFKA/BINARY/kafka_2.11-1.0.0/bin/kafka-topics.sh --create --zookeeper localhost:2181 --topic testtopic --replication-factor 1 --partitions 10".split())
Related
I have a kafka consumer in python to process log data (stacktrace analysis and automatic issue creation) and we are also using elk stack in parallel. Is there any possibility to stream logs to python via elk to get rid of kafka? I have no experience in elk and can't find anything about streaming from it. Seems that I can just query log data once per time but this doesn't seem a perfect decision.
No, you cannot stream data out of Elasticsearch on its own.
If your input is something else, you can use Logstash's various output plugins (or write your own) that can write into a supported Python library
For example: pipe, tcp, websocket/http, exec plugins are all generic enough to be used with any language
However, logstash does not persist events like Kafka does, so if you want something that can handle back pressure and doesn't drop events, you'd keep Kafka around
I'm new about using Kafka and elasticsearch. I've been trying to use Elastic search but I've some problem. I've grow up a docker compose file with all the images needed for building the environment then using kafka I've product into a specific topic the data and then I need to take from Kafka 's consumer data into a pub/sub system for sending data for the ingestion into elasticsearch.
I implement all this using python. I've seen that into the port and localhost as ip elasticsearch appear instead for kibana in the page appear the following sentence:
kibana server is not ready yet
the consumer python is something similar to it from which I take data from a topic:
from kafka import KafkaConsumer
# Import sys module
import sys
# Import json module to serialize data
import json
# Initialize consumer variable and set property for JSON decode
consumer = KafkaConsumer ('JSONtopic',bootstrap_servers = ['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf-8')))
for message in consumer:
print("Consumer records:\n")
print(message)
print("\nReading from JSON data\n")
print("Name:",message[6]['name'])
print("Email:",message[6]['email'])
# Terminate the script
sys.exit()
The goal is to use elasticsearch for doing analysis so I need to use it as backend as for visualize data into kibana. It could be really appreciate also a tutorial to follow for understanding what I should do for link this informations.
(P.s. data follow without problem from a topic to another one but the problem is to take this information and insert into elastic and have the possibility to visualize these informations)
If you're pushing data from Kafka to Elasticsearch then doing it with the Consumer API is typically not a good idea, since there are tools that exist that do it much better and handle more functionality.
For example:
Kafka Connect (e.g. 🎥 https://rmoff.dev/kafka-elasticsearch-video)
Logstash
This gist shows how to use Messenger() to receive messages from an Azure Event Hub. It works great.
https://gist.github.com/tomconte/e2a4667185a9bf674f59
However, using this technique yields all messages in the Event Hub. I would like to read messages since an offset or timestamp (don't care which). I can see in the Qpid docs how to set these requirements, but not when using Messenger().
Here's the relevant section in the Qpid docs:
https://qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/api/index.html
And a sample that shows how to use it:
qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/examples/selected_recv.py.html
Question: is it possible and if so how?
Based on my understanding, I think you want to read the event data which start with an offset or timestamp on EventHub. I reviewed all classes & methods of EventHub SDK for C#/Java, then there are not any way support this usage for consuming event data from partations on EventHub. Apache Qpid is a library support AMQP protocol for Java/C/Python, and EventHub support AMQP, but it not means EventHub support all methods of Qpid.
There are two solutions as workaround way for you.
Receiving all messages on EventHub, and filter these undesired for you.
Using Azure Stream Analytics to create an output pipeline for outputing messages to other storages, such as Table Storage, DocumentDB, then you can retrieve these data from the other storages with the offset/timestamp of your needs.
I have a Flask app which uses SocketIO to communicate with users currently online. I keep track of them by mapping the user ID with a session ID, which I can then use to communicate with them:
online_users = {'uid...':'sessionid...'}
I delcare this in my run.py file where the app is launched, and then I import it when I need it as such:
from app import online_users
I'm using Celery with RabbitMQ for task deployment, and I need to use this dict from within the tasks. So I import it as above, but when I use it it is empty even when I know it is populated. I realize after reading this that it is because each task is asynchronous and starts a new process with an empty dict, and so my best bet is to use some sort of database or cache.
I'd rather not run an additional service, and I only need to read from the dict (I won't be writing to it from the tasks). Is a cache/database my only option here?
That depends on what you have in the dict.... If it's something you can serialize to string you can serialize it to Json and pass it as an argument to that task. If it's an object you cannot serialize, then yes you need to use a cache/database.
I came across this discussion which seems to be a solution for exactly what I'm trying to do.
Communication through a message queue is now implemented in package python-socketio, through the use of Kombu, which provides a common API to work with several message queues including Redis and RabbitMQ.
Supposedly an official release will be out soon, but as of now it can be done using an additional package.
I'm adopting Kafka and trying to understand how to monitor it (e.g. is it running out of memory for log storage). I see that it uses Yammer Metrics and exposes them via JMX - this apparently makes sense to people in Java land.
Is there an HTTP API I can build on? Or really any sort of relatively structured output at all?
You can use yahoo's kafka manager to inspection of cluster state. Download kafka manager # https://github.com/yahoo/kafka-manager.
Hope it may help you
you can use Kafka monitoring tool for this
https://github.com/impetus-opensource/ankush
You can get the latest release from following link https://github.com/impetus-opensource/ankush/releases.
You can create as well as monitor your clusters using above tool