I have a kafka consumer in python to process log data (stacktrace analysis and automatic issue creation) and we are also using elk stack in parallel. Is there any possibility to stream logs to python via elk to get rid of kafka? I have no experience in elk and can't find anything about streaming from it. Seems that I can just query log data once per time but this doesn't seem a perfect decision.
No, you cannot stream data out of Elasticsearch on its own.
If your input is something else, you can use Logstash's various output plugins (or write your own) that can write into a supported Python library
For example: pipe, tcp, websocket/http, exec plugins are all generic enough to be used with any language
However, logstash does not persist events like Kafka does, so if you want something that can handle back pressure and doesn't drop events, you'd keep Kafka around
Related
In my company we developed a few games for which for some games the events are being sent to either Elasticsearch and others to RabbitMQ. We have a local CLI which grabs the data from both, compiles the messages into compressed (Gzip) JSON files after which another CLI converts them to SQL statements and throws them into a local SQL Server. We want now to scale up but the current setup is painful and nowhere near real-time for analysis.
I've recently built an application in Python which I was planning to publish to a docker container in AWS. The script grabs data from Elasticsearch, compiles into small compressed JSONS and publishes to an S3 bucket. From there the data is ingested into Snowflake for analysis. So far I was able to get the data in quite quickly and looks promising as an alternative.
I was planning to do something similar with RabbitMQ but I wanted to find an even better alternative which would allow this ingestion process to happen seamlessly and help me avoid having to implement within the python code all sorts of exception calls.
I've researched a bit and found there might be a way to link RabbitMQ to Amazon Kinesis Firehose. My question would be: How would I send the stream from RabbitMQ to Kinesis?
For Elasticsearch, what is the best way to achieve this? I've read about the logstash plugin for S3 (https://www.elastic.co/guide/en/logstash/current/plugins-outputs-s3.html) and about logstash plugin for kinesis (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html). Which approach would be ideal for real-time ingestion?
My answer will be very theotic and need to be adapted tested in real world and adapted to your use case.
For a near realtime behaviour, I would use logstash
with elasticsearch input and a short cron. this post can help https://serverfault.com/questions/946237/logstashs-elasticsearch-input-plugin-should-be-used-to-output-to-elasticsearch
S3 output (support gzip)
maybe jdbc output to your DB
RabbitMq output plugin
You can create more scallable archi by output to RabbitMQ and use other pipeline to listen to the queue and execute other tasks.
From logstash ES -> Rabbit MQ
From logstash RabbitMQ -> SQL
From logstash RabbitMQ -> Kinesis
From logstash RabbitMQ -> AWS
etc....
I am trying to work with Kafka for data ingestion but being new to this, i am kind of pretty much confused.I have multiple crawlers, who extract data for me from web platform. Now, the issue is i want to ingest that extract data to Hadoop using Kafka without any middle scripts/service file . Is it possible ?
without any middle scripts/service file . Is it possible ?
Unfortunately, no.
You need some service that's writing into Kafka (your scraper). Whether you produce into Kafka HTTP links (then write an intermediate consumer/producer that generates the scraped results), or only produce the final scraped results, that's up to you.
You also need a second service consuming those topic(s) that writes to HDFS. This could be Kafka Connect (via Confluent's HDFS Connector library), or PySpark (code you'd have to write yourself), or other options that include "middle scripts/services".
If you'd like to combine both options, I would suggest taking a look at Apache Nifi or Streamsets, which can perform HTTP lookups, (X)HTML parsing, and Kafka+HDFS connectors, all configured via a centralized GUI. Note: I believe any Python code would have to be rewritten in a JVM language to support major custom parsing logic in this pipeline
This gist shows how to use Messenger() to receive messages from an Azure Event Hub. It works great.
https://gist.github.com/tomconte/e2a4667185a9bf674f59
However, using this technique yields all messages in the Event Hub. I would like to read messages since an offset or timestamp (don't care which). I can see in the Qpid docs how to set these requirements, but not when using Messenger().
Here's the relevant section in the Qpid docs:
https://qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/api/index.html
And a sample that shows how to use it:
qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/examples/selected_recv.py.html
Question: is it possible and if so how?
Based on my understanding, I think you want to read the event data which start with an offset or timestamp on EventHub. I reviewed all classes & methods of EventHub SDK for C#/Java, then there are not any way support this usage for consuming event data from partations on EventHub. Apache Qpid is a library support AMQP protocol for Java/C/Python, and EventHub support AMQP, but it not means EventHub support all methods of Qpid.
There are two solutions as workaround way for you.
Receiving all messages on EventHub, and filter these undesired for you.
Using Azure Stream Analytics to create an output pipeline for outputing messages to other storages, such as Table Storage, DocumentDB, then you can retrieve these data from the other storages with the offset/timestamp of your needs.
I'm adopting Kafka and trying to understand how to monitor it (e.g. is it running out of memory for log storage). I see that it uses Yammer Metrics and exposes them via JMX - this apparently makes sense to people in Java land.
Is there an HTTP API I can build on? Or really any sort of relatively structured output at all?
You can use yahoo's kafka manager to inspection of cluster state. Download kafka manager # https://github.com/yahoo/kafka-manager.
Hope it may help you
you can use Kafka monitoring tool for this
https://github.com/impetus-opensource/ankush
You can get the latest release from following link https://github.com/impetus-opensource/ankush/releases.
You can create as well as monitor your clusters using above tool
I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.
You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.
Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.