Generic explanation: My application consumes messages from a topic and then splits them into separate topics according to their id, so the topics are named like topic_name_id. My goal is to connect those new topics to a certain sink (s3 or snowflake, haven't decided) so that the messages published in those topics will end up there. However, i've only found ways to do this using a configuration file, where you connect the sink to a topic that already exists and which you know the name of. But here the goal would be to connect the sink to the topic created during the process. Is there a way this can be achieved?
If the above is not possible, is there a way to connect to the common topic with all the messages, but create different tables (in snowflake) or s3 directories according to the message ID? Adding to that, in case of s3, the messages are added as individual json files, right? No way to combine them into one file?
Thanks
The outgoing IDs are known, right?
Kafka Connect uses a REST API that you generate a JSON HTTP body using those IDs and finalized topic names, then use requests, for example, to publish and start connectors for those topics. You can do that directly from the process directly before starting the producer, or you can send a request with the ID/topic name to a lambda job instead, which communicates with the Connect API
When using different topics with the S3 Sink connector, there will be separate S3 paths and separate files, based on the number of partitions in the topic and the other partitioner settings defined in your connector property. Most S3 processes are able to read full S3 prefixes, though, so I don't imagine that being an issue
I don't have experience with the Snowflake connector to know how it handles different topic names.
Related
I have a consumer that consumes from two topics with two different avro schemas. The number of topics can grow in the future, so unioning schemas to read from multiple topics is not scalable.
from confluent_kafka import Consumer
# assume config for group.id and bootstrap servers are set
consumer = Consumer(config)
consumer.subscribe(["randomtopic1", "randomtopic2"])
msg = consumer.poll()
msg_value = message.value() #Should be in bytes or str
How do I identify/extract which topic the message is from, without deserializing it with the avro schema? Wondering if topic name is encoded into any avro headers, I came across this link where they deserialize in the form of bytes and somehow extract the topic name. Then they deserialize the message value with the schema.
There are no "Avro headers" in the consumed records. Kafka record headers don't have the topic name (usually).
You can simply use message.topic(), then use an if statement against it to process the topics differently.
I am not too much sure about python, but as per the confluent recommendation, if you are using a single consumer that needs to subscribe to multiple topics and handles multiple data types at the same time, it is advised to use GenericRecord instead of a specific code binding type.
You can find a sample .NET code here https://github.com/confluentinc/confluent-kafka-dotnet/blob/master/examples/AvroGeneric/Program.cs
I'm build a solution Match Service where receive data from a third party provider from MQTT server. This data is a realtime data. We save this data in RDS Cluster.
Our users can create in another service a filter called Strateg, we send a cron every 5 minutes to this service and all records in database are send to Kafka topic to be processed in Match Service.
My design is based on events, so each new Strategy record in topic, Match Service performs a query in database for check if have any Match that active the Strategy threshold. If the threshold is passed, it sends out an new message to broker.
The API processes about 10k Strategy in each job, it's taking timing (about 250s for each job).
So my question is if there is a better way to design this system? I was thinking of adding a redis-layer, to avoid database transactions.
All suggestions welcome!
Think long and hard about your relational data store. If you really need it to be relational, then it may absolutely make sense, but if not, a relational database is often a horrible place to dump things like time-series and IoT output. It's a great place to put normalized, structured data for reporting, but a lousy dump/load location and real-time matching.
Look more at something like AWS RedShift, ElasticSearch, or some other no-sql solution that can ingest and match things at orders of magnitude higher scale.
In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve
There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html
Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.
There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.
In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.
I am trying to work with Kafka for data ingestion but being new to this, i am kind of pretty much confused.I have multiple crawlers, who extract data for me from web platform. Now, the issue is i want to ingest that extract data to Hadoop using Kafka without any middle scripts/service file . Is it possible ?
without any middle scripts/service file . Is it possible ?
Unfortunately, no.
You need some service that's writing into Kafka (your scraper). Whether you produce into Kafka HTTP links (then write an intermediate consumer/producer that generates the scraped results), or only produce the final scraped results, that's up to you.
You also need a second service consuming those topic(s) that writes to HDFS. This could be Kafka Connect (via Confluent's HDFS Connector library), or PySpark (code you'd have to write yourself), or other options that include "middle scripts/services".
If you'd like to combine both options, I would suggest taking a look at Apache Nifi or Streamsets, which can perform HTTP lookups, (X)HTML parsing, and Kafka+HDFS connectors, all configured via a centralized GUI. Note: I believe any Python code would have to be rewritten in a JVM language to support major custom parsing logic in this pipeline
This gist shows how to use Messenger() to receive messages from an Azure Event Hub. It works great.
https://gist.github.com/tomconte/e2a4667185a9bf674f59
However, using this technique yields all messages in the Event Hub. I would like to read messages since an offset or timestamp (don't care which). I can see in the Qpid docs how to set these requirements, but not when using Messenger().
Here's the relevant section in the Qpid docs:
https://qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/api/index.html
And a sample that shows how to use it:
qpid.apache.org/releases/qpid-proton-0.16.0/proton/python/examples/selected_recv.py.html
Question: is it possible and if so how?
Based on my understanding, I think you want to read the event data which start with an offset or timestamp on EventHub. I reviewed all classes & methods of EventHub SDK for C#/Java, then there are not any way support this usage for consuming event data from partations on EventHub. Apache Qpid is a library support AMQP protocol for Java/C/Python, and EventHub support AMQP, but it not means EventHub support all methods of Qpid.
There are two solutions as workaround way for you.
Receiving all messages on EventHub, and filter these undesired for you.
Using Azure Stream Analytics to create an output pipeline for outputing messages to other storages, such as Table Storage, DocumentDB, then you can retrieve these data from the other storages with the offset/timestamp of your needs.