I am porting a streaming application written in scala to python. I want to manually commit offset for DStream. This is done in scala like below:
stream = KafkaUtils.createDirectStream(soomeConfigs)
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
// some time later, after outputs have completed
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
But I am unable to find the similar APIs in python. Can you please guide me regarding the same that how Can I manually commit offsets using python client.
I resolved this by going back to pyspark 2.2 library as it has API to get offsetRanges and storing offsets on redis. I had to go back to python 2.7 as there is no "long" support in python 3.6.
import redis
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition, KafkaRDD
def get_offset_ranges(topic):
ranges = None
rk = '{topic}:offsets'.format(topic=topic)
cache = redis.Redis()
if cache.exists(rk):
mapping = cache.hgetall(rk)
ranges = dict()
for k, v in mapping.items():
tp = TopicAndPartition(topic, int(k))
ranges[tp] = long(v)
return ranges
def update_offset_ranges(offset_ranges):
cache = redis.Redis()
for rng in offset_ranges:
rk = '{rng.topic}:offsets'.format(rng=rng)
print("updating redis_key: {}, partion:{} , lastOffset: {} ".format(rk, rng.partition, rng.untilOffset))
cache.hset(rk, rng.partition, rng.untilOffset)
def do_some_work(rdd):
pass
def process_dstream(rdd):
rdd.foreachPartition(lambda iter: do_some_work(iter))
krdd = KafkaRDD(rdd._jrdd, sc, rdd._jrdd_deserializer)
off_ranges = krdd.offsetRanges()
for o in off_ranges:
print(str(o))
update_offset_ranges(off_ranges)
sc = SparkContext(appName="mytstApp")
ssc = StreamingContext(sc, 1)
kafka_params = {
"bootstrap.servers": "localhost:9092",
"group.id": "myUserGroup",
"enable.auto.commit": "false",
"auto.offset.reset": "smallest"
}
topic = "mytopic"
offset_ranges = get_offset_ranges(topic)
dstream = KafkaUtils.createDirectStream(ssc, "mytopic", kafka_params, fromOffsets=offset_ranges)
dstream.foreachRDD(process_dstream)
# Start our streaming context and wait for it to 'finish'
ssc.start()
# Wait for the job to finish
try:
ssc.awaitTermination()
except Exception as e:
ssc.stop()
raise e # to exit with error condition
Related
This is quite a specific question regarding nohup in linux, which runs a python file.
Back-story, I am trying to save down streaming data (from IG markets broadcast signal). And, as I am trying to run it via a remote-server (so I don't have to keep my own local desktop up 24/7),
somehow, the nohup will not engage when it 'listen's to a broadcast signal.
Below, is the example python code
#!/usr/bin/env python
#-*- coding:utf-8 -*-
"""
IG Markets Stream API sample with Python
"""
user_ = 'xxx'
password_ = 'xxx'
api_key_ = 'xxx' # this is the 1st api key
account_ = 'xxx'
acc_type_ = 'xxx'
fileLoc = 'marketdata_IG_spx_5min.csv'
list_ = ["CHART:IX.D.SPTRD.DAILY.IP:5MINUTE"]
fields_ = ["UTM", "LTV", "TTV", "BID_OPEN", "BID_HIGH", \
"BID_LOW", "BID_CLOSE",]
import time
import sys
import traceback
import logging
import warnings
warnings.filterwarnings('ignore')
from trading_ig import (IGService, IGStreamService)
from trading_ig.lightstreamer import Subscription
cols_ = ['timestamp', 'data']
# A simple function acting as a Subscription listener
def on_prices_update(item_update):
# print("price: %s " % item_update)
print("xxxxxxxx
))
# A simple function acting as a Subscription listener
def on_charts_update(item_update):
# print("price: %s " % item_update)
print(xxxxxx"\
.format(
stock_name=item_update["name"], **item_update["values"]
))
res_ = [xxxxx"\
.format(
stock_name=item_update["name"], **item_update["values"]
).split(' '))]
# display(pd.DataFrame(res_))
try:
data_ = pd.read_csv(fileLoc)[cols_]
data_ = data_.append(pd.DataFrame(res_, columns = cols_))
data_.to_csv(fileLoc)
print('there is data and we are reading it')
# display(data_)
except:
pd.DataFrame(res_, columns = cols_).to_csv(fileLoc)
print('there is no data and we are saving first time')
time.sleep(60) # sleep for 1 min
def main():
logging.basicConfig(level=logging.INFO)
# logging.basicConfig(level=logging.DEBUG)
ig_service = IGService(
user_, password_, api_key_, acc_type_
)
ig_stream_service = IGStreamService(ig_service)
ig_session = ig_stream_service.create_session()
accountId = account_
################ my code to set sleep function to sleep/read at only certain time intervals
s_time = time.time()
############################
# Making a new Subscription in MERGE mode
subscription_prices = Subscription(
mode="MERGE",
# make sure to put L1 in front of the instrument name
items= list_,
fields= fields_
)
# adapter="QUOTE_ADAPTER")
# Adding the "on_price_update" function to Subscription
subscription_prices.addlistener(on_charts_update)
# Registering the Subscription
sub_key_prices = ig_stream_service.ls_client.subscribe(subscription_prices)
print('this is the line here')
input("{0:-^80}\n".format("HIT CR TO UNSUBSCRIBE AND DISCONNECT FROM \
LIGHTSTREAMER"))
# Disconnecting
ig_stream_service.disconnect()
if __name__ == '__main__':
main()
#######
Then, I try to run it on linux using this command : nohup python marketdata.py
where marketdata.py is basically the python code above.
Somehow, the nohup will not engage....... Any experts/guru who might see what I am missing in my code?
I am trying to implement an exactly-once concept using confluent kafka library. But I couldn't succeed. Here is my producer code:
from confluent_kafka import Producer, SerializingProducer
import socket, json, random, time
conf = {'bootstrap.servers': "localhost:9092,localhost:9093,localhost:9094",
'client.id': socket.gethostname(),
'message.send.max.retries':2,
'enable.idempotence':True,
'isolation.level':'read_committed'}
producer = Producer(conf)
producer.produce("test_topic2", key="pc", value="ok")
producer.flush()
Here is my consumer code:
from confluent_kafka import Consumer, TopicPartition
conf = {'bootstrap.servers': 'localhost:9092,localhost:9093,localhost:9094',
'group.id': "foo2",
'auto.offset.reset': 'earliest',
'enable.idempotence':True,
'isolation.level':'read_committed'}
consumer = Consumer(conf)
topic = "test_topic2"
#consumer.subscribe([topic],on_assign=my_on_assign)
consumer.subscribe([topic])
# Poll for new messages from Kafka and print them.
try:
while True:
msg = consumer.poll()
if msg is None:
print("Waiting...")
elif msg.error():
print("ERROR: %s".format(msg.error()))
else:
# Extract the (optional) key and value, and print.
print(f"{msg.topic()} key = {msg.key()} value = {msg.value().decode('utf-8'):12}\
{msg.partition()} {msg.offset()} ")
except KeyboardInterrupt:
pass
finally:
# Leave group and commit final offsets
consumer.close()
And this is my output(respectively topic name, key name, value name, partition and offset):
test_topic2 key = b'pc' value = ok 2 3
test_topic2 key = b'pc' value = ok 2 4
As I know when I send the same key and value it needs to be arranged to show only one of them. Here for example I need to only get the message which offset is 3. So my question is am I implementing wrong architecture on my code or I understand the concept wrongly ? Thanks in advance.
am trying to read data from Azure event hub and store this dataframe to Mysql table in spark streaming mode.
below is the my pyspark code
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json
from datetime import datetime as dt
from pyspark.sql import DataFrameWriter
try:
session = SparkSession.builder.master("local").appName("dataingestion").config("")
spark = session.getOrCreate()
print("Successfully build spark session : ")
except:
print("Fail to build spark session : ")
raise Exception
startoffset = "#latest"
positionKey1 = {
"ehName": eventhubName,
"partitionId": 0
}
endTime = dt.now().strftime("%Y-%m-%dT%H:%M:%S.%fZ")
eventPosition1 = {
"offset": startoffset,
"seqNo": 0,
"enqueuedTime": endTime,
"isInclusive": True
}
positionMap = {
json.dumps(positionKey1) : eventPosition1
}
# Place the map into the main Event Hub config dictionary
ehreadConf = {}
ehreadConf["eventhubs.startingPositions"] = json.dumps(positionMap)
ehreadConf['eventhubs.connectionString'] = event_hub_sas_key
ehreadConf['eventhubs.consumerGroup'] = "$Default"
ehreadConf['eventhubs.maxEventsPerTrigger'] = 5000
try:
inputStream = spark.readStream.format("eventhubs").options(**ehreadConf).load()
print("Successfully connected the event hub : ")
print("Check streaming is started or not : ", inputStream.isStreaming)
print("Schema of inputStream : ", inputStream.printSchema())
except Exception:
print("Fail to connect with Azure event hub : ")
raise Exception
inputStream = inputStream.withColumn("body", inputStream["body"].cast("string"))
db_target_properties = {"user": username, "password": password, "driver": driver}
def foreach_batch_function(sparkDf,epoach_id):
sparkDf.write.option("batchsize","5000").jdbc(url=url, table="demo",properties=db_target_properties, mode="append")
pass
query = sparkDf.writeStream.outputMode("append").format("com.microsoft.sqlserver.jdbc.spark").option("truncate", 'false').option(
"checkpointLocation", "dbfs:/FileStore/lambda-StreamJob/AzureSql/checkpoint").trigger(processingTime='8 seconds').foreachBatch(foreach_batch_function).start()
query.awaitTermination()
spark.stop()
but am unable to store this spark dataframe into mysql table.
am getting an error like data source jdbc dose not support spark streaming.
py4j.protocol.Py4JJavaError: An error occurred while calling o68.start.
: java.lang.UnsupportedOperationException: Data source jdbc does not support streamed writing>
Like the error notes, support for writing from a stream to a JDBC sink is not present in Spark Yet (I guess).
Instead of writing stream directly into mysq, you can do a forEachBatch in the streamDf and do the write.jdbc operation.
server_name = "jdbc:mysql://localhost:3306"
database_name = "eventhub"
jdbcurl = server_name + "/" + database_name
table_name = "stream_cdr_data"
db_properties = {"user":""user"", "password":"data#123"}
def write_to_mysql(df, epoch_id):
dfwriter = df.write.mode("append")
dfwriter.jdbc(url=jdbcurl, table=table_name, properties=db_properties) # if this is not working use below
#df.write.jdbc(url=jdbcurl, table=table_name, properties=db_properties, mode="append")
pass
query = sparkDf.writeStream.outputMode("append").foreachBatch(write_to_mysql).start()
query.awaitTermination()
For below Issue (OP asked in below comment section)
request/expected seqNo xxxx is less than the received seqNo xxxx. The earliest seqNo is yyyy and the last seqNo is yyyy
Take a look at the FAQ !!!!!
Need some help to set the configuration for sasl.mechanism PLAIN (API) and GSSAPI (Kerberos) authentication.
We are using confluent Kafka here, there are two scripts, one a python script and the second one is a bash script which calls the python one. You can find the script below.
Thanks for the help in advance!
import json
import os
import string
import random
import socket
import uuid
import re
from datetime import datetime
import time
import hashlib
import math
import sys
from functools import cache
from confluent_kafka import Producer, KafkaError, KafkaException
topic_name = os.environ['TOPIC_NAME']
partition_count = int(os.environ['PARTITION_COUNT'])
message_key_template = json.loads(os.environ['KEY_TEMPLATE'])
message_value_template = json.loads(os.environ['VALUE_TEMPLATE'])
message_header_template = json.loads(os.environ['HEADER_TEMPLATE'])
bootstrap_servers = os.environ['BOOTSTRAP_SERVERS']
perf_counter_batch_size = int(os.environ.get('PERF_COUNTER_BATCH_SIZE', 100))
messages_per_aggregate = int(os.environ.get('MESSAGES_PER_AGGREGATE', 1))
max_message_count = int(os.environ.get('MAX_MESSAGE_COUNT', sys.maxsize))
def error_cb(err):
""" The error callback is used for generic client errors. These
errors are generally to be considered informational as the client will
automatically try to recover from all errors, and no extra action
is typically required by the application.
For this example however, we terminate the application if the client
is unable to connect to any broker (_ALL_BROKERS_DOWN) and on
authentication errors (_AUTHENTICATION). """
print("Client error: {}".format(err))
if err.code() == KafkaError._ALL_BROKERS_DOWN or \
err.code() == KafkaError._AUTHENTICATION:
# Any exception raised from this callback will be re-raised from the
# triggering flush() or poll() call.
raise KafkaException(err)
def acked(err, msg):
if err is not None:
print("Failed to send message: %s: %s" % (str(msg), str(err)))
producer_configs = {
'bootstrap.servers': bootstrap_servers,
'client.id': socket.gethostname(),
'error_cb': error_cb
}
# TODO: Need to support sasl.mechanism PLAIN (API) and GSSAPI (Kerberos) authentication.
# TODO: Need to support truststores for connecting to private DCs.
producer = Producer(producer_configs)
# generates a random value if it is not cached in the template_values dictionary
def get_templated_value(term, template_values):
if not term in template_values:
template_values[term] = str(uuid.uuid4())
return template_values[term]
def fill_template_value(value, template_values):
str_value = str(value)
template_regex = '{{(.+?)}}'
templated_terms = re.findall(template_regex, str_value)
for term in templated_terms:
str_value = str_value.replace(f"{{{{{term}}}}}", get_templated_value(term, template_values))
return str_value
def fill_template(template, templated_terms):
# TODO: Need to address metadata field, as it's treated as a string instead of a nested object.
return {field: fill_template_value(value, templated_terms) for field, value in template.items()}
#cache
def get_partition(lock_id):
bits = 128
bucket_size = 2**bits / partition_count
partition = (int(hashlib.md5(lock_id.encode('utf-8')).hexdigest(), 16) / bucket_size)
return math.floor(partition)
sequence_number = int(time.time() * 1000)
sequence_number = 0
message_count = 0
producing = True
start_time = time.perf_counter()
aggregate_message_counter = 0
# cache for templated term values so that they match across the different templates
templated_values = {}
try:
while producing:
sequence_number += 1
aggregate_message_counter += 1
message_count += 1
if aggregate_message_counter % messages_per_aggregate == 0:
# reset templated values
templated_values = {}
else:
for term in list(templated_values):
if term not in ['aggregateId', 'tenantId']:
del(templated_values[term])
# Fill in templated field values
message_key = fill_template(message_key_template, templated_values)
message_value = fill_template(message_value_template, templated_values)
message_header = fill_template(message_header_template, templated_values)
ts = datetime.utcnow().isoformat()[:-3]+'Z'
message_header['timestamp'] = ts
message_header['sequence_number'] = str(sequence_number)
message_value['timestamp'] = ts
message_value['sequenceNumber'] = sequence_number
lock_id = message_header['lock_id']
partition = get_partition(lock_id) # partition by lock_id, since key could be random, but a given aggregate_id should ALWAYS resolve to the same partition, regardless of key.
# Send message
producer.produce(topic_name, partition=partition, key=json.dumps(message_key), value=json.dumps(message_value), headers=message_header, callback=acked)
if sequence_number % perf_counter_batch_size == 0:
producer.flush()
end_time = time.perf_counter()
total_duration = end_time - start_time
messages_per_second=(perf_counter_batch_size/total_duration)
print(f'{messages_per_second} messages/second')
# reset start time
start_time = time.perf_counter()
if message_count >= max_message_count:
break
except Exception as e:
print(f'ERROR: %s' % e)
sys.exit(1)
finally:
producer.flush()
I am able to push data from Kafka to Memsql.
I am trying to push using Transform. I have created Kafka Consumer in Python which is consuming data from Kafka Topic and converting to Json Format.
I don't know how to use this as Transform in Memsql.
from confluent_kafka import KafkaError
from confluent_kafka.avro import AvroConsumer
from confluent_kafka.avro.serializer import SerializerError
import sys
c = AvroConsumer({
'bootstrap.servers': 'X.Y.Z.W:9092',
'group.id': 'groupid1112',
'schema.registry.url': 'http://X.Y.Z.W:8081',
'default.topic.config': {
'auto.offset.reset': 'smallest'
}
})
c.subscribe(['test_topic'])
count =0
while True:
try:
msg = c.poll(10)
except SerializerError as e:
print("Message deserialization failed for {}: {}".format(msg, e))
break
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
continue
else:
print(msg.error())
break
valueList = list(msg.value().values())
print(valueList)
c.close()
It's pritning
[1518776144187, 1, 2, 103,'asas',asas'eer',None]
check these docs
https://docs.memsql.com/memsql-pipelines/v6.0/transforms/
stay tuned for native avro support in an upcoming MemSQL release.
youll want to do something like the following, but with me sketching over the avro-specific details since I don't know the avro library off the top of my head.
```
def input_stream():
"""
Consume STDIN and yield each record that is received from MemSQL
"""
while True:
byte_len = sys.stdin.read(8)
if len(byte_len) == 8:
byte_len = struct.unpack("L", byte_len)[0]
result = sys.stdin.read(byte_len)
yield result
else:
assert len(byte_len) == 0, byte_len
return
avro_context = WhateverYouNeed() # maybe connect to schema registry here if you need to
for msg in input_stream():
object = DeserializeAvro(avro_context, msg) # this is your code
sys.stdout.write(SerializeToTSV(object)) # also your code
```
Using the schema registry should be fine, but you shouldn't need to worry about the details of reading from kafka in your transform script. I can try to get you a more detailed script Monday, but this is how to structure the code.