I have two clients (separate docker containers) both writing to a Cassandra cluster.
The first is writing real-time data, which is ingested at a rate that the cluster can handle, albeit with little spare capacity. This is regarded as high-priority data and we don't want to drop any. The ingestion rate varies quite a lot from minute to minute. Sometimes data backs up in the queue from which the client reads and at other times the client has cleared the queue and is (briefly) waiting for more data.
The second is a bulk data dump from an online store. We want to write it to Cassandra as fast as possible at a rate that soaks up whatever spare capacity there is after the real-time data is written, but without causing the cluster to start issuing timeouts.
Using the DataStax Python driver and keeping the two clients separate (i.e. they shouldn't have to know about or interact with each other), how can I throttle writes from the second client such that it maximises write throughput subject to the constraint of not impacting the write throughput of the first client?
The solution I came up with was to make both data producers write to the same queue.
To meet the requirement that the low-priority bulk data doesn't interfere with the high-priority live data, I made the producer of the low-priority data check the queue length and then add a record to the queue only if the queue length is below a suitable threshold (in my case 5 messages).
The result is that no live data message can have more than 5 bulk data messages in front of it in the queue. If messages start backing up on the queue then the bulk data producer stops queuing more data until the queue length falls below the threshold.
I also split the bulk data into many small messages so that they are relatively quick to process by the consumer.
There are three disadvantages of this approach:
There is no visibility of how many queued messages are low priority and how many are high priority. However we know that there can't be more than 5 low priority messages.
The producer of low-priority messages has to poll the queue to get the current length, which generates a small extra load on the queue server.
The threshold isn't applied strictly because there is a race between the two producers from checking the queue length to queuing a message. It's not serious because the low-priority producer queues only a single message when it loses the race and next time it will know the queue is too long and wait.
Related
I have a built a way to distribute a verification-based application over multiple machines using the following excellent blog post as a starting point:
https://eli.thegreenplace.net/2012/01/24/distributed-computing-in-python-with-multiprocessing
…which itself is based on the Python multiprocessing docs (see the section on managers, remote managers and proxy objects):
https://docs.python.org/3/library/multiprocessing.html
However, for my application at least, I have noticed that the networking costs are not constant, but rather grow (at first glance linearly) with the overall verification time, to the point where the cost is no longer acceptable (from a handful of seconds to hundreds). Is there a way to minimise these costs, and what is most likely to be driving them?
In my setup, the server machine opens a server at a given port. The clients then connect and read jobs from a single jobs queue and put finished/solved jobs to a single reporting queue - these queues are registered multiprocessing.Queue() queues. When a result to the verification problem is returned, or the queue is empty, the server sends a multiprocessing.Event() signal to the clients, allowing them to terminate. With clients terminated, the server/controller machine then shuts down the server.
In terms of costs, I think can of the following:
the cost of opening the server (payed once)
the cost of keeping the server ‘open’ (is this a real cost?)
the cost of accepting connections from clients (payed once per client)
the cost of reading/writing from/to the serialized queues (variable)
In terms of the last cost, writing to the jobs queue by the controller machine is performed only once (though the number of items can vary). The amount a client reads from the jobs queue and writes to the reporting queue is variable, but tends to occur only a handful of times before the overall verification problem is complete, as opposed to hundred of times.
This probably has multiple questions so bear with me. I am still figuring out the right way to use the Kafka Architecture. I know that the partitions of a topic are divided b/w the consumers.
What exactly are consumers? Right now, I am thinking of writing a daemon python process that acts as a consumer. When the consumer consumes a message from Kafka, there is a task that I have to complete. This is a huge task so I am creating sub-tasks that run concurrently. Can I have multiple consumers(python scripts) on the same machine?
I have multiple microservices that I am working on, so each microservice has its own consumer?
When the load increases I have to scale the consumers. I thought of spawning a new machine that has acts as another consumer. But I just feel that I am doing something wrong here and feel that there has to be a better way.
Can you tell me how you scaled your consumers based on the load? Do I have to increase my partitions in topics if I need to increase my consumers? How do I do it dynamically? Can I decrease the partitions when there are fewer messages produced? How many partitions are ideal initially?
And please suggest some good practices to follow.
This is the consumer script that I am using
while True:
message = client.poll(timeout=10)#client is the KafkaConsumer object
if message is not None:
if message.error():
raise KafkaException(message.error())
else:
logger.info('recieved topic {topic} partition {partition} offset {offset} key {key} - {value}'.format(
topic=message.topic(),
partition=message.partition(),
offset=message.offset(),
key=message.key(),
value=message.value()
))
#run task
Can I have multiple consumers(python scripts) on the same machine?
Yes. You can also have Python threads, though.
If you're not consuming multiple topics, then there is no need for multiple consumers.
What exactly are consumers?
Feel free to read over the Apache Kafka site...
each microservice has its own consumer?
Is each service running similar code? Then yes.
I thought of spawning a new machine
Spawn new instances of your app on one machine. Monitor CPU and Mem and Network load. Don't get new machines until at least one of those is above say 70% under normal processing.
Do I have to increase my partitions in topics if I need to increase my consumers?
In general, yes. The number of consumers in a consumer group is limited by the number of partitions in the subscribed topics.
Can I decrease the partitions when there are fewer messages produced?
No. Partitions cannot be decreased
When the load increases I have to scale the consumers
Not necessarily. Is the increased load constantly rising, or are there waves of it? If variable, then you can let Kafka buffer the messages. And the consumer will keep polling and processing as fast as it can.
You need to define your SLAs for how long a message will take to process after reaching a topic from a producer.
How many partitions are ideal initially?
There are multiple articles on this, and it depends specifically on your own hardware and application requirements. Simply logging each message, you could have thousands of partitions...
When the consumer consumes a message from Kafka, there is a task that I have to complete
Sounds like you might want to look at Celery, not necessarily just Kafka. You could also look at Faust for Kafka processing
I am very new to AWS SQS queues and I am currently playing around with boto. I noticed that when I try to read a queue filled with messages in a while loop, I see that after 10-25 messages are read, the queue does not return any message (even though the queue has more than 1000+ messages). It starts populating another set of 10-25 messages after a few seconds or on stopping and restarting the the program.
while true:
read_queue() // connection is already established with the desired queue.
Any thoughts on this behaviour or point me in the right direction. Just reiterating I am just couple of days old to SQS !!
Thanks
That's the way that SQS queues work by default (short polling). If you haven't changed any settings after setting up your queue, the default is to get messages from a weighted random sampling of machines. If you're using more than one machine and want all the messages you can consume at that moment (across all machines), you need to use long polling. See the Amazon documentation here. I don't think boto supports that directly ATM.
Long polling is more efficient because it allows you to leave the HTTP connection open for a period of time while you wait for more results. However, you can still do your own polling in boto by just setting up a loop and waiting for some period of time between reading the queue. You can still get good overall throughput with this polling strategy.
I'm using Pool.map from the multiprocessing library to iterate through a large XML file and save word and ngram counts into a set of three redis servers. (which sit completely in memory) But for some reason all 4 cpu cores sit around 60% idle the whole time. The server has plenty of RAM and iotop shows that there is no disk IO happening.
I have 4 python threads and 3 redis servers running as daemons on three different ports. Each Python thread connects to all three servers.
The number of redis operations on each server is well below what it's benchmarked as capable of.
I can't find the bottleneck in this program? What would be likely candidates?
Network latency may be contributing to your idle CPU time in your python client application. If the network latency between client to server is even as little as 2 milliseconds, and you perform 10,000 redis commands, your application must sit idle for at least 20 seconds, regardless of the speed of any other component.
Using multiple python threads can help, but each thread will still go idle when a blocking command is sent to the server. Unless you have very many threads, they will often synchronize and all block waiting for a response. Because each thread is connecting to all three servers, the chances of this happening are reduced, except when all are blocked waiting for the same server.
Assuming you have uniform random distributed access across the servers to service your requests (by hashing on key names to implement sharding or partitioning), then the odds that three random requests will hash to the same redis server is inversely proportional to the number of servers. For 1 server, 100% of the time you will hash to the same server, for 2 it's 50% of the time, for 3 it's 33% of the time. What may be happening is that 1/3 of the time, all of your threads are blocked waiting for the same server. Redis is a single-threaded at handling data operations, so it must process each request one after another. Your observation that your CPU only reaches 60% utilization agrees with the probability that your requests are all blocked on network latency to the same server.
Continuing the assumption that you are implementing client-side sharding by hashing on key names, you can eliminate the contention between threads by assigning each thread a single server connection, and evaluate the partitioning hash before passing a request to a worker thread. This will ensure all threads are waiting on different network latency. But there may be an even better improvement by using pipelining.
You can reduce the impact of network latency by using the pipeline feature of the redis-py module, if you don't need an immediate result from the server. This may be viable for you, since you are storing the results of data processing into redis, it seems. To implent this using redis-py, periodically obtain a pipeline handle to an existing redis connection object using the .pipeline() method and invoke multiple store commands against that new handle the same as you would for the primary redis.Redis connection object. Then invoke .execute() to block on the replies. You can get orders of magnitude improvement by using pipelining to batch tens or hundreds of commands together. Your client thread won't block until you issue the final .execute() method on the pipeline handle.
If you apply both changes, and each worker thread communicates to just one server, pipelining multiple commands together (at least 5-10 to see a significant result), you may see greater CPU usage in the client (nearer to 100%). The cpython GIL will still limit the client to one core, but it sounds like you are already using other cores for the XML parsing by using the multiprocessing module.
There is a good writeup about pipelining on the redis.io site.
I am getting at extremely fast rate, tweets from a long-lived connection to the Twitter API Streaming Server. I proceed by doing some heavy text processing and save the tweets in my database.
I am using PyCurl for the connection and callback function that care of text processing and saving in the db. See below my approach who is not working properly.
I am not familiar with network programming, so would like to know:
How can use Threads, Queue or Twisted frameworks to solve this problem ?
def process_tweet():
# do some heaving text processing
def open_stream_connection():
connect = pycurl.Curl()
connect.setopt(pycurl.URL, STREAMURL)
connect.setopt(pycurl.WRITEFUNCTION, process_tweet)
connect.setopt(pycurl.USERPWD, "%s:%s" % (TWITTER_USER, TWITTER_PASS))
connect.perform()
You should have a number of threads receiving the messages as they come in. That number should probably be 1 if you are using pycurl, but should be higher if you are using httplib - the idea being you want to be able to have more than one query on the Twitter API at a time, so there is a steady amount of work to process.
When each Tweet arrives, it is pushed onto a Queue.Queue. The Queue ensures that there is thread-safety in the communications - each tweet will only be handled by one worker thread.
A pool of worker threads is responsible for reading from the Queue and dealing with the Tweet. Only the interesting tweets should be added to the database.
As the database is probably the bottleneck, there is a limit to the number of threads in the pool that are worth adding - more threads won't make it process faster, it'll just mean more threads are waiting in the queue to access the database.
This is a fairly common Python idiom. This architecture will scale only to a certain degree - i.e. what one machine can process.
Here's simple setup if you are OK with using a single machine.
1 thread accepts connections. After a connection is accepted, it passes the accepted connection to another thread for processing.
You can, of course, use processes (e.g, using multiprocessing) instead of threads, but I'm not familiar with multiprocessing to give advice. The setup would be the same: 1 process accepts connections, then passes them to subprocesses.
If you need to shard the processing across multiple machines, then the simple thing to do would be to stuff the message into the database, then notify the workers about the new record (this will require some sort of coordination/locking between the workers). If you want to avoid hitting the database, then you'll have to pipe messages from your network process to the workers (and I'm not well versed enough in low level networking to tell you how to do that :))
I suggest this organization:
one process reads Twitter, stuffs tweets into database
one or more processes reads database, processes each, inserts into new database. Original tweets either deleted or marked processed.
That is, you have two more more processes/threads. The tweet database could be seen as a queue of work. Multiple worker processes take jobs (tweets) off the queue, and create data in the second database.