I am trying to move my python code from using dynamodb to dynamodb2 to have access to the global secondary index capability. One concept that to me is a lot less clear in ddb2 compared to ddb is that of a batch. Here's once version of my new code which was basically modified from my original ddb code:
item_pIds = []
batch = table.batch_write()
count = 0
while True:
m = inq.read()
count = count + 1
mStr = json.dumps(m)
pid = m['primaryId']
if pid in item_pIds:
print "pid=%d already exists in the batch, ignoring" % pid
continue
item_pIds.append(pid)
sid = m['secondaryId']
item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
batch.put_item(data=item_data)
if count >= 25:
batch = table.batch_write()
count = 0
item_pIds = []
So what I am doing here is I am getting (JSON) messages from a queue. Each message has a primaryId and a secondaryId. The secondaryId is not unique in that I might get several messages at about the same time that have the same. The primaryId is sort of unique. That is, if I get a set of messages at about the same time that have the same primaryId, it's bad. However, from time to time, say once in a few hours I may get a message that need to override an existing message with the same primaryId. So this seems to align well with the statement from the dynamodb2 documentation page similar to that of ddb:
DynamoDB’s maximum batch size is 25 items per request. If you attempt to put/delete more than that, the context manager will batch as many as it can up to that number, then flush them to DynamoDB and continue batching as more calls come in.
However, what I noticed is that a large chunk of messages that I get through the queue never make it to the database. That is, when I try to retrieve them later, they are not there. So I was told that a better way of handling batch writes is by doing something like this:
with table.batch_write() as batch:
while True:
m = inq.read()
mStr = json.dumps(m)
pid = m['primaryId']
sid = m['secondaryId']
item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
batch.put_item(data=item_data)
That is, I only call batch_write() once similar to how I would open a file only once and then write into it continuously. But in this case, I don't understand what the "rule of 25 max" means. When does a batch start and end? And how do I check for duplicate primaryIds? That is, remembering all messages that I ever received through the queue is not realistic since (i) I have too many of them (the system runs 24/7) and (ii) as I stated before, occasional repeated ids are OK.
Sorry for the long message.
A batch will start whenever the request is sent and end when the last request in the batch is completed.
As with any RESTful API, every request comes with a cost, meaning how much/many resources it will take to complete said request. With the batch_write() class in DynamoDB2, they are wrapping the requests in a group and creating a queue to process them, which will reduce the cost as they are no longer individual requests.
The batch_write() class returns a context manager that handles the individual requests and what you get back slightly resembles a Table object but only has the put_item and delete_item requests.
DynamoDB's max batch size is 25, just like you've read. From the comments in the source code:
DynamoDB's maximum batch size is 25 items per request. If you attempt
to put/delete more than that, the context manager will batch as many
as it can up to that number, then flush them to DynamoDB & continue
batching as more calls come in.
You can also read about migrating, batches in particular, from DynamoDB to DynamoDB2 here.
Related
In PyQt5, I want to read my serial port after writing (requesting a value) to it. I've got it working using readyRead.connect(self.readingReady), but then I'm limited to outputting to only one text field.
The code for requesting parameters sends a string to the serial port. After that, I'm reading the serial port using the readingReady function and printing the result to a plainTextEdit form.
def read_configuration(self):
if self.serial.isOpen():
self.serial.write(f"?request1\n".encode())
self.label_massGainOutput.setText(f"{self.serial.readAll().data().decode()}"[:-2])
self.serial.write(f"?request2\n".encode())
self.serial.readyRead.connect(self.readingReady)
self.serial.write(f"?request3\n".encode())
self.serial.readyRead.connect(self.readingReady)
def readingReady(self):
data = self.serial.readAll()
if len(data) > 0:
self.plainTextEdit_commandOutput.appendPlainText(f"{data.data().decode()}"[:-2])
else: self.serial.flush()
The problem I have, is that I want every answer from the serial port to go to a different plainTextEdit form. The only solution I see now is to write a separate readingReady function for every request (and I have a lot! Only three are shown now). This must be possible in a better way. Maybe using arguments in the readingReady function? Or returning a value from the function that I can redirect to the correct form?
Without using the readyRead signal, all my values are one behind. So the first request prints nothing, the second prints the first etc. and the last is not printed out.
Does someone have a better way to implement this functionality?
QSerialPort has asyncronous (readyRead) and syncronous API (waitForReadyRead), if you only read configuration once on start and ui freezing during this process is not critical to you, you can use syncronous API.
serial.write(f"?request1\n".encode())
serial.waitForReadyRead()
res = serial.read(10)
serial.write(f"?request2\n".encode())
serial.waitForReadyRead()
res = serial.read(10)
This simplification assumes that responces comes in one chunk and message size is below or equal 10 bytes which is not guaranteed. Actual code should be something like this:
def isCompleteMessage(res):
# your code here
serial.write(f"?request2\n".encode())
res = b''
while not isCompleteMessage(res):
serial.waitForReadyRead()
res += serial.read(10)
Alternatively you can create worker or thread, open port and query requests in it syncronously and deliver responces to application using signals - no freezes, clear code, slightly more complicated system.
I have written a Python script utilizing the Python-CAN library which records received CAN messages at a 1 second rate for 5 minutes, before logging all the messages into a file and exiting. The computer has a CAN module which is connecting to the CAN bus. (The other device on the bus is an engine) I communicate with it using the SocketCAN interface.
The test engine system that this computer is connected to is sending around 114 messages at what I believe is a 250kb baud rate. I am expecting to see 114 messages recorded in the file for each 1 second period, but instead I'm seeing about half that count. (~65 messages)
Could it be possible that the engine's ECU is set to a 500kb baud rate, and that's why I'm not getting the count I am expecting? I would think there would be no communication if the baud rates do not match, but I do not have physical access to the system because I'm sending the script remotely through an OTA update and not running it myself. (The device is headless, but is setup to run the script on startup) I just see the log files that are generated.
Here is the python code:
(A note, I have code parsing the received messages into the contained signals, but I did not include this code here because it happens at the end, and it is not relevant)
class logging:
def __init__(self):
#Dictionary to hold received CAN messages
self.message_Dict = {}
#List to hold queued dictionaries
self.message_Queue = []
#A "filters" object is also created here, but I did not include it
#I have verified the filters are correct on my test system
def main(self):
#Record the current time
currentTime = datetime.datetime.now()
#Record the overall start time
startTime = datetime.datetime.now()
#Record the iteration start time
lastIterationStartTime = currentTime
#Create the CanBus that will be used to send and receive CAN msgs from the MCU
canbus = can.interfaces.socketcan.SocketcanBus(channel='can0', bitrate=250000)
#These filters are setup correctly, because all the messages come through
#on my test system, but I did not include them here
canbus.set_filters(self.Filters)
# Creating Listener filters and notifier
listener = can.Listener()
#Main loop
while 1:
#create a variable to hold received data
msg2 = canbus.recv()
#Record the current time
currentTime = datetime.datetime.now()
#If a valid message is detected
if(msg2 != None):
if(len(msg2.data) > 0):
try:
#Save the message data into a queue (will be processed later)
self.message_Dict[msg2.arbitration_id] = msg2.data
except:
print("Error in storing CAN message")
#If 1 second has passed since the last iteration,
#add the dictionary to a new spot in the queue
if((currentTime - lastIterationStartTime) >= datetime.timedelta(seconds=1)):
#Add the dictionary with messages into the queue for later processing
messageDict_Copy = self.message_Dict.copy()
self.message_Queue.append(messageDict_Copy)
print("Number of messages in dictionary: " + str(len(self.message_Dict)) + "
Number of reports in queue: " + str(len(self.message_Queue)))
#Clear the dictionary for new messages for every iteration
self.message_Dict.clear()
#Record the reset time
lastIterationStartTime = datetime.datetime.now()
#Once 5 minutes of data has been recorded, write to the file
if((currentTime - startTime) > datetime.timedelta(minutes=5)):
#Here is where I write the data to a file. This is too long to include
#Clear the queue
self.message_Queue = []
#Clear the dictionary for new messages for every iteration
self.message_Dict.clear()
#When the script is run, execute the Main method
if __name__ == '__main__':
mainClass = logging()
mainClass.main()
I appreciate any ideas or input you have. Thank you
In my experience, most of the engine's ECU usually uses 250kb, but the newest ones are using 500kb. I would suggest you too also try the both.
Also, the messages will only come to the bus if they have been sent, it seems silly but for example a truck, if you don't step on the accelerator the messages referring to the accelerator will not appear. So, maybe you need to check if all components are being using as you expect. The lib of can-utils has a 'Can sniffer' that can also help you.
I suggest you to use 'can-utils' to help you in that. It is a powerful tool to can analyses.
Did you try to loop the baudrate? Maybe can also help to find another.
I subscribe to a real time stream which publishes a small JSON record at a slow rate (0.5 KBs every 1-5 seconds). The publisher has provided a python client that exposes these records. I write these records to a list in memory. The client is just a python wrapper for doing a curl command on a HTTPS endpoint for a dataset. A dataset is defined by filters and fields. I can let the client go for a few days and stop it at midnight to process multiple days worth of data as one batch.
Instead of multi-day batches described above, I'd like to write every n-records by treating the stream as a generator. The client code is below. I just added the append() line to create a list called 'records' (in memory) to playback later:
records=[]
data_set = api.get_dataset(dataset_id='abc')
for record in data_set.request_realtime():
records.append(record)
which as expected, gives me [*] in Jupyter Notebook; and keeps running.
Then, I created a generator from my list in memory as follows to extract one record (n=1 for initial testing):
def Generator():
count = 1
while count < 2:
for r in records:
yield r.data
count +=1
But my generator definition also gave me [*] and kept calculating; which I understand it is because the list is still being written in memory. But I thought my generator would be able to lock the state of my list and yield the first n-records. But it didn't. How can I code my generator in this case? And if a generator is not a good choice in this use case, please advise.
To give you the full picture, if my code was working, then, I'd have instantiated it, printed it, and received an object as expected like this:
>>>my_generator = Generator()
>>>print(my_generator)
<generator object Gen at 0x0000000009910510>
Then, I'd have written it to a csv file like so:
with open('myfile.txt', 'w') as f:
cf = csv.DictWriter(f, column_headers, extrasaction='ignore')
cf.writeheader()
cf.writerows(i.data for i in my_generator)
Note: I know there are many tools for this e.g. Kafka; but I am in an initial PoC phase. Please use Python 2x. Once I get my code working, I plan on stacking generators to set up my next n-record extraction so that I don't lose data in between. Any guidance on stacking would also be appreciated.
That's not how concurrency works. Unless some magic is being used that you didn't tell us about, while your first code returns * you can't run more code. Putting the generator in another cell just adds it to a queue to run when the first code finishes - since the first code will never finish, the second code will never even start running!
I suggest looking into some asynchronous networking library, like asyncio, twisted or trio. They allow you to make functions cooperative so while one of them is waiting for data, the other can run, instead of blocking. You'd probably have to rewrite the api.get_dataset code to be asynchronous as well.
I'm working with Django 1.4 and Celery 3.0 (rabbitmq) to build an assemblage of tasks for sourcing and caching queries to Twitter API 1.1. One thing I am trying to implement is chain of tasks, the last of which makes a recursive call to the task two nodes back, based on responses so far and response data in most recently retrieved response. Concretely, this allows the app to traverse a user timeline (up to 3200 tweets), taking into account that any given request can only yield at most 200 tweets (limitation on Twitter API).
Key components of my tasks.py can be seen here, but before pasting, I'll show the chain i'm calling from my Python shell (but that will ultimately be launched via user inputs in the final web app). Given:
>>request(twitter_user_id='#1010101010101#,
total_requested=1000,
max_id = random.getrandbits(128) #e.g. arbitrarily large number)
I call:
>> res = (twitter_getter.s(request) |
pre_get_tweets_for_user_id.s() |
get_tweets_for_user_id.s() |
timeline_recursor.s()).apply_async()
The critical thing is that timeline_recursor can initiate a variable number of get_tweets_for_user_id subtasks. When timeline_recursor is in its base case, it should return a response dict as defined here:
#task(rate_limit=None)
def timeline_recursor(request):
previous_tweets=request.get('previous_tweets', None) #If it's the first time through, this will be None
if not previous_tweets:
previous_tweets = [] #so we initiate to empty array
tweets = request.get('tweets', None)
twitter_user_id=request['twitter_user_id']
previous_max_id=request['previous_max_id']
total_requested=request['total_requested']
pulled_in=request['pulled_in']
remaining_requested = total_requested - pulled_in
if previous_max_id:
remaining_requested += 1 #this is because cursored results will always have one overlapping id
else:
previous_max_id = random.getrandbits(128) # for first time through loop
new_max_id = min([tweet['id'] for tweet in tweets])
test = lambda x, y: x<y
if remaining_requested < 0: #because we overshoot by requesting batches of 200
remaining_requested = 0
if tweets:
previous_tweets.extend(tweets)
if tweets and remaining_requested and (pulled_in > 1) and test(new_max_id, previous_max_id):
request = dict(user_pk=user_pk,
twitter_user_id=twitter_user_id,
max_id = new_max_id,
total_requested = remaining_requested,
tweets=previous_tweets)
#problem happens in this part of the logic???
response = (twitter_getter_config.s(request) | get_tweets_for_user_id.s() | timeline_recursor.s()).apply_async()
else: #if in base case, combine all tweets pulled in thus far and send back as "tweets" -- to be
#saved in db or otherwise consumed
response = dict(
twitter_user_id=twitter_user_id,
total_requested = total_requested,
tweets=previous_tweets)
return response
My expected response for res.result is therefore a dictionary comprised of a twitter user id, a requested number of tweets, and the set of tweets pulled in across successive calls.
However, all is not well in recursive task land. When i run the chain identified above, if I enter res.status right after initiating chain, it indicates "SUCCESS", even though in the log view of my celery worker, I can see that chained recursive calls to the twitter api are being made as expected, with the correct parameters. I can also immediately run result.result even as chained tasks are being executed. res.result yields an AsyncResponse instance id. Even after recursively chained tasks have finished running, res.result remains an AsyncResult id.
On the other hand, I can access my set of full tweets by going to res.result.result.result.result['tweets']. I can deduce that each of the chained chained subtasks is indeed occuring, I just don't understand why res.result doesn't have the expected result. The recursive returns that should be happening when timeline_recursor gets its base case don't appear to be propagating as expected.
Any thoughts on what can be done? Recursion in Celery can get quite powerful, but to me at least, it's not totally apparent how we should be thinking of recursion and recursive functions that utilize Celery and how this affects the logic of return statements in chained tasks.
Happy to clarify as needed, and thanks in advance for any advice.
what does apply_async return ( as in type of object )?
i don't know celery, but in Twisted and many other async frameworks... a call to something like that would immediately return ( usually True or perhaps an object that can track state ) as the tasks are deferred into the queue.
again, Not knowing celery , i would guess that this is happening:
you are: defining response immediately as the async deferred task, but then trying to act on it as if results have come in
you want to be: defining a callback routine to run on the results and return a value, once the task has been completed
looking at the celery docs, apply_async accepts callbacks via link - and i couldn't find any example of someone trying to capture a return value from it.
I am using boto library in Python to get Amazon SQS messages. In exceptional cases I don't delete messages from queue in order to give a couple of more changes to recover temporary failures. But I don't want to keep receiving failed messages constantly. What I would like to do is either delete messages after receiving more than 3 times or not get message if receive count is more than 3.
What is the most elegant way of doing it?
There are at least a couple of ways of doing this.
When you read a message in boto, you receive a Message object or some subclass thereof. The Message object has an "attributes" field that is a dict containing all message attributes known by SQS. One of the things SQS tracks is the approximate # of times the message has been read. So, you could use this value to determine whether the message should be deleted or not but you would have to be comfortable with the "approximate" nature of the value.
Alternatively, you could record message ID's in some sort of database and increment a count field in the database each time you read the message. This could be done in a simple Python dict if the messages are always being read within a single process or it could be done in something like SimpleDB if you need to record readings across processes.
Hope that helps.
Here's some example code:
>>> import boto.sqs
>>> c = boto.sqs.connect_to_region()
>>> q = c.lookup('myqueue')
>>> messages = c.receive_message(q, num_messages=1, attributes='All')
>>> messages[0].attributes
{u'ApproximateFirstReceiveTimestamp': u'1365474374620',
u'ApproximateReceiveCount': u'2',
u'SenderId': u'419278470775',
u'SentTimestamp': u'1365474360357'}
>>>
Other way could be you can put an extra identifier at the end of the message in your SQS queue. This identifier can keep the count of the number of times the message has been read.
Also if you want that your service should not poll these message again and again then you can create one more queue say "Dead Message Queue" and can transfer then message which has crossed the threshold to this queue.
aws has in-built support for this, just follow the below steps:
create a dead letter queue
enable Redrive policy for the source queue by checking "Use Redrive Policy"
select the dead letter queue you created in step#1 for "Dead Letter Queue"
Set "Maximum Receives" as "3" or any value between 1 and 1000
How it works is, whenever a message is received by the worker, the receive count increments. Once it reaches "Maximum Receives" count, the message is pushed to the dead letter queue. Note, even if you access the message via aws console, the receive count increments.
Source Using Amazon SQS Dead Letter Queues
Get ApproximateReceiveCount attribute from message you read.
move it to another queue(than you can manage error messages) or just delete it.
foreach (var message in response.Messages){
try{
var notifyMessage = JsonConvert.DeserializeObject<NotificationMessage>(message.Body);
Global.Sqs.DeleteMessageFromQ(message.ReceiptHandle);
}
catch (Exception ex){
var receiveMessageCount = int.Parse(message.Attributes["ApproximateReceiveCount"]);
if (receiveMessageCount >3 )
Global.Sqs.DeleteMessageFromQ(message.ReceiptHandle);
}
}
It should be done in few steps.
create SQS connection :-
sqsconnrec = SQSConnection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
create queue object :-
request_q = sqsconnrec.create_queue("queue_Name")
load the queue messages :-
messages= request_q.get_messages()
now you get the array of message objects and to find the total number of messages :-
just do len(messages)
should work like charm.