Simulate Time Series Events with Accurate Scheduler - python

I have an API which I will need to run some tests. We have already done the stress and load testing but the best way to test is to run some real life data. I have a fact table with all the historical data for the past years. The goal is to find a busy window of that history and "replay" it against our API.
Is there a way to "replay time series" data and simulate the API requests activity in Python.
The input data is like this with hundreds of thousands rows a day:
TimeStamp Input Data
------------------------------------------
2020-01-01 00:00:01:231 ABC
2020-01-01 00:00:01:456 ABD
2020-01-01 00:00:01:789 XYZ
...
I first thought of converting each row as a cron-entry, so when each row is activated, it will trigger a request to the API and use the data entry as the payload.
However, this approach adds so much overhead of starting Python processes and the time distribution is whacked: within a second, it might start lots of processes, load the library etc.
Is there a way I can start a long running Python process to perfectly replay based on the time series data? (ideally be as accurate within a few milliseconds)
Almost like:
while True:
currenttime = datetime.now()
# find from table rows with currentime
# make web requests with those rows
And then this become synchronous and every loop requires a database lookup..

Perhaps you'd want to write your real-time playback routine to be something like this (pseudocode):
def playbackEventsInWindow(startTime, endTime):
timeDiff = datetime.timedelta(startTime, datetime.now()).total_seconds()
prevTime = startTime
while True:
nextEvent = GetFirstEventInListAfterSpecifiedTime(prevTime)
if nextEvent:
nextTime = nextEvent.getEventTimeStamp()
if (nextTime >= endTime):
return # we've reached the end of our window
sleepTimeSeconds = datetime.timedelta(datetime.now(), nextTime).total_seconds()+timeDiff
if (sleepTimeSeconds > 0.0):
time.sleep(sleepTimeSeconds)
executeWebRequestsForEvent(nextEvent)
prevTime = nextTime
else:
return # we've reached the end of the list
Note that a naive implementation of GetFirstEventInListAfterSpecifiedTime(timeStamp) would simply start at the beginning of the events-list and then linearly scan down the list until it found an event with a timestamp greater than the specified argument, and return that event... but that implementation would quickly become very inefficient if the events-list is long. However, you could tweak it by having it store the index of the value it returned on the previous call, and start its linear-scan at that position rather than from the top of the list. That would allow it to return quickly (i.e. usually after just one step) in the common case (i.e. where the requested timestamps are steadily increasing).

Related

automatically remove items of a list every few second in redis

I'm trying to see how many users used my bot in the last 5 minute.
I've got an idea to every time a user used my bot I add his/hers id into a redis list with a timer. (reset the timer if user is already in the list)
And every time I want to check how many users are using the bot, I get the length of the list.
But i have no idea how to do that.
something like below code that expiers five minute later:
redis.setex('foo_var', 60 * 5, 'foo_value')
I've managed to add items to a list with :
redis.zadd('foo', {'item1': 0, 'item2': 1})
And get the length of the list like this (I don't know how to get full length of the list. (without using min and max)):
min = 0.0
max = 1000.0
redis.zcount('foo', min, max)
Right now the problem is how to expire items of a list on specific time.
Items within Lists, Sets, Hashes, and their ilk cannot be expired automatically by Redis. That said, it might be worth looking at Streams.
If you're not familiar, Streams are essentially a list of events with associated times and associated data. Think of it like a log file. The nice thing is you can add extra data to the event, like maybe the type of bot interaction.
So, just log an event every time the bot is used:
XADD bot_events * eventType login
The * here means to auto generate the ID of the event based on the server time. You can also provided one manually, but you almost never want to. An event ID is just a UNIX Epoch time in milliseconds and a sequence number separated by a dash like this: 1651232477183-0.
XADD can automatically trim the Stream for your time period so old record don't hang around. Do this by providing an event ID before which events will be deleted:
XADD bot_events MINID ~ 1651232477183-0 * eventType login
Note that ~ instructs Redis to trim Streams performantly. This means that it might not delete all the events. However, it will never delete more than you expect, only less. It can be replaced with = if you want exactness over performance.
Now that you have a Stream of events, you can then query that Stream for events over a specific time period, based on event IDs:
XRANGE bot_events 1651232477183-0 +
+ here, means until the end of the stream. The initial event ID could be replaces with - if you want all the stream events regardless of the time.
From here, you just count the number of results.
Note, all the examples here are presented as raw Redis commands, but it should be easy enough to translate them to Python.

Is there a way to let generator (yield) return data every 0.1 second (Python)?

I am developing a server-based data acquisition application. I want the server to read data from file (to simulate real data stream) every 0.01 second. Every it reads a chunk of data from the file. So I am using a generator to read data by chunks. Then I am think about using some sort of timer to control how often to the generator returns data. I have tried the RepeatedTimer mentioned in this post, but it didn't work. Any help on this?
Just a simple generator for your information.
def generator():
count = 0
while True:
count = count + 1
yield count
The RepeatedTimer module is mentioned here (the 3rd answer): Run certain code every n seconds
This is how I called the RepeatedTimer. When running the code, there was no response.
count = RepeatedTimer(0.01, generator)
print count

Getting microseconds past the hour

I am working on a quick program to generate DIS (Distributed Interactive Simulation) packets to stress test a gateway we have. I'm all set and rearing to go, except for one small issue. I'm having trouble pulling the current microseconds past the top of the hour correctly.
Currently I'm doing it like this:
now = dt.now()
minutes = int(now.strftime("%M"))
seconds = int(now.strftime("%S")) + minutes*60
microseconds = int(now.strftime("%f"))+seconds*(10**6)
However when I run this multiple times in a row, I'll get results all over the place, with numbers that cannot physically be right. Can someone sanity check my process??
Thanks very much
You can eliminate all that formatting and just do the following:
now = dt.now()
microseconds_past_the_hour = now.microsecond + 1000000*(now.minute*60 + now.second)
Keep in mind that running this multiple times in a row will continually produce different results, as the current time keeps advancing.

Why are my Datastore Write Ops so high?

I busted through my daily free quota on a new project this weekend. For reference, that's .05 million writes, or 50,000 if my math is right.
Below is the only code in my project that is making any Datastore write operations.
old = Streams.query().fetch(keys_only=True)
ndb.delete_multi(old)
try:
r = urlfetch.fetch(url=streams_url,
method=urlfetch.GET)
streams = json.loads(r.content)
for stream in streams['streams']:
stream = Streams(channel_id=stream['_id'],
display_name=stream['channel']['display_name'],
name=stream['channel']['name'],
game=stream['channel']['game'],
status=stream['channel']['status'],
delay_timer=stream['channel']['delay'],
channel_url=stream['channel']['url'],
viewers=stream['viewers'],
logo=stream['channel']['logo'],
background=stream['channel']['background'],
video_banner=stream['channel']['video_banner'],
preview_medium=stream['preview']['medium'],
preview_large=stream['preview']['large'],
videos_url=stream['channel']['_links']['videos'],
chat_url=stream['channel']['_links']['chat'])
stream.put()
self.response.out.write("Done")
except urlfetch.Error, e:
self.response.out.write(e)
This is what I know:
There will never be more than 25 "stream" in "streams." It's
guaranteed to call .put() exactly 25 times.
I delete everything from the table at the start of this call because everything needs to be refreshed every time it runs.
Right now, this code is on a cron running every 60 seconds. It will never run more often than once a minute.
I have verified all of this by enabling Appstats and I can see the datastore_v3.Put count go up by 25 every minute, as intended.
I have to be doing something wrong here, because 25 a minute is 1,500 writes an hour, not the ~50,000 that I'm seeing now.
Thanks
You are mixing two different things here: write API calls (what your code calls) and low-level datastore write operations. See the billing docs for relations: Pricing of Costs for Datastore Calls (second section).
This is the relevant part:
New Entity Put (per entity, regardless of entity size) = 2 writes + 2 writes per indexed property value + 1 write per composite index value
In your case Streams has 15 indexed properties resulting in: 2 + 15 * 2 = 32 write OPs per write API call.
Total per hour: 60 (requests/hour) * 25 (puts/request) * 32 (operations/put) = 48,000 datastore write operations per hour
It seems as though I've finally figured out what was going on, so I wanted to update here.
I found this older answer: https://stackoverflow.com/a/17079348/1452497.
I've missed somewhere along the line where the properties being indexed were somehow multiplying the writes by factors of at least 10, I did not expect that. I didn't need everything indexed and after turning off the index in my model, I've noticed the write ops drop DRAMATICALLY. Down to about where I expect them.
Thanks guys!
It is 1500*24=36,000 writes/day, which is very near to the daily quota.

How can I keep a list of recently seen users without running out of RAM/crashing my DB?

Here's some code that should demonstrate what I'm trying to do:
current_time = datetime.datetime.now()
recently_seen = []
user_id = 10
while True:
if user_id not in recently_seen:
recently_seen[user_id] = current_time
print 'seen {0}'.format(user_id)
else:
if current_time - recently_seen[user_id] > '5 seconds':
recently_seen[user_id] = current_time
print 'seen {0}'.format(user_id)
time.sleep(0.1)
Basically, my program is listening on a socket for users. This is wrapped in a loop that spits out user_ids as it sees them. This means, I'm seeing user_ids every few milliseconds.
What I'm trying to do is log the users it sees and at what times. Saying it saw a user at 0.1 seconds and then again at 0.7 seconds is silly. So I want to implement a 5 second buffer.
It should find a user and, if the user hasn't been seen within 5 seconds, log them to a database.
The two solutions I've come up with is:
1) Keep the user_id in a dictionary (similar to the sample code above) and check against this. The problem is, if it's running for a few days and continues finding new users, this will eventually use up my RAM
2) Log them to a database and check against that. The problem with this is, it finds users every few milliseconds. I don't want to read the database every few milliseconds...
I need some way of creating a list of limited size. That limit would be 5 seconds. Any ideas on how to implement this?
How about removing the user from your dictionary once you log them to the database?
Why aren't you using a DBM?
It will work like a dictionary but will be stored on the disk.

Categories