How do I create a list of timedeltas in python? - python

I've been searching through this website and have seen multiple references to time deltas, but haven't quite found what I'm looking for.
Basically, I have a list of messages that are received by a comms server and I want to calcuate the latency time between each message out and in. It looks like this:
161336.934072 - TMsg out: [O] enter order. RefID [123] OrdID [4568]
161336.934159 - TMsg in: [A] accepted. ordID [456] RefNumber [123]
Mixed in with these messages are other messages as well, however, I only want to capture the difference between the Out messages and in messages with the same RefID.
So far, to sort out from the main log which messages are Tmessages I've been doing this, but it's really inefficient. I don't need to be making new files everytime.:
big_file = open('C:/Users/kdalton/Documents/Minicomm.txt', 'r')
small_file1 = open('small_file1.txt', 'w')
for line in big_file:
if 'T' in line: small_file1.write(line)
big_file.close()
small_file1.close()
How do I calculate the time deltas between the two messages and sort out these messages from the main log?

First of all, don't write out the raw log lines. Secondly use a dict.
tdeltas = {} # this is an empty dict
if "T" in line:
get Refid number
if Refid in tedeltas:
tdeltas[Refid] = timestamp - tdeltas[Refid]
else:
tdeltas[Refid] = timestamp
Then at the end, convert to a list and print
allRefids = sorted(tdeltas.keys())
for k in allRefids:
print k+": "+tdeltas[k]+" secs"
You may want to convert your dates into time objects from the datetime module and then use timedelta objects to store in the dict. Probably not worth it for this task but it is worthwhile to learn how to use the datetime module.
Also, I have glossed over parsing the Refid from the input string, and the possible issue of converting the times from string to float and back.
Actually, just storing deltas will cause confusion if you ever have a Refid that is not accepted. If I were doing this for real, I would store a tuple in the value with the start datetime, end datetime and the delta. For a new record it would look like this: (161336.934072,0,0) and after the acceptance was detected it would look like this: (161336.934072,161336.934159,.000087). If the logging activity was continuous, say a global ecommerce site running 24x7, then I would periodically scan the dict for any entries with a non-zero delta, report them, and delete them. Then I would take the remaining values, sort them on the start datetime, then report and delete any where the start datetime is too old because that indicates failed transactions that will never complete.
Also, in a real ecommerce site, I might consider using something like Redis or Memcache as an external dict so that reporting and maintenance can be done by another server/application.

This generator function returns a tuple containing the id and the difference in timestamps between the out and in messages. (If you want to do something more complex with the time difference, check out datetime.timedelta). Note that this assumes out messages always appear before in messages.
def get_time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if len(e) == 11 and " ".join(e[2:5]) == "TMsg out: [O]":
ts[e[8]] = e[0] # store timestamp for id
elif len(e) == 10 and " ".join(e[2:5]) == "TMsg in: [A]":
in_ts, ref_id = e[0], e[9]
# Raises KeyError if out msg not seen yet. Handle if required.
out_ts = ts.pop(ref_id) # get ts for this id
yield (ref_id[1:-1], float(in_ts) - float(out_ts))
You can now get a list out of it:
>>> INFILE = 'C:/Users/kdalton/Documents/Minicomm.txt'
>>> list(get_time_deltas(INFILE))
[('123', 8.699999307282269e-05), ('1233', 0.00028700000257231295)]
Or write it to a file:
>>> with open("out.txt", "w") as outfile:
... for id, td in get_time_deltas(INFILE):
... outfile.write("Msg %s took %f seconds\n", (id, td))
Or chain it into a more complex workflow.
Update:
(in response to looking at the actual data)
Try this instead:
def get_time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if " ".join(e[2:5]) == "OuchMsg out: [O]":
ts[e[8]] = e[0] # store timestamp for id
elif " ".join(e[2:5]) == "OuchMsg in: [A]":
in_ts, ref_id = e[0], e[7]
out_ts = ts.pop(ref_id, None) # get ts for this id
# TODO: handle case where out_ts = None (no id found)
yield (ref_id[1:-1], float(in_ts) - float(out_ts))
INFILE = 'C:/Users/kdalton/Documents/Minicomm.txt'
print list(get_time_deltas(INFILE))
Changes in this version:
the number of fields is not as stated in the sample input posted in question. Removed check based on entry number
ordID for in messages is the one that matches refID in the out messages
used OuchMsg instead of TMsg
Update 2
To get an average of the deltas:
deltas = [d for _, d in get_time_deltas(INFILE)]
average = sum(deltas) / len(deltas)
Or, if you have previously generated a list containing all the data, we can reuse it instead of reparsing the file:
data = list(get_time_deltas(INFILE))
# .. use data for something some operation ...
# calculate average using the list
average = sum(d for _, d in data) / len(data)

Related

Redis - Python example of xadd and xread

Could you give a very simple example of using Redis' xread and xadd in Python ( that displays the type and format of return values form xread and input of xadd)? I've already read many documentation but none of them are in Python.
The Redis doc gives an example:
> XADD mystream * sensor-id 1234 temperature 19.8
1518951480106-0
but If I try in python:
sample = {b"hello":b"12"}
id = r.xadd("mystream", sample)
I get this error:
redis.exceptions.ResponseError: WRONGTYPE Operation against a key holding the wrong kind of value
make sure to flush before running just to make sure that there doesn't exist a list / stream with the same name. :
redis-cli flushall
if __name__ == '__main__':
r = redis.Redis(host='localhost', port=6379, db=0)
encoder = JSONEncoder()
sample = {"hello": encoder.encode([1234,125, 1235, 1235])} # converts list to string
stream_name = 'mystream'
for i in range(10):
r.xadd(stream_name, sample)
# "$" doesn't seem to work in python
read_samples = r.xread({stream_name:b"0-0"})
Based on redis-py documentation:
Redis intitalization:
redis = redis.Redis(host='localhost')
To add a key-value pair (key-value should be a dictionary):
redis.xadd(stream_name, {key: value})
Block to read:
redis.xread({stream_name: '$'}, None, 0)
stream_name and ID should be a dictionary.
$ means the most new message.
Moreover, instead of passing a normal ID for the stream mystream I
passed the special ID $. This special ID means that XREAD should use
as last ID the maximum ID already stored in the stream mystream, so
that we will receive only new messages, starting from the time we
started listening.from here
COUNT should be NONE if you want to receive the newest, not just any number of messages.
0 for BLOCK option means Block with a timeout of 0 milliseconds (that means to never timeout)
Looking at the help (or the docstrings (1), (2)) for the functions, they're quite straightforward:
>>> import redis
>>> r = redis.Redis()
>>> help(r.xadd)
xadd(name, fields, id='*', maxlen=None, approximate=True)
Add to a stream.
name: name of the stream
fields: dict of field/value pairs to insert into the stream
id: Location to insert this record. By default it is appended.
maxlen: truncate old stream members beyond this size
approximate: actual stream length may be slightly more than maxlen
>>> help(r.xread)
xread(streams, count=None, block=None)
Block and monitor multiple streams for new data.
streams: a dict of stream names to stream IDs, where
IDs indicate the last ID already seen.
count: if set, only return this many items, beginning with the
earliest available.
block: number of milliseconds to wait, if nothing already present.

Script skips second for loop when reading a file

I am trying to read a log file and compare certain values against preset thresholds. My code manages to log the raw data from with the first for loop in my function.
I have added print statements to try and figure out what was going on and I've managed to deduce that my second for loop never "happens".
This is my code:
def smartTest(log, passed_file):
# Threshold values based on averages, subject to change if need be
RRER = 5
SER = 5
OU = 5
UDMA = 5
MZER = 5
datafile = passed_file
# Log the raw data
log.write('=== LOGGING RAW DATA FROM SMART TEST===\r\n')
for line in datafile:
log.write(line)
log.write('=== END OF RAW DATA===\r\n')
print 'Checking SMART parameters...',
log.write('=== VERIFYING SMART PARAMETERS ===\r\n')
for line in datafile:
if 'Raw_Read_Error_Rate' in line:
line = line.split()
if int(line[9]) < RRER and datafile == 'diskOne.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK ONE OK!\r\n" %int(line[9]))
elif int(line[9]) < RRER and datafile == 'diskTwo.txt':
log.write("Raw_Read_Error_Rate SMART parameter is: %s. Value under threshold. DISK TWO OK!\r\n" %int(line[9]))
else:
print 'FAILED'
log.write("WARNING: Raw_Read_Error_Rate SMART parameter is: %s. Value over threshold!\r\n" %int(line[9]))
rcode = mbox(u'Attention!', u'One or more hardrives may need replacement.', 0x30)
This is how I am calling this function:
dataOne = diskOne()
smartTest(log, dataOne)
print 'Disk One Done'
diskOne() looks like this:
def diskOne():
if os.path.exists(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl"):
os.chdir(r"C:\Dejero\HDD Guardian 0.6.1\Smartctl")
os.system("Smartctl -a /dev/csmi0,0 > C:\Dejero\Installation-Scripts\diskOne.txt")
# Store file in variable
os.chdir(r"C:\Dejero\Installation-Scripts")
datafile = open('diskOne.txt', 'rb')
return datafile
else:
log.write('Smart utility not found.\r\n')
I have tried googling similar issues to mine and have found none. I tried moving my first for loop into diskOne() but the same issue occurs. There is no syntax error and I am just not able to see the issue at this point.
It is not skipping your second loop. You need to seek the position back. This is because after reading the file, the file offset will be placed at the end of the file, so you will need to put it back at the start. This can be done easily by adding a line
datafile.seek(0);
Before the second loop.
Ref: Documentation

Python : Why this scripts gets very slow after some point of time?

I am running the below script to extract ip addresses from file f for domains in file g. It is worth to mention that they are 11 files in the path and each have about 800 million lines (each file f). In this script I am loading file g in a dictionary d in the memory and then I am comparing lines of file f, with the items in the dictionary d, if there, I check if the bl_date in d is between dates in f, then write it to another dictionary dns_dic. Here is how my script looks like:
path = '/data/data/2014*.M.mtbl.A.1'
def process_file(file):
start = time()
dns_dic=defaultdict(set)
d = defaultdict(set)
filename =file.split('/')[-1]
print(file)
g = open ('/data/data/ap2014-2dom.txt','r')
for line in g:
line = line.strip('\n')
domain, bl_date= line.split('|')
bl_date = int(bl_date)
if domain in d:
d[domain].add(bl_date)
else:
d[domain] = set([bl_date])
print("loaded APWG in %.fs" % (time()-start))
stat_d, stat_dt = 0, 0
f = open(file,'r')
with open ('/data/data/overlap_last_%s.txt' % filename,'a') as w:
for n, line in enumerate(f):
line=line.strip('')
try:
jdata = json.loads(line)
dom = jdata.get('bailiwick')[:-1]
except:
pass
if dom in d:
stat_d += 1
for bl_date in d.get(dom):
if jdata.get('time_first') <= bl_date <= jdata.get('time_last'):
stat_dt += 1
dns_dic[dom].update(jdata.get('rdata', []))
for domain,ips in dns_dic.items():
for ip in ips:
w.write('%s|%s\n' % (domain,ip))
w.flush()
if __name__ == "__main__":
files_list = glob(path)
cores = 11
print("Using %d cores" % cores)
pp = Pool(processes=cores)
pp.imap_unordered(process_file, files_list)
pp.close()
pp.join()
Here is file f:
{"bailiwick":"ou.ac.","time_last": 1493687431,"time_first": 1393687431,"rdata": ["50.21.180.100"]}
{"bailiwick": "ow.ac.","time_last": 1395267335,"time_first": 1395267335,"rdata": ["50.21.180.100"]}
{"bailiwick":"ox.ac.","time_last": 1399742959,"time_first": 1393639617,"rdata": ["65.254.35.122", "216.180.224.42"]}
Here is file g:
ou.ac|1407101455
ox.ac|1399553282
ox.ac|1300084462
ox.ac|1400243222
Expected result:
ou.ac|["50.21.180.100"]
ox.ac|["65.254.35.122", "216.180.224.42"]
Can somebody help me find out why at some point of time the script become really slow although memory usage is all the time about 400 MG.
Even though it doesn't change the overall computational complexity, I would start with avoiding redundant dict lookup operations. For instance, instead of
if domain in d:
d[domain].add(bl_date)
else:
d[domain] = set([bl_date])
you might want to do
d.setdefault(domain, set()).add(bl_date)
in order to perform one lookup instead of two. But actually, it seems like a set is not the perfect choice for storing a domain's access timestamps. If you used lists instead, you could sort each domain's timestamps before you start matching them to the session data from f. That way, you would simply compare each session's fields time_last and time_first to the first and last element in the domain's timestamp list to determine if the IP addresses are to be put into dns_dic[dom].
In general, you are doing a lot of unnecessary work in the for bl_date in d.get(dom): loop. At least, at the first bl_date that lies between the time_last and time_first fields, you should terminate the loop. Depending on the length of g, this might be your bottleneck.

Building AIS Messages Decoder

I used to decode AIS messages with theis package (Python) https://github.com/schwehr/noaadata/tree/master/ais until I started getting a new format of the messages.
As you may know, AIS messages come in two types mostly. one part (one message) or two parts (multi message). Message#5 is always comes in two parts. example:
!AIVDM,2,1,1,A,55?MbV02;H;s<HtKR20EHE:address#hidden#Dn2222222216L961O5Gf0NSQEp6ClRp8,0*1C
!AIVDM,2,2,1,A,88888888880,2*25
I used to decode this just fine using the following piece of code:
nmeamsg = fields.split(',')
if nmeamsg[0] != '!AIVDM':
return
total = eval(nmeamsg[1])
part = eval(nmeamsg[2])
aismsg = nmeamsg[5]
nmeastring = string.join(nmeamsg[0:-1],',')
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
--
elif (total>1):
# Multi Slot Messages: 5,6,8,12,14,17,19,20?,21,24,26
global multimsg
if total==2:
if msgnum==5:
if nmeastring.count('!AIVDM')==2 and len(nmeamsg)==13: # make sure there are two parts concatenated together
aismsg = nmeamsg[5]+nmeamsg[11]
bv = binary.ais6tobitvec(aismsg)
msg5 = ais_msg_5.decode(bv)
print "message5 :",msg5
return msg5
Now I'm getting a new format of the messages:
!SAVDM,2,1,7,A,55#0hd01sq`pQ3W?O81L5#E:1=0U8U#000000016000006H0004m8523k#Dp,0*2A,1410825672
!SAVDM,2,2,7,A,4hC`2U#C`40,2*76,1410825672,1410825673
Note. the number at the last index is the time in epoch format
I tried to adjust my code to decode this new format. I succeed in decoding messages with one part. My problem is multi message type.
nmeamsg = fields.split(',')
if nmeamsg[0] != '!AIVDM' and nmeamsg[0] != '!SAVDM':
return
total = eval(nmeamsg[1])
part = eval(nmeamsg[2])
aismsg = nmeamsg[5]
nmeastring = string.join(nmeamsg[0:-1],',')
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[7])))
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
Decoder can't bring the two lines as one. So decoding fails because message#5 should contain two strings not one. The error i get is in these lines:
if nmeastring.count('!SAVDM')==2 and len(nmeamsg)==13:
aismsg = nmeamsg[5]+nmeamsg[11]
Where len(nmeamsg) is always 8 (second line) and nmeastring.count('!SAVDM') is always 1
I hope I explained this clearly so someone can let me know what I'm missing here.
UPDATE
Okay I think I found the reason. I pass messages from file to script line by line:
for line in file:
i=i+1
try:
doais(line)
Where message#5 should be passed as two lines. Any idea on how can I accomplish that?
UPDATE
I did it by modifying the code a little bit:
for line in file:
i=i+1
try:
nmeamsg = line.split(',')
aismsg = nmeamsg[5]
bv = binary.ais6tobitvec(aismsg)
msgnum = int(bv[0:6])
print msgnum
if nmeamsg[0] != '!AIVDM' and nmeamsg[0] != '!SAVDM':
print "wrong format"
total = eval(nmeamsg[1])
if total == 1:
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[8])))
doais(line,msgnum,dbtimestring,aismsg)
if total == 2: #Multi-line messages
lines= line+file.next()
nmeamsg = lines.split(',')
dbtimestring = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(nmeamsg[15])))
aismsg = nmeamsg[5]+nmeamsg[12]
doais(lines,msgnum,dbtimestring,aismsg)
Be aware that noaadata is my old research code. libais is my production library thst is in use for NOAA's ERMA and WhaleAlert.
I usually make decoding a two pass process. First join multi-line messages. I refer to this as normalization (ais_normalize.py). You have several issues in this step. First the two component lines have different timestamps on the right of the second string. By the USCG old metadata standard, the last one matters. So my code will assume that these two lines are not related. Second, you don't have the required station id field.
Where are you getting the SA from in SAVDM? What device ("talker" in the NMEA vocab) is receiving these messages?
If you're in Ruby, I can recommend the NMEA and AIS decoder ruby gem that I wrote, available on github. It's based on the unofficial AIS spec at catb.org which is maintained by one of Kurt's colleagues.
It handles combining of multipart messages, reads from streams, and supports a large of NMEA and AIS messages. Decoding the 50 binary subtypes of AIS messages 6 and 8 is presently in development.
To handle the nonstandard lines you posted:
!SAVDM,2,1,7,A,55#0hd01sq`pQ3W?O81L5#E:1=0U8U#000000016000006H0004m8523k#Dp,0*2A,1410825672
!SAVDM,2,2,7,A,4hC`2U#C`40,2*76,1410825672,1410825673
It would be necessary to add a new parse rule that accepts fields after the checksum, but aside from that it should go smoothly. In other words, you'd copy the parser line here:
| BANG DATA CSUM { result = NMEAPlus::AISMessageFactory.create(val[0], val[1], val[2]) }
and have something like
| BANG DATA CSUM COMMA DATA { result = NMEAPlus::AISMessageFactory.create(val[0], val[1], val[2], val[4]) }
What do you do with those extra timestamp(s)? It almost looks like they've been appended by whatever software is doing the logging, rather than being part of the actual message.

Handle modifications in a diff file

I've got a diff file and I want to handle adds/deletions/modifications to update an SQL database.
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
With a Python script, I first detect two following lines with a regular expressions to handle modifications like B.
re.compile(r"^-(.+?)\|(.*?)\|(.+?)\n\+(.+?)\|(.*?)\|(.+?)(?:\n|\Z)", re.MULTILINE)
I delete all the matching lines, and then rescan my file and then handle all of them like additions/deletions.
My problem is with lines like D & E. For the moment I treat them like two deletions, then two additions, and I've got consequences of CASCADE DELETE in my SQL database, as I should treat them as modifications.
How can I handle such modifications D & E?
The diff file is generated by a bash script before, I could handle it differently if needed.
Try this:
>>> a = '''
+NameA|InfoA1|InfoA2
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
-NameC|InfoC1|InfoC2
-NameD|InfoD1|InfoD2
-NameE|InfoE1|InfoE2
+NameD|InfoD1|InfoD3
+NameE|InfoE3|InfoE2
'''
>>> diff = {}
>>> for row in a.splitlines():
if not row:
continue
s = row.split('|')
name = s[0][1:]
data = s[1:]
if row.startswith('+'):
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'added'
else:
change = diff.get(name, {'rows': []})
change['rows'].append(row)
change['status'] = 'modified' if change.has_key('status') else 'removed'
diff[name] = change
>>> def print_by_status(status=None):
for item, value in diff.items():
if status is not None and status == value['status'] or status is None:
print '\nStatus: %s\n%s' % (value['status'], '\n'.join(value['rows']))
>>> print_by_status(status='added')
Status: added
+NameA|InfoA1|InfoA2
>>> print_by_status(status='modified')
Status: modified
-NameD|InfoD1|InfoD2
+NameD|InfoD1|InfoD3
Status: modified
-NameE|InfoE1|InfoE2
+NameE|InfoE3|InfoE2
Status: modified,
-NameB|InfoB1|InfoB2
+NameB|InfoB3|InfoB2
In this case you will have dictionary with all collected data with diff status and rows. You can do with current dict whatever you want.

Categories