Graphing the number of elements down based on timestamps start/end - python

I am trying to graph alarm counts in Python to give some sort of display to give an idea of the peak amount of network elements down between two timespans. The way that our alarms report handles it is in CSV like this:
Name,Alarm Start,Alarm Clear
NE1,15:42 08/09/11,15:56 08/09/11
NE2,15:42 08/09/11,15:57 08/09/11
NE3,15:42 08/09/11,16:31 08/09/11
NE4,15:42 08/09/11,15:59 08/09/11
I am trying to graph the start and end between those two points and how many NE's were down during that time, including the maximum number and when it went under or over a certain count. An example is below:
15:42 08/09/11 - 4 Down
15:56 08/09/11 - 3 Down
etc.
Any advice where to start on this would be great. Thanks in advance, you guys and gals have been a big help in the past.

I'd start by parsing your indata to a map indexed by dates with counts as values. Just increase the count for each row with the same date you encounter.
After that, use some plotting module, for instance matplotlib to plot the keys of the map versus the values. That should cover it!
Do you need any more detailed ideas?

Related

Apply a function to each row python

I am trying to convert from UTC time to LocaleTime in my dataframe. I have a dictionary where I store the number of hours I need to shift for each country code. So for example if I have df['CountryCode'][0]='AU' and I have a df['UTCTime'][0]=2016-08-12 08:01:00 I want to get df['LocaleTime'][0]=2016-08-12 19:01:00 which is
df['UTCTime'][0]+datetime.timedelta(hours=dateDic[df['CountryCode'][0]])
I have tried to do it with a for loop but since I have more than 1 million rows it's not efficient. I have looked into the apply function but I can't seem to be able to put it to take inputs from two different columns.
Can anyone help me?
Without having a more concrete example its difficult but try this:
pd.to_timedelta(df.CountryCode.map(dateDict), 'h') + df.UTCTime

Matplotlib WeekdayLocator giving wrong dates/too many ticks

I've been working with matplotlib.pyplot to plot some data over date ranges, but have been running across some weird behavior, not too different from this question.
The primary difference between my issue and that one (aside from the suggested fix not working) is they refer to different locators (WeekdayLocator() in my case, AutoDateLocator() in theirs.) As some background, here's what I'm getting:
The expected and typical result, where my data is displayed with a reasonable date range:
And the very occasional result, where the data is given some ridiculous range of about 5 years (from what I can see):
I did some additional testing with a generic matplotlib.pyplot.plot and it seemed to be unrelated to using a subplot, or just creating the plot using the module directly.
plt.plot(some plot)
vs.
fig = plt.figure(...)
sub = fig.add_subplot(...)
sub.plot(some plot)
From what I could find, the odd behavior only happens when the data set only has one point (and therefore only having a single date to plot). The outrageous number of ticks is caused by the WeekdayLocator() which, for some reason, attempts to generate 1653 ticks for the x-axis date range (from about 2013 to 2018) based on this error output:
RuntimeError: RRuleLocator estimated to generate 1635 ticks from
2013-07-11 19:23:39+00:00 to 2018-01-02 00:11:39+00:00: exceeds Locator.MAXTICKS * 2 (20)
(This was from some experimenting with the WeekdayLocator().MAXTICKS member set to 10)
I then tried changing the Locator based on how many date points I had to plot:
# If all the entries in the plot dictionary have <= 1 data point to plot
if all(len(times[comp]) <= 1 for comp in times.keys()):
sub.xaxis.set_major_locator(md.DayLocator())
else:
sub.xaxis.set_major_locator(md.WeekdayLocator())
This worked for the edge cases where I'd have a line 2 points and a line with 1 (or just a point) and wanted the normal ticking since it didn't get messed up, but only sort of worked to fix my problem:
Now I don't have a silly amount of tick marks, but my date range is still 5 years! (Side Note: I also tried using an HourLocator(), but it attempted to generate almost 40,000 tick marks...)
So I guess my question is this: is there some way to rein in the date range explosion when only having one date to plot, or am I at the mercy of a strange bug with Matplotlib's date plotting methods?
What I would like to have is something similar to the first picture, where the date range goes from a little before the first date and a little after the last date. Even if Matplotlib were to fill up the axis range to about match the frequency of ticks in the first image, I would expect it to only span the course of a month or so, not five whole years.
Edit:
Forgot to mention that the range explosion also appears to occur regardless of which Locator I use. Plotting with zero points just results in a blank x-axis (due to no date range at all), a single point gives me the described huge date range, and multiple points/lines gives the expected date ranges.

How to remove day from datetime index in pandas?

The idea behind this question is, that when I'm working with full datetime tags and data from different days, I sometimes want to compare how the hourly behavior compares.
But because the days are different, I can not directly plot two 1-hour data sets on top of each other.
My naive idea would be that I need to remove the day from the datetime index on both sets and then plot them on top of each other. What's the best way to do that?
Or, alternatively, what's the better approach to my problem?
This may not be exactly it but should help you along, assuming ts is your timeseries:
hourly = ts.resample('H')
hourly.index = pd.MultiIndex.from_arrays([hourly.index.hour, hourly.index.normalize()])
hourly.unstack().plot()
If you don't care about the day AT ALL, just hourly.index = hourly.index.hour should work

summing entries in a variable based on another variable(make unique) in python lists

i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!
No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.
I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.
As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)

gnuplot alternative with higher time precision

At present I'm using gnuplot to plot data against a time line. However the precision of the time line is in milliseconds but gnuplot only seems to be able to handle seconds.
I've looked at a couple of alternatives, but really I just need something like gnuplot that can cope with fractions of a second.
The programming language used for the main script is Python and whilst I've looked at matplotlib, it seems to be a lot more 'heavy duty' than gnuplot. As I won't always be the one updating the graphing side of things, I want to keep it as easy as possible.
Any suggestions?
Update
I'm using this with gnuplot:
set xdata time
set timefmt "%Y-%m-%d-%H:%M:%S"
However there is no %f to get milliseconds. For example, this works:
2011-01-01-09:00:01
but I need:
2011-01-01-09:00:01.123456
According to the gnuplot 4.6 manual it states, under "Time/date specifiers" (page 114 of the gnuplot 4.6 PDF):
%S - second, integer 0–60 on output, (double) on input
What this means is that when reading timestamps such as 2013-09-16 09:56:59.412 the fractional portion will be included as part of the %S specifier. Such a timestamp will be handled correctly with:
set timefmt "%Y-%m-%d %H:%M:%S"
set datafile separator ","
plot "timed_results.data" using 1:2 title 'Results' with lines
and fed with data like:
2013-09-16 09:56:53.405,10.947
2013-09-16 09:56:54.392,10.827
2013-09-16 09:56:55.400,10.589
2013-09-16 09:56:56.394,9.913
2013-09-16 09:56:58.050,11.04
You can set the ticks format with
set format x '%.6f'
or (maybe, I have not tried it, as I now prefer to use Matplotlib and do not have gnuplot installed on my machines):
set timefmt "%Y-%m-%d-%H:%M:%.6S"
(note the number of digits specified along with the %S format string).
More details can be found in the excellent not so Frequently Asked Questions.
I'm using gnuplot for the same purposes, my input looks like:
35010.59199,100,101
35010.76560,100,110
35011.05703,100,200
35011.08119,100,110
35011.08154,100,200
35011.08158,100,200
35011.08169,100,200
35011.10814,100,200
35011.16955,100,110
35011.16985,100,200
35011.17059,100,200
The first column is seconds since midnight and after the comma a nanosecond part. You can save this in a csv file and in gnuplut do:
set datafile separator ','
plot "test.csv" using 1:3 with lines
I originally misunderstood your problem. I think the finer resolution to the time format is a big problem with gnuplot and one that to my knowledge is not implemented.
One possible work-around would be to use awk to convert your date into the number of seconds with something like
plot "<awk 'your_awk_one_liner' file1.dat" with lines
and then just do a regular double by double plot and forget that it was every time at all (a bit like Martin's solution).
I'm afraid I am not very good with awk and so I cannot help with this bit -
these pages might help though -
http://www.gnu.org/manual/gawk/html_node/Time-Functions.html and http://www.computing.net/answers/unix/script-to-convert-datetime-to-seco/3795.html.
The use of awk with gnuplot is described here: http://t16web.lanl.gov/Kawano/gnuplot/datafile3-e.html.
You could then plot a second axis (and not the data) with the correct times - something like the method used here: Is there a way to plot change of day on an hourly timescale on the x axis?
I'm afraid I don't have time to try and write a complete solution - but something reasonable should be possible.
Good luck - keep us updated if you get something working - I would be interested.

Categories