Histogram of times from a CSV via Pandas

Histogram of times from a CSV via Pandas - python

I am analysing race results from a CSV which looks like this:
Position,Time,Race #,Batch,Name,Surname,Category,Sex,Age
1,00:25:04,58,E,Luke,Schlebusch,Junior,Male,17
2,00:25:16,92,E,Anrich,Zimmermann,Junior,Male,17
3,00:26:27,147,E,Ryan,Mathaba,Open,Male,33
4,00:26:58,53,E,Daniel,Rademan,Junior,Male,16
5,00:27:17,19,E,Werner,Du Preez,Open,Male,29
6,00:27:44,148,E,Mazu,Ndandani,Open,Male,37
7,00:27:45,42,E,Dakota,Murphy,Open,Male,20
8,00:28:29,56,E,David,Schlebusch,Master,Male,51
9,00:28:32,52,E,Caleb,Rademan,Minimee,Male,12
I am using the following call to read_csv to parse this into a Pandas dataframe:
race1 = pandas.read_csv('data.csv', parse_dates='Time', index_col='Time')
This enables me to plot a cumulative distribution of race times very easily by just doing:
race1.Position.plot()
Pandas handles all the intricacies of the date data type and makes a nice x axis with proper formatting of the times.
Is there an elegant way of getting a histogram of times which is similarly straightforward? Ideally, I would like to be able to do race1.index.hist() or race1.index.to_series().hist(), but I know that doesn't work.
I've been able to coerce the time to a timedelta and get a working result with
times = race1.index.to_series()
((times - times[0]).dt.seconds/60).hist()
This produces a histogram of the correct shape, but obviously with wrong x values (they are off by the fastest time).
Is there an elegant way to read the column as a timedelta to begin with, and is there a better way of creating the histogram, including proper ticks? Proper ticks here mean that they use the correct locator and updates properly.

This appears to work pretty well, although I would be happier with it if it didn't go through the Matplotlib date specifics regarding ordinal dates.
times = race1.index.to_series()
today = pandas.Timestamp('00:00:00')
timedelta = times - today
times_ordinal = timedelta.dt.seconds/(24*60*60) + today.toordinal()
ax = times_ordinal.hist()
ax.xaxis_date()
plt.gcf().autofmt_xdate()
plt.ylabel('Number of finishers')

Related

Plot each year of a time series on the same x-axis

I have a time series with daily data that I want to plot to see how it evolves over a year. I want to compare how it evolves over the year compared to previous years. I have written the following code in Python:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
plt.plot(xindex, data['biljett'])
plt.show()
The graph looks as follows:
A graph how the data evolves over a year compared to previous years. The line is continuous and and does not end with the end of the year which makes it fuzzy. What am I doing wrong ?

From technical perspectives, it happens because your data points are not sorted w.r.t. date, thus it goes back and forth to connect data points in the data frame order. you sort the data based on xindex and you're good to go. to do that: (first you need to put xindex in data dataframe as a new column)
data.sort_values(by='xindex').reset_index(drop=True)
From the visualization perspective, I think you might have several values per each day count, thus plot is not a good option to begin with. So IMHO you'd want to try plt.scatter() to visualize your data in a better way.

I have rewritten as follows:
xindex = data['biljett'].index.month*30 + data['biljett'].index.day
data['biljett'].sort_values('xindex').reset_index(drop=True)
plt.plot(xindex, data['biljett'])
plt.show()
but gets the following error message:
ValueError: No axis named xindex for object type

reportlab LinePlot axis with date and time

I want to visualize data in a LinePlot using reportlab. The data has x-axis values (timestamps) with the form YYYYMMDDHHMMSS. I know that a reportlab x-axis class NormalDateXValueAxis exists but it only takes dates (YYYYMMDD) and does not allow to use time.
One question is does reportlab already support this with any class that I have not found yet?
A different approach I am trying is to simply use the timestamp string as x-axis values and define a formatter for these values. An example is:
from reportlab.graphics.charts.lineplots import LinePlot
from reportlab.graphics.shapes import Drawing, _DrawingEditorMixin
from datetime import datetime
def formatter(val):
dtstr = str(int(val))
print(dtstr)
dt = (datetime.strptime(str(int(val)), "%Y%m%d%H%M%S")).strftime("%d.%m.%Y %H:%M:%S")
return dt
class Test(_DrawingEditorMixin, Drawing):
def __init__(self,width=258,height=150,*args,**kw):
Drawing.__init__(self,width,height,*args,**kw)
# font
fontSize = 7
# chart
self._add(self,LinePlot(),name='chart',validate=None,desc=None)
self.chart.y = 16
self.chart.x = 32
self.chart.width = 212
self.chart.height = 90
# x axis
self.chart.xValueAxis.labels.fontSize = fontSize-1
self.chart.xValueAxis.labelTextFormat = formatter
# y axis
self.chart.yValueAxis.labels.fontSize = fontSize -1
# sample data
self.chart.data = [
[
(20200225130120, 100),
(20200225130125, 0),
(20200225130130, 300),
(20200225130135, 0),
(20200225130140, 500),
(20200225130145, 0),
(20200225130150, 700),
(20200225130155, 0),
(20200225130315, 900)
]
]
if __name__=="__main__": #NORUNTESTS
Test().save(formats=['pdf'],outDir='.',fnRoot=None)
But I have two problems with this aproach.
The values given to the formatter are unpredictable (at least for me). Reportlab seems to modify the ticks in a way it deems to be best. The result is sometimes there are values that are not valid timestamps and can't be parsed by datetime. I sometimes got the exception that seconds must be between 0 and 59. Reportlab created a tick with value 20200225136000.
Since the x axis does not know that that these values are timestamps it still leaves room for 20200225135961, 20200225135965, etc. The result is a gap in the graph.

One question is does reportlab already support this with any class
that I have not found yet?
Not that I know of, but I think what you want can be achieved from ValueAxis. If you can change the library, I suggest you to do with matplotlib as I've seen previous working examples. You can also try to see if PYX (which is also a good alternative to ReportLab) deals with such scenarios, but I didn't find any.

Inside the documentation it has a function inside the lineplots.py file called SimpleTimeSeriesPlot(LinePlot)
By looking at it when you specify .xValueAxis It will read your data as a date (how much flexibility this has I am not sure as I haven't tested it but it's worth testing that out)
Instead of calling LinePlot() you would call SimpleTimeSeriesPlot() and push through that same line of code and just add the .xValueAxis to your code.
You can also specify min and max dates to parse by doing .xValueAxis.valueMin or Max

Plotting txt file with date, time and value data

I am pretty new to Python programing and have already been confronted with a problem that drives me insane. I kept on searching for the problem - even here at stack overflow. However, I didn't get any solution to solve my problem, which made me sign up for this site.
Nevertheless, this is my problem:
I have several txt files, that contain 3 columns. The first one can be neglected, the second one contains a mixture of date and time, separated with the letter "T" and the third column contains the value (pressure, temperature, what so ever).
Now, what I want to do is, to plot the second column (time and date) on the x axis and the value on the y axis.
I've tried MANY codes - also some are described here at stack overflow - but none of them was the one I was searching for and brought the right results.
More detailed, this is what a my txt files look like:
# MagPy ASCII
234536326.456,2014-06-17T14:23:00.000000,459.7463393940044
674346235.235,2014-06-17T14:28:00.000000,462.8783040474751
and so on.
Forget about the first column. Only the second and third one are relevant. So here, I guess, I have to skip the first line (and the first column), right?
HOWEVER - and here comes the part I cannot solve - with this "T" inside the second column, this becomes a string format.
One of my many errors I get is: could not convert string to float
Well, I searched stack overflow and came across the following code:
x, y = np.loadtxt('example.txt', dtype=int, delimiter=',',
unpack=True, usecols=(1,2))
plt.plot(x, y)
plt.title('example 1')
plt.xlabel('D')
plt.ylabel('Frequency')
plt.show()
I edited the "usecols" to 1 and 2, but with this code, I get the error: list index out of range
So, it doesn't matter what I do, I get an error any time. And the only thing I want is a plot (with matplotlib), that contains time and date on the x axis and the value (e.g. 459.7463393940044 from above) on the y axis.
And talking about what I need: At the end, I have to put several diagrams (about 4-6), that were generated with MANY txt file data, in one figure.
Please, can anyone help me with this? I'd appreciate your help a lot!

This is numpy datetime format. You need to add an explicit converter for the datetime field. The documentation contains additional format details. This question shows how to fill in the converters argument.
date, closep, highp, lowp, openp, volume =
np.loadtxt(f, delimiter=',', unpack=True,
converters={0:mdates.strpdate2num('%d-%b-%y')})
Is that enough to lead you to a full solution?

First, thanks for your response! Unfortunately, this doesn't work at all. I tried it with your converter and combined it with my code, but it didn't work out well. I tried then this code:
# Converter function
datefunc = lambda x: mdates.date2num(datetime.strptime(x, '%d %m %Y %H %M %S'))
# Read data from 'file.dat'
dates, levels = np.genfromtxt('BMP085_10085001_0001_201508.txt', # Data to be read
delimiter=19, # First column is 19 characters wide
converters={1: datefunc}, # Formatting of column 0
dtype=float, # All values are floats
unpack=True) # Unpack to several variables
fig = plt.figure()
ax = fig.add_subplot(111)
# Configure x-ticks
ax.set_xticks(dates) # Tickmark + label at every plotted point
ax.xaxis.set_major_formatter(mdates.DateFormatter('%d/%m/%Y %H:%M'))
ax.plot_date(dates, levels, ls='-', marker='o')
ax.set_title('title')
ax.set_ylabel('Waterlevel (m)')
ax.grid(True)
# Format the x-axis for dates (label formatting, rotation)
fig.autofmt_xdate(rotation=45)
fig.tight_layout()
fig.show()
And I get an list index out of range error again. Seriously, I do not know any possible solution to make it look like that in the end: {Plot date and time (x axis) versus a value (y axis) using data from file} - see diagram on the bottom of the page.

I've had success doing the following. First, it's ok to read in your ISO8601 date as a string (so a basic read a line and split on comma will work). To convert the date string to a datetime object you can use
import dateutil
# code to read in date_strings and my_values as lists goes here ...
# Here's the magic to parse the ISO8601 strings and make them into datetime objects
my_dates = dateutil.parser.parse(date_strings)
# Now plot the results
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot_date(x=my_dates, y=my_values, marker='o', linestyle='')
ax.set_xlabel('Time')

Plotting dates with plot_date

I have a bunch of dates in the form MM / DD / YYYY. I use a parser to get them into something that looks like datetime.datetime(YYYY, MM, DD, 0, 0)
.
I have data corresponding to each date in an array y and would like to plot the two arrays against one another. Using matplotlib.dates.date2num I plot them as
x= matplotlib.dates.date2num(dates)
plot_date(dates,y)
When I do this, I get the following plot
Where as I would prefer something that looks like a time series.
How can I fix this

Presumably the dates and data that you're reading in are not in time order, in which case you'll need to sort them both before they're passed to plot_date():
s = np.argsort(dates)
plot_date(dates[s],y[s])
This should work whether you sort on the list/array of datetime instances (your dates) or their numerical equivalents (your x).

Plotting Pandas DataFrames as single days on the x-axis in Python/Matplotlib

I've got data like this:
col1 ;col2
2001-01-01;1
2001-01-01;2
2001-01-02;3
2001-01-03;4
2001-01-03;2
2001-01-04;2
I'm reading it in Python/Pandas using pd.read_csv(...) into a DataFrame.
Now I want to plot col2 on the y-axis and col1 on the x-axis day-wise. I searched a lot but couldn't too many very useful pages describing this in detail. I found that matplotlib does currently NOT support the dataformat in which the dates are stored in (datetime64).
I tried converting it like this:
fig, ax = plt.subplots()
X = np.asarray(df['col1']).astype(DT.datetime)
xfmt = mdates.DateFormatter('%b %d')
ax.xaxis.set_major_formatter(xfmt)
ax.plot(X, df['col2'])
plt.show()
but this does NOT work.
What is the best way?
I can only find bits there and bits there, but nothing really working in complete and more importantly, up-to-date ressources related to this functionality for the latest version of pandas/numpy/matplotlib.
I'd also be interested to convert this absolut dates to consecutive day-indices, i.e:
The starting day 2001-01-01 is Day 1, thus the data would look like this:
col1 ;col2 ; col3
2001-01-01;1;1
2001-01-01;2;1
2001-01-02;3;2
2001-01-03;4;3
2001-01-03;2;3
2001-01-04;2;4
.....
2001-02-01;2;32
Thank you very much in advance.

Pandas.read_csv supports parse_dates=True (default of course is False) That would save you converting the dates separately.
Also for a simple dataframe like this, pandas plot() function works perfectly well.
Example:
dates = pd.date_range('20160601',periods=4)
dt = pd.DataFrame(np.random.randn(4,1),index=dates,columns=['col1'])
dt.plot()
plt.show()

Ok as far as I can see there's no need anymore to use matplotlib directly, but instead pandas itself already offer plotting functions which can be used as methods to the dataframe-objects, see http://pandas.pydata.org/pandas-docs/stable/visualization.html. These functions themselves use matplotlib, but are easier to use because they handle the datatypes correctly themselves :-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.