Plotting and handling date/time data with Pandas

Plotting and handling date/time data with Pandas - python

currently, I am working in a project and would like to plot data from a logger in a daily basis. The format of the written output is a .csv file and contains in a column the Date/Time stamp
ex: 2018-10-15 10:00. In the other columns, there is just data in float format. I get the written stamp automatically in 10mins interval from 00:00 until 23:50.
I am looking to analyze the data and group it by days*using groupby() and further on compute mean and deviations of the day. I want to plot the mean and std_deviation data for several years as scatter or line graph. The major ticks are years or months with days as minor ticks.
On a daily basis I want to compare the variation of the mean within a certain month and plot against the entire time interval with hours as major ticks and every 10mins intervals as minor ticks. I want to be able to put this in a for loop if possible.
To be honest I've tried a lot of different possibilities but I can't achieve everything with only one. If I could, I would try not to use set_index() to be the Date/Time column, so it is easier to apply the group. I am using the Pandas module to achieve my whole analysis for convenience.
I would be really happy for any guidance.
Thanks you very much!!!!!

Just a couple of pointers:
When reading the csv with pd.read_csv (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) you can specify which columns contain date/times:
df = pd.read_csv('myfile.csv', parse_dates=['date'])
Then you can use .dt to access date/time specific features, see: https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties
So you can add a column with only day numbers, like:
df['day'] = df['date'].dt.dayofyear
Then you can group by this new column..

Related

Format datetime gap on Pyplot

I am trying to visualize measured data using Pyplot.
The data is stored in a dataframe from Pandas like this:
Dataframe
So I want to plot the outputs over the respective date and time.
Just plotting it works great, however there is an issue that I cannot fix.
Sometimes the measurements are stopped and resumed at later time. That means that during some measurements there is a gap in the date (e.g. 1 or 2 days skipped).
When I now plot the above dataframe, there is a large region that is skipped due to no data being present.
Graph with gap
Is there any way to change that, so the gap is closed and all the data is presented in a better way?

How to plot OHLC data in simulated real-time from historical tick data?

I'm working on developing a charting package that is able to take in tick data and, given a predefined time interval (timeframe), plot price movements on a candlestick chart (open, high, low, and close) in simulated real-time, allowing one to change the speed and direction of the simulation, as well as pause it. Here is an example of the desired behavior (updating the data in real-time) represented graphically:
Real-time price data plotting using a candlestick chart.
The UI aspect of the plotting is not a concern, as the graphical aspect of the project has already been created (I am using a function that takes in all the aggregated OHLC data and plots it according to timeframe). I am, however, having difficulty deciding on how to achieve the above described behavior in a performance-friendly manner (e.g. not re-creating the entire OHLC dataframe from the tick data every time a new tick is read into the method) as the dataset I am working with is quite large.
I've thought of some preliminary solutions, however all them have their flaws and I am at a loss for how to tackle this problem. The following are some ideas I've had to tackle this problem:
Deriving OHLC data directly from the tick data with each new tick.
The idea here is that we'll have a sliding timestamp that will allow us to read in all the tick data up to the "current" time in the simulation, and then aggregate it into OHLC format using a dataframe.
Although this achieves our desired behavior, this requires us to recreate the dataframe containing the OHLC data every tick, thus the overhead for a large set of data would be unacceptable in this case.
Pre-processing the ticks into OHLC and reading directly from the dataframe.
Similar to the above, using a sliding timestamp to read data up to the "current" time in the simulation. The difference here, is that we'd be reading data from the OHLC dataframe directly after pre-processing all the tick data into it.
This allows us to easily go through the data, as it would exist in a format that the method which plots it would understand (that being OHLC), but it doesn't achieve our required behavior of simulating movements that happen during a candle's formation.
Performing the first solution every n milliseconds.
Although this would reduce the overhead, it would still recreate the dataframe every n milliseconds, which would still make it unfeasible for large datasets.
Only modifying the current candle.
The idea here is that all the historical candles will be stored in a dataframe, while the current candle is updated with every new tick until it closes (and so on with each successive candle).
While this achieves the desired simulation behavior in a more performance-friendly manner, I'm not sure how we could get rewinding to work in this context, as moving backward on the historical data wouldn't be possible since it would be in OHLC format.
I am also unsure of how we could change the timeframe using this method as it would need to determine how much time to look back and gather ticks to create a new "current" candle.
I'm working with Python, however I think this is more of a conceptual problem than a language-specific one.

plotting very large data in python

I am trying to plot a large set of values against time. My dataset spans over 46 days and includes data for every second of the day. Since the plots are incomprehensible when plotted directly, I tried to group them. the groupby function in pandas works fine as long as one needs to find some aggregates or summary statistics. I tried the following command, but it just gives a blop on the plot and does not do what I want it to.
df1 = df.groupby(pd.Grouper(key='time', freq='7D'))['values']
Is there a way to group the data according to a column and then add it in a new column?
I also tried plots after making time the index, but that also does not help.

Skip weekends on stock charts with matplolib

This is not duplicate, because existing answers on similar questions don't describe exactly what I need.
Matplotlib has great formatters inside and I love to use them:
ax.xaxis.set_major_locator(matplotlib.dates.MonthLocator())
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b%y'))
They let me plot such stock market charts:
This is what I need, but it has 1 issue: weekends. They are present on x axis and make my chart a little ugly.
Other questions about this issue give advice to create custom formatter. They show examples of such formatters. But no one of them do pretty formatting like matplotlib do:
May19, Jun19, Jul19...
I mean this line of code:
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b%y'))
My question is: please help me to format x axis like matplotlib do: May19, Jun19, Jul19... and don't create weekends when stock market is closed.

What you could almost always do is something similar to what Nic Wanavit suggested.
Manually set your labels, depending on what you need on your axis.
Especially in this case the plot is looking a bit ugly because you have timespans in your data that are not provided with actual data (the weekends in this case) so pyplot will simply connect these points with the corresponding length from the x-axis.
What you can do then is just to plot your data equally distant - which is correct if the data is daily - otherwise consider to interpolate it using e.g. pandas bultin interpolation.
To avoid pyplot automatically detect the index I had to do this:
df['plotidx'] = [i for i in range(len(df['close'])):
Here all the closing values for the stock are stored in a column named 'close' obvsl.
You plot this correspondingly.
Then you can obtain all the ticks created via
labels = [item.get_text() for item in ax.get_xticklabels()]
Adjust them as desired with
labels[i] = string_for_the_label_no_i
Then get them back on the graph using
ax.xaxis.set_ticklabels(labels)
You need to somewhat "update" the plot then. Also keep in mind, that resizing a lot could end up with the labels being as also said in the documentation strange location.
It is some kind of a workaround but worked fine for me because it feels natural to plot data equally distant next to each other rather then making up some data for the weekends.
Greets

to set the x ticks
assuming that you have the dates variable in dataframe row df['dates']
ax.xaxis.set_ticks(df['dates'])

Select right label for x axis in case of Multiindex dataframe with python pandas

First of all merry x-mas to you all :). Hope you are doing well and relaxing.
But as you can see, some people have to work. And there is the point where I need your help :)
I do have an very messy csv chart with many data which I handeled to clean and to make it easier to read and to work with. But now I am stuck.
I created a multiindex with level=0 as date and level=1 for time and column c as attribute, as you can see on the image
I would like to plot this information, with date on the x axis and the attreibute on the y-axis. However, I stuck here:
BMW_Clean_Data.loc["07.07.2018"::,]["Atteibute"].plot()
If I plot this I get the following
As you can see guys, on the x axis I get Date and time format. How can I select only date, or time as x axis?
I am grateful for any kind of help

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.