Plotting multiple timeseries power data using matplotlib and pandas

Plotting multiple timeseries power data using matplotlib and pandas - python

I have a csv file of power levels at several stations (4 in this case, though "HUT4" is not in this short excerpt):
2014-06-21T20:03:21,HUT3,74
2014-06-21T21:03:16,HUT1,70
2014-06-21T21:04:31,HUT3,73
2014-06-21T21:04:33,HUT2,30
2014-06-21T22:03:50,HUT3,64
2014-06-21T23:03:29,HUT1,60
(etc . .)
The times are not synchronised across stations. The power level is (in this case) integer percent. Some machines report in volts (~13.0), which would be an additional issue when plotting.
The data is easy to read into a dataframe, to index the dataframe, to put into a dictionary. But I can't get the right syntax to make a meaningful plot. Either all stations on a single plot sharing a timeline that's big enough for all stations, or as separate plots, maybe a subplot for each station. If I do:
import pandas as pd
df = pd.read_csv('Power_Log.csv',names=['DT','Station','Power'])
df2=df.groupby(['Station']) # set 'Station' as the data index
d = dict(iter(df2)) # make a dictionary including each station's data
for stn in d.keys():
d[stn].plot(x='DT',y='Power')
plt.legend(loc='lower right')
plt.savefig('Station_Power.png')
I do get a plot but the X axis is not right for each station.
I have not figured out yet how to do four independent subplots, which would free me from making a wide-enough timescale.
I would greatly appreciate comments on getting a single plot right and/or getting good looking subplots. The subplots do not need to have synchronised X axes.

I'd rather plot the typical way, smth like:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])
plt.savefig()
( http://matplotlib.org/users/pyplot_tutorial.html )
Re more subplots: simply call plt.plot() multiple times, once for each data series.
P.S. you can set xticks this way: Changing the "tick frequency" on x or y axis in matplotlib?

Sorry for the comment above where I needed to add code. Still learning . .
From the 5th code line:
import matplotlib.dates as mdates
for stn in d.keys():
plt.figure()
d[stn].interpolate().plot(x='DT',y='Power',title=stn,rot=45)
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%D/%M/%Y'))
plt.savefig('Station_Power_'+stn+'.png')
Does more or less what I want to do except the DateFormatter line does not work. I would like to shorten my datetime data to show just date. If it places ticks at midnight that would be brilliant but not strictly necessary.
The key to getting a continuous plot is to use the interpolate() method in the plot.
With this data having different x scales from station to station a plot of all stations on the same graph does not work. HUT4 in my data has far fewer records and only plots to about 25% of the scale even though the datetime values cover more or less the same range as the other HUTs.

Related

How to create a stacked barchart for a large dataset in Python?

For research purposes at my university, I need to create a stacked bar chart for speech data. I would like to represent the hours of speech on the y-axis and the frequency on the x-axis. The speech comes from different components, hence the stacked part of the chart. The data resides in a Pandas dataframe, which has a lot of columns, but the important ones are "component", "hours" and "ps_med_frequency" which are used in the graph.
A simplified view of the DF (it has 6.2k rows and 120 columns, a-k components):
component
filename
ps_med_freq (rounded to integer)
hours (length)
...
a
fn0001_ps
230
0.23
b
fn0002_ps
340
0.12
c
fn003_ps
278
0.09
I have already tried this with matplotlib, seaborn or just the plot method from the Pandas dataframe itself. None seem to work properly.
A snippet of seaborn code I have tried:
sns.barplot(data=meta_dataframe, x='ps_med_freq', y='hours', hue='component', dodge=False)
And basically all variations of this as well.
Below you can see one of the most "viable" results I've had so far:
example of failed graph
It seems to have a lot of inexplicable grey blobs, which I first contributed to the large dataset, but if I just plot it as a histogram and count the frequencies instead of showing them by hour, it works perfectly fine. Does anyone know a solution to this?
Thanks in advance!
P.S.: Yes, I realise this is a huge dataset and at first sight, the graph seems useless with that much data on it, but matplotlib has interactive graphs where you can zoom etc., where this kind of graph becomes useful for my purpose.

With sns.barplot you're creating a bar for each individual frequency value. You'll probably want to group similar frequencies together, as with sns.histplot(..., multiple='stack'). If you want a lot of detail, you can increase the number of bins for the histogram. Note that sns.barplot never creates stacks, it would just plot each bar transparently on top of the others.
You can create a histogram, using the hours as weights, so they get summed.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some suitable random test data
np.random.seed(20230104)
component_prob = np.random.uniform(0.1, 2, 7)
component_prob /= component_prob.sum()
df = pd.DataFrame({'component': np.random.choice([*'abcdefg'], 6200, p=component_prob),
'ps_med_freq': (np.random.normal(0.05, 1, 6200).cumsum() + 200).astype(int),
'hours': np.random.randint(1, 39, 6200) * .01})
# create bins for every range of 10, suitably rounded, and shifted by 0.001 to avoid floating point roundings
bins = np.arange(df['ps_med_freq'].min() // 10 * 10 - 0.001, df['ps_med_freq'].max() + 10, 10)
plt.figure(figsize=(16, 5))
ax = sns.histplot(data=df, x='ps_med_freq', weights='hours', hue='component', palette='bright',
multiple='stack', bins=bins)
# sns.move_legend(ax, loc='upper left', bbox_to_anchor=(1.01, 1.01)) # legend outside
sns.despine()
plt.tight_layout()
plt.show()

How do I create a clear line plot with a large number of values (Pandas)?

I have a pandas DataFrame of 8664 rows which contains the following columns of importance to me: EASTVEL , NORTHVEL, Z_NAP, DATE+TIME. Definitions of the columns are:
EASTVEL = Flow of current where min(-) values are west and plus(+) values are east.
NORTHVEL = Flow of current where min(-) values are south and plus(+) values are north.
Z_NAP = Depth of water
DATE+TIME = Date + time in this format: 2021-11-17 10:00:00
Now the problem that I encouter is the following: I want to generate a plot with EASTVEL on the x-axis and Z_NAP on the y-axis within the timeframe of 2021-11-17 10:00:00 untill 2021-11-17 12:00:00 (I already created a df frame_3_LW that only contains those values). However because I have so many values I get a plot like you see below. However I would like just one line describing the course of EASTVEL against Z_NAP. That way it will be way more clear. Can anybody help me with that?

Well you've already gotten down to the problem itself. Your code is fine, the problem is that you have many points and that this way of visualization doesn't seem to work if the variables change too much or if you have too many points.. You could try plotting every 5th point (or something like that) but I doubt it would enhance the graph.
Even though not directly what you asked, I do suggest you either:
Plot both variables independently on the same graph
Plot the ratio of the two variables. That way you have 1 line only, describing the relationship between the variables.
:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(0)
df = pd.DataFrame({'east_vel': np.random.randint(0, 5, 100),
'z_nap': np.random.randint(5, 10, 100),
'time': np.arange(100)},
)
# option 1, plot both variables
plt.figure()
plt.plot(df['time'], df['east_vel'], color='blue', label='east_vel')
plt.plot(df['time'], df['z_nap'], color='orange', label='z_nap')
plt.title('Two separate lines')
plt.legend()
# option 2, plot ratio between variables
plt.figure()
plt.plot(df['time'], df['east_vel']/df['z_nap'], label='ast_vel vs. z_nap')
plt.title('Ratio between east_vel and z_nap')
plt.legend()
plt.show()
Here's the output of the code:
Even for a relatively small amount of points (100), plotting them one against each other would be very messy:

Matplotlib best practices for automatically spacing out/omitting overlapping tick labels and annotations

I've decided to use Matplotlib in one of my projects which involves having to automatically generate graphs and slapping them onto reports.
Trying to make matplotlib graphs attractive to the eye is something that's been a lot of fun - however there's still just 1 little bit I'm somewhat stuck on!
Right now, I have an issue in cases where there are tonnes of data points. The problem occurs when the x-axis ticks and the annotations overlap!
With few datapoints, the graph is very pretty:
However, in edge cases with very large amount of datapoints, it gets completely messed up:
What I'd basically like Matplotlib to do is to use some kind of determination to make sure that no other annotation element is within range when it applies an annotation. Same concept for the x-axis ticks!
The solutions I've ruled out are things like the x-axis showing every other tick, since in some cases it's possible that even just the first 3 ticks are very close to each other!

You can control your xticks using both set_xticks and set_xticklabels methods knowing that you're controlling just the x-axis. So, your data won't be affected
Here is an example; I've generated a list called days which contains all days in 2019 and the output graph:
from datetime import date, timedelta
import matplotlib.pyplot as plt
sdate = date(2019, 1, 1)
edate = date(2019, 12, 31)
delta = edate - sdate
days = [sdate + timedelta(days=i) for i in range(delta.days+1)]
fig, ax = plt.subplots()
ax.set_xticks(range(365))
ax.set_xticklabels(days)
plt.xticks(rotation=45)
plt.show()
And it generated this graph which looks close enough to yours:
Now, let's see how to use set_xticks and set_xticklabels to handle this issue. All you need to do is to limit the vectors getting passed to these two methods like so:
#skips 30 items in-between
ax.set_xticks(range(0, 365, 30))
ax.set_xticklabels(days[::30])
And this produces this graph:
That's how you can control the ticks of your x-axis. I pretty much believe you can find a similar way to control the labels of your data points.

Compare multiple year data on a single plot python

I have two time series from different years stored in pandas dataframes. For example:
data15 = pd.DataFrame(
[1,2,3,4,5,6,7,8,9,10,11,12],
index=pd.date_range(start='2015-01',end='2016-01',freq='M'),
columns=['2015']
)
data16 = pd.DataFrame(
[5,4,3,2,1],
index=pd.date_range(start='2016-01',end='2016-06',freq='M'),
columns=['2016']
)
I'm actually working with daily data but if this question is answered sufficiently I can figure out the rest.
What I'm trying to do is overlay the plots of these different data sets onto a single plot from January through December to compare the differences between the years. I can do this by creating a "false" index for one of the datasets so they have a common year:
data16.index = data15.index[:len(data16)]
ax = data15.plot()
data16.plot(ax=ax)
But I would like to avoid messing with the index if possible. Another problem with this method is that the year (2015) will appear in the x axis tick label which I don't want. Does anyone know of a better way to do this?

One way to do this would be to overlay a transparent axes over the first, and plot the 2nd dataframe in that one, but then you'd need to update the x-limits of both axes at the same time (similar to twinx). However, I think that's far more work and has a few more downsides: you can't easily zoom interactively into a specific region anymore for example, unless you make sure both axes are linked via their x-limits. Really, the easiest is to take into account that offset, by "messing with the index".
As for the tick labels, you can easily change the format so that they don't show the year by specifying the x-tick format:
import matplotlib.dates as mdates
month_day_fmt = mdates.DateFormatter('%b %d') # "Locale's abbreviated month name. + day of the month"
ax.xaxis.set_major_formatter(month_day_fmt)
Have a look at the matplotlib API example for specifying the date format.

I see two options.
Option 1: add a month column to your dataframes
data15['month'] = data15.index.to_series().dt.strftime('%b')
data16['month'] = data16.index.to_series().dt.strftime('%b')
ax = data16.plot(x='month', y='2016')
ax = data15.plot(x='month', y='2015', ax=ax)
Option 2: if you don't want to do that, you can use matplotlib directly
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(data15['2015'].values)
ax.plot(data16['2016'].values)
plt.xticks(range(len(data15)), data15.index.to_series().dt.strftime('%b'), size='small')
Needless to say, I would recommend the first option.

You might be able to use pandas.DatetimeIndex.dayofyear to get the day number which will allow you to plot two different year's data on top of one another.
in: date=pd.datetime('2008-10-31')
in: date.dayofyear
out: 305

Compare 1 independent vs many dependent variables using seaborn pairplot in an horizontal plot

The pairplot function from seaborn allows to plot pairwise relationships in a dataset.
According to the documentation (highlight added):
By default, this function will create a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.
It is also possible to show a subset of variables or plot different variables on the rows and columns.
I could find only one example of subsetting different variables for rows and columns, here (it's the 6th plot under the Plotting pairwise relationships with PairGrid and pairplot() section). As you can see, it's plotting many independent variables (x_vars) against the same single dependent variable (y_vars) and the results are pretty nice.
I'm trying to do the same plotting a single independent variable against many dependent ones.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
ages = np.random.gamma(6,3, size=50)
data = pd.DataFrame({"age": ages,
"weight": 80*ages**2/(ages**2+10**2)*np.random.normal(1,0.2,size=ages.shape),
"height": 1.80*ages**5/(ages**5+12**5)*np.random.normal(1,0.2,size=ages.shape),
"happiness": (1-ages*0.01*np.random.normal(1,0.3,size=ages.shape))})
pp = sns.pairplot(data=data,
x_vars=['age'],
y_vars=['weight', 'height', 'happiness'])
The problem is that the subplots get arranged vertically, and I couldn't find a way to change it.
I know that then the tiling structure would not be so neat as the Y axis should be labeled at every subplot. Also, I know I could generate the plots making it by hand with something like this:
fig, axes = plt.subplots(ncols=3)
for i, yvar in enumerate(['weight', 'height', 'happiness']):
axes[i].scatter(data['age'],data[yvar])
Still, I'm learning to use the seaborn and I find interface very convenient, so I wonder if there's a way. Also, this example is pretty easy, but for more complex datasets seaborn handles for you many more things that would make the raw-matplotlib approach much more complex quite quickly (hue, to start)

You can achieve what it seems you are looking for by swapping the variable names passed to the x_vars and y_vars parameters. So revisiting the sns.pairplot portion of your code:
pp = sns.pairplot(data=data,
y_vars=['age'],
x_vars=['weight', 'height', 'happiness'])
Note that all I've done here is swap x_vars for y_vars. The plots should now be displayed horizontally:
The x-axis will now be unique to each plot with a common y-axis determined by the age column.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.