Plotting line plot with groupby in matplotlib/seaborn? - python

I have the following dataset (abbreviated, but still conveys the same idea). I want to show how user score changes over time (the postDate conveys time). The data is also presorted by postDate. The hope is to see a nice plot (perhaps using seaborn if possible) that has the score as the y-axis, time as the x-axis, and shows the users' scores over time (with a separate line for each user). Do I need to convert the postDate (currently a string) to another format in order to plot nicely? Thank you so much!
userID postDate userScore (1-10 scale)
Mia1 2017-01-11 09:07:10.616328+00:00 8
John2 2017-01-17 08:05:45.917629+00:00 6
Leila1 2017-01-22 07:47:67.615628+00:00 9
Mia1 2017-01-30 03:45:50.817325+00:00 7
Leila 2017-02-02 06:38:01.517223+00:00 10

Based on the sample data you show your postDate series is already pandas datetime values. So to plot with dates on the X axis the key in matplotlib is to use plot_date, not plot. Something like this:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
for key, g in df.groupby['userID']:
ax.plot_date(g['postDate'], g['userScore'], label=key)
ax.legend()

I've used plotly before, it's a really nice option to do interactive visualizations if you are using Jupyter Notebook. You generate htmls or plot inline in Jupyter with cufflinks. It's only paid for hosting your graphs somewhere but I use it for free for my own data analysis.
Install plotly and also cufflinks, cufflinks helps out to do plots almost instantly with pandas dfs.
For example you could do:
your_df.iplot(x='postDate', y='userScore')
this will automatically give you the 'time-series' you describe.

Related

Python Visualisation Not Plotting Full Range of Data Points

I'm just starting out on using Python and I'm using it to plot some points through Power BI. I use Power BI as part of my work anyway and this is for an article I'm writing alongside learning. I'm aware Power BI isn't the ideal place to be using Python :)
I have a dataset of average banana prices since 1995 (https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1132096/bananas-30jan23.csv)
I've managed to turn that into a nice line chart which plots the average for each month but only shows the yearly labels. The chart is really nice and I'm happy with it other than the fact that it isn't plotting anything before 1997 or after 2020 despite the date range being outside that. Earlier visualisations without the x-axis labelling grouping led to all points being plot but with this it's now no longer working.
ChatGPT got me going in circles that never resolved the issue so I suspect my issue may lie in my understand of Python. If anyone could help me understand the issue that would be brilliant, I can provide more information if that helps:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Convert the 'Date' column to a datetime format
dataset['Date'] = pd.to_datetime(dataset['Date'])
# Group the dataframe by month and calculate the average price for each month
monthly_average = dataset.groupby(dataset['Date'].dt.strftime('%B-%Y'))['Price'].mean()
# Plot the monthly average price against the month using seaborn
ax = sns.lineplot(x=monthly_average.index, y=monthly_average.values)
# Find the unique years in the dataset
unique_years = np.unique(dataset['Date'].dt.year)
# Set the x-axis tick labels to only be the unique years
ax.xaxis.set_ticklabels(unique_years)
ax.xaxis.set_major_locator(plt.MaxNLocator(len(unique_years)))
# Show the plot
plt.show()
Resulting Chart

How do I create a clear line plot with a large number of values (Pandas)?

I have a pandas DataFrame of 8664 rows which contains the following columns of importance to me: EASTVEL , NORTHVEL, Z_NAP, DATE+TIME. Definitions of the columns are:
EASTVEL = Flow of current where min(-) values are west and plus(+) values are east.
NORTHVEL = Flow of current where min(-) values are south and plus(+) values are north.
Z_NAP = Depth of water
DATE+TIME = Date + time in this format: 2021-11-17 10:00:00
Now the problem that I encouter is the following: I want to generate a plot with EASTVEL on the x-axis and Z_NAP on the y-axis within the timeframe of 2021-11-17 10:00:00 untill 2021-11-17 12:00:00 (I already created a df frame_3_LW that only contains those values). However because I have so many values I get a plot like you see below. However I would like just one line describing the course of EASTVEL against Z_NAP. That way it will be way more clear. Can anybody help me with that?
Well you've already gotten down to the problem itself. Your code is fine, the problem is that you have many points and that this way of visualization doesn't seem to work if the variables change too much or if you have too many points.. You could try plotting every 5th point (or something like that) but I doubt it would enhance the graph.
Even though not directly what you asked, I do suggest you either:
Plot both variables independently on the same graph
Plot the ratio of the two variables. That way you have 1 line only, describing the relationship between the variables.
:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(0)
df = pd.DataFrame({'east_vel': np.random.randint(0, 5, 100),
'z_nap': np.random.randint(5, 10, 100),
'time': np.arange(100)},
)
# option 1, plot both variables
plt.figure()
plt.plot(df['time'], df['east_vel'], color='blue', label='east_vel')
plt.plot(df['time'], df['z_nap'], color='orange', label='z_nap')
plt.title('Two separate lines')
plt.legend()
# option 2, plot ratio between variables
plt.figure()
plt.plot(df['time'], df['east_vel']/df['z_nap'], label='ast_vel vs. z_nap')
plt.title('Ratio between east_vel and z_nap')
plt.legend()
plt.show()
Here's the output of the code:
Even for a relatively small amount of points (100), plotting them one against each other would be very messy:

Display bars in plot by ascending/descending order, Matplotlib, Python

I have a DataFrame with several numeric columns. My task is to display sums of each column in one chart. I decided to use import matplotlib as plt and a next code
df.sum().plot.bar(figsize = (10,5))
plt.xlabel('Category')
plt.ylabel('Sum')
plt.title('Mast popular categories')
plt.show()
I've got this pic
What I want to change here is to replace bars by ascending/descending order (for example from biggest bar to smallest). But I can't succeed. Help me please to solve this!
Sort the data before plot:
df.sum().sort_values(ascending=False).plot.bar(figsize = (10,5))

Compare multiple year data on a single plot python

I have two time series from different years stored in pandas dataframes. For example:
data15 = pd.DataFrame(
[1,2,3,4,5,6,7,8,9,10,11,12],
index=pd.date_range(start='2015-01',end='2016-01',freq='M'),
columns=['2015']
)
data16 = pd.DataFrame(
[5,4,3,2,1],
index=pd.date_range(start='2016-01',end='2016-06',freq='M'),
columns=['2016']
)
I'm actually working with daily data but if this question is answered sufficiently I can figure out the rest.
What I'm trying to do is overlay the plots of these different data sets onto a single plot from January through December to compare the differences between the years. I can do this by creating a "false" index for one of the datasets so they have a common year:
data16.index = data15.index[:len(data16)]
ax = data15.plot()
data16.plot(ax=ax)
But I would like to avoid messing with the index if possible. Another problem with this method is that the year (2015) will appear in the x axis tick label which I don't want. Does anyone know of a better way to do this?
One way to do this would be to overlay a transparent axes over the first, and plot the 2nd dataframe in that one, but then you'd need to update the x-limits of both axes at the same time (similar to twinx). However, I think that's far more work and has a few more downsides: you can't easily zoom interactively into a specific region anymore for example, unless you make sure both axes are linked via their x-limits. Really, the easiest is to take into account that offset, by "messing with the index".
As for the tick labels, you can easily change the format so that they don't show the year by specifying the x-tick format:
import matplotlib.dates as mdates
month_day_fmt = mdates.DateFormatter('%b %d') # "Locale's abbreviated month name. + day of the month"
ax.xaxis.set_major_formatter(month_day_fmt)
Have a look at the matplotlib API example for specifying the date format.
I see two options.
Option 1: add a month column to your dataframes
data15['month'] = data15.index.to_series().dt.strftime('%b')
data16['month'] = data16.index.to_series().dt.strftime('%b')
ax = data16.plot(x='month', y='2016')
ax = data15.plot(x='month', y='2015', ax=ax)
Option 2: if you don't want to do that, you can use matplotlib directly
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(data15['2015'].values)
ax.plot(data16['2016'].values)
plt.xticks(range(len(data15)), data15.index.to_series().dt.strftime('%b'), size='small')
Needless to say, I would recommend the first option.
You might be able to use pandas.DatetimeIndex.dayofyear to get the day number which will allow you to plot two different year's data on top of one another.
in: date=pd.datetime('2008-10-31')
in: date.dayofyear
out: 305

Plotting multiple timeseries power data using matplotlib and pandas

I have a csv file of power levels at several stations (4 in this case, though "HUT4" is not in this short excerpt):
2014-06-21T20:03:21,HUT3,74
2014-06-21T21:03:16,HUT1,70
2014-06-21T21:04:31,HUT3,73
2014-06-21T21:04:33,HUT2,30
2014-06-21T22:03:50,HUT3,64
2014-06-21T23:03:29,HUT1,60
(etc . .)
The times are not synchronised across stations. The power level is (in this case) integer percent. Some machines report in volts (~13.0), which would be an additional issue when plotting.
The data is easy to read into a dataframe, to index the dataframe, to put into a dictionary. But I can't get the right syntax to make a meaningful plot. Either all stations on a single plot sharing a timeline that's big enough for all stations, or as separate plots, maybe a subplot for each station. If I do:
import pandas as pd
df = pd.read_csv('Power_Log.csv',names=['DT','Station','Power'])
df2=df.groupby(['Station']) # set 'Station' as the data index
d = dict(iter(df2)) # make a dictionary including each station's data
for stn in d.keys():
d[stn].plot(x='DT',y='Power')
plt.legend(loc='lower right')
plt.savefig('Station_Power.png')
I do get a plot but the X axis is not right for each station.
I have not figured out yet how to do four independent subplots, which would free me from making a wide-enough timescale.
I would greatly appreciate comments on getting a single plot right and/or getting good looking subplots. The subplots do not need to have synchronised X axes.
I'd rather plot the typical way, smth like:
import matplotlib.pyplot as plt
plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])
plt.savefig()
( http://matplotlib.org/users/pyplot_tutorial.html )
Re more subplots: simply call plt.plot() multiple times, once for each data series.
P.S. you can set xticks this way: Changing the "tick frequency" on x or y axis in matplotlib?
Sorry for the comment above where I needed to add code. Still learning . .
From the 5th code line:
import matplotlib.dates as mdates
for stn in d.keys():
plt.figure()
d[stn].interpolate().plot(x='DT',y='Power',title=stn,rot=45)
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%D/%M/%Y'))
plt.savefig('Station_Power_'+stn+'.png')
Does more or less what I want to do except the DateFormatter line does not work. I would like to shorten my datetime data to show just date. If it places ticks at midnight that would be brilliant but not strictly necessary.
The key to getting a continuous plot is to use the interpolate() method in the plot.
With this data having different x scales from station to station a plot of all stations on the same graph does not work. HUT4 in my data has far fewer records and only plots to about 25% of the scale even though the datetime values cover more or less the same range as the other HUTs.

Categories