Line Plot in MathPlotLib, by frequency of date - python

So I have a dataframe in pandas like below:
date max min rain snow ice
0 2019-01-01 58 39 0.06 0.0 0.0
1 2019-01-01 58 39 0.06 0.0 0.0
2 2019-01-01 58 39 0.06 0.0 0.0
3 2019-01-01 58 39 0.06 0.0 0.0
4 2019-01-01 58 39 0.06 0.0 0.0
The goal is to create a line plot which shows, on the x axis, the max temperature, and on the y axis, the frequency of each date for that temperature.
So basically, the list of dates are shop transactions and I want to see the effect the temperature has on the number of transactions per day.
I've tried to use this which groups the weather_frame by date, but I can't get my plot to show the temperature on the x axis.
max_temp = weather_frame.groupby(weather_frame.date).size()
I've attached the file below. I had to delete some of it to stay within the size limits for paste bin so, the graph may appear corrupted. Data Link

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
date_freq = weather_frame.groupby(weather_frame.date).size()
max_temp = weather_frame[['date', 'max']].groupby(weather_frame.date).mean()
sns.set()
plt.figure()
sns.regplot(x=max_temp, y=date_freq)
plt.xlabel('Maximum Temperature')
plt.ylabel('Number of Transactions per Day')
It looks like there is a slight positive relationship between max temperature and number of transactions per day.

Related

Seaborn boxplot with grouped data into categories with count column

I run into a problem when trying to plot my dataset with a seaborn boxplot. I've got a dataset received grouped from database like:
region age total
0 STC 2.0 11024
1 PHA 84.0 3904
2 OLK 55.0 12944
3 VYS 72.0 5592
4 PAK 86.0 2168
... ... ... ...
1460 KVK 62.0 4600
1461 MSK 41.0 26568
1462 LBK 13.0 6928
1463 JHC 18.0 8296
1464 HKK 88.0 2408
And I would like to create a box plot with the region on an x-scale, age on a y-scale, based on the total number of observations.
When I try ax = sns.boxplot(x='region', y='age', data=df), I receive a simple boxplot, where isn't taking the total column into account. The one, hard-coding option is to repeat rows by a number of totals, but I don't like this solution.
sns.histplot and sns.kdeplot support a weigts= parameter, but sns.boxplot doesn't. Simply repeating values doesn't need to be a bad solution, but in this case the numbers are very huge. You could create a new dataframe with repeated data, but divide the 'total' column to make the values manageable.
The sample data have all different regions, which makes creating a boxplot rather strange. The code below supposes there aren't too many regions (1400 regions certainly wouldn't work well).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO
df_str = ''' region age total
STC 2.0 11024
STC 84.0 3904
STC 55.0 12944
STC 72.0 5592
STC 86.0 2168
PHA 62.0 4600
PHA 41.0 26568
PHA 13.0 6928
PHA 18.0 8296
PHA 88.0 2408'''
df = pd.read_csv(StringIO(df_str), delim_whitespace=True)
# use a scaled down version of the totals as a repeat factor
repeats = df['total'].to_numpy(dtype=int) // 100
df_total = pd.DataFrame({'region': np.repeat(df['region'].values, repeats),
'age': np.repeat(df['age'].values, repeats)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14, 4))
sns.kdeplot(data=df, x='age', weights='total', hue='region', ax=ax1)
sns.boxplot(data=df_total, y='age', x='region', ax=ax2)
plt.tight_layout()
plt.show()
An alternative would be to do everything outside seaborn, using statsmodels.stats.weightstats.DescrStatsW to calculate the percentiles and plot the boxplots via matplotlib. Outliers would still have to be calculated separately. (See also this post)

How does this transparent extension come with a plot in lineplot?

The plot in documentation looks like this :
with code
sns.lineplot(x="timepoint", y="signal",
hue="region", style="event",
data=fmri)
and
mine comes out to be like this
for code :
sns.lineplot(
# data=fmri,
x=df["_C_UP"]["s"][:10],
y=df["_C_UP"]["px"][:10]
# hue="event"
);
How do I get the same effect for those lines ( that transparent color around it )
here is what my data looks like
#Energy s py pz px dxy dyz dz2 dxz dx2 tot
50 -17.98094 0.72320 0.31781 0.00000 0.31882 0.0 0.0 0.0 0.0 0.0 1.35982
51 -17.87394 0.29726 0.14415 0.00000 0.14491 0.0 0.0 0.0 0.0 0.0 0.58632
52 -17.76794 0.63694 0.02456 0.00000 0.02484 0.0 0.0 0.0 0.0 0.0 0.68634
53 -17.66194 1.78595 0.06032 0.00001 0.06139 0.0 0.0 0.0 0.0 0.0 1.90766
54 -17.55494 1.97809 0.09038 0.00001 0.09192 0.0 0.0 0.0 0.0 0.0 2.16040
In the fmri datasets, there are actually multiple observations for each time point and subgroup, for example, at timepoint == 14 :
fmri[fmri['timepoint']==14]
subject timepoint event region signal
1 s5 14 stim parietal -0.080883
57 s13 14 stim parietal -0.033713
58 s12 14 stim parietal -0.068297
59 s11 14 stim parietal -0.114469
60 s10 14 stim parietal -0.052288
61 s9 14 stim parietal -0.130267
So the line you see, is actually the mean of all these observations (stratified by group) and the ribbon is the 95% confidence interval of this mean. For example, you can turn this off by doing:
sns.lineplot(x="timepoint", y="signal",
hue="region", style="event",
data=fmri,ci=None)
So to get the exact plot, you need to have multiple observations or replicates. If you don't, and your intention is to just connect the points, you cannot get a confidence interval.
If you want to look at a trend line, one thing you can try is a polynomial smooth. And it makes sense to plot the data as points too.
Using an example from the same fmri dataset:
df = fmri[(fmri['subject']=="s5") & (fmri['event']== "stim") & (fmri['region'] == "frontal")]
sns.regplot(data=df,x = "timepoint",y = "signal",order=3)
Or use a loess smooth, which is more complicated (see this post about what is drawn below )
import matplotlib.pyplot as plt
from skmisc.loess import loess
lfit = loess(df['timepoint'],df['signal'])
lfit.fit()
pred = lfit.predict(df['timepoint'], stderror=True)
conf = pred.confidence()
fig, ax = plt.subplots()
sns.scatterplot(data=df,x = "timepoint",y = "signal",ax=ax)
sns.lineplot(x = df["timepoint"],y = pred.values,ax=ax,color="#A2D2FF")
ax.fill_between(df['timepoint'],conf.lower, conf.upper, alpha=0.1,color="#A2D2FF")
It depends on the data. The plot from the seaborn documentation that you show is based on a dataset where for every x value there are several y values (repeated measurements). The lines in the plot then indicate the means of those y values, and the shaded regions indicate the associated 95% confidence intervals.
In your data, there is only one y value for each x value, so there is no way to calculate a confidence interval.

How to plot a graph using this data with python?

I want to create time series plot using max, min and avg temperatures from each month of the year.
I would recommend looking into matplotlib to visualize different types of data which can be installed with a quick pip3 install matplotlib.
Here is some starter code you can play around with to get familiar with the library:
# Import the library
import matplotlib.pyplot as plt
# Some sample data to play around with
temps = [30,40,45,50,55,60]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
# Create a figure and plot the data
plt.figure()
plt.plot(temps)
# Add labels to the data points (optional)
for i, point in enumerate(months):
plt.annotate(point, (i, temps[i]))
# Apply some labels
plt.ylabel("Temperature (F)")
plt.title("Temperature Plot")
# Hide the x axis labels
plt.gca().axes.get_xaxis().set_visible(False)
# Show the comlpeted plot
plt.show()
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# sample data
df = pd.DataFrame({'Date':pd.date_range('2010-01-01', '2010-12-31'),
'Temp':np.random.randint(20, 100, 365)})
df.head()
Date Temp
0 2010-01-01 95
1 2010-01-02 20
2 2010-01-03 22
3 2010-01-04 26
4 2010-01-05 93
# group by month and get min, max, mean values for temperature
temp_agg = df.groupby(df.Date.dt.month)['Temp'].agg([min, max, np.mean])
temp_agg.index.name='month'
temp_agg
min max mean
month
1 20 99 50.258065
2 25 98 56.642857
3 22 89 51.225806
4 22 98 60.333333
5 27 99 57.645161
6 21 99 62.000000
7 20 98 67.419355
8 36 98 63.806452
9 22 99 62.166667
10 24 99 63.322581
11 22 97 64.200000
12 20 99 60.870968
# shorthand method of plotting entire dataframe
temp_agg.plot()

How to I make a line graph out of this?

I've imported seaborn and typed in this:
Bunker2019_Jan_to_Jun.plot(x='2019', y='Total')
Bunker2019_Jan_to_Jun.plot(x='2019', y='MGO')
and it shows two graphs. Is there any way I can show the year 2019(Jan to Dec) and 2020(Jan to Jun)?
If you like them on the same plot, you need to combine the data frame, not very sure what is your "2019" column (date or string?), so below I tried to create a data.frame thats like yours:
import seaborn as sns
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np
mths = pd.date_range(start='1/1/2019', periods=12,freq="M").strftime("%b").to_list()
Bunker2019 = pd.DataFrame({'2019':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Bunker2020 = pd.DataFrame({'2020':mths,'Total':np.random.uniform(0,1,12),
'MGO':np.random.uniform(0,1,12)})
Simple way to add the year to create a new date:
Bunker2019['Date'] = '2019_'+ Bunker2019['2019'].astype(str)
Bunker2020['Date'] = '2020_'+ Bunker2020['2020'].astype(str)
We concat and melt, setting an order:
df = pd.concat([Bunker2019[['Date','Total','MGO']],Bunker2020[['Date','Total','MGO']]])
df = df.melt(id_vars='Date')
df['Date'] = pd.Categorical(df['Date'],categories=df['Date'].unique(),ordered=True)
So now it is a long format, containing information for both 2020 and 2019:
Date variable value
0 2019_Jan Total 0.187751
1 2019_Feb Total 0.091374
2 2019_Mar Total 0.929739
3 2019_Apr Total 0.621981
4 2019_May Total 0.371236
5 2019_Jun Total 0.027078
6 2019_Jul Total 0.719046
7 2019_Aug Total 0.138531
Now to plot:
plt.figure(figsize=(12,5))
ax = sns.lineplot(data=df,x='Date',y='value',hue='variable')
sns.scatterplot(data=df,x='Date',y='value',hue='variable',ax=ax,legend=False)
plt.xticks(rotation=65, horizontalalignment='right')
plt.show()
I created the source DataFrame as:
Month MGO MFO
0 2019-01 79.1 85.0
1 2019-02 69.9 91.2
2 2019-03 68.9 90.4
3 2019-04 71.1 87.0
4 2019-05 75.9 85.6
5 2019-06 60.9 82.1
6 2019-07 68.4 75.0
7 2019-08 75.8 60.7
8 2019-09 82.0 58.8
9 2019-10 95.3 56.6
10 2019-11 90.2 59.7
11 2019-12 86.5 57.7
12 2020-01 79.1 50.0
13 2020-02 88.9 52.2
14 2020-03 74.9 54.4
15 2020-04 87.1 51.0
16 2020-05 92.9 52.6
17 2020-06 105.9 53.1
(for now Month column as string).
If you have 2 separate source DataFrames, concatenate them.
The first processing step is to convert Month column to datetime
type and set it as the index:
df.Month = pd.to_datetime(df.Month)
df.set_index('Month', inplace=True)
The first, more straightforward possibility to create
the drawing is:
df.plot(style='-x');
For my data sample I got:
The second possibility is to generate the picture with smoothened lines.
To do this, you can draw two plots in a single axex:
first - smoothened line, from resampled DataFrame, with
interpolation, but without markers, as now there are much more points,
second - only markers, taken from the original DataFrame,
both with the same list of colors.
The code to do it is:
fig, ax = plt.subplots()
color = ['blue', 'orange']
df.resample('D').interpolate('quadratic').plot(ax=ax, color=color)
df.plot(ax=ax, marker='x', linestyle='None', legend=False, color=color);
This time the result is:

Change tick frequency for datetime axis [duplicate]

This question already has an answer here:
Change tick frequency on X (time, not number) frequency in matplotlib
(1 answer)
Closed 3 years ago.
I have the following dataframe:
Date Prod_01 Prod_02
19 2018-03-01 49870 0.0
20 2018-04-01 47397 0.0
21 2018-05-01 53752 0.0
22 2018-06-01 47111 0.0
23 2018-07-01 53581 0.0
24 2018-08-01 55692 0.0
25 2018-09-01 51886 0.0
26 2018-10-01 56963 0.0
27 2018-11-01 56732 0.0
28 2018-12-01 59196 0.0
29 2019-01-01 57221 5.0
30 2019-02-01 55495 472.0
31 2019-03-01 65394 753.0
32 2019-04-01 59030 1174.0
33 2019-05-01 64466 2793.0
34 2019-06-01 58471 4413.0
35 2019-07-01 64785 6110.0
36 2019-08-01 63774 8360.0
37 2019-09-01 64324 9558.0
38 2019-10-01 65733 11050.0
And I need to plot a time series of the 'Prod_01' column.
The 'Date' column is in the pandas datetime format.
So I used the following command:
plt.figure(figsize=(10,4))
plt.plot('Date', 'Prod_01', data=test, linewidth=2, color='steelblue')
plt.xticks(rotation=45, horizontalalignment='right');
Output:
However, I want to change the frequency of the xticks to one month, so I get one tick and one label for each month.
I have tried the following command:
plt.figure(figsize=(10,4))
plt.plot('Date', 'Prod_01', data=test, linewidth=2, color='steelblue')
plt.xticks(np.arange(1, len(test), 1), test['Date'] ,rotation=45, horizontalalignment='right');
But I get this:
How can I solve this problem?
Thanks in advance.
I'm not very familiar with pandas data frames. However, I can't see why this wouldn't work with any pyplot:
According the top SO answer on related post by ImportanceOfBeingErnest:
The spacing between ticklabels is exclusively determined by the space between ticks on the axes.
So, to change the distance between ticks, and the labels you can do this:
Suppose a cluttered and base-10 centered person displays the following graph:
It takes the following code and importing matplotlib.ticker:
import numpy as np
import matplotlib.pyplot as plt
# Import this, too
import matplotlib.ticker as ticker
# Arbitrary graph with x-axis = [-32..32]
x = np.linspace(-32, 32, 1024)
y = np.sinc(x)
# -------------------- Look Here --------------------
# Access plot's axes
axs = plt.axes()
# Set distance between major ticks (which always have labels)
axs.xaxis.set_major_locator(ticker.MultipleLocator(5))
# Sets distance between minor ticks (which don't have labels)
axs.xaxis.set_minor_locator(ticker.MultipleLocator(1))
# -----------------------------------------------------
# Plot and show graph
plt.plot(x, y)
plt.show()
To change where the labels are placed, you can change the distance between the 'major ticks'. You can also change the smaller 'minor ticks' in between, which don't have a number attached. E.g., on a clock, the hour ticks have numbers on them and are larger (major ticks) with smaller, unlabeled ones between marking the minutes (minor ticks).
By changing the --- Look Here --- part to:
# -------------------- Look Here --------------------
# Access plot's axes
axs = plt.axes()
# Set distance between major ticks (which always have labels)
axs.xaxis.set_major_locator(ticker.MultipleLocator(8))
# Sets distance between minor ticks (which don't have labels)
axs.xaxis.set_minor_locator(ticker.MultipleLocator(4))
# -----------------------------------------------------
You can generate the cleaner and more elegant graph below:
Hope that helps!

Categories