I am looking to automate some work I have been doing in PowerPoint/Excel using Python and MatPlotLib; however, I am having trouble recreating what I have been doing in PowerPoint/Excel.
I have three data series that are grouped by month on the x-axis; however, the months are not date/time and have no real x-values. I want to be able to assign x-values based on the number of rows (so they are not stacked), then group them by month, and add a vertical line once the month "value" changes.
It is also important to note that the number of rows per month can vary, so im having trouble grouping the months and automatically adding the vertical line once the month data changes to the next month.
Here is a sample image of what I created in PowerPoint/Excel and what I am hoping to accomplish:
Here is what I have so far:
For above: I added a new column to my csv file named "Count" and added that as my x-values; however, that is only a workaround to get my desired "look" and does not separate the points by month.
My code so far:
manipulate.csv
Count,Month,Type,Time
1,June,Purple,13
2,June,Orange,3
3,June,Purple,13
4,June,Orange,12
5,June,Blue,55
6,June,Blue,42
7,June,Blue,90
8,June,Orange,3
9,June,Orange,171
10,June,Blue,132
11,June,Blue,96
12,July,Orange,13
13,July,Orange,13
14,July,Orange,22
15,July,Orange,6
16,July,Purple,4
17,July,Orange,3
18,July,Orange,18
19,July,Blue,99
20,August,Blue,190
21,August,Blue,170
22,August,Orange,33
23,August,Orange,29
24,August,Purple,3
25,August,Purple,9
26,August,Purple,6
testchart.py
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('manipulate.csv')
df=df.reindex(columns=["Month", "Type", "Time", "Count"])
df['Orange'] = df.loc[df['Type'] == 'Orange', 'Time']
df['Blue'] = df.loc[df['Type'] == 'Blue', 'Time']
df['Purple'] = df.loc[df['Type'] == 'Purple', 'Time']
print(df)
w = df['Count']
x = df['Orange']
y = df['Blue']
z = df['Purple']
plt.plot(w, x, linestyle = 'none', marker='o', c='Orange')
plt.plot(w, y, linestyle = 'none', marker='o', c='Blue')
plt.plot(w, z, linestyle = 'none', marker='o', c='Purple')
plt.ylabel("Time")
plt.xlabel("Month")
plt.show()
Can I suggest using Seaborn's swarmplot instead? It might be easier:
import seaborn as sns
import matplotlib.pyplot as plt
# Change the month to an actual date then set the format to just the date's month's name
df.Month = pd.to_datetime(df.Month, format='%B').dt.month_name()
sns.swarmplot(data=df, x='Month', y='Time', hue='Type', palette=['purple', 'orange', 'blue'])
plt.legend().remove()
for x in range(len(df.Month.unique())-1):
plt.axvline(0.5+x, linestyle='--', color='black', alpha = 0.5)
Output Graph:
Or Seaborn's stripplot with some jitter value:
import seaborn as sns
import matplotlib.pyplot as plt
# Change the month to an actual date then set the format to just the date's month's name
df.Month = pd.to_datetime(df.Month, format='%B').dt.month_name()
sns.stripplot(data=df, x='Month', y='Time', hue='Type', palette=['purple', 'orange', 'blue'], jitter=0.4)
plt.legend().remove()
for x in range(len(df.Month.unique())-1):
plt.axvline(0.5+x, linestyle='--', color='black', alpha = 0.5)
If not, this answer will use matplotlib.dates's mdates to format the labels of the xaxis to just the month names. It will also use datetime's timedelta to add some days to each month to split them up (so that they are not overlapped):
from datetime import timedelta
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
df.Month = pd.to_datetime(df.Month, format='%B')
separators = df.Month.unique() # Get each unique month, to be used for the vertical lines
# Add an amount of days to each value within a range of 25 days based on how many days are in each month in the dataframe
# This is just to split up the days so that there is no overlap
dayAdditions = sum([list(range(2,25,int(25/x))) for x in list(df.groupby('Month').count().Time)], [])
df.Month = [x + timedelta(days=count) for x,count in zip(df.Month, dayAdditions)]
df=df.reindex(columns=["Month", "Type", "Time", "Count"])
df['Orange'] = df.loc[df['Type'] == 'Orange', 'Time']
df['Blue'] = df.loc[df['Type'] == 'Blue', 'Time']
df['Purple'] = df.loc[df['Type'] == 'Purple', 'Time']
w = df['Count']
x = df['Orange']
y = df['Blue']
z = df['Purple']
fig, ax = plt.subplots()
plt.plot(df.Month, x, linestyle = 'none', marker='o', c='Orange')
plt.plot(df.Month, y, linestyle = 'none', marker='o', c='Blue')
plt.plot(df.Month, z, linestyle = 'none', marker='o', c='Purple')
plt.ylabel("Time")
plt.xlabel("Month")
ax.xaxis.set_major_locator(mdates.MonthLocator(bymonthday=15)) # Set the locator at the 15th of each month
ax.xaxis.set_major_formatter(mdates.DateFormatter('%B')) # Set the format to just be the month name
for sep in separators[1:]:
plt.axvline(sep, linestyle='--', color='black', alpha = 0.5) # Add a separator at every month starting at the second month
plt.show()
Output:
This is how I put your data in a df, in case anyone else wants to grab it to help answer the question:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(
'''Count,Month,Type,Time
1,June,Purple,13
2,June,Orange,3
3,June,Purple,13
4,June,Orange,12
5,June,Blue,55
6,June,Blue,42
7,June,Blue,90
8,June,Orange,3
9,June,Orange,171
10,June,Blue,132
11,June,Blue,96
12,July,Orange,13
13,July,Orange,13
14,July,Orange,22
15,July,Orange,6
16,July,Purple,4
17,July,Orange,3
18,July,Orange,18
19,July,Blue,99
20,August,Blue,190
21,August,Blue,170
22,August,Orange,33
23,August,Orange,29
24,August,Purple,3
25,August,Purple,9
26,August,Purple,6''')
df = pd.read_csv(TESTDATA, sep = ',')
Maybe add custom x-axis labels and separating lines between months:
new_month = ~df.Month.eq(df.Month.shift(-1))
for c in df[new_month].Count.values[:-1]:
plt.axvline(c + 0.5, linestyle="--", color="gray")
plt.xticks(
(df[new_month].Count + df[new_month].Count.shift(fill_value=0)) / 2,
df[new_month].Month,
)
for color in ["Orange", "Blue", "Purple"]:
plt.plot(
df["Count"],
df[color],
linestyle="none",
marker="o",
color=color.lower(),
label=color,
)
I would also advise that you rename the color columns into something more descriptive and if possible add more time information to your data sample (days, year).
I am trying to convert values to axis units. I checked codes with similar problems but none addressed this specific challenge. As can be seen in the image below, expected plot (A) was supposed to show month (Jan, Feb etc.) on the x-axis, but it was showing dates (2015-01 etc) in plot (B).
Below is the source code, kindly assist. Thanks.
plt.rcParams["font.size"] = 18
plt.figure(figsize=(20,5))
plt.plot(df.air_temperature,label="Air temperature at Frankfurt Int. Airport in 2015")
plt.xlim(("2015-01-01","2015-12-31"))
plt.xticks(["2015-{:02d}-15".format(x) for x in range(1,13,1)],["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"])
plt.legend()
plt.ylabel("Temperature (°C)")
plt.show()
A wise way to draw the plot with datetime is to use datetime format in place of str; so, first of all, you should do this conversion:
df = pd.read_csv(r'data/frankfurt_weather.csv')
df['time'] = pd.to_datetime(df['time'], format = '%Y-%m-%d %H:%M')
Then you can set up the plot as you please, preferably following Object Oriented Interface:
plt.rcParams['font.size'] = 18
fig, ax = plt.subplots(figsize = (20,5))
ax.plot(df['time'], df['air_temperature'], label = 'Air temperature at Frankfurt Int. Airport in 2015')
ax.legend()
ax.set_ylabel('Temperature (°C)')
plt.show()
Then you can customize:
x ticks' labels format and position with matplotlib.dates:
ax.xaxis.set_major_locator(md.MonthLocator(interval = 1))
ax.xaxis.set_major_formatter(md.DateFormatter('%b'))
x axis limits:
ax.set_xlim([pd.to_datetime('2015-01-01', format = '%Y-%m-%d'),
pd.to_datetime('2015-12-31', format = '%Y-%m-%d')])
capital first letter of x ticks' labels for months' names
fig.canvas.draw()
ax.set_xticklabels([month.get_text().title() for month in ax.get_xticklabels()])
Complete Code
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as md
df = pd.read_csv(r'data/frankfurt_weather.csv')
df['time'] = pd.to_datetime(df['time'], format = '%Y-%m-%d %H:%M')
plt.rcParams['font.size'] = 18
fig, ax = plt.subplots(figsize = (20,5))
ax.plot(df['time'], df['air_temperature'], label = 'Air temperature at Frankfurt Int. Airport in 2015')
ax.legend()
ax.set_ylabel('Temperature (°C)')
ax.xaxis.set_major_locator(md.MonthLocator(interval = 1))
ax.xaxis.set_major_formatter(md.DateFormatter('%b'))
ax.set_xlim([pd.to_datetime('2015-01-01', format = '%Y-%m-%d'),
pd.to_datetime('2015-12-31', format = '%Y-%m-%d')])
fig.canvas.draw()
ax.set_xticklabels([month.get_text().title() for month in ax.get_xticklabels()])
plt.show()
I want to make a timeline that shows the average number of messages sent over a 24h period. So far, I have managed to format both of the axes. The Y-axis already has the correct data in it.
These are the lists of data:
dates[] #a list of datetimes reduced to hours and minutes
values[] #a list of int
Now, for some time, I have tried to insert data into the graph. I have managed to insert the data now, but I assume that the X-axis is causing some problems because of formatting.
lineColor = "#f0f8ff"
chartColor = "#f0f8ff"
backgroundColor = "#36393f"
girdColor = "#8a8a8a"
dates = []
values = []
fig, ax = plt.subplots()
hours = mdates.HourLocator(interval=2)
d_fmt = mdates.DateFormatter('%H:%M')
ax.xaxis.set_minor_locator(mdates.HourLocator(interval=1))
ax.xaxis.set_major_locator(hours)
ax.xaxis.set_major_formatter(d_fmt)
ax.fill(dates, values)
ax.plot(dates, values, color=Commands.lineColor)
ax.set_xlim(["00:00", "23:59"])
plt.fill_between(dates, values,)
# region ChartDesign
ax.set_title('Amount of Messages')
ax.tick_params(axis='y', colors=Commands.chartColor)
ax.tick_params(axis='x', colors=Commands.chartColor)
ax.tick_params(which='minor', colors=Commands.chartColor)
ax.set_ylabel('Messages', color=Commands.chartColor)
plt.grid(True, color=Commands.girdColor)
ax.set_facecolor(Commands.backgroundColor)
ax.spines["bottom"].set_color(Commands.chartColor)
ax.spines["left"].set_color(Commands.chartColor)
ax.spines["top"].set_color(Commands.chartColor)
ax.spines["right"].set_color(Commands.chartColor)
fig.patch.set_facecolor(Commands.backgroundColor)
fig.tight_layout()
fig.autofmt_xdate()
# endregion
There are similar questions, but they aren't much use for me.
Since I don't have any sample data, I created a simple data and made a graph. The 0:00 time on the timeline is a challenge, so I need to be creative. I have replaced the last 0:00 with 24:00. Then I set the time interval value to 48 as the interval on the X axis. In your code, it will be every 2 hours. I have removed the code that I deemed unnecessary.
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np
lineColor = "#f0f8ff"
chartColor = "#f0f8ff"
backgroundColor = "#36393f"
girdColor = "#8a8a8a"
date_rng = pd.date_range('2020-12-01', '2020-12-02', freq='1H')
dates = date_rng.strftime('%H:%M').tolist()
values = np.random.randint(0,25, size=25)
dates[-1] = '24:00'
fig, ax = plt.subplots(figsize=(12,9))
hours = mdates.HourLocator(interval=48)
ax.xaxis.set_major_locator(hours)
# ax.fill(dates, values)
ax.plot(dates, values, color=lineColor)
ax.fill_between(dates, values,)
# region ChartDesign
ax.set_title('Amount of Messages', color=chartColor)
ax.tick_params(axis='y', colors=chartColor)
ax.tick_params(axis='x', colors=chartColor)
# ax.tick_params(which='major', colors=chartColor)
ax.set_ylabel('Messages', color=chartColor)
ax.grid(True, color=girdColor)
ax.set_facecolor(backgroundColor)
ax.spines["bottom"].set_color(chartColor)
ax.spines["left"].set_color(chartColor)
ax.spines["top"].set_color(chartColor)
ax.spines["right"].set_color(chartColor)
fig.set_facecolor(backgroundColor)
fig.tight_layout()
fig.autofmt_xdate()
plt.show()
I have very simple code:
from matplotlib import dates
import matplotlib.ticker as ticker
my_plot=df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90)
I've got:
but I would like to have fewer labels on X axis. To do this I've add:
my_plot.xaxis.set_major_locator(ticker.MaxNLocator(12))
It generates fewer labels but values of labels have wrong values (=first of few labels from whole list)
What am I doing wrong?
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' (YYYY-MM)
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
You were almost there. Just set ticker.MultipleLocator.
The pandas.DataFrame.boxplot also returns axes, which is an object of class matplotlib.axes.Axes. So you can use this code snippet to customize your labels:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
center = np.random.randint(50,size=(10, 20))
spread = np.random.rand(10, 20) * 30
flier_high = np.random.rand(10, 20) * 30 + 30
flier_low = np.random.rand(10, 20) * -30
y = np.concatenate((spread, center, flier_high, flier_low))
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(y)
x = ['Label '+str(i) for i in range(20)]
ax.set_xticklabels(x)
ax.set_xlabel('Day')
# Set a tick on each integer multiple of a base within the view interval.
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
plt.xticks(rotation=90)
I think there is a compatibility issue with Pandas plots and Matplotlib formatters.
With the following code:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
You get this chart:
But I notice that this is grouped by month instead of by day. It may not be what you wanted.
Adding the day component to the string 'Day' messes up the chart as there seems to be too many boxes.
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
for l in labels:
if i % 15 == 0:
label = labels[i]
i += 1
new_labels.append(label)
else:
label = ''
i += 1
new_labels.append(label)
ax.set_xticklabels(new_labels)
plt.show()
The for loop creates the tick labels every as many periods as desired. In the first chart they were set every 3 months. In the second one, every 15 days.
If you would like to see less grid lines:
df = pd.read_csv('lt_stream-1001-full.csv', header=0, encoding='utf8')
df['reg_date'] = pd.to_datetime(df['reg_date'] , format='%Y-%m-%d %H:%M:%S')
df.set_index('reg_date', inplace=True)
df_h = df.resample(rule='H').mean()
df_h['Day']=df_h.index.strftime('%Y-%m-%d')
print(df_h)
f, ax = plt.subplots()
my_plot = df_h.boxplot(by='Day',figsize=(12,5), showfliers=False, rot=90, ax=ax)
locs, labels = plt.xticks()
i = 0
new_labels = list()
new_locs = list()
for l in labels:
if i % 3 == 0:
label = labels[i]
loc = locs[i]
i += 1
new_labels.append(label)
new_locs.append(loc)
else:
i += 1
ax.set_xticks(new_locs)
ax.set_xticklabels(new_labels)
ax.grid(axis='y')
plt.show()
I've read about x_compat in Pandas plot in order to apply Matplotlib formatters, but I get an error when trying to apply it. I'll give it another shot later.
Old unsuccesful answer
The tick labels seem to be dates. If they are set as datetime in your dataframe, you can:
months = mdates.MonthLocator(1,4,7,10) #Choose the months you like the most
ax.xaxis.set_major_locator(months)
Otherwise, you can let Matplotlib know they are dates by:
ax.xaxis_date()
Your comment:
I have add additional information:
I've forgoten to show what is inside DataFrame.
I have three columns:
reg_Date - datetime64 (index)
temperature - float64
Day - date converted from reg_Date to string, it looks like '2017-10' *(YYYY-MM) *
Box plot group date by 'Day' and I would like to show values 'Day" as a label but not all values
, for example every third one.
Based on your comment in italic above, I would use reg_Date as the input and the following lines:
days = mdates.DayLocator(interval=3)
daysFmt = mdates.DateFormatter('%Y-%m') #to format display
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(daysFmt)
I forgot to mention that you will need to:
import matplotlib.dates as mdates
Does this work?