I am looking to automate some work I have been doing in PowerPoint/Excel using Python and MatPlotLib; however, I am having trouble recreating what I have been doing in PowerPoint/Excel.
I have three data series that are grouped by month on the x-axis; however, the months are not date/time and have no real x-values. I want to be able to assign x-values based on the number of rows (so they are not stacked), then group them by month, and add a vertical line once the month "value" changes.
It is also important to note that the number of rows per month can vary, so im having trouble grouping the months and automatically adding the vertical line once the month data changes to the next month.
Here is a sample image of what I created in PowerPoint/Excel and what I am hoping to accomplish:
Here is what I have so far:
For above: I added a new column to my csv file named "Count" and added that as my x-values; however, that is only a workaround to get my desired "look" and does not separate the points by month.
My code so far:
manipulate.csv
Count,Month,Type,Time
1,June,Purple,13
2,June,Orange,3
3,June,Purple,13
4,June,Orange,12
5,June,Blue,55
6,June,Blue,42
7,June,Blue,90
8,June,Orange,3
9,June,Orange,171
10,June,Blue,132
11,June,Blue,96
12,July,Orange,13
13,July,Orange,13
14,July,Orange,22
15,July,Orange,6
16,July,Purple,4
17,July,Orange,3
18,July,Orange,18
19,July,Blue,99
20,August,Blue,190
21,August,Blue,170
22,August,Orange,33
23,August,Orange,29
24,August,Purple,3
25,August,Purple,9
26,August,Purple,6
testchart.py
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('manipulate.csv')
df=df.reindex(columns=["Month", "Type", "Time", "Count"])
df['Orange'] = df.loc[df['Type'] == 'Orange', 'Time']
df['Blue'] = df.loc[df['Type'] == 'Blue', 'Time']
df['Purple'] = df.loc[df['Type'] == 'Purple', 'Time']
print(df)
w = df['Count']
x = df['Orange']
y = df['Blue']
z = df['Purple']
plt.plot(w, x, linestyle = 'none', marker='o', c='Orange')
plt.plot(w, y, linestyle = 'none', marker='o', c='Blue')
plt.plot(w, z, linestyle = 'none', marker='o', c='Purple')
plt.ylabel("Time")
plt.xlabel("Month")
plt.show()
Can I suggest using Seaborn's swarmplot instead? It might be easier:
import seaborn as sns
import matplotlib.pyplot as plt
# Change the month to an actual date then set the format to just the date's month's name
df.Month = pd.to_datetime(df.Month, format='%B').dt.month_name()
sns.swarmplot(data=df, x='Month', y='Time', hue='Type', palette=['purple', 'orange', 'blue'])
plt.legend().remove()
for x in range(len(df.Month.unique())-1):
plt.axvline(0.5+x, linestyle='--', color='black', alpha = 0.5)
Output Graph:
Or Seaborn's stripplot with some jitter value:
import seaborn as sns
import matplotlib.pyplot as plt
# Change the month to an actual date then set the format to just the date's month's name
df.Month = pd.to_datetime(df.Month, format='%B').dt.month_name()
sns.stripplot(data=df, x='Month', y='Time', hue='Type', palette=['purple', 'orange', 'blue'], jitter=0.4)
plt.legend().remove()
for x in range(len(df.Month.unique())-1):
plt.axvline(0.5+x, linestyle='--', color='black', alpha = 0.5)
If not, this answer will use matplotlib.dates's mdates to format the labels of the xaxis to just the month names. It will also use datetime's timedelta to add some days to each month to split them up (so that they are not overlapped):
from datetime import timedelta
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
df.Month = pd.to_datetime(df.Month, format='%B')
separators = df.Month.unique() # Get each unique month, to be used for the vertical lines
# Add an amount of days to each value within a range of 25 days based on how many days are in each month in the dataframe
# This is just to split up the days so that there is no overlap
dayAdditions = sum([list(range(2,25,int(25/x))) for x in list(df.groupby('Month').count().Time)], [])
df.Month = [x + timedelta(days=count) for x,count in zip(df.Month, dayAdditions)]
df=df.reindex(columns=["Month", "Type", "Time", "Count"])
df['Orange'] = df.loc[df['Type'] == 'Orange', 'Time']
df['Blue'] = df.loc[df['Type'] == 'Blue', 'Time']
df['Purple'] = df.loc[df['Type'] == 'Purple', 'Time']
w = df['Count']
x = df['Orange']
y = df['Blue']
z = df['Purple']
fig, ax = plt.subplots()
plt.plot(df.Month, x, linestyle = 'none', marker='o', c='Orange')
plt.plot(df.Month, y, linestyle = 'none', marker='o', c='Blue')
plt.plot(df.Month, z, linestyle = 'none', marker='o', c='Purple')
plt.ylabel("Time")
plt.xlabel("Month")
ax.xaxis.set_major_locator(mdates.MonthLocator(bymonthday=15)) # Set the locator at the 15th of each month
ax.xaxis.set_major_formatter(mdates.DateFormatter('%B')) # Set the format to just be the month name
for sep in separators[1:]:
plt.axvline(sep, linestyle='--', color='black', alpha = 0.5) # Add a separator at every month starting at the second month
plt.show()
Output:
This is how I put your data in a df, in case anyone else wants to grab it to help answer the question:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(
'''Count,Month,Type,Time
1,June,Purple,13
2,June,Orange,3
3,June,Purple,13
4,June,Orange,12
5,June,Blue,55
6,June,Blue,42
7,June,Blue,90
8,June,Orange,3
9,June,Orange,171
10,June,Blue,132
11,June,Blue,96
12,July,Orange,13
13,July,Orange,13
14,July,Orange,22
15,July,Orange,6
16,July,Purple,4
17,July,Orange,3
18,July,Orange,18
19,July,Blue,99
20,August,Blue,190
21,August,Blue,170
22,August,Orange,33
23,August,Orange,29
24,August,Purple,3
25,August,Purple,9
26,August,Purple,6''')
df = pd.read_csv(TESTDATA, sep = ',')
Maybe add custom x-axis labels and separating lines between months:
new_month = ~df.Month.eq(df.Month.shift(-1))
for c in df[new_month].Count.values[:-1]:
plt.axvline(c + 0.5, linestyle="--", color="gray")
plt.xticks(
(df[new_month].Count + df[new_month].Count.shift(fill_value=0)) / 2,
df[new_month].Month,
)
for color in ["Orange", "Blue", "Purple"]:
plt.plot(
df["Count"],
df[color],
linestyle="none",
marker="o",
color=color.lower(),
label=color,
)
I would also advise that you rename the color columns into something more descriptive and if possible add more time information to your data sample (days, year).
Related
I have the following code:
# Ratings by day, divided by Staff member
from datetime import datetime as dt
by_staff = df.groupby('User ID')
plt.figure(figsize=(15,8))
# Those are used to calculate xticks and yticks
xmin, xmax = pd.to_datetime(dt.now()), pd.to_datetime(0)
ymin, ymax = 0, 0
for index, data in by_staff:
by_day = data.groupby('Date')
x = pd.to_datetime(by_day.count().index)
y = by_day.count()['Value']
xmin = min(xmin, x.min())
xmax = max(xmax, x.max())
ymin = min(ymin, min(y))
ymax = max(ymax, max(y))
plt.plot_date(x, y, marker='o', label=index, markersize=12)
plt.title('Ratings by day, by Staff member', fontdict = {'fontsize': 25})
plt.xlabel('Day', fontsize=15)
plt.ylabel('n° of ratings for that day', fontsize=15)
ticks = pd.date_range(xmin, xmax, freq='D')
plt.xticks(ticks, rotation=60)
plt.yticks(range(ymin, ymax + 1))
plt.gcf().autofmt_xdate()
plt.grid()
plt.legend([a for a, b in by_staff],
title="Ratings given",
loc="center left",
bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()
I'd like to set the value shown at a specific xtick to 0 if there's no data for the day. Currently, this is the plot shown:
I tried some Google searches, but I can't seem to explain my problem correctly. How could I solve this?
My dataset: https://cdn.discordapp.com/attachments/311932890017693700/800789506328100934/sample-ratings.csv
Let's try to simplify the task by letting pandas aggregate the data. We group by Date and User ID simultaneously and then unstack the dataframe. This allows us to fill the missing data points with a preset value like 0.The form x = df.groupby(["Date",'User ID']).count().Value.unstack(fill_value=0) is compact chaining for a= df.groupby(["Date",'User ID']), b=a.count(), c=b.Value, x=c.unstack(fill_value=0). You can print out each intermediate result of these chained pandas operations to see what it does.
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv("test.csv", sep=",", parse_dates=["Date"])
#by_staff = df.groupby(["Date",'User ID']) - group entries by date and ID
#.count - count identical date-ID pairs
#.Value - use only this column
#.unstack(fill_value=0) bring resulting data from long to wide form
#and fill missing data with zero
by_staff = df.groupby(["Date",'User ID']).count().Value.unstack(fill_value=0)
ax = by_staff.plot(marker='o', markersize=12, linestyle="None", figsize=(15,8))
plt.title('Ratings by day, by Staff member', fontdict = {'fontsize': 25})
plt.xlabel('Day', fontsize=15)
plt.ylabel('n° of ratings for that day', fontsize=15)
#labeling only the actual rating values shown in the grid
plt.yticks(range(df.Value.max() + 1))
#this is not really necessary, it just labels zero differently
#labels = ["No rating"] + [str(i) for i in range(1, df.Value.max() + 1)]
#ax.set_yticklabels(labels)
plt.gcf().autofmt_xdate()
plt.grid()
plt.show()
Sample output:
Obviously, you don't see multiple entries.
I am graphing three lines on a single plot. I want the x-axis to display the date the data was taken on and the time from 00:00 to 24:00. Right now my code displays the time of day correctly but for the date, instead of the date that the data was recorded on being displayed, the current date is shown (12-18). I am unsure how to correct this. Also it would be acceptable for my plot to show only time from 00:00 to 24:00 with out the date on the x-axis. Thank you for your help!!
# set index as time for graphing
monAverages['Time'] = monAverages['Time'].apply(lambda x: pd.to_datetime(str(x)))
index = monAverages['Time']
index = index.apply(lambda x: pd.to_datetime(str(x)))
averagePlot = dfSingleDay
predictPlot = predictPlot[np.isfinite(predictPlot)]
datasetPlot = datasetPlot[np.isfinite(datasetPlot)]
predictPlot1 = pd.DataFrame(predictPlot)
datasetPlot1 = pd.DataFrame(datasetPlot)
averagePlot.set_index(index, drop=True,inplace=True)
datasetPlot1.set_index(index, drop=True,inplace=True)
predictPlot1.set_index(index, drop=True,inplace=True)
plt.rcParams["figure.figsize"] = (10,10)
plt.plot(datasetPlot1,'b', label='Real Data')
plt.plot(averagePlot, 'y', label='Average for this day of the week')
plt.plot(predictPlot1, 'g', label='Predictions')
plt.title('Power Consumption')
plt.xlabel('Date (00-00) and Time of Day(00)')
plt.ylabel('kW')
plt.legend()
plt.show()
You need to be sure that you get only the time:
import matplotlib.dates as mdates
# set index as time for graphing
monAverages['Time'] = monAverages['Time'].apply(lambda x: pd.to_datetime(str(x)))
index = monAverages['Time']
#index = index.apply(lambda x: pd.to_datetime(str(x)))
dates= [dt.datetime.strptime(d,'%Y-%m-%d %H:%M:%S').time() for d in index]
averagePlot = dfSingleDay
predictPlot = predictPlot[np.isfinite(predictPlot)]
datasetPlot = datasetPlot[np.isfinite(datasetPlot)]
predictPlot1 = pd.DataFrame(predictPlot)
datasetPlot1 = pd.DataFrame(datasetPlot)
plt.rcParams["figure.figsize"] = (10,10)
plt.plot(dates,datasetPlot1,'b', label='Real Data')
plt.plot(dates,averagePlot, 'y', label='Average for this day of the week')
plt.plot(dates,predictPlot1, 'g', label='Predictions')
plt.title('Power Consumption')
plt.xlabel('Date (00-00) and Time of Day(00)')
plt.ylabel('kW')
plt.legend()
plt.show()
This code here explains how you can run it
import datetime as dt
import matplotlib.pyplot as plt
dates = ['2019-12-18 00:00:00','2019-12-18 12:00:00','2019-12-18 13:00:00']
x = [dt.datetime.strptime(d,'%Y-%m-%d %H:%M:%S').time() for d in dates]
y = range(len(x))
plt.plot(x,y)
plt.gcf().autofmt_xdate()
plt.show()
I am using the pandas plot to generate a stacked bar chart, which has a different behaviour from matplotlib's, but the dates always come out with a bad format and I could not change it.
I would also like to a "total" line on the chart. But when I try to add it, the previous bars are erased.
I want to make a chart like the one below (generated by excel). The black line is the sum of the bars.
I've looked at some solutions online, but they only look good when there are not many bars, so you get some space between the labels.
Here is the best I could do and below there is the code I used.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
# DATA (not the full series from the chart)
dates = ['2016-10-31', '2016-11-30', '2016-12-31', '2017-01-31', '2017-02-28', '2017-03-31',
'2017-04-30', '2017-05-31', '2017-06-30', '2017-07-31', '2017-08-31', '2017-09-30',
'2017-10-31', '2017-11-30', '2017-12-31', '2018-01-31', '2018-02-28', '2018-03-31',
'2018-04-30', '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31', '2018-09-30',
'2018-10-31', '2018-11-30', '2018-12-31', '2019-01-31', '2019-02-28', '2019-03-31']
variables = {'quantum ex sa': [6.878011, 6.557054, 3.229360, 3.739318, 1.006442, -0.117945,
-1.854614, -2.882032, -1.305225, 0.280100, 0.524068, 1.847649,
5.315940, 4.746596, 6.650303, 6.809901, 8.135243, 8.127328,
9.202209, 8.146417, 6.600906, 6.231881, 5.265775, 3.971435,
2.896829, 4.307549, 4.695687, 4.696656, 3.747793, 3.366878],
'price ex sa': [-11.618681, -9.062433, -6.228452, -2.944336, 0.513788, 4.068517,
6.973203, 8.667524, 10.091766, 10.927501, 11.124805, 11.368854,
11.582204, 10.818471, 10.132152, 8.638781, 6.984159, 5.161404,
3.944813, 3.723371, 3.808564, 4.576303, 5.170760, 5.237303,
5.121998, 5.502981, 5.159970, 4.772495, 4.140812, 3.568077]}
df = pd.DataFrame(index=pd.to_datetime(dates), data=variables)
# PLOTTING
ax = df.plot(kind='bar', stacked=True, width=1)
# df['Total'] = df.sum(axis=1)
# df['Total'].plot(ax=ax)
ax.axhline(0, linewidth=1)
ax.yaxis.set_major_formatter(plticker.PercentFormatter())
plt.tight_layout()
plt.show()
Edit
This is what work best for me. This works better than using the pandas df.plot(kind='bar', stacked=True) because it allows for better formatting of the date labels in the x axis and also allows for any number of series for the bars.
for count, col in enumerate(df.columns):
old = df.iloc[:, :count].sum(axis=1)
bottom_series = ((old >= 0) == (df[col] >= 0)) * old
ax.bar(df.index, df[col], label=col, bottom=bottom_series, width=31)
df['Total'] = df.sum(axis=1)
ax.plot(df.index, df['Total'], color='black', label='Total')
Is this what you want:
fig, ax = plt.subplots(1,1, figsize=(16,9))
# PLOTTING
ax.bar(df.index, df['price ex sa'], bottom=df['quantum ex sa'],width=31, label='price ex sa')
ax.bar(df.index, df['quantum ex sa'], width=31, label='quantum ex sa')
total = df.sum(axis=1)
ax.plot(total.index, total, color='r', linewidth=3, label='total')
ax.legend()
plt.show()
Edit: There seems to be a bug (features) on plotting with datetime. I tried to convert the index to string and it works:
df.index=df.index.strftime('%Y-%m')
ax = df.plot(kind='bar', stacked=True, width=1)
df['Total'] = df.sum(axis=1)
df['Total'].plot(ax=ax, label='total')
ax.legend()
Edit 2: I think I know what's going on. The problem is that
ax = df.plot(kind='bar', stacked=True)
returns/sets x-axis of ax to range(len(df)) labeled by the corresponding values from df.index, but not df.index itself. That's why if we plot the second series on the same ax, it doesn't show (due to different scale of xaxis). So I tried:
# PLOTTING
colums = df.columns
ax = df.plot(kind='bar', stacked=True, width=1, figsize=(10, 6))
ax.plot(range(len(df)), df.sum(1), label='Total')
ax.legend()
plt.show()
and it works as expected
I've tried for several hours to make this work. I tried using 'python-gantt' package, without luck. I also tried plotly (which was beautiful, but I can't host my sensitive data on their site, so that won't work).
My starting point is code from here:
How to plot stacked event duration (Gantt Charts) using Python Pandas?
Three Requirements:
Include the 'Name' on the y axis rather than the numbers.
If someone has multiple events, put all the event periods on one line (this will make pattern identification easier), e.g. Lisa will only have one line on the visual.
Include the 'Event' listed on top of the corresponding line (if possible), e.g. Lisa's first line would say "Hire".
The code will need to be dynamic to accommodate many more people and more possible event types...
I'm open to suggestions to visualize: I want to show the duration for various staffing events throughout the year, as to help identify patterns.
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dt
df = pd.DataFrame({'Name': ['Joe','Joe','Lisa','Lisa','Lisa','Alice'],
'Event': ['Hire','Term','Hire','Transfer','Term','Term'],
'Start_Date': ["2014-01-01","2014-02-01","2015-01-01","2015-02-01","2015-03-01","2016-01-01"],
'End_Date': ["2014-01-31","2014-03-15","2015-01-31","2015-02-28","2015-05-01","2016-09-01"]
})
df = df[['Name','Event','Start_Date','End_Date']]
df.Start_Date = pd.to_datetime(df.Start_Date).astype(datetime)
df.End_Date = pd.to_datetime(df.End_Date).astype(datetime)
fig = plt.figure()
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(df.index, dt.date2num(df.Start_Date), dt.date2num(df.End_Date))
I encountered the same problem in the past. You seem to appreciate the aesthetics of Plotly. Here is a little piece of code which uses matplotlib.pyplot.broken_barh instead of matplotlib.pyplot.hlines.
from collections import defaultdict
from datetime import datetime
from datetime import date
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
df = pd.DataFrame({
'Name': ['Joe', 'Joe', 'Lisa', 'Lisa', 'Lisa', 'Alice'],
'Event': ['Hire', 'Term', 'Hire', 'Transfer', 'Term', 'Term'],
'Start_Date': ['2014-01-01', '2014-02-01', '2015-01-01', '2015-02-01', '2015-03-01', '2016-01-01'],
'End_Date': ['2014-01-31', '2014-03-15', '2015-01-31', '2015-02-28', '2015-05-01', '2016-09-01']
})
df = df[['Name', 'Event', 'Start_Date', 'End_Date']]
df.Start_Date = pd.to_datetime(df.Start_Date).astype(datetime)
df.End_Date = pd.to_datetime(df.End_Date).astype(datetime)
names = df.Name.unique()
nb_names = len(names)
fig = plt.figure()
ax = fig.add_subplot(111)
bar_width = 0.8
default_color = 'blue'
colors_dict = defaultdict(lambda: default_color, Hire='green', Term='red', Transfer='orange')
# Plot the events
for index, name in enumerate(names):
mask = df.Name == name
start_dates = mdates.date2num(df.loc[mask].Start_Date)
end_dates = mdates.date2num(df.loc[mask].End_Date)
durations = end_dates - start_dates
xranges = zip(start_dates, durations)
ymin = index - bar_width / 2.0
ywidth = bar_width
yrange = (ymin, ywidth)
facecolors = [colors_dict[event] for event in df.loc[mask].Event]
ax.broken_barh(xranges, yrange, facecolors=facecolors, alpha=1.0)
# you can set alpha to 0.6 to check if there are some overlaps
# Shrink the x-axis
box = ax.get_position()
ax.set_position([box.x0, box.y0, box.width * 0.8, box.height])
# Add the legend
patches = [mpatches.Patch(color=color, label=key) for (key, color) in colors_dict.items()]
patches = patches + [mpatches.Patch(color=default_color, label='Other')]
plt.legend(handles=patches, bbox_to_anchor=(1, 0.5), loc='center left')
# Format the x-ticks
ax.xaxis.set_major_locator(mdates.YearLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
ax.xaxis.set_minor_locator(mdates.MonthLocator())
# Format the y-ticks
ax.set_yticks(range(nb_names))
ax.set_yticklabels(names)
# Set the limits
date_min = date(df.Start_Date.min().year, 1, 1)
date_max = date(df.End_Date.max().year + 1, 1, 1)
ax.set_xlim(date_min, date_max)
# Format the coords message box
ax.format_xdata = mdates.DateFormatter('%Y-%m-%d')
# Set the title
ax.set_title('Gantt Chart')
plt.show()
I hope this will help you.
I am trying to convert Line garph to Bar graph using python panda.
Here is my code which gives perfect line graph as per my requirement.
conn = sqlite3.connect('Demo.db')
collection = ['ABC','PQR']
df = pd.read_sql("SELECT * FROM Table where ...", conn)
df['DateTime'] = df['Timestamp'].apply(lambda x: dt.datetime.fromtimestamp(x))
df.groupby('Type').plot(x='DateTime', y='Value',linewidth=2)
plt.legend(collection)
plt.show()
Here is my DataFrame df
http://postimg.org/image/75uy0dntf/
Here is my Line graph output from above code.
http://postimg.org/image/vc5lbi9xv/
I want to draw bar graph instead of line graph.I want month name on x axis and value on y axis. I want colorful bar graph.
Attempt made
df.plot(x='DateTime', y='Value',linewidth=2, kind='bar')
plt.show()
It gives improper bar graph with date and time(instead of month and year) on x axis. Thank you for help.
Here is a code that might do what you want.
In this code, I first sort your database by time. This step is important, because I use the indices of the sorted database as abscissa of your plots, instead of the timestamp. Then, I group your data frame by type and I plot manually each group at the right position (using the sorted index). Finally, I re-define the ticks and the tick labels to display the date in a given format (in this case, I chose MM/YYYY but that can be changed).
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
types = ['ABC','BCD','PQR']*3
vals = [126,1587,141,10546,1733,173,107,780,88]
ts = [1414814371, 1414814371, 1406865621, 1422766793, 1422766793, 1425574861, 1396324799, 1396324799, 1401595199]
aset = zip(types, vals, ts)
df = pd.DataFrame(data=aset, columns=['Type', 'Value', 'Timestamp'])
df = df.sort(['Timestamp', 'Type'])
df['Date'] = df['Timestamp'].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%m/%Y'))
groups = df.groupby('Type')
ngroups = len(groups)
colors = ['r', 'g', 'b']
fig = plt.figure()
ax = fig.add_subplot(111, position=[0.15, 0.15, 0.8, 0.8])
offset = 0.1
width = 1-2*offset
#
for j, group in enumerate(groups):
x = group[1].index+offset
y = group[1].Value
ax.bar(x, y, width=width, color=colors[j], label=group[0])
xmin, xmax = min(df.index), max(df.index)+1
ax.set_xlim([xmin, xmax])
ax.tick_params(axis='x', which='both', top='off', bottom='off')
plt.xticks(np.arange(xmin, xmax)+0.5, list(df['Date']), rotation=90)
ax.legend()
plt.show()
I hope this works for you. This is the output that I get, given my subset of your database.