visualize two columns in the same data set - python

I am trying to group and sort four columns, count values and chart them in the same bar graph to see the trend how the count has changed.
Year Month Bl_year Month
2018 Jan 2019 Jan
2018 Feb 2018 Mar
2018 Dec 2020 Dec
2019 Apr 2019 Sep
2020 Nov 2020 Dec
2019 Sep 2018 Jan
I tried to group and sort first and counting values first by the year and then next by the month.
df_Activity_count = df.sort_values(['year','month'],ascending = True).groupby('month')
df_Activity_count_BL = df.sort_values(['BL year','BL month'],ascending = True).groupby('BL month')
Now I am trying to compare these two in the same bar. Can someone please help.

Try to pass ax to your plot command:
df_Activity_count = df.sort_values(['year','month'],ascending = True).groupby('month')
df_Activity_count_BL = df.sort_values(['BL year','BL month'],ascending = True).groupby('BL month')
ax = df_Activity_count.years.value_counts().unstack(0).plot.bar()
df_Activity_count_BL['BL year'].value_counts().unstack(0).plot.bar(ax=ax)

Since you tagged matplotlib, I will chip in a solution using pyplot
import matplotlib.pyplot as plt
# Create an axis object
fig, ax = plt.subplots()
# Define dataframes
df_Activity_count = df.sort_values(['year','month'],ascending = True).groupby('month')
df_Activity_count_BL = df.sort_values(['BL year','BL month'],ascending = True).groupby('BL month')
# Plot using the axis object ax defined above
df_Activity_count['year'].value_counts().unstack(0).plot.bar(ax=ax)
df_Activity_count_BL['BL year'].value_counts().unstack(0).plot.bar(ax=ax)

Related

How to plot different dataframe data in one figure?

I need some guidance to plot:
scatter plot of df1 data: time vs y use the hue for the column z
line plot df2 data: time vs. y
a single line at y=c (c is a constant)
y data in df1 and df2 are different but they are in the same range.
I do not know where to begin. Any guidance is appreciated.
More explanation. A portion of data is presented here. I want to plot:
scatter plot of time vs CO2
finding the yearly rolling average of CO2 (from 01/01/2016 to 09/30/2019 based on hourly data. So the first average will be from "01/01/2016 00" to "12/31/2016 23" and second average will be from "01/01/2016 01" to "01/01/2017 00") (like the trend in plot below)
finding the maximum of all the data and through a line over the plot (like straight line below)
Sample data
data = {'Date':['0 01/14/2016 00', '01/14/2016 01','01/14/2016 02','01/14/2016 03','01/14/2016 04','01/14/2016 05','01/14/2016 06','01/14/2016 07','01/14/2016 08','01/14/2016 09','01/14/2016 10','01/14/2016 11','01/14/2016 12','01/14/2016 13','01/14/2016 14','01/14/2016 15','01/14/2016 16','01/14/2016 17','01/14/2016 18','01/14/2016 19'],
'CO2':[2415.9,2416.5,2429.8,2421.5,2422.2,2428.3,2389.1,2343.2,2444.,2424.8,2429.6,2414.7,2434.9,2420.6,2420.5,2397.1,2415.6,2417.4,2373.2,2367.9],
'Year':[2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016]}
# Create DataFrame
df = pd.DataFrame(data)
# DataFrame view
Date CO2 Year
0 01/14/2016 00 2415.9 2016
01/14/2016 01 2416.5 2016
01/14/2016 02 2429.8 2016
01/14/2016 03 2421.5 2016
01/14/2016 04 2422.2 2016
using matplotlib.pyplot:
plt.hlines to add a horizontal line at a constant
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# with synthetic data
np.random.seed(365)
data = {'CO2': [np.random.randint(2000, 2500) for _ in range(783)],
'Date': pd.bdate_range(start='1/1/2016', end='1/1/2019').tolist()}
# create the dataframe:
df = pd.DataFrame(data)
# verify Date is in datetime format
df['Date'] = pd.to_datetime(df['Date'])
# set Date as index so .rolling can be used
df.set_index('Date', inplace=True)
# add rolling mean
df['rolling'] = df['CO2'].rolling('365D').mean()
# plot the data
plt.figure(figsize=(8, 8))
plt.scatter(x=df.index, y='CO2', data=df, label='data')
plt.plot(df.index, 'rolling', data=df, color='black', label='365 day rolling mean')
plt.hlines(max(df['CO2']), xmin=min(df.index), xmax=max(df.index), color='red', linestyles='dashed', label='Max')
plt.hlines(np.mean(df['CO2']), xmin=min(df.index), xmax=max(df.index), color='green', linestyles='dashed', label='Mean')
plt.xticks(rotation='45')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
Plot using synthetic data:
Issues with the Date format in the data from the op:
Use a regular expression to fix the Date column
Place the code to fix Date, just before df['Date'] = pd.to_datetime(df['Date'])
import re
# your data
Date CO2 Year
0 01/14/2016 00 2415.9 2016
01/14/2016 01 2416.5 2016
01/14/2016 02 2429.8 2016
01/14/2016 03 2421.5 2016
01/14/2016 04 2422.2 2016
df['Date'] = df['Date'].apply(lambda x: (re.findall(r'\d{2}/\d{2}/\d{4}', x)[0]))
# fixed Date column
Date CO2 Year
01/14/2016 2415.9 2016
01/14/2016 2416.5 2016
01/14/2016 2429.8 2016
01/14/2016 2421.5 2016
01/14/2016 2422.2 2016
You can use a dual-axis chart. It will ideally look the same as yours because both the axes will be the same scale. Can directly plot using pandas data frames
import matplotlib.pyplot as plt
import pandas as pd
# create a color map for the z column
color_map = {'z_val1':'red', 'z_val2':'blue', 'z_val3':'green', 'z_val4':'yellow'}
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax2 = ax1.twinx() #second axis within the first
# define scatter plot
df1.plot.scatter(x = 'date',
y = 'CO2',
ax = ax1,
c = df['z'].apply(lambda x:color_map[x]))
# define line plot
df2.plot.line(x = 'date',
y = 'MA_CO2', #moving average in dataframe 2
ax = ax2)
# plot the horizontal line at y = c (constant value)
ax1.axhline(y = c, color='r', linestyle='-')
# to fit the chart properly
plt.tight_layout()

time on xaxis in plotly

I have my x-axis values in this format : ['May 23 2018 06:31:52 GMT', 'May 23 2018 06:32:02 GMT', 'May 23 2018 06:32:12 GMT', 'May 23 2018 06:32:22 GMT', 'May 23 2018 06:32:32 GMT']
and corresponding values for the y-axis which are some numbers.
But when I am plotting these using plotly , x-axis show only part of the date (May 23 2018) for each point. Time for each point is not shown.
I tried setting up tickformat also in layout, but it does not seems to work.
layout = go.Layout(
title=field+ "_its diff_value chart",
xaxis = dict(
tickformat = '%b %d %Y %H:%M:%S'
)
)
any help is appreciated.
This is the screenshot of the graph made.
Try converting your x-values to datetime objects
Then tell plotly to use a fixed tick distance
import random
import datetime
import plotly
plotly.offline.init_notebook_mode()
x = [datetime.datetime.now()]
for d in range(100):
x.append(x[0] + datetime.timedelta(d))
y = [random.random() for _ in x]
scatter = plotly.graph_objs.Scatter(x=x, y=y)
layout = plotly.graph_objs.Layout(xaxis={'type': 'date',
'tick0': x[0],
'tickmode': 'linear',
'dtick': 86400000.0 * 14}) # 14 days
fig = plotly.graph_objs.Figure(data=[scatter], layout=layout)
plotly.offline.iplot(fig)
To skip inconsistent time series, add this before plotting the plotly chart
fig.update_xaxes(
rangebreaks=[
dict(bounds=['2018-05-23 06:31:52','2018-05-23 06:32:02']),
dict(bounds=['2018-05-23 06:32:02','2018-05-23 06:32:12']),
dict(bounds=['2018-05-23 06:32:12','2018-05-23 06:32:22']),
dict(bounds=['2018-05-23 06:32:22','2018-05-23 06:32:32'])
]
)

Add months to xaxis and legend on a matplotlib line plot

I am trying to plot stacked yearly line graphs by months.
I have a dataframe df_year as below:
Day Number of Bicycle Hires
2010-07-30 6897
2010-07-31 5564
2010-08-01 4303
2010-08-02 6642
2010-08-03 7966
with the index set to the date going from 2010 July to 2017 July
I want to plot a line graph for each year with the xaxis being months from Jan to Dec and only the total sum per month is plotted
I have achieved this by converting the dataframe to a pivot table as below:
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
This creates the pivot table as below which I can plot as show in the attached figure:
Number of Bicycle Hires 2010 2011 2012 2013 2014
1 NaN 403178.0 494325.0 565589.0 493870.0
2 NaN 398292.0 481826.0 516588.0 522940.0
3 NaN 556155.0 818209.0 504611.0 757864.0
4 NaN 673639.0 649473.0 658230.0 805571.0
5 NaN 722072.0 926952.0 749934.0 890709.0
plot showing yearly data with months on xaxis
The only problem is that the months show up as integers and I would like them to be shown as Jan, Feb .... Dec with each line representing one year. And I am unable to add a legend for each year.
I have tried the following code to achieve this:
dims = (15,5)
fig, ax = plt.subplots(figsize=dims)
ax.plot(pt)
months = MonthLocator(range(1, 13), bymonthday=1, interval=1)
monthsFmt = DateFormatter("%b '%y")
ax.xaxis.set_major_locator(months) #adding this makes the month ints disapper
ax.xaxis.set_major_formatter(monthsFmt)
handles, labels = ax.get_legend_handles_labels() #legend is nowhere on the plot
ax.legend(handles, labels)
Please can anyone help me out with this, what am I doing incorrectly here?
Thanks!
There is nothing in your legend handles and labels, furthermore the DateFormatter is not returning the right values considering they are not datetime objects your translating.
You could set the index specifically for the dates, then drop the multiindex column level which is created by the pivot (the '0') and then use explicit ticklabels for the months whilst setting where they need to occur on your x-axis. As follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import datetime
# dummy data (Days)
dates_d = pd.date_range('2010-01-01', '2017-12-31', freq='D')
df_year = pd.DataFrame(np.random.randint(100, 200, (dates_d.shape[0], 1)), columns=['Data'])
df_year.index = dates_d #set index
pt = pd.pivot_table(df_year, index=df_year.index.month, columns=df_year.index.year, aggfunc='sum')
pt.columns = pt.columns.droplevel() # remove the double header (0) as pivot creates a multiindex.
ax = plt.figure().add_subplot(111)
ax.plot(pt)
ticklabels = [datetime.date(1900, item, 1).strftime('%b') for item in pt.index]
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(ticklabels) #add monthlabels to the xaxis
ax.legend(pt.columns.tolist(), loc='center left', bbox_to_anchor=(1, .5)) #add the column names as legend.
plt.tight_layout(rect=[0, 0, 0.85, 1])
plt.show()

pandas day of week axis labels

I am plotting a pandas series that spans one week. My code:
rng = pd.date_range('1/6/2014',periods=169,freq='H')
graph = pd.Series(shared_index, index=rng[:168])
graph.plot(shared_index)
Which displays 7 x-axis labels:
[06 Jan 2014, 07, 08, 09, 10, 11, 12]
But I want:
[Mon, Tue, Wed, Thu, Fri, Sat, Sun]
What do I specify in code to change axis labels?
Thanks!
perhaps you can manually fix the tick labels:
rng = pd.date_range('1/6/2014',periods=169,freq='H')
graph = pd.Series(np.random.randn(168), index=rng[:168])
ax = graph.plot()
weekday_map= {0:'MON', 1:'TUE', 2:'WED', 3:'THU',
4:'FRI', 5:'SAT', 6:'SUN'}
xs = sorted(ax.get_xticks(minor='both'))
wd = graph.index[xs - xs[0]].map(pd.Timestamp.weekday)
ax.set_xticks(xs)
ax.set_xticks([], minor=True)
ax.set_xticklabels([weekday_map[d] for d in wd])

Python drawing cumulative plot (matplotlib)

I have not used matplotlib, but looks like it is main library for drawing plots. I want to draw CPU usage plot. I have background processes each minute making record (date, min_load, avg_load, max_load). date could be timestamp or nice formatted date.
I want to draw diagram which show min_load, avg_load and max_load on the same plot. On X axis I would like to put minutes, hours, days, week depending on how much data there is.
There are possible gaps. Let's say monitored process crashes and because no one restarts it there might be gaps for several hours.
Example of how I imagine it: http://img714.imageshack.us/img714/2074/infoplot1.png
This does not illustrate gaps, but in this situation on readings go to 0.
I am playing with matplotlib right now I will try sharing my results too. This is how data might look like:
1254152292;0.07;0.08;0.13
1254152352;0.04;0.05;0.10
1254152412;0.09;0.10;0.17
1254152472;0.28;0.29;0.30
1254152532;0.20;0.20;0.21
1254152592;0.09;0.12;0.15
1254152652;0.09;0.12;0.14
1254152923;0.13;0.12;0.30
1254152983;0.13;0.25;0.32
Or it could look something like this:
Wed Oct 06 08:03:55 CEST 2010;0.25;0.30;0.35
Wed Oct 06 08:03:56 CEST 2010;0.00;0.01;0.02
Wed Oct 06 08:03:57 CEST 2010;0.00;0.01;0.02
Wed Oct 06 08:03:58 CEST 2010;0.00;0.01;0.02
Wed Oct 06 08:03:59 CEST 2010;0.00;0.01;0.02
Wed Oct 06 08:04:00 CEST 2010;0.00;0.01;0.02
Wed Oct 06 08:04:01 CEST 2010;0.25;0.50;0,75
Wed Oct 06 08:04:02 CEST 2010;0.00;0.01;0.02
-david
Try:
from matplotlib.dates import strpdate2num, epoch2num
import numpy as np
from pylab import figure, show, cm
datefmt = "%a %b %d %H:%M:%S CEST %Y"
datafile = "cpu.dat"
def parsedate(x):
global datefmt
try:
res = epoch2num( int(x) )
except:
try:
res = strpdate2num(datefmt)(x)
except:
print("Cannot parse date ('"+x+"')")
exit(1)
return res
# parse data file
t,a,b,c = np.loadtxt(
datafile, delimiter=';',
converters={0:parsedate},
unpack=True)
fig = figure()
ax = fig.add_axes((0.1,0.1,0.7,0.85))
# limit y axis to 0
ax.set_ylim(0);
# colors
colors=['b','g','r']
fill=[(0.5,0.5,1), (0.5,1,0.5), (1,0.5,0.5)]
# plot
for x in [c,b,a]:
ax.plot_date(t, x, '-', lw=2, color=colors.pop())
ax.fill_between(t, x, color=fill.pop())
# legend
ax.legend(['max','avg','min'], loc=(1.03,0.4), frameon=False)
fig.autofmt_xdate()
show()
This parses the lines from "cpu.dat" file. Date is parsed by parsedate function.
Matplotlib should find the best format for the x axis.
Edit: Added legend and fill_between (maybe there is better way to do this).

Categories