Renaming X-Axis Labels when using Matplotlib and Pandas - python

I'm running my code on iPython Notebooks, on a Macbook Pro Yosemite 10.10.4
I have a CSV file that I am trying to read using Python, and looking to come up with charts. The problem I am facing is renaming the X-Axis labels.
Essentially, the chart is trying to plot a count of different types of Audit Violations, but has really long descriptions of the said violations. For example:
Not approved by regional committee.......another 300 words - 17
No contract with vendor.......another 300 words - 14
Vendor Registration not on record.......another 300 words - 9
Instead of having these verbose reasons though, I would like to rename the X-Axis labels to just numbers or alphabets so that the graph reads somewhat like this:
A - 17
B - 14
C - 9
This is the code I have used, and except for the label names, I am happy with the result.
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
plt.rcParams['figure.figsize'] = (15, 5)
fixed_data = pd.read_csv('audit-rep.csv',sep=',',encoding='latin1',index_col='Index', parse_dates=['Report Date'],dayfirst=False)
viol_counts = data['Audit Flag'].value_counts()
viol_counts[:10]
viol_counts[:10].plot(kind='bar')
I have tried to rename the x-axis labels using the code below.
viol_counts.set_ylabel('No. of Violations')
viol_counts.set_title('Audit Results')
viol_counts.set_xticks(ind+width)
viol_counts.set_xticklabels( ('A', 'B','C') )
This is the error I get when using the above code.
AttributeError: 'Series' object has no attribute 'set_ylabel'
I have come across a few other posts related to this issue, but not seen one that specifically addresses the renaming of individual labels.
This isn't utterly important though, and I'm just trying to learn using python, and the actual work has been done in excel.

if you assign the plot Axes object to a variable name (here I've called it viol_plot), then you can perform action on that Axes object (you are currently trying to set the labels and ticks on the Series, not the plot):
viol_plot = viol_counts[:10].plot(kind='bar')
viol_plot.set_ylabel('No. of Violations')
viol_plot.set_title('Audit Results')
viol_plot.set_xticks(ind+width)
viol_plot.set_xticklabels( ('A', 'B','C') )

Methods such as set_ylabel, set_title are for plot object.
One option would be using subplot:
figure, ax = pl.subplots(1,1)
#note the ax option.
viol_counts[:10].plot(kind='bar', ax=ax)
ax.set_ylabel('No. of Violations')
ax.set_title('Audit Results')
ax.set_xticks(ind+width)
ax.set_xticklabels( ('A', 'B','C') )
Or, you can just plot and use different functions. One downside is that I don't know how to set xticklabels in this case:
viol_counts[:10].plot(kind='bar')
#note names of functions are slightly different
pl.ylabel('No. of Violations')
pl.title('Audit Results')
pl.xticks(ind+width)

Related

How do you get dates on the start on the specified month? (matplotlib)

# FEB
# configuring the figure and plot space
fig, lx = plt.subplots(figsize=(30,10))
# converting the Series into str so the data can be plotted
wd = df2['Unnamed: 1']
wd = wd.astype(float)
# adding the x and y axes' values
lx.plot(list(df2.index.values), wd)
# defining what the labels will be
lx.set(xlabel='Day', ylabel='Weight', title='Daily Weight February 2022')
# defining the date format
date_format = DateFormatter('%m-%d')
lx.xaxis.set_major_formatter(date_format)
lx.xaxis.set_minor_locator(mdates.WeekdayLocator(interval=1))
Values I would like the x-axis to have:
['2/4', '2/5', '2/6', '2/7', '2/8', '2/9', '2/10', '2/11', '2/12', '2/13', '2/14', '2/15', '2/16', '2/17', '2/18', '2/19', '2/20', '2/21', '2/22', '2/23', '2/24', '2/25', '2/26', '2/27']
Values on the x-axis:
enter image description here
It is giving me the right number of values just not the right labels. I have tried to specify the start and end with xlim=['2/4', '2/27], however that did seem to work.
It would be great to see how your df2 actually looks, but from your code snippet, it looks like it has weights recorded but not the corresponding dates.
How about prepare a data frame that has dates in it?
(Also, since this question is tagged with seaborn too, I'm going to use Seaborn, but the same idea should work.)
import pandas as pd
import seaborn as sns
import seaborn.objects as so
from matplotlib.dates import DateFormatter
sns.set_theme()
Create an index with the dates starting from 4 Feb with the number of days we have weight recorded.
index = pd.date_range(start="2/4/2022", periods=df.count().Weight, name="Date")
Then with Seaborn's object interface (v0.12+), we can do:
(
so.Plot(df2.set_index(index), x="Date", y="Weight")
.add(so.Line())
.scale(x=so.Temporal().label(formatter=DateFormatter('%m-%d')))
.label(title="Daily Weight February 2022")
)
I have solved this solution. Very simple. I just added mdates.WeekdayLocator() to set_major_formatter. I overlooked this when I was going through the matplotlib docs. But happy to have found this solution.

How to change seaborn violinplot legend labels?

I'm using seaborn to make a violinplot, which uses hues to identify who survived and who didn't. This is given by the column 'DEATH_EVENT', where 0 means the person survived and 1 means they didn't. The only issue I'm having is that I can't figure out how to set labels for this hue legend. As seen below, 'DEATH_EVENT' presents 0 and 1, but I want to change this into 'Survived' and 'Not survived'.
Current code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
sns.set()
plt.style.use('seaborn')
data = pd.read_csv('heart_failure_clinical_records_dataset.csv')
g = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
g.set_xticklabels(['No smoking', 'Smoking'])
I tried to use: g.legend(labels=['Survived', 'Not survived']), but it returns it without the colors, instead a thin and thick line for some reason.
I'm aware I could just use:
data['DEATH_EVENT'].replace({0:'Survived', 1:'Not survived'}, inplace=True)
but I wanted to see if there was another way. I'm still a rookie, so I'm guessing that there's a reason why the CSV's author made it so that it uses integers to describe plenty of things. Ex: if someone smokes or not, sex, diabetic or not, etc. Maybe it runs faster?
Controlling Seaborn legends is still somewhat tricky (some extensions to matplotlib's API would be helpful). In this case, you could grab the handles from the just-created legend and reuse them for a new legend:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.DataFrame({"smoking": np.random.randint(0, 2, 200),
"survived": np.random.randint(0, 2, 200),
"age": np.random.normal(60, 10, 200),
"DEATH_EVENT": np.random.randint(0, 2, 200)})
ax = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
ax.set_xticklabels(['No smoking', 'Smoking'])
ax.legend(handles=ax.legend_.legendHandles, labels=['Survived', 'Not survived'])
Here is an approach to make the change via the dataframe without changing the original dataframe. To avoid accessing ax.legend_ alltogether (to remove the legend title), a trick is to rename the column to a blank string (and use that blank string for hue). If the dataframe isn't super long (i.e. not having millions of rows), the speed and memory overhead are quite modest.
names = {0: 'Survived', 1: 'Not survived'}
ax = sns.violinplot(data=data.replace({'DEATH_EVENT': names}).rename(columns={'DEATH_EVENT': ''}),
x='smoking', y='age', hue='')

Integrating over range of dates, and labeling the xaxis

I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such:
Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates.
I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated.
NAME,DATE,O,W
A,1/20/2000,12,50
B,1/20/2000,25,28
C,1/20/2000,14,15
A,1/21/2000,34,50
B,1/21/2000,8,3
C,1/21/2000,10,19
A,1/22/2000,47,35
B,1/22/2000,4,27
C,1/22/2000,46,1
A,1/23/2000,19,31
B,1/23/2000,18,10
C,1/23/2000,19,41
Contents of CSV in text form above.
Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps.
Data Collection / Imports:
Just re-creating your dataset for the example.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
header = ['NAME', 'DATE', 'O', 'W']
data = [['A','1/20/2000',12,50],
['B','1/20/2000',25,28],
['C','1/20/2000',14,15],
['A','1/21/2000',34,50],
['B','1/21/2000',8,3],
['C','1/21/2000',10,19],
['A','1/22/2000',47,35],
['B','1/22/2000',4,27],
['C','1/22/2000',46,1],
['A','1/23/2000',19,31],
['B','1/23/2000',18,10],
['C','1/23/2000',19,41]]
df = pd.DataFrame(data, columns=header)
df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
# Subset to just the 'A' labels.
df_a = df[df['NAME'] == 'A']
Plotting:
# Define the number of ticks you need.
nticks = 4
# Define the date format.
mask = '%m-%d-%Y'
# Create the set of custom date labels.
step = int(df_a.shape[0] / nticks)
xdata = np.arange(df_a.shape[0])
xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step]
# Create the plot.
fig, ax = plt.subplots(1, 1)
ax.plot(xdata, df_a['O'], label='Oil')
ax.plot(xdata, df_a['W'], label='Water')
ax.set_xticks(np.arange(df_a.shape[0], step=step))
ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right')
ax.set_title('Test in Naming Labels for the X-Axis')
ax.legend()
Output:
I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values.
See How to convert datetime to integer in python

Format datetime in seaborn faceted scatter plot

I am learning python pandas + matplotlib + seaborn plotting and data visualization from a "R Lattice" perspective. I am still getting my legs. Here is a basic question that I could not get to work just right. Here's the example:
# envir (this is running in an iPython notebook)
%pylab inline
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# generate some data
nRows = 500
df = pd.DataFrame({'c1' : np.random.choice(['A','B','C','D'], size=nRows),
'c2' : np.random.choice(['P','Q','R'], size=nRows),
'i1' : np.random.randint(20,50, nRows),
'i2' : np.random.randint(0,10, nRows),
'x1' : 3 * np.random.randn(nRows) + 90,
'x2' : 2 * np.random.randn(nRows) + 89,
't1' : pd.date_range('10/3/2014', periods=nRows)})
# plot a lattice like plot
# 'hue=' is like 'groups=' in R
# 'col=' is like "|" in lattice formula interface
g = sns.FacetGrid(df, col='c1', hue='c2', size=4, col_wrap=2, aspect=2)
g.map(scatter, 't1', 'x1', s=20)
g.add_legend()
I would like the x axis to plot in an appropriate date time format, not as an integer. I am ok specify the format (YYYY-MM-DD, for example) as a start.
However it would be better if the time range was inspected and the appropriate scale was produced. In R Lattice (and other plotting systems), if the x variable is a datetime, a "pretty" function would determine if the range was large and implied YYYY only (say, for plotting 20 year time trend), YYYY-MM (for plotting something that was a few years)... or YYYY-MM-DD HH:MM:SS format for high frequency time series data (i.e. something sampled every 100 mS). That was done automatically. Is there anything like that available for this case?
One other really basic question on this example (I am almost embarrassed to ask). How can I get a title on this plot?
Thanks!
Randall
It looks like seaborn does not support datetime on the axes in lmplot yet. However, it does support with a few other of its plots. In the mean time, I would suggest adding your need to the issue in the link above, since it currently seems there isn't enough perceived need for them to address it.
As far as a title, use can use set_title() on the object itself. That would look something like this:
.
.
.
g = sns.FacetGrid(df, col='c1', hue='c2', size=4, col_wrap=2, aspect=2)
g.map(scatter, 't1', 'x1', s=20)
g.add_legend()
Then simply add:
g.set_title('Check out that beautiful facet plot!')

How to create five different figures with the same format in one script from a single DataFrame based on a single Excel file?

I am still very new to Python so this is likely an easy question but I have yet to locate a satisfactory answer. I have data from five different sources which I am trying to plot in one script after loading the data from a Excel file to a single DataFrame. As it is now, I only know how to graph one source at a time or all 5 in a single figure (or somwhere between 1 and 5). Here is my code, the entire script. It may not all be necessary but I have included it all just in case.
import numpy as np
import pandas as pd
from pandas import *
import matplotlib
import matplotlib.pyplot as plot
import datetime as datetime
from datetime import *
#Import data from Excel File
data2007 = pd.ExcelFile('f:\Python\Learning 19-4-2013\Data 2007.xls')
table2007 = data2007.parse('Sheet1', skiprows=[0,1,2,3,4,5], index=None)
#Plot data for first meter
ax = plot.figure(figsize=(7,4), dpi=100).add_subplot(111)
FirstMeter = table2007_3.columns[0]
Meter1 = table2007_3[FirstMeter]
Meter1.plot(ax=ax, style='-v')
#Plot data for second meter
SecondMeter = table2007_3.columns[1]
Meter2 = table2007_3[SecondMeter]
Meter2.plot(ax=ax, style='-v')
#Plot data for third meter
ThirdMeter = table2007_3.columns[2]
Meter3 = table2007_3[ThirdMeter]
Meter3.plot(ax=ax, style='v-')
#Plot data for fourth meter
FourthMeter = table2007_3.columns[3]
Meter4 = table2007_3[FourthMeter]
Meter4.plot(ax=ax, style='v-')
#Plot data for fifth meter
FifthMeter = table2007_3.columns[4]
Meter5 = table2007_3[FifthMeter]
Meter5.plot(ax=ax, style='v-')
#Command to show plots
plot.show()
I see you are making a new Series (e.g., Meter1) out of each column of your DataFrame and then plotting them individually on the same axes. Instead, you can plot the DataFrame itself. Pandas assumes you want to plot each column as a separate line on the same plot, which is exactly what you seem to be doing here.
table_2007.plot(style='v-')
or perhaps table_2007[0:4].plot(style='v-') if there are other columns which you need to leave out.
By default, it also generates a legend, which you can suppress with the keyword argument legend=False.
If you want separate figures, as the title of your question suggests the subplots=True argument might get the job done.

Categories