After running the code, this happens:
ValueError: Dimensions of labels and X must be compatible
I do not quite understand what is the error above
Honestly, pretty new to python, was referring to a code and following it to make a boxplot graph, but encountered an error, here is my code:
import numpy as np
import matplotlib.pyplot as plt
title = "Annual Bus Population"
titlelen = len(title)
print("{:*^{titlelen}}".format(title, titlelen=titlelen+6))
print()
filename = 'annual-bus-population-by-passenger-capacity.csv'
data = np.genfromtxt(filename, dtype=["i4", "U50", "i8"], delimiter=",", names=True)
#print("Original data: " + str(data.shape))
null_rows = np.isnan(data['number'])
nonnull_values = data[null_rows==False]
#print("Filtered data: " + str(nonnull_values.shape))
labels = list(set(data['capacity']))
capacities = np.arange(0,len(labels))
capacity_number = data[['capacity','number']]
numbers = capacity_number['number']
values_nine = numbers[capacity_number ['capacity'] == '<10']
values_fifteen = numbers[capacity_number['capacity'] == '10-15']
values_twenty = numbers[capacity_number['capacity'] == '16-20']
values_twentyfive = numbers[capacity_number['capacity'] == '21-25']
values_thirty= numbers[capacity_number ['capacity'] == '21-30']
values_thirtyfive = numbers[capacity_number ['capacity'] == '31-35']
values_fourty = numbers[capacity_number ['capacity'] == '36-40']
values_fourtyfive = numbers[capacity_number ['capacity'] == '40-45']
values_fifty = numbers[capacity_number ['capacity'] == '45-50']
values_fiftyfive = numbers[capacity_number ['capacity'] == '51-55']
values_sixty = numbers[capacity_number ['capacity'] == '56-60']
values_sixtyfive = numbers[capacity_number ['capacity'] == '61-65']
values_seventy = numbers[capacity_number ['capacity'] == '66-70']
values_moreseventy = numbers[capacity_number ['capacity'] == '>70']
values_total = [values_nine,values_fifteen,values_twenty,values_twentyfive,values_thirty,values_thirtyfive,values_fourty,values_fourtyfive,values_fifty,values_fiftyfive,values_sixty,values_sixtyfive,values_seventy,values_moreseventy]
#print(values_total.shape)
#print()
plt.figure(2, figsize=(30,30))
plt.title(title,fontsize=50)
plt.ylabel('Number of passengers',fontsize=40)
plt.yticks(fontsize=30)
plt.xticks(fontsize=30,rotation='vertical')
bp_dict = plt.boxplot(values_total,labels=labels,patch_artist=True)
## change outline color, fill color and linewidth of the boxes
for box in bp_dict['boxes']:
# change outline color
box.set( color='#7570b3', linewidth=2)
# change fill color
box.set( facecolor = '#1b9e77' )
## change color and linewidth of the whiskers
for whisker in bp_dict['whiskers']:
whisker.set(color='#7570b3', linewidth=2)
## change color and linewidth of the caps
for cap in bp_dict['caps']:
cap.set(color='#7570b3', linewidth=2)
## change color and linewidth of the medians
for median in bp_dict['medians']:
median.set(color='#b2df8a', linewidth=2)
## change the style of fliers and their fill
for flier in bp_dict['fliers']:
flier.set(marker='D', color='#e7298a', alpha=0.5)
print(bp_dict.keys())
for line in bp_dict['medians']:
# get position data for median line
x, y = line.get_xydata()[1] # top of median line
# overlay median value
plt.text(x, y, '%.1f' % y,
horizontalalignment='center',fontsize=30) # draw above, centered
fliers = []
for line in bp_dict['fliers']:
ndarray = line.get_xydata()
if (len(ndarray)>0):
max_flier = ndarray[:,1].max()
max_flier_index = ndarray[:,1].argmax()
x = ndarray[max_flier_index,0]
print("Flier: " + str(x) + "," + str(max_flier))
plt.text(x,max_flier,'%.1f' % max_flier,horizontalalignment='center',fontsize=30,color='green')
plt.show()
The error was in this line:
bp_dict = plt.boxplot(values_total,labels=labels,patch_artist=True)
Dataset gotten from:
https://data.gov.sg/dataset/annual-age-bus-population-by-passenger-capacity
Any help is greatly appreciated ^^
thanks
Your error is in your labels variable. Specifically, you have extra values in it such as 15-Nov. Also, you lose the order of the labels when you use the set() function, so they come out in a random order. I'm not quite sure what you need to do to fix it tonight, but you can just remove the labels parameter from your call to plt.boxplot() to get something working. Then you can figure out labels that work.
The error is trying to say "The dimensions of the data and dimension of the labels do not match".
Good luck!
labels should be feature_names (as is column dimension, or axis=1) in order to be drawn in one plot (matplot) by different column-divisions.
But your labels_var is just a list of one column (capacity) values - it is not correct.
You need either pivot_table your dataframe ... or plt.boxplot (not ax.boxplot - I did't investigate why) gives an opportunity to use the grouping_param e.g. "by 'capacity'" (possibly suits your case)... or you can try to use seaborn library - probably it gives more opportunities
Try using the following for plotting for older versions.
bp_dict = plt.boxplot(values_total.transpose(),labels=labels,patch_artist=True)
Related
To make things clearer, I don't want to remove the entire bin from the histogram, I just want to get rid of some of the data so that it is brought below a desired frequency. The line in the image shows the max frequency I would like
For context, I have a dataset containing a number of angles. My question is very similar to the question asked here Remove data above threshold in histogram in terms of the data used but unlike the question in the link, I dont wish to get rid of the data, just reduce it.
Can I do this directly from the histogram or will I need to just delete some of the data in the dataset?
edit (sorry I am new to coding and formatting here):
here is a solution i tried
bns = 30
hist, bins = np.histogram(dataset['Steering'], bins= bns)
removeddata = []
spb = 700
for j in range(bns):
rdata = []
for i in range(len(dataset['Steering'])):
if dataset['Steering'][i] >= bins[j] and dataset['Steering'][i] <=
bins[j+1]:
rdata.append(i)
rdata = shuffle(rdata)
rdata = rdata[spb:]
removeddata.extend(rdata)
print('removed:', len(removeddata))
dataset.drop(dataset.index[removeddata], inplace = True)
print ('remaining:', len(dataset))
center = (bins[:-1] + bins[1:])*0.5
plt.bar(center,hist,width=0.05)
plt.show()
This is somebody else's solution but it seemed to work for them. Even directly copying, it still throws errors. The error I got was "ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()", I tried to change 'and' to & and got the error "TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]". Unsure what this exactly refers to but points to the line with the if statement. Checked the dtype of everything and they are all type float64, so unsure of my next step
This solution takes into account the clarified requirement that the original input data that exceeds the frequency threshold be dropped. I left my other answer because it is simpler and different enough that it may be useful to another user.
To clarify, this answer produces a new 1D array of data with fewer elements and then plots a histogram from that new data. The data are shuffled before the elements are removed (in case the input data were pre-sorted) in order to prevent bias in dropping data from either the low or high side of each bin.
import numpy as np
import matplotlib.pyplot as plt
from random import shuffle
def remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst):
if to_gate_lst[idx] == 0:
return(data_lst)
else:
bin_min, bin_max = bins_lst[idx], bins_lst[idx + 1]
for i in range(len(data_lst)):
if bin_min <= data_lst[i] < bin_max:
del data_lst[i]
to_gate_lst[idx] -= 1
break
return remove_gated_val_recursive(idx, to_gate_lst, bins_lst, data_lst)
threshold = 80
fig, ax1 = plt.subplots()
ax1.set_title("Some data")
np.random.seed(30)
data = np.random.randn(1000)
num_bins = 23
raw_hist, raw_bins = np.histogram(data, num_bins)
to_gate = []
for i in range(len(raw_hist)):
if raw_hist[i] > threshold:
to_gate.append(raw_hist[i] - threshold)
else:
to_gate.append(0)
data_lst = list(data)
shuffle(data_lst)
for idx in range(len(raw_hist)):
remove_gated_val_recursive(idx, to_gate, raw_bins, data_lst)
new_data = np.array(data_lst)
hist, bins = np.histogram(new_data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)
plt.show()
gives the following histogram, plotted from the new_data array.
This answer doesn't re-bin or re-center the data, but I believe it generally achieves what you're asking. Working from the example in the chosen answer of the post you linked, I edit the hist array so that the original input data is not changed as you indicated is your preferred solution:
import numpy as np
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1,2)
ax1.set_title("Some data")
ax2.set_title("Gated data < threshold")
np.random.seed(10)
data = np.random.randn(1000)
num_bins = 23
avg_samples_per_bin = 200
hist, bins = np.histogram(data, num_bins)
width = 0.7 * (bins[1] - bins[0])
center = (bins[:-1] + bins[1:]) * 0.5
ax1.bar(center, hist, align='center', width=width)
threshold = 80
gated = np.empty([len(hist)], dtype=np.int64)
for i in range(len(hist)):
if hist[i] > threshold:
gated[i] = threshold
else:
gated[i] = hist[i]
ax2.bar(center, gated, align="center", width=width)
plt.show()
which gives
I am learning with python code and I have some issues:
https://github.com/Slothfulwave612/Football-Analytics-Using-Python/blob/master/03.%20Analyzing%20Event%20Data/pass_map.py
My dubt is really simple:
I would like to apply a for expresion in order to apply the pass code to multiple football matches.
import matplotlib.pyplot as plt
import json
from pandas.io.json import json_normalize
from FCPython import createPitch
pitch_length_X = 120
pitch_width_Y = 80
(fig,ax) = createPitch(pitch_length_X, pitch_width_Y,'yards','gray')
## the code integrated in order to analyze multiple matches
P1TMP = [16205, 16131, 16265]
for i in P1TMP:
## match id for our El Clasico
match_id = int(i)
home_team = 'Barcelona'
player_name = 'Lionel Andrés Messi Cuccittini'
## this is the name of our event data file for
## our required El Clasico
file_name = str(match_id) + '.json'
## loading the required event data file
my_data = json.load(open('/content/drive/My Drive/20200515 CHIRINGUITO/events/' + file_name, 'r', encoding='utf-8'))
## get the nested structure into a dataframe
## store the dataframe in a dictionary with the match id as key
df = json_normalize(my_data, sep='_').assign(match_id = file_name[:-5])
## making the list of all column names
column = list(df.columns)
## all the type names we have in our dataframe
all_type_name = list(df['type_name'].unique())
## creating a data frame for pass
## and then removing the null values
## only listing the player_name in the dataframe
pass_df = df.loc[df['type_name'] == 'Pass', :].copy()
pass_df.dropna(inplace=True, axis=1)
pass_df = pass_df.loc[pass_df['player_name'] == player_name, :]
## creating a data frame for ball receipt
## removing all the null values
## and only listing Barcelona players in the dataframe
breceipt_df = df.loc[df['type_name'] == 'Ball Receipt*', :].copy()
breceipt_df.dropna(inplace=True, axis=1)
breceipt_df = breceipt_df.loc[breceipt_df['team_name'] == 'Barcelona', :]
pass_comp, pass_no = 0, 0
## pass_comp: completed pass
## pass_no: unsuccessful pass
## iterating through the pass dataframe
for row_num, passed in pass_df.iterrows():
if passed['player_name'] == player_name:
## for away side
x_loc = passed['location'][0]
y_loc = passed['location'][1]
pass_id = passed['id']
summed_result = sum(breceipt_df.iloc[:, 14].apply(lambda x: pass_id in x))
if summed_result > 0:
## if pass made was successful
color = 'blue'
label = 'Successful'
pass_comp += 1
else:
## if pass made was unsuccessful
color = 'green'
label = 'Unsuccessful'
pass_no += 1
## plotting circle at the player's position
shot_circle = plt.Circle((pitch_length_X - x_loc, y_loc), radius=2, color=color, label=label)
shot_circle.set_alpha(alpha=0.2)
ax.add_patch(shot_circle)
## parameters for making the arrow
pass_x = 120 - passed['pass_end_location'][0]
pass_y = passed['pass_end_location'][1]
dx = ((pitch_length_X - x_loc) - pass_x)
dy = y_loc - pass_y
## making an arrow to display the pass
pass_arrow = plt.Arrow(pitch_length_X - x_loc, y_loc, -dx, -dy, width=1, color=color)
## adding arrow to the plot
ax.add_patch(pass_arrow)
## computing pass accuracy
pass_acc = (pass_comp / (pass_comp + pass_no)) * 100
pass_acc = str(round(pass_acc, 2))
## adding text to the plot
plt.text(20, 85, '{} pass map vs Real Madrid'.format(player_name), fontsize=15)
plt.text(20, 82, 'Pass Accuracy: {}'.format(pass_acc), fontsize=15)
## handling labels
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='best', bbox_to_anchor=(0.9, 1, 0, 0),fontsize=12)
## editing the figure size and saving it
fig.set_size_inches(12, 8)
fig.savefig('{} passmap.png'.format(match_id), dpi=200)
## showing the plot
plt.show()
I only have edited the code in order to analayze multiple matches with a for expresion.
P1TMP = [16205, 16131, 16265]
for i in P1TMP:
And the results:
In The first image the result is almost perfect, but the Kind of passes´s filter is not working.
enter image description here
In the second image the passes are a mix of the passes of the first match and the second match. I only want the passes of the second match.
enter image description here
And in the third is the mix of the match nº1 +nº2 + n3º. I need the passes of the third :
enter image description here
Thanks in advance for your support.
Best Regards
So it's combining all the matches onto 1 because the figure is "drawing" on top of the previous one. There's a few other things you need to change too.
The away team will not always be Real Madrid, so make that dynamic
Adjust that in the figure text text so it's not always "vs. Real Madrid"
Save the file as something dynamic so they don't overwrite
Instead of doing plt.text to put in the titles (which is fine if you want to annotate at a specific x,y coordinates), use plt.title() and plt.suptitle(). It'll center and just make it a nicer layout of the text
You us i for you match id variable when iterating, but then you don't change/include that in the loop
This is the main issue: (fig,ax) = createPitch(pitch_length_X, pitch_width_Y,'yards','gray')
is what is creating your "blank canvas" to plot on. So this needs to be called before each plot. It's like grabbing a new blank sheet of paper to draw on. If you just use the first initial sheet, then everything will go on that 1 sheet. So move that into your for loop
Code
import matplotlib.pyplot as plt
import json
from pandas.io.json import json_normalize
from FCPython import createPitch
## Note Statsbomb data uses yards for their pitch dimensions
pitch_length_X = 120
pitch_width_Y = 80
## match id for our El Clasico
match_list = [16205, 16131, 16265]
teamA = 'Barcelona' #<--- adjusted here
for match_id in match_list:
## calling the function to create a pitch map
## yards is the unit for measurement and
## gray will be the line color of the pitch map
(fig,ax) = createPitch(pitch_length_X, pitch_width_Y,'yards','gray') #< moved into for loop
player_name = 'Lionel Andrés Messi Cuccittini'
## this is the name of our event data file for
## our required El Clasico
file_name = str(match_id) + '.json'
## loading the required event data file
my_data = json.load(open('Statsbomb/data/events/' + file_name, 'r', encoding='utf-8'))
## get the nested structure into a dataframe
## store the dataframe in a dictionary with the match id as key
df = json_normalize(my_data, sep='_').assign(match_id = file_name[:-5])
teamB = [x for x in list(df['team_name'].unique()) if x != teamA ][0] #<--- get other team name
## making the list of all column names
column = list(df.columns)
## all the type names we have in our dataframe
all_type_name = list(df['type_name'].unique())
## creating a data frame for pass
## and then removing the null values
## only listing the player_name in the dataframe
pass_df = df.loc[df['type_name'] == 'Pass', :].copy()
pass_df.dropna(inplace=True, axis=1)
pass_df = pass_df.loc[pass_df['player_name'] == player_name, :]
## creating a data frame for ball receipt
## removing all the null values
## and only listing Barcelona players in the dataframe
breceipt_df = df.loc[df['type_name'] == 'Ball Receipt*', :].copy()
breceipt_df.dropna(inplace=True, axis=1)
breceipt_df = breceipt_df.loc[breceipt_df['team_name'] == 'Barcelona', :]
pass_comp, pass_no = 0, 0
## pass_comp: completed pass
## pass_no: unsuccessful pass
## iterating through the pass dataframe
for row_num, passed in pass_df.iterrows():
if passed['player_name'] == player_name:
## for away side
x_loc = passed['location'][0]
y_loc = passed['location'][1]
pass_id = passed['id']
summed_result = sum(breceipt_df.iloc[:, 14].apply(lambda x: pass_id in x))
if summed_result > 0:
## if pass made was successful
color = 'blue'
label = 'Successful'
pass_comp += 1
else:
## if pass made was unsuccessful
color = 'red'
label = 'Unsuccessful'
pass_no += 1
## plotting circle at the player's position
shot_circle = plt.Circle((pitch_length_X - x_loc, y_loc), radius=2, color=color, label=label)
shot_circle.set_alpha(alpha=0.2)
ax.add_patch(shot_circle)
## parameters for making the arrow
pass_x = 120 - passed['pass_end_location'][0]
pass_y = passed['pass_end_location'][1]
dx = ((pitch_length_X - x_loc) - pass_x)
dy = y_loc - pass_y
## making an arrow to display the pass
pass_arrow = plt.Arrow(pitch_length_X - x_loc, y_loc, -dx, -dy, width=1, color=color)
## adding arrow to the plot
ax.add_patch(pass_arrow)
## computing pass accuracy
pass_acc = (pass_comp / (pass_comp + pass_no)) * 100
pass_acc = str(round(pass_acc, 2))
## adding text to the plot
plt.suptitle('{} pass map vs {}'.format(player_name, teamB), fontsize=15) #<-- make dynamic and change to suptitle
plt.title('Pass Accuracy: {}'.format(pass_acc), fontsize=15) #<-- change to title
## handling labels
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
plt.legend(by_label.values(), by_label.keys(), loc='best', bbox_to_anchor=(0.9, 1, 0, 0), fontsize=12)
## editing the figure size and saving it
fig.set_size_inches(12, 8)
fig.savefig('{} passmap.png'.format(match_id), dpi=200) #<-- dynamic file name
## showing the plot
plt.show()
I'm making a plot to compare band structure calculations from two different methods. This means plotting multiple lines for each set of data. I want to have a set of widgets that controls each set of data separately. The code below works if I only plot one set of data, but I can't get the widgets to work properly for two sets of data.
#!/usr/bin/env python3
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.widgets import Slider, TextBox
#cols = ['blue', 'red', 'green', 'purple']
cols = ['#3f54bf','#c14142','#59bf3f','#b83fbf']
finam = ['wan_band.dat','wan_band.pwx.dat']
#finam = ['wan_band.dat'] # this works
lbot = len(finam)*0.09 + 0.06
fig, ax = plt.subplots()
plt.subplots_adjust(bottom=lbot)
ax.margins(x=0) # lines go to the edge of the horizontal axes
def setlines(lines, txbx1, txbx2):
''' turn lines on/off based on text box values '''
try:
mn = int(txbx1) - 1
mx = int(txbx2) - 1
for ib in range(len(lines)):
if (ib<mn) or (ib>mx):
lines[ib].set_visible(False)
else :
lines[ib].set_visible(True)
plt.draw()
except ValueError as err:
print('Invalid range')
#end def setlines(cnt, lines, txbx1, txbx2):
def alphalines(lines, valin):
''' set lines' opacity '''
maxval = int('ff',16)
maxval = hex(int(valin*maxval))[2:]
for ib in range(bcnt):
lines[ib].set_color(cols[cnt]+maxval)
plt.draw()
#end def alphalines(lines, valtxt):
lines = [0]*len(finam) # 2d list to hold Line2Ds
txbox1 = [0]*len(finam) # list of Lo Band TextBoxes
txbox2 = [0]*len(finam) # lsit of Hi Band TextBoxes
alslid = [0]*len(finam) # list of Line Opacity Sliders
for cnt, fnam in enumerate(finam):
ptcnt = 0 # point count
fid = open(fnam, 'r')
fiit = iter(fid)
for line in fiit:
if line.strip() == '' :
break
ptcnt += 1
fid.close()
bandat_raw = np.loadtxt(fnam)
bcnt = int(np.round((bandat_raw.shape[0] / (ptcnt))))
print(ptcnt)
print(bcnt)
# get views of the raw data that are easier to work with
kbandat = bandat_raw[:ptcnt,0] # k point length along path
ebandat = bandat_raw.reshape((bcnt,ptcnt,2))[:,:,1] # band energy # k-points
lines[cnt] = [0]*bcnt # point this list element to another list
for ib in range(bcnt):
#l, = plt.plot(kbandat, ebandat[ib], c=cols[cnt],lw=1.0)
l, = ax.plot(kbandat, ebandat[ib], c=cols[cnt],lw=1.0)
lines[cnt][ib] = l
y0 = 0.03 + 0.07*cnt
bxht = 0.035
axbox1 = plt.axes([0.03, y0, 0.08, bxht]) # x0, y0, width, height
axbox2 = plt.axes([0.13, y0, 0.08, bxht])
txbox1[cnt] = TextBox(axbox1, '', initial=str(1))
txbox2[cnt] = TextBox(axbox2, '', initial=str(bcnt))
txbox1[cnt].on_submit( lambda x: setlines(lines[cnt], x, txbox2[cnt].text) )
txbox2[cnt].on_submit( lambda x: setlines(lines[cnt], txbox1[cnt].text, x) )
axalpha = plt.axes([0.25, y0, 0.65, bxht])
alslid[cnt] = Slider(axalpha, '', 0.1, 1.0, valinit=1.0)
salpha = alslid[cnt]
alslid[cnt].on_changed( lambda x: alphalines(lines[cnt], x) )
#end for cnt, fnam in enumerate(finam):
plt.text(0.01, 1.2, 'Lo Band', transform=axbox1.transAxes)
plt.text(0.01, 1.2, 'Hi Band', transform=axbox2.transAxes)
plt.text(0.01, 1.2, 'Line Opacity', transform=axalpha.transAxes)
plt.show()
All the widgets only control the last data set plotted instead of the individual data sets I tried to associate with each widget. Here is a sample output:
Here the bottom slider should be changing the blue lines' opacity, but instead it changes the red lines' opacity. Originally the variables txbox1, txbox2, and alslid were not lists. I changed them to lists though to ensure they weren't garbage collected but it didn't change anything.
Here is the test data set1 and set2 I've been using. They should be saved as files 'wan_band.dat' and 'wan_band.pwx.dat' as per the hard coded list finam in the code.
I figured it out, using a lambda to partially execute some functions with an iterator value meant they were always being evaluated with the last value of the iterator. Switching to functools.partial fixed the issue.
I am working with a (public data set) trying to learn more about how to visualize data and some basics of machine learning, I seem to have got myself really stuck. I'm trying to work with the seaborn violin plot using the hue tag to plot red and white wines by column in the data set vs the quality column...
I'm probably not explaining this well.
anyway my code looks like this:
class Wine():
def __init__(self):
self.Process()
def Process(self):
whit = pds.read_csv("winequality-white.csv", sep=";", header=0)
reds = pds.read_csv("winequality-red.csv", sep=";", header=0)
self.Plot_Against_Color(whit, reds)
def Plot_Against_Color(self, white, red):
nwhites = white.shape[0]; nreds = red.shape[0]
white_c = ["white"] * nwhites; red_c = ["red"] * nreds
white['color'] = white_c; red['color'] = red_c
total = white.append(red, ignore_index=True)
parameters = list(total)
nparameters = len(parameters)
plt_rows = math.floor((nparameters - 1) / 3)
if (nparameters - 1) % 3 > 0:
plt_rows += 1
fig, ax = plt.subplots(int(plt_rows), 3)
fig.suptitle('Characteristics of Red and White Wine as a Function of Quality')
for i in range(len(parameters)):
title = parameters[i]
if title == 'quality' or title == 'color':
continue
print(title)
r = math.floor(i / 3);
c = i % 3
sns.violinplot(data=total, x='quality', y=title, hue='color', split=True, ax=[r, c])
ax[r, c].set_xlabel('quality')
ax[r, c].set_ylabel(title)
plt.tight_layout()
This gives me an error
AttributeError: 'list' object has no attribute 'fill_betweenx'
I've also tried writing this out to subplot using the example here.
This is a whole other series of errors. I'm at a loss of what to do now... Any help?
The problem is in this part:
sns.violinplot(data=total, x='quality', y=title, hue='color', split=True, ax=[r, c])
please correct the axes assignment this way ax=ax[r, c]:
sns.violinplot(data=total, x='quality', y=title, hue='color', split=True, ax=ax[r, c])
And all should work fine.
I can draw a boxplot from data:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.rand(100)
plt.boxplot(data)
Then, the box will range from the 25th-percentile to 75th-percentile, and the whisker will range from the smallest value to the largest value between (25th-percentile - 1.5*IQR, 75th-percentile + 1.5*IQR), where the IQR denotes the inter-quartile range. (Of course, the value 1.5 is customizable).
Now I want to know the values used in the boxplot, i.e. the median, upper and lower quartile, the upper whisker end point and the lower whisker end point. While the former three are easy to obtain by using np.median() and np.percentile(), the end point of the whiskers will require some verbose coding:
median = np.median(data)
upper_quartile = np.percentile(data, 75)
lower_quartile = np.percentile(data, 25)
iqr = upper_quartile - lower_quartile
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
I was wondering, while this is acceptable, would there be a neater way to do this? It seems that the values should be ready to pull-out from the boxplot, as it's already drawn.
Why do you want to do so? what you are doing is already pretty direct.
Yeah, if you want to fetch them for the plot, when the plot is already made, simply use the get_ydata() method.
B = plt.boxplot(data)
[item.get_ydata() for item in B['whiskers']]
It returns an array of the shape (2,) for each whiskers, the second element is the value we want:
[item.get_ydata()[1] for item in B['whiskers']]
I've had this recently and have written a function to extract the boxplot values from the boxplot as a pandas dataframe.
The function is:
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
And is called by passing an array of labels (the ones that you would pass to the boxplot plotting function) and the data returned by the boxplot function itself.
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
data1 = np.random.normal(loc = 0, scale = 1, size = 1000)
data2 = np.random.normal(loc = 5, scale = 1, size = 1000)
data3 = np.random.normal(loc = 10, scale = 1, size = 1000)
labels = ['data1', 'data2', 'data3']
bp = plt.boxplot([data1, data2, data3], labels=labels)
print(get_box_plot_data(labels, bp))
plt.show()
Outputs the following from get_box_plot_data:
label lower_whisker lower_quartile median upper_quartile upper_whisker
0 data1 -2.491652 -0.587869 0.047543 0.696750 2.559301
1 data2 2.351567 4.310068 4.984103 5.665910 7.489808
2 data3 7.227794 9.278931 9.947674 10.661581 12.733275
And produces the following plot:
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
equal to
upper_whisker = data.max()
lower_whisker = data.min()
if you just want to get the real data points in the dataset. But statistically speaking, the whisker values are upper_quantile+1.5IQR and lower_quantile-1.5IQR