Pandas Python - Create subplots from 2 CSV columns - python

I am trying to create subplots: first Pie plot (got it), second bar plot (didn't succeed):
These are the columns:
My Code:
top_series = all_data.head(50).groupby('Top Rated ')['Top Rated '].count()
top_values = top_series.values.tolist()
top_index = ['Top Rated', 'Not Top Rated']
top_colors = ['#27AE60', '#E74C3C']
all_data['Rating_Cat'] = all_data['Rating'].apply(lambda x : 'High' if (x > 10000000 ) else 'Low')
rating_series = all_data.head(50).groupby('Rating_Cat')['Rating_Cat'].count()
rating_values = rating_series.values.tolist()
rating_index = ['High' , 'Low']
rating_colors = ['#F1C40F', '#27AE60']
fig, axs = plt.subplots(1,2, figsize=(16,5))
axs[0].pie(top_values, labels=top_index, autopct='%1.1f%%', shadow=True, startangle=90,
explode=(0.05, 0.05), radius=1.2, colors=top_colors, textprops={'fontsize':12})
all_data['Rating_Cat'].value_counts().plot(kind = 'bar', ax=axs[1])
fig.suptitle('Does "Rating" really affect on Top Sellers ?' , fontsize=17)
My question:
How to create the second plot that will get output like:
axis X = 1 , 2 , 3 , 4 .... 50 + Top reated / NO (according to the current col)
axis y = the rating from 0 to 7603388.0
I have really tried lots of things, but I am kind of lost here... please help !!

In first plot you are taking first 50 rows of the dataset and plot shares of each value in Top Rated column.
If I understand what you are trying to do in second plot (You want to have each of the Rating from first 100 values plotted from first to last with color based on the Top rated):
#taking first 100 rows
rating_series = all_data.head(100).copy()
#assigning color to the values, so you could use it in bar() plot
rating_series["color"] = rating_series["Top Rated "].map({"Top Rated": "#27AE60", "No": "#E74C3C"})
#plotting the values
axs[1].bar(rating_series.index, rating_series["Rating"], color = rating_series["color"])
If you want to add legend to the plot, you have to do it manually
import matplotlib.patches as mpatches
axs[1].legend(handles=[mpatches.Patch(color='#27AE60', label='Top Rated'),
mpatches.Patch(color='#E74C3C', label='Not Top Rated')])
Edit: My whole code
import pandas as pd
import numpy as np
import matplotlib.patches as mpatches
import random
df = pd.DataFrame(
{
"Rating": np.random.randint(0,7603388,size=200),
"Top Rated ": [random.choice(['Top Rated', 'No']) for rated in range(0,200)]
}
)
#taking first 100 rows
rating_series = df.head(100).copy()
#assigning color to the values, so you could use it in bar() plot
rating_series["color"] = rating_series["Top Rated "].map({"Top Rated": "#27AE60", "No": "#E74C3C"})
#checking if there were no NaNs
rating_series["color"].value_counts(dropna=False)
#Output:
#E74C3C 53
#FFC300 47
#Name: color, dtype: int64
#1st plot
top_series = rating_series.groupby('Top Rated ')['Top Rated '].count()
top_index = ['Top Rated', 'Not Top Rated']
top_colors = ['#27AE60', '#E74C3C']
fig, axs = plt.subplots(1,2, figsize=(16,5))
axs[0].pie(top_series.values, labels=top_index, autopct='%1.1f%%', shadow=True, startangle=90,
explode=(0.05, 0.05), radius=1.2, colors=top_colors, textprops={'fontsize':12})
#2nd plot
axs[1].bar(rating_series.index, rating_series["Rating"], color = rating_series["color"])
axs[1].legend(handles=[mpatches.Patch(color='#27AE60', label='Top Rated'),
mpatches.Patch(color='#E74C3C', label='Not Top Rated')])

Related

Sorting the bars in the barchart based on the values in y axis

So I am doing this guided project in datacamp and it is essentially about Exploring the MarketCap of various Cryptocurrencies over the time. Even though I know other means to get the output, I am sticking to the proposed method.
So I need to make a bar graph for the top 10 cryptocurrencies(x axis) and their share of marketcap (y axis). I am able to get the desired output but I want to go one step up and sort the bar in the descending order. Right now, it is sorted based on the first letter of the respective crypto currencies. Here is the code,
#Declaring these now for later use in the plots
TOP_CAP_TITLE = 'Top 10 market capitalization'
TOP_CAP_YLABEL = '% of total cap'
# Selecting the first 10 rows and setting the index
cap10 = cap.iloc[:10,]
# Calculating market_cap_perc
cap10 = cap10.assign(market_cap_perc = round(cap10['market_cap_usd']/sum(cap['market_cap_usd'])*100,2))
# Plotting the barplot with the title defined above
fig, ax = plt.subplots(1,1)
ax.bar(cap10['symbol'], cap10['market_cap_perc'])
ax.set_title(TOP_CAP_TITLE)
ax.set_ylabel(TOP_CAP_YLABEL)
plt.show()
I've replicated your code with dummy data, and output the plot, is this the sorted plot you're looking for? Only need to sort the dataframe using df.sort_values()
import pandas as pd
import matplotlib.pyplot as plt
d = {'BCH': 8, 'BTC': 55, 'ETH': 12, 'MIOTA': 4, 'ADA': 0.5, 'BTG': 0.8, 'XMR': 0.7, 'DASH': 1, 'LTC': 0.99, 'XRP': 2.5}
cap = pd.DataFrame({'symbol': d.keys(), 'market_cap_perc': d.values()})
#Declaring these now for later use in the plots
TOP_CAP_TITLE = 'Top 10 market capitalization'
TOP_CAP_YLABEL = '% of total cap'
# Selecting the first 10 rows and setting the index
cap10 = cap.iloc[:10,]
# Calculating market_cap_perc
# cap10 = cap10.assign(market_cap_perc = round(cap10['market_cap_usd']/sum(cap['market_cap_usd'])*100,2))
cap10 = cap10.sort_values('market_cap_perc', ascending=False) #add this line
# Plotting the barplot with the title defined above
fig, ax = plt.subplots(1,1)
ax.bar(cap10['symbol'], cap10['market_cap_perc'])
ax.set_title(TOP_CAP_TITLE)
ax.set_ylabel(TOP_CAP_YLABEL)
plt.show()
You can sort cap10 before plotting:
cap10 = cap10.sort_values(by='market_cap_perc', ascending=False)
fig, ax = plt.subplots(1,1)
...

Pandas Python Visualization - ValueError: shape mismatch: ERROR

*Edit:
Why the right plot (Bar) is showing 50% , half black screen on the plot, wierd numbers, "garbage"... how to fix the right plot ?
here is my code:
top_series = all_data.head(50).groupby('Top Rated ')['Top Rated '].count()
top_values = top_series.values.tolist()
top_index = ['Top Rated', 'Not Top Rated']
top_colors = ['#27AE60', '#E74C3C']
rating_series = all_data.head(50).groupby('Rating')['Rating'].count()
rating_values = rating_series.values.tolist()
rating_index = ['High' , 'Low']
rating_colors = ['#F1C40F', '#27AE60']
fig, axs = plt.subplots(1,2, figsize=(16,5))
axs[0].pie(top_values, labels=top_index, autopct='%1.1f%%', shadow=True, startangle=90,
explode=(0.05, 0.05), radius=1.5, colors=top_colors, textprops={'fontsize':15})
axs[1].bar(rating_series.index, rating_series.values, color='b')
axs[1].set_xlabel('Rating')
axs[1].set_ylabel('Amount')
fig.suptitle('Does "Rating" really affect on Top Sellers ? ')
CSV cols:
Output (look at the right plot):
I suppose, that keys is a list of all keys. So it can have a different shape than the top_values.
If you would do:
axs[1].bar(top_series.index, top_series.values, color='b')
It should work well.
But, if you just want to plot the histogram, there is even shorter version, without temporary objects:
all_data['Top Rated '].value_counts().plot(kind = 'bar', ax=axs[1])
Edit: The Rating column is a numeric one, not a string one. You have to create a column which will have values High and Low. For example:
all_data['Rating_Cat'] = all_data['Rating'].apply(lambda x : 'High' if (x > 10000000 ) else 'Low')
And then use this column to plot this kind of bar plot

How do I plot a categorical bar chart with different classes for each category in Matplotlib?

I was trying to reproduce this plot with Matplotlib:
So, by looking at the documentation, I found out that the closest thing is a grouped bar chart. The problem is that I have a different number of "bars" for each category (subject, illumination, ...) compared to the example provided by matplotlib that instead only has 2 classes (M, F) for each category (G1, G2, G3, ...). I don't know exactly from where to start, does anyone here has any clue? I think in this case the trick they made to specify bars location:
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, men_means, width, label='Men')
rects2 = ax.bar(x + width/2, women_means, width, label='Women')
does not work at all as in the second class (for example) there is a different number of bars. It would be awesome if anyone could give me an idea. Thank you in advance!
Supposing the data resides in a dataframe, the bars can be generated by looping through the categories:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# first create some test data, similar in structure to the question's
categories = ['Subject', 'Illumination', 'Location', 'Daytime']
df = pd.DataFrame(columns=['Category', 'Class', 'Value'])
for cat in categories:
for _ in range(np.random.randint(2, 7)):
df = df.append({'Category': cat,
'Class': "".join(np.random.choice([*'tuvwxyz'], 10)),
'Value': np.random.uniform(10, 17)}, ignore_index=True)
fig, ax = plt.subplots()
start = 0 # position for first label
gap = 1 # gap between labels
labels = [] # list for all the labels
label_pos = np.array([]) # list for all the label positions
# loop through the categories of the dataframe
# provide a list of colors (at least as long as the expected number of categories)
for (cat, df_cat), color in zip(df.groupby('Category', sort=False), ['navy', 'orange'] * len(df)):
num_in_cat = len(df_cat)
# add a text for the category, using "axes coordinates" for the y-axis
ax.text(start + num_in_cat / 2, 0.95, cat, ha='center', va='top', transform=ax.get_xaxis_transform())
# positions for the labels of the current category
this_label_pos = np.arange(start, start + num_in_cat)
# create bars at the desired positions
ax.bar(this_label_pos, df_cat['Value'], color=color)
# store labels and their positions
labels += df_cat['Class'].to_list()
label_pos = np.append(label_pos, this_label_pos)
start += num_in_cat + gap
# set the positions for the labels
ax.set_xticks(label_pos)
# set the labels
ax.set_xticklabels(labels, rotation=30)
# optionally set a new lower position for the y-axis
ax.set_ylim(ymin=9)
# optionally reduce the margin left and right
ax.margins(x=0.01)
plt.tight_layout()
plt.show()

Python/Matplotlib - Find the highest value of a group of bars

I have this image from Matplotlib :
I would like to write for each category (cat i with i in [1-10] in the figure) the highest value and its corresponding legend on the graphic.
Below you can find visually what I would like to achieve :
The thing is the fact that I don't know if it is possible because of the way of plotting from matplotlib.
Basically, this is the part of the code for drawing multiple bars :
# create plot
fig, ax = plt.subplots(figsize = (9,9))
index = np.arange(len_category)
if multiple:
bar_width = 0.3
else :
bar_width = 1.5
opacity = 1.0
#test_array contains test1 and test2
cmap = get_cmap(len(test_array))
for i in range(len(test_array)):
count = count + 1
current_label = test_array[i]
rects = plt.bar(index-0.2+bar_width*i, score_array[i], bar_width, alpha=opacity, color=np.random.rand(3,1),label=current_label )
plt.xlabel('Categories')
plt.ylabel('Scores')
plt.title('Scores by Categories')
plt.xticks(index + bar_width, categories_array)
plt.legend()
plt.tight_layout()
plt.show()
and this is the part I have added in order to do what I would like to achieve. But it searches the max across all the bars in the graphics. For example, the max of test1 will be in cat10 and the max of test2 will be cat2. Instead, I would like to have the max for each category.
for i in range(len(test_array)):
count = count + 1
current_label = test_array[i]
rects = plt.bar(index-0.2+bar_width*i, score_array[i], bar_width,alpha=opacity,color=np.random.rand(3,1),label=current_label )
max_score_current = max(score_array[i])
list_rect = list()
max_height = 0
#The id of the rectangle who get the highest score
max_idx = 0
for idx,rect in enumerate(rects):
list_rect.append(rect)
height = rect.get_height()
if height > max_height:
max_height = height
max_idx = idx
highest_rect = list_rect[max_idx]
plt.text(highest_rect.get_x() + highest_rect.get_width()/2.0, max_height, str(test_array[i]),color='blue', fontweight='bold')
del list_rect[:]
Do you have an idea about how I can achieve that ?
Thank you
It usually better to keep data generation and visualization separate. Instead of looping through the bars themselves, just get the necessary data prior to plotting. This makes everything a lot more simple.
So first create a list of labels to use and then loop over the positions to annotate then. In the code below the labels are created by mapping the argmax of a column array to the test set via a dictionary.
import numpy as np
import matplotlib.pyplot as plt
test1 = [6,4,5,8,3]
test2 = [4,5,3,4,6]
labeldic = {0:"test1", 1:"test2"}
a = np.c_[test1,test2]
maxi = np.max(a, axis=1)
l = ["{} {}".format(i,labeldic[j]) for i,j in zip(maxi, np.argmax(a, axis=1))]
for i in range(a.shape[1]):
plt.bar(np.arange(a.shape[0])+(i-1)*0.3, a[:,i], width=0.3, align="edge",
label = labeldic[i])
for i in range(a.shape[0]):
plt.annotate(l[i], xy=(i,maxi[i]), xytext=(0,10),
textcoords="offset points", ha="center")
plt.margins(y=0.2)
plt.legend()
plt.show()
From your question it is not entirely clear what you want to achieve, but assuming that you want the relative height of each bar in one group printed above that bar, here is one way to achieve that:
from matplotlib import pyplot as plt
import numpy as np
score_array = np.random.rand(2,10)
index = np.arange(score_array.shape[1])
test_array=['test1','test2']
opacity = 1
bar_width = 0.25
for i,label in enumerate(test_array):
rects = plt.bar(index-0.2+bar_width*i, score_array[i], bar_width,alpha=opacity,label=label)
heights = [r.get_height() for r in rects]
print(heights)
rel_heights = [h/max(heights) for h in heights]
idx = heights.index(max(heights))
for i,(r,h, rh) in enumerate(zip(rects, heights, rel_heights)):
plt.text(r.get_x() + r.get_width()/2.0, h, '{:.2}'.format(rh), color='b', fontweight ='bold', ha='center')
plt.show()
The result looks like this:

Changing the colour of a boxplot when its paired

I want to change the colour of the boxplots according to what they represent, this are grouped in pairs, so my question is:
How can i change the colour of the boxplots when they are paired?
Considering that the first boxplot of each pair should be blue and the second one red.
This is the code, sorry if it's messy:
def obtenerBoxplotsAnuales(self, directorioEntrada, directorioSalida):
meses = ["Enero","Febrero","Marzo","Abril","Mayo","Junio", "Julio", "Agosto","Septie.","Octubre","Noviem.","Diciem."]
ciudades = ["CO","CR"]
anios = ["2011", "2012", "2013"]
boxPlotMensual = []
fig = plt.figure()
fig.set_size_inches(14.3, 9)
ax = plt.axes()
plt.hold(True)
for anio in anios:
boxPlotAnual = []
i=0
ticks = []
for mes in range(len(meses)):
data1 = getSomeData()
data2 = getSomeData()
data = [ [int(float(data1[2])), int(float(data1[0])), int(float(data1[1]))],
[int(float(data2[2])), int(float(data2[0])), int(float(data2[1]))] ]
plt.boxplot(data, positions=[i,i+1], widths=0.5)
ticks.append(i+0.5)
i=i+2
hB, = plt.plot([1,1],'b-')
hR, = plt.plot([1,1],'r-')
plt.legend((hB, hR),('Caleta', 'Comodoro'))
hB.set_visible(False)
hR.set_visible(False)
ax.set_xticklabels(meses)
ax.set_xticks(ticks)
plt.savefig(directorioSalida+"/asdasd"+str(anio)+".ps", orientation='landscape', papertype='A4' )
This is what i get:
I've read that the solution is related with the fact that plt.boxplot(...) returns a kind of dict object that contains a list of the lines created so the way to modify the colour of each boxplot would be access to the indexes? How for this case?
You can set the colour of the return dict from boxplot as follows,
import matplotlib.pyplot as plt
import numpy as np
nboxes = 10
# fake up some data
spread= np.random.rand(50,nboxes) * 100
center = np.ones((25,nboxes)) * 50
flier_high = np.random.rand(10,nboxes) * 100 + 100
flier_low = np.random.rand(10,nboxes) * -100
data =np.concatenate((spread, center, flier_high, flier_low), 0)
# plot figure
plt.figure()
bp = plt.boxplot(data)
for i, box in enumerate(bp['boxes']):
#Colour alternate boxes blue and red
if i%2:
box.set_color('blue')
else:
box.set_color('red')
plt.show()
Where you loop through all boxes in bp['boxes'] and use the method set_color (you can also box.set_markerfacecolor and other standard matplotlib artist attributes). The bp dict also contains ['boxes', 'fliers', 'medians', 'means', 'whiskers', 'caps'] which can also be changed as required.

Categories