How to plot a heatmap using seaborn or matplotlib? - python

I have a dataframe that I am trying to visualize into a heatmap, I used matplotlib to make a heatmap but it is showing data that is not apart of my dataframe.
I've tried to create a heatmap using matplotlib from an example I found online and changed the code to work for my data. But on the left side of the graph and top of it there are random values that are not apart of my data and I'm not sure how to remove them.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
url = 'http://mcubed.net/ncaab/seeds.shtml'
#Getting the website text
data = requests.get(url).text
#Parsing the website
soup = BeautifulSoup(data, "html5lib")
#Create an empty list
dflist = []
#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to
pandas
for b in soup.findAll({"b"})[2:-1]:
dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header
= None))
dflist[0]
#Created a new list, due to the melt we are going to do not been able to
replace
#the dataframes in DFList
meltedDF = []
#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):
#Creating the team name
name = "Team " + str(teamnumber)
#Making the team name a column, with the values in df[0] and df[1] in
our dataframes
df[name] = df[0] + df[1]
#Melting the dataframe to make the team name its own column
meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))
# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)
# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win
Percent', 'variable':'Team' , 'value': 'VS'})\
.reindex(['Team', 'VS', 'Record', 'Win
Percent'], axis = 1)
allTeamStats
#Graph visualization Making a HeatMap
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
y=["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16"]
x=["16","15","14","13","12","11","10","9","8","7","6","5","4","3","2","1"]
winp = []
for i in x:
lst = []
for j in y:
percent = allTeamStats.loc[(allTeamStats["Team"]== 'Team '+i) &\
(allTeamStats["VS"]== "vs.#"+j)]['Win
Percent'].iloc[0]
percent = float(percent[:-1])
lst.append(percent)
winp.append(lst)
winpercentage= np.array([[]])
fig,ax=plt.subplots(figsize=(18,18))
im= ax.imshow(winp, cmap='hot')
# We want to show all ticks...
ax.set_xticks(np.arange(len(y)))
ax.set_yticks(np.arange(len(x)))
# ... and label them with the respective list entries
ax.set_xticklabels(y)
ax.set_yticklabels(x)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(x)):
for j in range(len(y)):
text = ax.text(j, i, winp[i][j],
ha="center", va="center", color="red")
ax.set_title("Win Percentage of Each Matchup", fontsize= 40)
heatmap = plt.pcolor(winp)
plt.colorbar(heatmap)
ax.set_ylabel('Seeds', fontsize=40)
ax.set_xlabel('Seeds', fontsize=40)
plt.show()
The results I get are what I want except for the two lines that are on the left side and top of the heatmap. I'm unsure what these values are coming from and to easier see them I used cmap= 'hot' to show the values that are not supposed to be there. If you could help me fix my code to plot it correctly or plot an entire new heatmap using seaborn (my TA told me to try using seaborn but I've never used it yet) with my data. Anything helps Thanks!

I think the culprit is this line: im= ax.imshow(winp, cmap='hot') in your code. Delete it and try again. Basically, anything that you plotted after that line was laid over what that line created. The left and top "margins" were the only parts of the image on the bottom that you could see.

Related

Python: graph from csv filtered by pandas shows no graph

I would like to create a graph on filtered data from a .csv file. A graph is created but without content.
Here is the code and the result:
# var 4 graph
xs = []
ys = []
name = "Anna"
gender = "F"
state = "CA"
# 4 reading csv file
import pandas as pd
# reading csv file
dataFrame = pd.read_csv("../Kursmaterialien/data/names.csv")
#print("DataFrame...\n",dataFrame)
# select rows containing text
dataFrame = dataFrame[(dataFrame['Name'] == name)&dataFrame['State'].str.contains(state)&dataFrame['Gender'].str.contains(gender)]
#print("\nFetching rows with text ...\n",dataFrame)
print(dataFrame)
# append var with value
xs.append(list(dataFrame['Year']))
ys.append(list(dataFrame['Count']))
#xs.append(list(map(str,dataFrame['Year'])))
#ys.append(list(map(str,dataFrame['Count'])))
print(xs)
print(ys)
Result from print(xs) and print(ys)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(xs,ys)
plt.show()
Resulting plot:
I see that the variables start with two brackets, but don't know if that is the problem and how to fix it.
The graphic should look something like this :
You are correct about the two brackets, you have to extract the data from the inner bracket. This is done by setting the indice to 0 to get the first column (which is also the only one).
This should work:
XS=xs[0]
YS=ys[0]
plt.plot(XS,YS)
plt.show()
With your double brackets, the plt.plot is plotting each pairs of points as a different element. And plotting a point doesn't draw a line, as by default the makers are off. If you try to add markers to your plot, you will see the markers in different colours, but no lines.
plt.plot(xs,ys,'o') #round marker

Plotting multiple excel sheets on the same graph

I have this excel file that i need to plot. So far my code looks like this
import pandas as pd
import matplotlib.pyplot as plt
file = 'weatherdata.xlsx'
def plotMeteoData(file,x_axis,metric,*list_of_cities):
df = pd.ExcelFile(file)
sheets = df.sheet_names
print(sheets)
df_list = []
for city in list_of_cities:
df_list.append(df.parse(city))
x=range(len(df_list[0][x_axis].tolist()))
width = 0.3
for j,df_item in enumerate(df_list):
plt.bar(x,df_item[metric],width,label = sheets[j]) #this isn't working
x = [i+width for i in x]
plt.xlabel(x_axis)
plt.ylabel(metric)
plt.xticks(range(len(df_list[0][x_axis].tolist())),labels=df_list[0][x_axis].tolist())
plt.show()
t=['Αθήνα','Θεσσαλονίκη','Πάτρα']
plotMeteoData(file,'Μήνας','Υγρασία',*t)
and gives this output.
Each color represents an excel sheet, x-axis represents the months and y-axis represents some values.
I've commented the line where I'm trying to add some labels for each sheet and I'm unable to. Also if you look at the above output the bars aren't centered with each xtick. How can I fix those problems? Thanks
Typically you use plt.subplots, as it gives you more control over the graph. The code below calculates the offset needed for the xtick labels to be centered and shows the legend with the city labels:
import pandas as pd
import matplotlib.pyplot as plt
file = 'weatherdata.xlsx'
def plotMeteoData(file,x_axis,metric,*list_of_cities):
df = pd.ExcelFile(file)
sheets = df.sheet_names
print(sheets)
df_list = []
for city in list_of_cities:
df_list.append(df.parse(city))
x=range(len(df_list[0][x_axis].tolist()))
width = 0.3
# Calculate the offset of the center of the xtick labels
xTickOffset = width*(len(list_of_cities)-1)/2
# Create a plot
fig, ax = plt.subplots()
for j,df_item in enumerate(df_list):
ax.bar(x,df_item[metric],width,label = sheets[j]) #this isn't working
x = [i+width for i in x]
ax.set_xlabel(x_axis)
ax.set_ylabel(metric)
# Add a legend (feel free to change the location)
ax.legend(loc='upper right')
# Add the xTickOffset to the xtick label positions so they are centered
ax.set_xticks(list(map(lambda x:x+xTickOffset, range(len(df_list[0][x_axis].tolist())))),labels=df_list[0][x_axis].tolist())
plt.show()
t=['Athena', 'Thessaloniki', 'Patras']
plotMeteoData(file,'Month','Humidity',*t)
Resulting Graph:
The xtick offset should account for different numbers of excel pages. See this for more information on legends.

Scatterplot subplots with the same colors and markers

I am creating a figure with two subplots with scatter plots. I want to use the same color scheme and marker definitions for each subplot, but can't seem to get it to work. Please forgive the length of my minimal working example, but I trimmed it down as much as I can.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
from scipy.stats import probplot
#Raw Data
area_old = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176]
combined = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176, 3059.4357099599456, 3348.0415691055823, 4839.023360449559, 4398.877634354169, 29269.67455441528, 11058.400909555028, 18266.34679952683, 16641.3446048029, 24983.586163502885, 5811.868753338233]
#Attributes to map colors and markers to
lt_bt = ['r','s','s','r','r','u','r','s','r','r','s','r']
combined_bt =['r','s','s','r','r','u','r','s','r','r','s','r','u','u','r','s','r','s','r','r','r','u']
#Get Probability plot Data
a = probplot(area_old,dist='norm', plot=None)
b= probplot(combined,dist='norm', plot=None)
#Colors and Markers to use
colors = {'r':'red','s':'blue', 'u':'green'}
markers = {'r':'*','s':'x', 'u':'o'}
#Create Dataframe to combine raw data, attributes and sort
old_df = pd.DataFrame(area_old, columns=['Long Term Sites: N=12'])
old_df['Bar_Type'] = lt_bt
old_df = old_df.sort_values(by='Long Term Sites: N=12')
old_df['quart']=a[0][0]
#Pandas series of colors for plotting on subplot 'ax'
ax_color = old_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Create Dataframe to combine raw data, attributes and sort
combined_df = pd.DataFrame(combined, columns=['ALL SITES N=22'])
combined_df['Bar_Type'] = combined_bt
combined_df = combined_df.sort_values(by='ALL SITES N=22')
combined_df['quart']=b[0][0]
#Pandas series of colors for plotting on subplot 'ax1'
ax1_color = combined_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Legend Handles
undif = plt.Line2D([0,0],[0,1], color='green',marker='o',linestyle=' ')
reatt = plt.Line2D([0,0],[0,1], color='red',marker='*',linestyle=' ')
sep = plt.Line2D([0,0],[0,1], color='blue',marker='x',linestyle=' ')
fig,(ax,ax1) = plt.subplots(ncols=2,sharey=True)
#Plot each data point seperatly with different markers and colors
for i, thing in old_df.iterrows():
ax.scatter(thing['quart'],thing['Long Term Sites: N=12'],c=ax_color.iloc[i],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
for i , thing in combined_df.iterrows():
ax1.scatter(thing['quart'],thing['ALL SITES N=22'],c=ax1_color.iloc[i],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
ax.set_title('LONG TERM SITES N=12')
ax1.set_title('ALL SITES N=22')
ax1.set_ylabel('')
ax.set_ylabel('TOTAL EDDY AREA, IN METERS SQUARED')
ax.set_ylim(0,35000)
ax.get_yaxis().set_major_formatter(tkr.FuncFormatter(lambda x, p: format(int(x), ',')))
legend = ax.legend([reatt,sep,undif],["Reattachment","Separation", "Undifferentiated"],loc=2,title='Bar Type',fontsize='x-small')
plt.setp(legend.get_title(),fontsize='x-small')
ax.set_xlabel('QUANTILES')
ax1.set_xlabel('QUANTILES')
plt.tight_layout()
The basic idea is I am plotting scatter plots, point-by-point, to assign the appropriate color and marker. I assign colors using pandas integer indexing .iloc and assign marker by specifying a key for the markers dictionary.
I know something isn't right because the first point in old_df and combined_df (i.e. old_df.loc[1,:],combined_df.loc[1,:]) should have the color and marker of 'blue' and 'x', respectivly.
What am I doing wrong?
Not sure why, but somehow using .iloc in the ax.scatter results in unpredictable behavior. All I had to do was remove the .iloc method and replace it with a dictionary mapping (i.e. c=ax_color.iloc[i] to c=colors[thing['Bar_Type']]) everything works fine!
A working example of the desired result:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
from scipy.stats import probplot
#Raw Data
area_old = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176]
combined = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176, 3059.4357099599456, 3348.0415691055823, 4839.023360449559, 4398.877634354169, 29269.67455441528, 11058.400909555028, 18266.34679952683, 16641.3446048029, 24983.586163502885, 5811.868753338233]
#Attributes to map colors and markers to
lt_bt = ['r','s','s','r','r','u','r','s','r','r','s','r']
combined_bt =['r','s','s','r','r','u','r','s','r','r','s','r','u','u','r','s','r','s','r','r','r','u']
#Get Probability plot Data
a = probplot(area_old,dist='norm', plot=None)
b= probplot(combined,dist='norm', plot=None)
#Colors and Markers to use
colors = {'r':'red','s':'blue', 'u':'green'}
markers = {'r':'*','s':'x', 'u':'o'}
#Create Dataframe to combine raw data, attributes and sort
old_df = pd.DataFrame(area_old, columns=['Long Term Sites: N=12'])
old_df['Bar_Type'] = lt_bt
old_df = old_df.sort_values(by='Long Term Sites: N=12')
old_df['quart']=a[0][0]
#Pandas series of colors for plotting on subplot 'ax'
ax_color = old_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Create Dataframe to combine raw data, attributes and sort
combined_df = pd.DataFrame(combined, columns=['ALL SITES N=22'])
combined_df['Bar_Type'] = combined_bt
combined_df = combined_df.sort_values(by='ALL SITES N=22')
combined_df['quart']=b[0][0]
#Pandas series of colors for plotting on subplot 'ax1'
ax1_color = combined_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Legend Handles
undif = plt.Line2D([0,0],[0,1], color='green',marker='o',linestyle=' ')
reatt = plt.Line2D([0,0],[0,1], color='red',marker='*',linestyle=' ')
sep = plt.Line2D([0,0],[0,1], color='blue',marker='x',linestyle=' ')
fig,(ax,ax1) = plt.subplots(ncols=2,sharey=True)
#Plot each data point seperatly with different markers and colors
for i, thing in old_df.iterrows():
ax.scatter(thing['quart'],thing['Long Term Sites: N=12'],c=colors[thing['Bar_Type']],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
for i , thing in combined_df.iterrows():
ax1.scatter(thing['quart'],thing['ALL SITES N=22'],c=colors[thing['Bar_Type']],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
ax.set_title('LONG TERM SITES N=12')
ax1.set_title('ALL SITES N=22')
ax1.set_ylabel('')
ax.set_ylabel('TOTAL EDDY AREA, IN METERS SQUARED')
ax.set_ylim(0,35000)
ax.get_yaxis().set_major_formatter(tkr.FuncFormatter(lambda x, p: format(int(x), ',')))
legend = ax.legend([reatt,sep,undif],["Reattachment","Separation", "Undifferentiated"],loc=2,title='Bar Type',fontsize='x-small')
plt.setp(legend.get_title(),fontsize='x-small')
ax.set_xlabel('QUANTILES')
ax1.set_xlabel('QUANTILES')
plt.tight_layout()

MatPlotlib Seaborn Multiple Plots formatting

I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)

Parsing CSV file using Panda

I have been using matplotlib for quite some time now and it is great however, I want to switch to panda and my first attempt at it didn't go so well.
My data set looks like this:
sam,123,184,2.6,543
winter,124,284,2.6,541
summer,178,384,2.6,542
summer,165,484,2.6,544
winter,178,584,2.6,545
sam,112,684,2.6,546
zack,145,784,2.6,547
mike,110,984,2.6,548
etc.....
I want first to search the csv for anything with the name mike and create it own list. Now with this list I want to be able to do some math for example add sam[3] + winter[4] or sam[1]/10. The last part would be to plot it columns against each other.
Going through this page
http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
The only thing I see is if I have a column header, however, I don't have any headers. I only know the position in a row of the values I want.
So my question is:
How do I create a bunch of list for each row (sam, winter, summer)
Is this method efficient if my csv has millions of data point?
Could I use matplotlib plotting to plot pandas dataframe?
ie :
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike[1], winter[3], label='Mike vs Winter speed', color = 'red')
You can read a csv without headers:
data=pd.read_csv(filepath, header=None)
Columns will be numbered starting from 0.
Selecting and filtering:
all_summers = data[data[0]=='summer']
If you want to do some operations grouping by the first column, it will look like this:
data.groupby(0).sum()
data.groupby(0).count()
...
Selecting a row after grouping:
sums = data.groupby(0).sum()
sums.loc['sam']
Plotting example:
sums.plot()
import matplotlib.pyplot as plt
plt.show()
For more details about plotting, see: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html
df = pd.read_csv(filepath, header=None)
mike = df[df[0]=='mike'].values.tolist()
winter = df[df[0]=='winter'].values.tolist()
Then you can plot those list as you wanted to above
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike, winter, label='Mike vs Winter speed', color = 'red')

Categories