Plotting multiple excel sheets on the same graph - python

I have this excel file that i need to plot. So far my code looks like this
import pandas as pd
import matplotlib.pyplot as plt
file = 'weatherdata.xlsx'
def plotMeteoData(file,x_axis,metric,*list_of_cities):
df = pd.ExcelFile(file)
sheets = df.sheet_names
print(sheets)
df_list = []
for city in list_of_cities:
df_list.append(df.parse(city))
x=range(len(df_list[0][x_axis].tolist()))
width = 0.3
for j,df_item in enumerate(df_list):
plt.bar(x,df_item[metric],width,label = sheets[j]) #this isn't working
x = [i+width for i in x]
plt.xlabel(x_axis)
plt.ylabel(metric)
plt.xticks(range(len(df_list[0][x_axis].tolist())),labels=df_list[0][x_axis].tolist())
plt.show()
t=['Αθήνα','Θεσσαλονίκη','Πάτρα']
plotMeteoData(file,'Μήνας','Υγρασία',*t)
and gives this output.
Each color represents an excel sheet, x-axis represents the months and y-axis represents some values.
I've commented the line where I'm trying to add some labels for each sheet and I'm unable to. Also if you look at the above output the bars aren't centered with each xtick. How can I fix those problems? Thanks

Typically you use plt.subplots, as it gives you more control over the graph. The code below calculates the offset needed for the xtick labels to be centered and shows the legend with the city labels:
import pandas as pd
import matplotlib.pyplot as plt
file = 'weatherdata.xlsx'
def plotMeteoData(file,x_axis,metric,*list_of_cities):
df = pd.ExcelFile(file)
sheets = df.sheet_names
print(sheets)
df_list = []
for city in list_of_cities:
df_list.append(df.parse(city))
x=range(len(df_list[0][x_axis].tolist()))
width = 0.3
# Calculate the offset of the center of the xtick labels
xTickOffset = width*(len(list_of_cities)-1)/2
# Create a plot
fig, ax = plt.subplots()
for j,df_item in enumerate(df_list):
ax.bar(x,df_item[metric],width,label = sheets[j]) #this isn't working
x = [i+width for i in x]
ax.set_xlabel(x_axis)
ax.set_ylabel(metric)
# Add a legend (feel free to change the location)
ax.legend(loc='upper right')
# Add the xTickOffset to the xtick label positions so they are centered
ax.set_xticks(list(map(lambda x:x+xTickOffset, range(len(df_list[0][x_axis].tolist())))),labels=df_list[0][x_axis].tolist())
plt.show()
t=['Athena', 'Thessaloniki', 'Patras']
plotMeteoData(file,'Month','Humidity',*t)
Resulting Graph:
The xtick offset should account for different numbers of excel pages. See this for more information on legends.

Related

Plotting multiple columns on the same figure using pandas [duplicate]

This question already has answers here:
How to plot different groups of data from a dataframe into a single figure
(5 answers)
Closed 5 months ago.
I am trying to plot multiple different lines on the same graph using pandas and matplotlib.
I have a series of 100 synthetic temperature histories that I want to plot, all in grey, against the original real temperature history I used to generate the data.
How can I plot all of these series on the same graph? I know how to do it in MATLAB but am using Python for this project, and pandas has been the easiest way I have found to read in every single column from the output file without having to specify each column individually. The number of columns of data will change from 100 up to 1000, so I need a generic solution. My code plots each of the data series individually fine, but I just need to work out how to add them both to the same figure.
Here is the code so far:
# dirPath is the path to my working directory
outputFile = "output.csv"
original_data = "temperature_data.csv"
# Read in the synthetic temperatures from the output file, time is the index in the first column
data = pd.read_csv(outputFile,header=None, skiprows=1, index_col=0)
# Read in the original temperature data, time is the index in the first column
orig_data = pd.read_csv(dirPath+original_data,header=None, skiprows=1, index_col=0)
# Convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
# Plot all columns of synthetic data in grey
data = data.plot.line(title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel=("Synthetic avergage hourly temperature (C)"),
color="#929591",
legend=None)
# Plot one column of original data in black
orig_data = orig_data.plot.line(color="k",legend="Original temperature data")
# Create and save figure
fig = data.get_figure()
fig = orig_data.get_figure()
fig.savefig("temp_arma.png")
This is some example data for the output data:
And this is the original data:
Plotting each individually gives these graphs - I just want them overlaid!
Your data.plot.line returns an AxesSubplot instance, you can catch it and feed it to your second command:
# plot 1
ax = data.plot.line(…)
# plot 2
data.plot.line(…, ax=ax)
Try to run this code:
# convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
# Plot all columns of synthetic data in grey
ax = data.plot.line(title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel=("Synthetic avergage hourly temperature (C)"),
color="#929591",
legend=None)
# Plot one column of original data in black
orig_data.plot.line(color="k",legend="Original temperature data", ax=ax)
# Create and save figure
ax.figure.savefig("temp_arma.png")
You should directy use matplotlib functions. It offers more control and is easy to use as well.
Part 1 - Reading files (borrowing your code)
# Read in the synthetic temperatures from the output file, time is the index in the first column
data = pd.read_csv(outputFile,header=None, skiprows=1, index_col=0)
# Read in the original temperature data, time is the index in the first column
orig_data = pd.read_csv(dirPath+original_data,header=None, skiprows=1, index_col=0)
# Convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
Part 2 - Plotting
fig = plt.figure(figsize=(10,8))
ax = plt.gca()
# Plotting all histories:
# 1st column contains time hence excluding
for col in data.columns[1:]:
ax.plot(data["Time"], data[col], color='grey')
# Orig
ax.plot(orig_data["Time"], orig_data["Temperature"], color='k')
# axis labels
ax.set_xlabel("Time (yrs)")
ax.set_ylabel("avergage hourly temperature (C)")
fig.savefig("temp_arma.png")
Try the following:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data.plot.line(ax=ax,
title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel="Synthetic avergage hourly temperature (C)",
color="#929591",
legend=False)
orig_data.rename(columns={orig_data.columns[0]: "Original temperature data"},
inplace=True)
orig_data.plot.line(ax=ax, color="k")
It's pretty much your original code with the following slight modifications:
Getting the ax object
fig, ax = plt.subplots()
and using it for the plotting
data.plot.line(ax=ax, ...
...
orig_data.plot.line(ax=ax, ...)
Result for some randomly generated sample data:
import random # For sample data only
# Sample data
data = pd.DataFrame({
f'col_{i}': [random.random() for _ in range(25)]
for i in range(1, 50)
})
orig_data = pd.DataFrame({
'col_0': [random.random() for _ in range(25)]
})

How to label my x-axis with years extracted from my time-series data?

I have data in this format / shape etc in a dataframe that I would like to represent in the form of a graph showing the total counts per each month. I have resampled the data so that it shows one row for one month, and then I wrote the following code to chart it out:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
#Read in data & create total column
stacked_bar_data = new_df
stacked_bar_data["total"] = stacked_bar_data.var1 + stacked_bar_data.var2
#Set general plot properties
sns.set_style("whitegrid")
sns.set_context({"figure.figsize": (24, 10)})
sns.set_context("poster")
#Plot 1 - background - "total" (top) series
sns.barplot(x = stacked_bar_data.index, y = stacked_bar_data.total, color = "red")
#Plot 2 - overlay - "bottom" series
bottom_plot = sns.barplot(x = stacked_bar_data.index, y = stacked_bar_data.attended, color = "#0000A3")
topbar = plt.Rectangle((0,0),1,1,fc="red", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='#0000A3', edgecolor = 'none')
l = plt.legend([bottombar, topbar], ['var1', 'var2'], loc=1, ncol = 2, prop={'size':18})
l.draw_frame(False)
#Optional code - Make plot look nicer
sns.despine(left=True)
bottom_plot.set_ylabel("Count")
# bottom_plot.set_xlabel("date")
#Set fonts to consistent 16pt size
for item in ([bottom_plot.xaxis.label, bottom_plot.yaxis.label] +
bottom_plot.get_xticklabels() + bottom_plot.get_yticklabels()):
item.set_fontsize(16)
# making sure our xticks is formatted correctly
plt.xticks(fontsize=20)
years = mdates.YearLocator() # every year
months = mdates.MonthLocator() # every month
years_fmt = mdates.DateFormatter('%Y')
bottom_plot.xaxis.set_major_locator(years)
bottom_plot.xaxis.set_major_formatter(years_fmt)
bottom_plot.xaxis.set_minor_locator(months)
plt.show()
# bottom_plot.axes.xaxis.set_visible(False)
Thing is, my chart doesn't show me the years at the bottom. I believe I have all the pieces necessary to solve this problem, but for some reason I can't figure out what I'm doing wrong.
I think I'm doing something wrong with how I set up the subplots of the sns.barplot. Maybe I should be assigning them to fig and ax or something like that? That's how I saw it done on the matplotlib site. I just can't managed to transfer that logic over to my example.
Any help would be most appreciated. Thanks!
There are few things to consider. First of all, please try to convert your date column (new_df.date) to datetime.
new_df.date = pd.to_datetime(new_df.date)
Second of all do not use this part:
bottom_plot.xaxis.set_major_locator(years)
bottom_plot.xaxis.set_major_formatter(years_fmt)
bottom_plot.xaxis.set_minor_locator(months)
Instead use:
x_dates = stacked_bar_data['date'].dt.strftime('%Y').sort_values().unique()
bottom_plot.set_xticklabels(labels=x_dates, rotation=0, ha='center')
This is because seaborn re-locates the bars to integer positions. Even if we set them to be dates - Note, that you used indices explicitly. Below is fully working example. Note - this gives you major ticks only. You'll have to work the minor ticks out. My comments and things I've commented out after double #.
stacked_bar_data.date = pd.to_datetime(stacked_bar_data.date)
stacked_bar_data["total"] = stacked_bar_data.var1 + stacked_bar_data.var2
#Set general plot properties
sns.set_style("whitegrid")
sns.set_context({"figure.figsize": (14, 7)}) ## modified size :)
sns.set_context("poster")
years = mdates.YearLocator() # every year
months = mdates.MonthLocator() # every month
years_fmt = mdates.DateFormatter('%Y')
sns.barplot(x = stacked_bar_data.index, y = stacked_bar_data.total, color = "red")
bottom_plot = sns.barplot(x = stacked_bar_data.index, y = stacked_bar_data.attended, color = "#0000A3")
topbar = plt.Rectangle((0,0),1,1,fc="red", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='#0000A3', edgecolor = 'none')
l = plt.legend([bottombar, topbar], ['var1', 'var2'], loc=1, ncol = 2, prop={'size':18})
l.draw_frame(False)
#Optional code - Make plot look nicer
sns.despine(left=True)
bottom_plot.set_ylabel("Count")
# bottom_plot.set_xlabel("date")
# making sure our xticks is formatted correctly
## plt.xticks(fontsize=20) # not needed as you change font below in the loop
## Do not use at all
## bottom_plot.xaxis.set_major_locator(years)
## bottom_plot.xaxis.set_major_formatter(years_fmt)
## bottom_plot.xaxis.set_minor_locator(months)
#Set fonts to consistent 16pt size
for item in ([bottom_plot.xaxis.label, bottom_plot.yaxis.label] +
bottom_plot.get_xticklabels() + bottom_plot.get_yticklabels()):
item.set_fontsize(16)
## This part is required if you want to stick to seaborn
## This is because the moment you start using seaborn it will "re-position" the bars
## at integer position rather than dates. W/o seaborn there is no such need
x_dates = stacked_bar_data['date'].dt.strftime('%Y').sort_values().unique()
bottom_plot.set_xticklabels(labels=x_dates, rotation=0, ha='center')
plt.show()

How to plot a heatmap using seaborn or matplotlib?

I have a dataframe that I am trying to visualize into a heatmap, I used matplotlib to make a heatmap but it is showing data that is not apart of my dataframe.
I've tried to create a heatmap using matplotlib from an example I found online and changed the code to work for my data. But on the left side of the graph and top of it there are random values that are not apart of my data and I'm not sure how to remove them.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
url = 'http://mcubed.net/ncaab/seeds.shtml'
#Getting the website text
data = requests.get(url).text
#Parsing the website
soup = BeautifulSoup(data, "html5lib")
#Create an empty list
dflist = []
#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to
pandas
for b in soup.findAll({"b"})[2:-1]:
dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header
= None))
dflist[0]
#Created a new list, due to the melt we are going to do not been able to
replace
#the dataframes in DFList
meltedDF = []
#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):
#Creating the team name
name = "Team " + str(teamnumber)
#Making the team name a column, with the values in df[0] and df[1] in
our dataframes
df[name] = df[0] + df[1]
#Melting the dataframe to make the team name its own column
meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))
# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)
# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win
Percent', 'variable':'Team' , 'value': 'VS'})\
.reindex(['Team', 'VS', 'Record', 'Win
Percent'], axis = 1)
allTeamStats
#Graph visualization Making a HeatMap
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
y=["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16"]
x=["16","15","14","13","12","11","10","9","8","7","6","5","4","3","2","1"]
winp = []
for i in x:
lst = []
for j in y:
percent = allTeamStats.loc[(allTeamStats["Team"]== 'Team '+i) &\
(allTeamStats["VS"]== "vs.#"+j)]['Win
Percent'].iloc[0]
percent = float(percent[:-1])
lst.append(percent)
winp.append(lst)
winpercentage= np.array([[]])
fig,ax=plt.subplots(figsize=(18,18))
im= ax.imshow(winp, cmap='hot')
# We want to show all ticks...
ax.set_xticks(np.arange(len(y)))
ax.set_yticks(np.arange(len(x)))
# ... and label them with the respective list entries
ax.set_xticklabels(y)
ax.set_yticklabels(x)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(x)):
for j in range(len(y)):
text = ax.text(j, i, winp[i][j],
ha="center", va="center", color="red")
ax.set_title("Win Percentage of Each Matchup", fontsize= 40)
heatmap = plt.pcolor(winp)
plt.colorbar(heatmap)
ax.set_ylabel('Seeds', fontsize=40)
ax.set_xlabel('Seeds', fontsize=40)
plt.show()
The results I get are what I want except for the two lines that are on the left side and top of the heatmap. I'm unsure what these values are coming from and to easier see them I used cmap= 'hot' to show the values that are not supposed to be there. If you could help me fix my code to plot it correctly or plot an entire new heatmap using seaborn (my TA told me to try using seaborn but I've never used it yet) with my data. Anything helps Thanks!
I think the culprit is this line: im= ax.imshow(winp, cmap='hot') in your code. Delete it and try again. Basically, anything that you plotted after that line was laid over what that line created. The left and top "margins" were the only parts of the image on the bottom that you could see.

Define bubble sizes according to a column and bubble colors according to another column in scatter plot (matplotlib)

I'm building a simple scatter plot that reads data from a xls file.
It's the classic Life expectancy x GDP per capita scatter plot. Here's the code:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
#ler a terceira sheet da planilha
data = pd.read_excel('sample.xls', sheet_name=0)
data.head()
plt.scatter(x = data['LifeExpec'],
y = data['GDPperCapita'],
s = data['PopX1000'],
c = data['PopX1000'],
cmap=cm.viridis,
edgecolors = 'none',
alpha = 0.7)
for estado in range(len(data['UF'])):
plt.text(x = data['LifeExpec'][estado],
y = data['GDPperCapita'][estado],
s = data['UF'][estado],
fontsize = 14)
plt.colorbar()
plt.show()
The .xls file:
The population column from the xls file (PopX1000) is defining the bubbles sizes and currently it's defining their colors as well.
I would like the bubbles to change sizes according to population (as they do now), but the colors to change according to the Region the State is in.
I believe I can't simply change the c property because it expects a float value.
Any tips on how to do this?
You could transform the Region to a numeric representation, and use that as a "key" to your colormap. Below are two methods to do that (one is commented out, pick whichever you choose, the result should be the same):
plt.scatter(x = data['LifeExpec'],
y = data['GDPperCapita'],
s = data['PopX1000'],
c = pd.factorize(data['Region'])[0],
# Alternatively:
# c = data['Region'].astype('category').cat.codes
cmap=cm.viridis,
edgecolors = 'none',
alpha = 0.7)

Python: Legend has wrong colors on Pandas MultiIndex plot

I'm trying to plot data from 2 seperate MultiIndex, with the same data as levels in each.
Currently, this is generating two seperate plots and I'm unable to customise the legend by appending some string to individualise each line on the graph. Any help would be appreciated!
Here is the method so far:
def plot_lead_trail_res(df_ante, df_post, symbols=[]):
if len(symbols) < 1:
print "Try again with a symbol list. (Time constraints)"
else:
df_ante = df_ante.loc[symbols]
df_post = df_post.loc[symbols]
ante_leg = [str(x)+'_ex-ante' for x in df_ante.index.levels[0]]
post_leg = [str(x)+'_ex-post' for x in df_post.index.levels[0]]
print "ante_leg", ante_leg
ax = df_ante.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION', legend=ante_leg)
ax = df_post.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION', legend=post_leg)
ax.set_xlabel('Time-shift of sentiment data (days) with financial data')
ax.set_ylabel('Mutual Information')
Using this function call:
sentisignal.plot_lead_trail_res(data_nasdaq_top_100_preprocessed_mi_res, data_nasdaq_top_100_preprocessed_mi_res_validate, ['AAL', 'AAPL'])
I obtain the following figure:
Current plots
Ideally, both sets of lines would be on the same graph with the same axes!
Update 2 [Concatenation Solution]
I've solved the issues of plotting from multiple frames using concatenation, however the legend does not match the line colors on the graph.
There are not specific calls to legend and the label parameter in plot() has not been used.
Code:
df_ante = data_nasdaq_top_100_preprocessed_mi_res
df_post = data_nasdaq_top_100_preprocessed_mi_res_validate
symbols = ['AAL', 'AAPL']
df_ante = df_ante.loc[symbols]
df_post = df_post.loc[symbols]
df_ante.index.set_levels([[str(x)+'_ex-ante' for x in df_ante.index.levels[0]],df_ante.index.levels[1]], inplace=True)
df_post.index.set_levels([[str(x)+'_ex-post' for x in df_post.index.levels[0]],df_post.index.levels[1]], inplace=True)
df_merge = pd.concat([df_ante, df_post])
df_merge['SHIFT'] = abs(df_merge['SHIFT'])
df_merge.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION')
Image:
MultiIndex Plot Image
I think, with
ax = df_ante.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION', legend=ante_leg)
you put the output of the plot() in ax, including the lines, which then get overwritten by the second function call. Am I right, that the lines which were plotted first are missing?
The official procedure would be rather something like
fig = plt.figure(figsize=(5, 5)) # size in inch
ax = fig.add_subplot(111) # if you want only one axes
now you have an axes object in ax, and can take this as input for the next plots.

Categories