I have this excel file that i need to plot. So far my code looks like this
import pandas as pd
import matplotlib.pyplot as plt
file = 'weatherdata.xlsx'
def plotMeteoData(file,x_axis,metric,*list_of_cities):
df = pd.ExcelFile(file)
sheets = df.sheet_names
print(sheets)
df_list = []
for city in list_of_cities:
df_list.append(df.parse(city))
x=range(len(df_list[0][x_axis].tolist()))
width = 0.3
for j,df_item in enumerate(df_list):
plt.bar(x,df_item[metric],width,label = sheets[j]) #this isn't working
x = [i+width for i in x]
plt.xlabel(x_axis)
plt.ylabel(metric)
plt.xticks(range(len(df_list[0][x_axis].tolist())),labels=df_list[0][x_axis].tolist())
plt.show()
t=['Αθήνα','Θεσσαλονίκη','Πάτρα']
plotMeteoData(file,'Μήνας','Υγρασία',*t)
and gives this output.
Each color represents an excel sheet, x-axis represents the months and y-axis represents some values.
I've commented the line where I'm trying to add some labels for each sheet and I'm unable to. Also if you look at the above output the bars aren't centered with each xtick. How can I fix those problems? Thanks
Typically you use plt.subplots, as it gives you more control over the graph. The code below calculates the offset needed for the xtick labels to be centered and shows the legend with the city labels:
import pandas as pd
import matplotlib.pyplot as plt
file = 'weatherdata.xlsx'
def plotMeteoData(file,x_axis,metric,*list_of_cities):
df = pd.ExcelFile(file)
sheets = df.sheet_names
print(sheets)
df_list = []
for city in list_of_cities:
df_list.append(df.parse(city))
x=range(len(df_list[0][x_axis].tolist()))
width = 0.3
# Calculate the offset of the center of the xtick labels
xTickOffset = width*(len(list_of_cities)-1)/2
# Create a plot
fig, ax = plt.subplots()
for j,df_item in enumerate(df_list):
ax.bar(x,df_item[metric],width,label = sheets[j]) #this isn't working
x = [i+width for i in x]
ax.set_xlabel(x_axis)
ax.set_ylabel(metric)
# Add a legend (feel free to change the location)
ax.legend(loc='upper right')
# Add the xTickOffset to the xtick label positions so they are centered
ax.set_xticks(list(map(lambda x:x+xTickOffset, range(len(df_list[0][x_axis].tolist())))),labels=df_list[0][x_axis].tolist())
plt.show()
t=['Athena', 'Thessaloniki', 'Patras']
plotMeteoData(file,'Month','Humidity',*t)
Resulting Graph:
The xtick offset should account for different numbers of excel pages. See this for more information on legends.
I am creating a figure with two subplots with scatter plots. I want to use the same color scheme and marker definitions for each subplot, but can't seem to get it to work. Please forgive the length of my minimal working example, but I trimmed it down as much as I can.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
from scipy.stats import probplot
#Raw Data
area_old = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176]
combined = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176, 3059.4357099599456, 3348.0415691055823, 4839.023360449559, 4398.877634354169, 29269.67455441528, 11058.400909555028, 18266.34679952683, 16641.3446048029, 24983.586163502885, 5811.868753338233]
#Attributes to map colors and markers to
lt_bt = ['r','s','s','r','r','u','r','s','r','r','s','r']
combined_bt =['r','s','s','r','r','u','r','s','r','r','s','r','u','u','r','s','r','s','r','r','r','u']
#Get Probability plot Data
a = probplot(area_old,dist='norm', plot=None)
b= probplot(combined,dist='norm', plot=None)
#Colors and Markers to use
colors = {'r':'red','s':'blue', 'u':'green'}
markers = {'r':'*','s':'x', 'u':'o'}
#Create Dataframe to combine raw data, attributes and sort
old_df = pd.DataFrame(area_old, columns=['Long Term Sites: N=12'])
old_df['Bar_Type'] = lt_bt
old_df = old_df.sort_values(by='Long Term Sites: N=12')
old_df['quart']=a[0][0]
#Pandas series of colors for plotting on subplot 'ax'
ax_color = old_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Create Dataframe to combine raw data, attributes and sort
combined_df = pd.DataFrame(combined, columns=['ALL SITES N=22'])
combined_df['Bar_Type'] = combined_bt
combined_df = combined_df.sort_values(by='ALL SITES N=22')
combined_df['quart']=b[0][0]
#Pandas series of colors for plotting on subplot 'ax1'
ax1_color = combined_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Legend Handles
undif = plt.Line2D([0,0],[0,1], color='green',marker='o',linestyle=' ')
reatt = plt.Line2D([0,0],[0,1], color='red',marker='*',linestyle=' ')
sep = plt.Line2D([0,0],[0,1], color='blue',marker='x',linestyle=' ')
fig,(ax,ax1) = plt.subplots(ncols=2,sharey=True)
#Plot each data point seperatly with different markers and colors
for i, thing in old_df.iterrows():
ax.scatter(thing['quart'],thing['Long Term Sites: N=12'],c=ax_color.iloc[i],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
for i , thing in combined_df.iterrows():
ax1.scatter(thing['quart'],thing['ALL SITES N=22'],c=ax1_color.iloc[i],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
ax.set_title('LONG TERM SITES N=12')
ax1.set_title('ALL SITES N=22')
ax1.set_ylabel('')
ax.set_ylabel('TOTAL EDDY AREA, IN METERS SQUARED')
ax.set_ylim(0,35000)
ax.get_yaxis().set_major_formatter(tkr.FuncFormatter(lambda x, p: format(int(x), ',')))
legend = ax.legend([reatt,sep,undif],["Reattachment","Separation", "Undifferentiated"],loc=2,title='Bar Type',fontsize='x-small')
plt.setp(legend.get_title(),fontsize='x-small')
ax.set_xlabel('QUANTILES')
ax1.set_xlabel('QUANTILES')
plt.tight_layout()
The basic idea is I am plotting scatter plots, point-by-point, to assign the appropriate color and marker. I assign colors using pandas integer indexing .iloc and assign marker by specifying a key for the markers dictionary.
I know something isn't right because the first point in old_df and combined_df (i.e. old_df.loc[1,:],combined_df.loc[1,:]) should have the color and marker of 'blue' and 'x', respectivly.
What am I doing wrong?
Not sure why, but somehow using .iloc in the ax.scatter results in unpredictable behavior. All I had to do was remove the .iloc method and replace it with a dictionary mapping (i.e. c=ax_color.iloc[i] to c=colors[thing['Bar_Type']]) everything works fine!
A working example of the desired result:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
from scipy.stats import probplot
#Raw Data
area_old = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176]
combined = [7603.4897489697905, 2941.7094279413577, 8153.896678990219, 7289.99097646249, 8620.196237363853, 11619.546945954673, 8458.80648310436, 7161.530990460888, 28486.298572761007, 4928.4856128268875, 4219.122621992603, 31687.155529782176, 3059.4357099599456, 3348.0415691055823, 4839.023360449559, 4398.877634354169, 29269.67455441528, 11058.400909555028, 18266.34679952683, 16641.3446048029, 24983.586163502885, 5811.868753338233]
#Attributes to map colors and markers to
lt_bt = ['r','s','s','r','r','u','r','s','r','r','s','r']
combined_bt =['r','s','s','r','r','u','r','s','r','r','s','r','u','u','r','s','r','s','r','r','r','u']
#Get Probability plot Data
a = probplot(area_old,dist='norm', plot=None)
b= probplot(combined,dist='norm', plot=None)
#Colors and Markers to use
colors = {'r':'red','s':'blue', 'u':'green'}
markers = {'r':'*','s':'x', 'u':'o'}
#Create Dataframe to combine raw data, attributes and sort
old_df = pd.DataFrame(area_old, columns=['Long Term Sites: N=12'])
old_df['Bar_Type'] = lt_bt
old_df = old_df.sort_values(by='Long Term Sites: N=12')
old_df['quart']=a[0][0]
#Pandas series of colors for plotting on subplot 'ax'
ax_color = old_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Create Dataframe to combine raw data, attributes and sort
combined_df = pd.DataFrame(combined, columns=['ALL SITES N=22'])
combined_df['Bar_Type'] = combined_bt
combined_df = combined_df.sort_values(by='ALL SITES N=22')
combined_df['quart']=b[0][0]
#Pandas series of colors for plotting on subplot 'ax1'
ax1_color = combined_df.loc[:,'Bar_Type'].apply(lambda x: colors[x])
#Legend Handles
undif = plt.Line2D([0,0],[0,1], color='green',marker='o',linestyle=' ')
reatt = plt.Line2D([0,0],[0,1], color='red',marker='*',linestyle=' ')
sep = plt.Line2D([0,0],[0,1], color='blue',marker='x',linestyle=' ')
fig,(ax,ax1) = plt.subplots(ncols=2,sharey=True)
#Plot each data point seperatly with different markers and colors
for i, thing in old_df.iterrows():
ax.scatter(thing['quart'],thing['Long Term Sites: N=12'],c=colors[thing['Bar_Type']],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
for i , thing in combined_df.iterrows():
ax1.scatter(thing['quart'],thing['ALL SITES N=22'],c=colors[thing['Bar_Type']],marker=markers[thing['Bar_Type']],zorder=10,s=50)
del i, thing
ax.set_title('LONG TERM SITES N=12')
ax1.set_title('ALL SITES N=22')
ax1.set_ylabel('')
ax.set_ylabel('TOTAL EDDY AREA, IN METERS SQUARED')
ax.set_ylim(0,35000)
ax.get_yaxis().set_major_formatter(tkr.FuncFormatter(lambda x, p: format(int(x), ',')))
legend = ax.legend([reatt,sep,undif],["Reattachment","Separation", "Undifferentiated"],loc=2,title='Bar Type',fontsize='x-small')
plt.setp(legend.get_title(),fontsize='x-small')
ax.set_xlabel('QUANTILES')
ax1.set_xlabel('QUANTILES')
plt.tight_layout()
I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)
I have been using matplotlib for quite some time now and it is great however, I want to switch to panda and my first attempt at it didn't go so well.
My data set looks like this:
sam,123,184,2.6,543
winter,124,284,2.6,541
summer,178,384,2.6,542
summer,165,484,2.6,544
winter,178,584,2.6,545
sam,112,684,2.6,546
zack,145,784,2.6,547
mike,110,984,2.6,548
etc.....
I want first to search the csv for anything with the name mike and create it own list. Now with this list I want to be able to do some math for example add sam[3] + winter[4] or sam[1]/10. The last part would be to plot it columns against each other.
Going through this page
http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
The only thing I see is if I have a column header, however, I don't have any headers. I only know the position in a row of the values I want.
So my question is:
How do I create a bunch of list for each row (sam, winter, summer)
Is this method efficient if my csv has millions of data point?
Could I use matplotlib plotting to plot pandas dataframe?
ie :
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike[1], winter[3], label='Mike vs Winter speed', color = 'red')
You can read a csv without headers:
data=pd.read_csv(filepath, header=None)
Columns will be numbered starting from 0.
Selecting and filtering:
all_summers = data[data[0]=='summer']
If you want to do some operations grouping by the first column, it will look like this:
data.groupby(0).sum()
data.groupby(0).count()
...
Selecting a row after grouping:
sums = data.groupby(0).sum()
sums.loc['sam']
Plotting example:
sums.plot()
import matplotlib.pyplot as plt
plt.show()
For more details about plotting, see: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html
df = pd.read_csv(filepath, header=None)
mike = df[df[0]=='mike'].values.tolist()
winter = df[df[0]=='winter'].values.tolist()
Then you can plot those list as you wanted to above
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike, winter, label='Mike vs Winter speed', color = 'red')