I am currently working through the book Hands On Machine Learning and am trying to replicate a visualization where we plot the lat and lon co-ordinates on a scatter plot of San Diego. I have taken the plot code from the book which uses the code below (matplotlib method). I would like to replicate the same visualization using plotnine. Could someone help me with the translation.
matplotlib method
# DATA INGEST -------------------------------------------------------------
# Import the file from github
url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv" # Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content
# Reading the downloaded content and turning it into a pandas dataframe
housing = pd.read_csv(io.StringIO(download.decode('utf-8')))
# Then plot
import matplotlib.pyplot as plt
# The size is now related to population divided by 100
# the colour is related to the median house value
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.show()
plotnine method
from plotnine import ggplot, geom_point, aes, stat_smooth, scale_color_cmap
# Lets try the same thing in ggplot
(ggplot(housing, aes('longitude', 'latitude', size = "population", color = "median_house_value"))
+ geom_point(alpha = 0.1)
+ scale_color_cmap(name="jet"))
If your question was the colour mapping, then you were close: just needed cmap_name='jet' instead of name='jet'.
If it is a broader styling thing, below is close to what you had with matplotlib.
matplotlib method
plotline method
p = (ggplot(housing, aes(x='longitude', y='latitude', size='population', color='median_house_value'))
+ theme_matplotlib()
+ geom_point(alpha=0.4)
+ annotate('text', x=-114.6, y=42, label='population', size=8)
+ annotate('point', x=-115.65, y=42, size=5, color='#6495ED', fill='#6495ED', alpha=0.8)
+ labs(x=None, color='Median house value')
+ scale_y_continuous(breaks=np.arange(34,44,2))
+ scale_color_cmap(cmap_name='jet')
+ scale_size_continuous(range=(0.05, 6))
+ guides(size=False)
+ theme(
text = element_text(family='DejaVu Sans', size=8),
axis_text_x = element_blank(),
axis_ticks_minor=element_blank(),
legend_key_height = 34,
legend_key_width = 9,
)
)
p
I am not sure to what capacity it's possible to modify the formatting of colour bar in plotnine. If others have additional ideas, I would be most interested - I think the matplotlib colour bar looks nicer.
Related
I am trying to create a percentage stacked area chart for contents inside a human retinal cell from top to bottom, using matplotlib in PyCharm, but I just got introduced to python/coding yesterday and don't know how to convert the raw data into percentages.
I have the data in an Excel sheet I'm importing with pandas and am able to show a "normal" stacked area chart, bot not sure how or where to transform the data into percentages.
I have managed to plot out the values in a stacked area chart, looking something like this:
Stacked Plot
What I want is to show the percentage individual components make up, like in this example from anychart.com
percentage stacked area graph.
Here is the code I currently have cobbled together:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
organelles = pd.read_excel('H:\\cell 1 stuff\\measurements\\testbox organelles 1.xlsx')
slices = organelles["Slice"]
outersegs = organelles["POS"]
apicals = organelles["aps"]
matrix = organelles["IPM"]
granules = organelles["gran"]
labels = ["POS", "AP", "IPM", "Granules"]
plt.title('subretinal space components makeup', fontdict={'fontweight': 'bold', 'fontsize': 14})
plt.stackplot(slices, outersegs, apicals, matrix, granules,
colors =['blue', 'green', 'yellow', 'brown'])
plt.xticks(np.arange(500,6250,500), rotation = 45)
plt.xlim(1000, 6000)
plt.xlabel('Slice')
plt.ylabel('$nm²$')
plt.yticks(rotation = 90)
plt.legend(labels, loc = 2)
fig = plt.figure
plt.show()
I managed to get it to do what I wanted by summing up all the columns up and calculating the percentage each individual class has by using the following code:
allorganelles = organelles["POS"] + organelles["aps"] + organelles["IPM"] + organelles["gran"] + organelles["IS"]
allorganelles = organelles["POS"] + organelles["aps"] + organelles["IPM"] + organelles["gran"] + organelles["IS"]
pos_rel = organelles["POS"].div(allorganelles, 0)*100
apicals_rel = organelles["aps"].div(allorganelles, 0)*100
ipm_rel = organelles["IPM"].div(allorganelles, 0)*100
gran_rel = organelles["gran"].div(allorganelles, 0)*100
is_rel = organelles["IS"].div(allorganelles, 0)*100
plt.stackplot(slices, apicals_rel, pos_rel, ipm_rel, gran_rel, is_rel,
colors =['green', 'blue', 'yellow', 'brown', 'pink'])
This answer was posted as an edit to the question Percentage Stacked Area Chart from Excel Spreadsheet with matplotlib by the OP hufoveconMax under CC BY-SA 4.0.
This question already has answers here:
How to add percentages on top of grouped bars
(6 answers)
How to annotate grouped bar plot with percent by hue/legend group
(1 answer)
Closed 1 year ago.
I have same issue with this post, and already try this solution (also the comment). But i got weird percentage result. Since I am not eligible yet to comment, I post this question.
As far as I tweak this, it's happen because of the weird order of this line but i can't find the solution.
a = [p.get_height() for p in plot.patches]
My expected output is the total percentage of each Class will be 100%
Here the first source code I use
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df = sns.load_dataset("titanic")
def with_hue(plot, feature, Number_of_categories, hue_categories):
a = [p.get_height() for p in plot.patches]
patch = [p for p in plot.patches]
for i in range(Number_of_categories):
total = feature.value_counts().values[i]
# total = np.sum(a[::hue_categories])
for j in range(hue_categories):
percentage = '{:.1f}%'.format(100 * a[(j*Number_of_categories + i)]/total)
x = patch[(j*Number_of_categories + i)].get_x() + patch[(j*Number_of_categories + i)].get_width() / 2 - 0.15
y = patch[(j*Number_of_categories + i)].get_y() + patch[(j*Number_of_categories + i)].get_height()
p3.annotate(percentage, (x, y), size = 11)
plt.show()
plt.figure(figsize=(12,8))
p3 = sns.countplot(x="class", hue="who", data=df)
p3.set(xlabel='Class', ylabel='Count')
with_hue(p3, df['class'],3,3)
and the first output
while using total value with total = np.sum(a[::hue_categories]) give this output
First, note that in matplotlib and seaborn, a subplot is called an "ax". Giving such a subplot a name such as "p3" or "plot" leads to unnecessary confusion when studying the documentation and online example code.
The bars in the seaborn bar plot are organized, starting with all the bars belonging to the first hue value, then the second, etc. So, in the given example, first come all the blue, then all the orange and finally all the green bars. This makes looping through ax.patches somewhat complicated. Luckily, the same patches are also available via ax.collections, where each hue group forms a separate collection of bars.
Here is some example code:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
def percentage_above_bar_relative_to_xgroup(ax):
all_heights = [[p.get_height() for p in bars] for bars in ax.containers]
for bars in ax.containers:
for i, p in enumerate(bars):
total = sum(xgroup[i] for xgroup in all_heights)
percentage = f'{(100 * p.get_height() / total) :.1f}%'
ax.annotate(percentage, (p.get_x() + p.get_width() / 2, p.get_height()), size=11, ha='center', va='bottom')
df = sns.load_dataset("titanic")
plt.figure(figsize=(12, 8))
ax3 = sns.countplot(x="class", hue="who", data=df)
ax3.set(xlabel='Class', ylabel='Count')
percentage_above_bar_relative_to_xgroup(ax3)
plt.show()
I have a line plot and a scatter plot that are conceptually linked by sample IDs, i.e. each dot on the 2D scatter plot corresponds to a line on the line plot.
While I have done linked plotting before using scatter plots, I have not seen examples of this for the situation above - where I select dots and thus selectively view lines.
Is it possible to link dots on a scatter plot to a line on a line plot? If so, is there an example implementation available online?
Searching the web for bokeh link line and scatter plot yields no examples online, as of 14 August 2018.
I know this is a little late - but maybe this snippet of code will help?
import numpy as np
from bokeh.io import output_file, show
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure
from bokeh.models import Circle,MultiLine
def play():
x = np.linspace(0,10,100)
y = np.random.rand(100)
xs = np.random.rand(100,3)
ys = np.random.normal(size=(100,3))
xp = [list(xi) for xi in xs] # Because Multi-List does not like numpy arrays
yp = [list(yi) for yi in ys]
output_file('play.html')
source = ColumnDataSource(data=dict(x=x,y=y,xp=xp,yp=yp))
TOOLS = 'box_select'
left = figure(tools=TOOLS,plot_width=700,plot_height=700)
c1 = left.circle('x','y',source=source)
c1.nonselection_glyph = Circle(fill_color='gray',fill_alpha=0.4,
line_color=None)
c1.selection_glyph = Circle(fill_color='orange',line_color=None)
right = figure(tools=TOOLS,plot_width=700,plot_height=700)
c2 = right.multi_line(xs='xp',ys='yp',source=source)
c2.nonselection_glyph = MultiLine(line_color='gray',line_alpha=0.2)
c2.selection_glyph = MultiLine(line_color='orange')
p = gridplot([[left, right]])
show(p)
As things turn out, I was able to make this happen by using HoloViews rather than Bokeh. The relevant example for making this work comes from the Selection1d tap stream.
http://holoviews.org/reference/streams/bokeh/Selection1D_tap.html#selection1d-tap
I will do an annotated version of the example below.
First, we begin with imports. (Note: all of this assumes work is being done in the Jupyter notebook.)
import numpy as np
import holoviews as hv
from holoviews.streams import Selection1D
from scipy import stats
hv.extension('bokeh')
First off, we set some styling options for the charts. In my experience, I usually build the chart before styling it, though.
%%opts Scatter [color_index=2 tools=['tap', 'hover'] width=600] {+framewise} (marker='triangle' cmap='Set1' size=10)
%%opts Overlay [toolbar='above' legend_position='right'] Curve (line_color='black') {+framewise}
This function below generates data.
def gen_samples(N, corr=0.8):
xx = np.array([-0.51, 51.2])
yy = np.array([0.33, 51.6])
means = [xx.mean(), yy.mean()]
stds = [xx.std() / 3, yy.std() / 3]
covs = [[stds[0]**2 , stds[0]*stds[1]*corr],
[stds[0]*stds[1]*corr, stds[1]**2]]
return np.random.multivariate_normal(means, covs, N)
data = [('Week %d' % (i%10), np.random.rand(), chr(65+np.random.randint(5)), i) for i in range(100)]
sample_data = hv.NdOverlay({i: hv.Points(gen_samples(np.random.randint(1000, 5000), r2))
for _, r2, _, i in data})
The real magic begins here. First off, we set up a scatterplot using the hv.Scatter object.
points = hv.Scatter(data, ['Date', 'r2'], ['block', 'id']).redim.range(r2=(0., 1))
Then, we create a Selection1D stream. It pulls in points from the points object.
stream = Selection1D(source=points)
We then create a function to display the regression plot on the right. There's an empty plot that is the "default", and then there's a callback that hv.DynamicMap calls on.
empty = (hv.Points(np.random.rand(0, 2)) * hv.Curve(np.random.rand(0, 2))).relabel('No selection')
def regression(index):
if not index:
return empty
scatter = sample_data[index[0]]
xs, ys = scatter['x'], scatter['y']
slope, intercep, rval, pval, std = stats.linregress(xs, ys)
xs = np.linspace(*scatter.range(0)+(2,))
reg = slope*xs+intercep
return (scatter * hv.Curve((xs, reg))).relabel('r2: %.3f' % slope)
Now, we create the DynamicMap which dynamically loads the regression curve data.
reg = hv.DynamicMap(regression, kdims=[], streams=[stream])
# Ignoring annotation for average - it is not relevant here.
average = hv.Curve(points, 'Date', 'r2').aggregate(function=np.mean)
Finally, we display the plots.
points * average + reg
The most important thing I learned from building this is that the indices for the points have to be lined up with the indices for the regression curves.
I hope this helps others building awesome viz using HoloViews!
I am translating a set of R visualizations to Python. I have the following target R multiple plot histograms:
Using Matplotlib and Seaborn combination and with the help of a kind StackOverflow member (see the link: Python Seaborn Distplot Y value corresponding to a given X value), I was able to create the following Python plot:
I am satisfied with its appearance, except, I don't know how to put the Header information in the plots. Here is my Python code that creates the Python Charts
""" Program to draw the sampling histogram distributions """
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
import seaborn as sns
def main():
""" Main routine for the sampling histogram program """
sns.set_style('whitegrid')
markers_list = ["s", "o", "*", "^", "+"]
# create the data dataframe as df_orig
df_orig = pd.read_csv('lab_samples.csv')
df_orig = df_orig.loc[df_orig.hra != -9999]
hra_list_unique = df_orig.hra.unique().tolist()
# create and subset df_hra_colors to match the actual hra colors in df_orig
df_hra_colors = pd.read_csv('hra_lookup.csv')
df_hra_colors['hex'] = np.vectorize(rgb_to_hex)(df_hra_colors['red'], df_hra_colors['green'], df_hra_colors['blue'])
df_hra_colors.drop(labels=['red', 'green', 'blue'], axis=1, inplace=True)
df_hra_colors = df_hra_colors.loc[df_hra_colors['hra'].isin(hra_list_unique)]
# hard coding the current_component to pc1 here, we will extend it by looping
# through the list of components
current_component = 'pc1'
num_tests = 5
df_columns = df_orig.columns.tolist()
start_index = 5
for test in range(num_tests):
current_tests_list = df_columns[start_index:(start_index + num_tests)]
# now create the sns distplots for each HRA color and overlay the tests
i = 1
for _, row in df_hra_colors.iterrows():
plt.subplot(3, 3, i)
select_columns = ['hra', current_component] + current_tests_list
df_current_color = df_orig.loc[df_orig['hra'] == row['hra'], select_columns]
y_data = df_current_color.loc[df_current_color[current_component] != -9999, current_component]
axs = sns.distplot(y_data, color=row['hex'],
hist_kws={"ec":"k"},
kde_kws={"color": "k", "lw": 0.5})
data_x, data_y = axs.lines[0].get_data()
axs.text(0.0, 1.0, row['hra'], horizontalalignment="left", fontsize='x-small',
verticalalignment="top", transform=axs.transAxes)
for current_test_index, current_test in enumerate(current_tests_list):
# this_x defines the series of current_component(pc1,pc2,rhob) for this test
# indicated by 1, corresponding R program calls this test_vector
x_series = df_current_color.loc[df_current_color[current_test] == 1, current_component].tolist()
for this_x in x_series:
this_y = np.interp(this_x, data_x, data_y)
axs.plot([this_x], [this_y - current_test_index * 0.05],
markers_list[current_test_index], markersize = 3, color='black')
axs.xaxis.label.set_visible(False)
axs.xaxis.set_tick_params(labelsize=4)
axs.yaxis.set_tick_params(labelsize=4)
i = i + 1
start_index = start_index + num_tests
# plt.show()
pp = PdfPages('plots.pdf')
pp.savefig()
pp.close()
def rgb_to_hex(red, green, blue):
"""Return color as #rrggbb for the given color values."""
return '#%02x%02x%02x' % (red, green, blue)
if __name__ == "__main__":
main()
The Pandas code works fine and it is doing what it is supposed to. It is my lack of knowledge and experience of using 'PdfPages' in Matplotlib that is the bottleneck. How can I show the header information in Python/Matplotlib/Seaborn that I can show in the corresponding R visalization. By the Header information, I mean What The R visualization has at the top before the histograms, i.e., 'pc1', MRP, XRD,....
I can get their values easily from my program, e.g., current_component is 'pc1', etc. But I don't know how to format the plots with the Header. Can someone provide some guidance?
You may be looking for a figure title or super title, fig.suptitle:
fig.suptitle('this is the figure title', fontsize=12)
In your case you can easily get the figure with plt.gcf(), so try
plt.gcf().suptitle("pc1")
The rest of the information in the header would be called a legend.
For the following let's suppose all subplots have the same markers. It would then suffice to create a legend for one of the subplots.
To create legend labels, you can put the labelargument to the plot, i.e.
axs.plot( ... , label="MRP")
When later calling axs.legend() a legend will automatically be generated with the respective labels. Ways to position the legend are detailed e.g. in this answer.
Here, you may want to place the legend in terms of figure coordinates, i.e.
ax.legend(loc="lower center",bbox_to_anchor=(0.5,0.8),bbox_transform=plt.gcf().transFigure)
My question is almost exactly similar to this one. However, I'm not satisfied with the answers, because I want to generate an actual heatmap, without explicitely binning the data.
To be precise, I would like to display the function that is the result of a convolution between the scatter data and a custom kernel, such as 1/x^2.
How should I implement this with matplotlib?
EDIT: Basically, what I have done is this. The result is here. I'd like to keep everything, the axis, the title, the labels and so on. Basically just change the plot to be like I described, while re-implementing as little as possible.
Convert your time series data into a numeric format with matplotlib.dats.date2num. Lay down a rectangular grid that spans your x and y ranges and do your convolution on that plot. Make a pseudo-color plot of your convolution and then reformat the x labels to be dates.
The label formatting is a little messy, but reasonably well documented. You just need to replace AutoDateFormatter with DateFormatter and an appropriate formatting string.
You'll need to tweak the constants in the convolution for your data.
import numpy as np
import datetime as dt
import pylab as plt
import matplotlib.dates as dates
t0 = dt.date.today()
t1 = t0+dt.timedelta(days=10)
times = np.linspace(dates.date2num(t0), dates.date2num(t1), 10)
dt = times[-1]-times[0]
price = 100 - (times-times.mean())**2
dp = price.max() - price.min()
volume = np.linspace(1, 100, 10)
tgrid = np.linspace(times.min(), times.max(), 100)
pgrid = np.linspace(70, 110, 100)
tgrid, pgrid = np.meshgrid(tgrid, pgrid)
heat = np.zeros_like(tgrid)
for t,p,v in zip(times, price, volume):
delt = (t-tgrid)**2
delp = (p-pgrid)**2
heat += v/( delt + delp*1.e-2 + 5.e-1 )**2
fig = plt.figure()
ax = fig.add_subplot(111)
ax.pcolormesh(tgrid, pgrid, heat, cmap='gist_heat_r')
plt.scatter(times, price, volume, marker='x')
locator = dates.DayLocator()
ax.xaxis.set_major_locator(locator)
ax.xaxis.set_major_formatter(dates.AutoDateFormatter(locator))
fig.autofmt_xdate()
plt.show()