I have these data structures:
X axis values:
delta_Array = np.array([1000,2000,3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000])
Y Axis values
error_matrix =
[[ 24.22468454 24.22570421 24.22589308 24.22595919 24.22598979
24.22600641 24.22601644 24.22602294 24.2260274 24.22603059]
[ 28.54275713 28.54503017 28.54545119 28.54559855 28.54566676
28.54570381 28.54572615 28.54574065 28.5457506 28.54575771]]
How do I plot them as a line plot using matplotlib and python
This code I came up with renders a flat line as follows
figure(3)
i = 0
for i in range(error_matrix.shape[0]):
plot(delta_Array, error_matrix[i,:])
title('errors')
xlabel('deltas')
ylabel('errors')
grid()
show()
The problem here looks like is scaling of the axes. But im not sure how to fix it. Any ideas, suggestions how to get the curvature showing up properly?
You could use ax.twinx to create twin axes:
import matplotlib.pyplot as plt
import numpy as np
delta_Array = np.array([1000,2000,3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000])
error_matrix = np.array(
[[ 24.22468454, 24.22570421, 24.22589308, 24.22595919, 24.22598979, 24.22600641, 24.22601644, 24.22602294, 24.2260274, 24.22603059],
[ 28.54275713, 28.54503017, 28.54545119, 28.54559855, 28.54566676, 28.54570381, 28.54572615, 28.54574065, 28.5457506, 28.54575771]])
fig = plt.figure()
ax = []
ax.append(fig.add_subplot(1, 1, 1))
ax.append(ax[0].twinx())
colors = ('red', 'blue')
for i,c in zip(range(error_matrix.shape[0]), colors):
ax[i].plot(delta_Array, error_matrix[i,:], color = c)
plt.show()
yields
The red line corresponds to error_matrix[0, :], the blue with error_matrix[1, :].
Another possibility is to plot the ratio error_matrix[0, :]/error_matrix[1, :].
Matplotlib is showing you the right thing. If you want both curves on the same y scale, then they will be flat because their difference is much larger than the variation in each. If you don't mind different y scales, then do as unutbu suggested.
If you want to compare the rate of change between the functions, then I'd suggest normalising by the highest value in each:
import matplotlib.pyplot as plt
import numpy as np
plt.plot(delta_Array, error_matrix[0] / np.max(error_matrix[0]), 'b-')
plt.plot(delta_Array, error_matrix[1] / np.max(error_matrix[1]), 'r-')
plt.show()
And by the way, you don't need to be explicit in the dimensions of your 2D array. When you use error_matrix[i,:], it is the same as error_matrix[i].
Related
I have the following dataset:
results=[array([6.06674849e-18, 2.28597646e-03]), array([0.02039694, 0.01245901, 0.01264321, 0.00963068]), array([2.28719585e-18, 5.14800709e-02, 2.90957713e-02, 0.00000000e+00,
4.22761202e-19, 3.21765246e-02, 8.86959187e-03, 0.00000000e+00])]
I'd like to create a heatmap from it which looks similarly to the following figure:
Is it possible to create such diagram with seaborn or matplotlib or any other plotting package, and if so, how to do this?
One approach is to equalize the row lengths with np.repeat.
This only works well if all rows have a length that is a divisor of the longest row length.
The data suggest using a LogNorm, although such a norm gets distracted with the zeros in the sample input.
Some code to illustrate the idea:
from matplotlib import pyplot as plt
from matplotlib import colors as mcolors
import numpy as np
results = [np.array([6.06674849e-18, 2.28597646e-03]),
np.array([0.02039694, 0.01245901, 0.01264321, 0.00963068]),
np.array([2.28719585e-18, 5.14800709e-02, 2.90957713e-02, 0.00000000e+00,
4.22761202e-19, 3.21765246e-02, 8.86959187e-03, 0.00000000e+00])]
longest = max([len(row) for row in results])
equalized = np.array( [np.repeat(row, longest // len(row)) for row in results])
# equalized = np.where(equalized == 0, np.NaN, equalized)
norm = mcolors.LogNorm()
heatmap = plt.imshow(equalized, cmap='nipy_spectral', norm=norm, interpolation='nearest',
origin='lower', extent=[0, 6000, 0.5, len(results)+0.5])
plt.colorbar(heatmap)
plt.gca().set_aspect('auto')
plt.yticks(range(1, len(results) + 1))
plt.show()
Another example with 7 levels (random numbers). Input generated as:
bands = 7
results = [np.random.uniform(0, 1, 2**i) for i in range(1, bands+1)]
I am trying to generate a histogram using matplotlib. I am reading data from the following file:
https://github.com/meghnasubramani/Files/blob/master/class_id.txt
My intent is to generate a histogram with the following bins: 1, 2-5, 5-100, 100-200, 200-1000, >1000.
When I generate the graph it doesn't look nice.
I would like to normalize the y axis to (frequency of occurrence in a bin/total items). I tried using the density parameter but whenever I try that my graph ends up completely blank. How do I go about doing this.
How do I get the width's of the bars to be the same, even though the bin ranges are varied?
Is it also possible to specify the ticks on the histogram? I want to have the ticks correspond to the bin ranges.
import matplotlib.pyplot as plt
FILE_NAME = 'class_id.txt'
class_id = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [1, 2, 5, 100, 200, 1000, max(class_id)]
x = plt.hist(class_id, bins=num_bins, histtype='bar', align='mid', rwidth=0.5, color='b')
print (x)
plt.legend()
plt.xlabel('Items')
plt.ylabel('Frequency')
As suggested by importanceofbeingernest, we can use bar charts to plot categorical data and we need to categorize values in bins, for ex with pandas:
import matplotlib.pyplot as plt
import pandas
FILE_NAME = 'class_id.txt'
class_id_file = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [0, 2, 5, 100, 200, 1000, max(class_id_file)]
categories = pandas.cut(class_id_file, num_bins)
df = pandas.DataFrame(class_id_file)
dfg = df.groupby(categories).count()
bins_labels = ["1-2", "2-5", "5-100", "100-200", "200-1000", ">1000"]
plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=bins_labels)
#plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=categories.categories)
plt.xlabel('Items')
plt.ylabel('Frequency')
Not what you asked for, but you could also stay with histogram and choose logarithm scale to improve readability:
plt.xscale('log')
I have a list of data in which the numbers are between 1000 and 20 000.
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
When I plot a histogram using the hist() function, the y-axis represents the number of occurrences of the values within a bin. Instead of the number of occurrences, I would like to have the percentage of occurrences.
Code for the above plot:
f, ax = plt.subplots(1, 1, figsize=(10,5))
ax.hist(data, bins = len(list(set(data))))
I've been looking at this post which describes an example using FuncFormatter but I can't figure out how to adapt it to my problem. Some help and guidance would be welcome :)
EDIT: Main issue with the to_percent(y, position) function used by the FuncFormatter. The y corresponds to one given value on the y-axis I guess. I need to divide this value by the total number of elements which I apparently can' t pass to the function...
EDIT 2: Current solution I dislike because of the use of a global variable:
def to_percent(y, position):
# Ignore the passed in position. This has the effect of scaling the default
# tick locations.
global n
s = str(round(100 * y / n, 3))
print (y)
# The percent symbol needs escaping in latex
if matplotlib.rcParams['text.usetex'] is True:
return s + r'$\%$'
else:
return s + '%'
def plotting_hist(folder, output):
global n
data = list()
# Do stuff to create data from folder
n = len(data)
f, ax = plt.subplots(1, 1, figsize=(10,5))
ax.hist(data, bins = len(list(set(data))), rwidth = 1)
formatter = FuncFormatter(to_percent)
plt.gca().yaxis.set_major_formatter(formatter)
plt.savefig("{}.png".format(output), dpi=500)
EDIT 3: Method with density = True
Actual desired output (method with global variable):
Other answers seem utterly complicated. A histogram which shows the proportion instead of the absolute amount can easily produced by weighting the data with 1/n, where n is the number of datapoints.
Then a PercentFormatter can be used to show the proportion (e.g. 0.45) as percentage (45%).
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
plt.hist(data, weights=np.ones(len(data)) / len(data))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.show()
Here we see that three of the 7 values are in the first bin, i.e. 3/7=43%.
Simply set density to true, the weights will be implicitly normalized.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
plt.hist(data, density=True)
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.show()
You can calculate the percentages yourself, then plot them as a bar chart. This requires you to use numpy.histogram (which matplotlib uses "under the hood" anyway). You can then adjust the y tick labels:
import matplotlib.pyplot as plt
import numpy as np
f, ax = plt.subplots(1, 1, figsize=(10,5))
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
heights, bins = np.histogram(data, bins = len(list(set(data))))
percent = [i/sum(heights)*100 for i in heights]
ax.bar(bins[:-1], percent, width=2500, align="edge")
vals = ax.get_yticks()
ax.set_yticklabels(['%1.2f%%' %i for i in vals])
plt.show()
I think the simplest way is to use seaborn which is a layer on matplotlib. Note that you can still use plt.subplots(), figsize(), ax, and fig to customize your plot.
import seaborn as sns
And using the following code:
sns.displot(data, stat='probability'))
Also, sns.displot has so many parameters that allow for very complex and informative graphs very easily. They can be found here: displot Documentation
You can use functools.partial to avoid using globals in your example.
Just add n to function parameters:
def to_percent(y, position, n):
s = str(round(100 * y / n, 3))
if matplotlib.rcParams['text.usetex']:
return s + r'$\%$'
return s + '%'
and then create a partial function of two arguments that you can pass to FuncFormatter:
percent_formatter = partial(to_percent,
n=len(data))
formatter = FuncFormatter(percent_formatter)
Full code:
from functools import partial
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
def to_percent(y, position, n):
s = str(round(100 * y / n, 3))
if matplotlib.rcParams['text.usetex']:
return s + r'$\%$'
return s + '%'
def plotting_hist(data):
f, ax = plt.subplots(figsize=(10, 5))
ax.hist(data,
bins=len(set(data)),
rwidth=1)
percent_formatter = partial(to_percent,
n=len(data))
formatter = FuncFormatter(percent_formatter)
plt.gca().yaxis.set_major_formatter(formatter)
plt.show()
plotting_hist(data)
gives:
I found yet an other way to do so. As you can see in other answers, density=True alone doesn't solve the problem, as it calculates the area under the curve in percentage. But that can easily be converted, just divide it by the width of the bars.
import matplotlib.pyplot as plt
data = [1000, 1000, 5000, 3000, 4000, 16000, 2000]
bins=10
plt.hist(data, bins=bins, density=True)
bar_width = (max(data)-min(data))/bins # calculate width of a bar
ticks = plt.yticks()[0] # get ticks
tick_labels = ticks * bar_width # calculate labels for ticks
tick_labels = map(lambda f: f"{f:0.2}%",tick_labels) # format float to string
plt.yticks(ticks=ticks, labels=tick_labels) # set new labels
plt.show()
However, the solution weights=np.ones(len(data)) / len(data) may be a shorther and cleaner. This is just an other way and without numpy
I'm kind of new to python, so I'm hoping that the answer to my question is relatively straight forward.
I'm trying to make a choropleth map using geopandas. However, since I'm making multiple maps that need to be compared to each other, it is indispensable that I use a custom data classification scheme (rather than quantiles or jenks). Hence, I've been trying to work with the User_Defined scheme, and I'm able to create the bins but I don't know how to apply them to the map itself.
This is what I did to create my classification scheme:
import pysal.esda.mapclassify as ps
from pysal.esda.mapclassify import User_Defined
bins = [5, 20, 100, 600, 1000, 3000, 5000, 10000, 20000, 400000]
ud = User_Defined(projected_world_exports['Value'], bins)
(where 'Value' is the column I plot in the map)
And then when I try to plot the choropleth map I don't know what the scheme is meant to be called
projected_world_exports.plot(column='Value', cmap='Greens', scheme = ?????)
If anyone could help I would be hugely appreciative!
Thanks x
Here is an alternative approach that does not require modifying the geopandas code. It involves first labeling the bins so that you can create a custom colormap that maps each bin label to a specific color. A column must then be created in your geodataframe that specifies which bin label is applied to each row in the geodataframe, and this column is then used to plot the choropleth using the custom colormap.
from matplotlib.colors import LinearSegmentedColormap
bins = [5, 20, 100, 600, 1000, 3000, 5000, 10000, 20000, 400000]
# Maps values to a bin.
# The mapped values must start at 0 and end at 1.
def bin_mapping(x):
for idx, bound in enumerate(bins):
if x < bound:
return idx / (len(bins) - 1.0)
# Create the list of bin labels and the list of colors
# corresponding to each bin
bin_labels = [idx / (len(bins) - 1.0) for idx in range(len(bins))]
color_list = ['#edf8fb', '#b2e2e2', '#66c2a4', '#2ca25f', '#006d2c', \
'#fef0d9', '#fdcc8a', '#fc8d59', '#e34a33', '#b30000']
# Create the custom color map
cmap = LinearSegmentedColormap.from_list('mycmap',
[(lbl, color) for lbl, color in zip(bin_labels, color_list)])
projected_world_exports['Bin_Lbl'] = projected_world_exports['Value'].apply(bin_mapping)
projected_world_exports.plot(column='Bin_Lbl', cmap=cmap, alpha=1, vmin=0, vmax=1)
I took a look at the code of geopandas plotting function (https://github.com/geopandas/geopandas/blob/master/geopandas/plotting.py) but I guess the plot method only accepts one of the three name ("quantiles", "equal_interval", "fisher_jenks") but not directly a list of bins or a pysal.esda.mapclassify classifier such as User_Defined.
(I guess it could be linked to that issue where the last comment is about defining an API for "user defined" binning).
However for now I guess you can achieve this by slightly modifying and reusing the functions from the file I linked.
For example you could rewrite you're own version of plot_dataframe like this :
import numpy as np
def plot_dataframe(s, column, binning, cmap,
linewidth=1.0, figsize=None, **color_kwds):
import matplotlib.pyplot as plt
values = s[column]
values = np.array(binning.yb)
fig, ax = plt.subplots(figsize=figsize)
ax.set_aspect('equal')
mn = values.min()
mx = values.max()
poly_idx = np.array(
(s.geometry.type == 'Polygon') | (s.geometry.type == 'MultiPolygon'))
polys = s.geometry[poly_idx]
if not polys.empty:
plot_polygon_collection(ax, polys, values[poly_idx], True,
vmin=mn, vmax=mx, cmap=cmap,
linewidth=linewidth, **color_kwds)
plt.draw()
return ax
Then you would need to define the functions _flatten_multi_geoms and plot_polygon_collection by copying them and you are ready to use it like this :
bins = [5, 20, 100, 600, 1000, 3000, 5000, 10000, 20000, 400000]
ud = User_Defined(projected_world_exports['Value'], bins)
plot_dataframe(projected_world_exports, 'Value', ud, 'Greens')
This can be done easily using UserDefined scheme. While defining such scheme, a mapclassify.MapClassifier object will be used under the hood. In fact, all the supported schemes are provided by mapclassify.
For passing your bins, you need to pass them in classification_kwds arguments.
So, your code is going to be:
projected_world_exports.plot(
column='Value',
cmap='Greens',
scheme='UserDefined',
classification_kwds={'bins': bins}
)
Is there a way to get matplotlib to connect data from two different data sets with the same line?
Context: I need to plot some data in log scale, but some of them are negative. I use the workaround of plotting the data absolute value in different colours (red for positive and green for negative), something like:
import pylab as pl
pl.plot( x, positive_ys, 'r-' ) # positive y's
pl.plot( x, abs( negative_ys ), 'g-' ) # negative y's
pl.show()
However, as they represent the same quantity, it would be helpful to have the two data series connected by the same line. Is this possible?
I cannot use pl.plot( x, abs( ys )) because I need to be able to differentiate between the positive and originally negative values.
With numpy you can use logical indexing.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
x = np.array([10000, 1000, 100, 10, 1, 5, 50, 500, 5000, 50000])
y = np.array([-10000, -1000, -100, -10, -1, 5, 50, 500, 5000, 50000])
ax.plot(x,abs(y),'+-b',label='all data')
ax.plot(abs(x[y<= 0]),abs(y[y<= 0]),'o',markerfacecolor='none',
markeredgecolor='r',
label='we are negative')
ax.set_xscale('log')
ax.set_yscale('log')
ax.legend(loc=0)
plt.show()
The key feature is first plotting all absolute y-values and then re-plotting those that were originally negative as hollow circles to single them out. This second step uses the logical indexing x[y<=0] and y[y<=0] to only pick those elements of the y-array which are negative.
The example above gives you this figure:
If you really have two different data sets, the following code will give you the same figure as above:
x1 = np.array([1, 10, 100, 1000, 10000])
x2 = np.array([5, 50, 500, 5000, 50000])
y1 = np.array([-1, -10, -100, -1000, -10000])
y2 = np.array([5, 50, 500, 5000, 50000])
x = np.concatenate((x1,x2))
y = np.concatenate((y1,y2))
sorted = np.argsort(y)
ax.plot(x[sorted],abs(y[sorted]),'+-b',label='all data')
ax.plot(abs(x[y<= 0]),abs(y[y<= 0]),'o',markerfacecolor='none',
markeredgecolor='r',
label='we are negative')
Here, you first use np.concatenate to combine both the x- and the y-arrays. Then you employ np.argsort to sort the y-array in a way that makes sure you do not get a overly zig-zaggy line when plotting. You use that index-array (sorted) when you call the first plot. As the second plot only plots symbols but no connecting line, you do not require sorted arrays here.