I created a bar chart using Matplotlib from the count of unique strings in a NumPy array. Now I would like to display only the top 10 most frequent species in the bar chart. I am new to Python so I am having trouble figuring it out. This is also my first question here, so let me know if I'm missing any important information
test_indices = numpy.where((obj.year == 2014) & (obj.native == "Native"))
SpeciesList2014 = numpy.append(SpeciesList2014, obj.species_code[test_indices])
labels, counts = numpy.unique(SpeciesList2014, return_counts=True)
indexSort = numpy.argsort(counts)
plt.bar(labels[indexSort][::-1], counts[indexSort][::-1], align='center')
plt.xticks(rotation=45)
plt.show()
You already have the values in a sorted array but you only want to select the ten values with the most counts.
It seems your array is sorted with larger counts as last values so you can exploit the numpy indexing as
plt.bar(labels[indexSort][-1:-11:-1], counts[indexSort][-1:-11;-1], align='center')
where [a:b:c] means a=start index, b=end index c= step, and negative values represent counting from the end of the array.
Or alternatively:
n=counts.shape[0]
plt.bar(labels[indexSort][n-11:], counts[indexSort][n-11:], align='center')
which plots in increasing order.
Do yourself a favor and learn about Numpy Indexing.
In this simple case, the last 10 elements of an array are indicated by the notation [-10:], that you can read from the last element minus ten to the last element.
import numpy as np
import matplotlib.pyplot as plt
# syntetic data
np.random.seed(20210428)
SpeciesList2014 = np.random.randint(0, 100, 2000)
# this is from your code
species, counts = np.unique(SpeciesList2014, return_counts=True)
topindices = np.argsort(counts)[-10:]
# here you probably can have, simply, topspecies = species[topindices]
topspecies = [repr(label) for label in species[topindices]]
topcounts = counts[topindices]
# plotting
plt.bar(topspecies, topcounts)
plt.show()
Related
Plotting a discrete xarray DataArray variable in a Dataset with xr.plot.scatter() yields a legend in which the discrete values are ordered arbitrarily, corresponding to unpredictable colour assignment to each level. Would it be possible to specify a specific colour or position for a given discrete value?
A simple reproducible example:
import xarray as xr
# get a predefined dataset
uvz = xr.tutorial.open_dataset("eraint_uvz")
# select a 2-D subset of the data
uvzr = uvz.isel(level=0, month=0, latitude=slice(150, 242),
longitude=slice(240, 300))
# define a discrete variable based on levels of a continuous variable
uvzr['zone'] = 'A'
uvzr['zone'] = uvzr.zone.where(uvzr.u > 30, other='C')
uvzr['zone'] = uvzr.zone.where(uvzr.u > 10, other='B')
# do the plot
xr.plot.scatter(uvzr, x='longitude', y='latitude', hue='zone')
Is there a way to ensure that the legend entries are arranged 'A', 'B', 'C' from top to bottom, say? Or ensure that A is assigned to blue, and B to orange, for example?
I know I can reset the values of the matplotlib color cycler, but for that to be useful I first need to know which order the discrete values will be plotted in.
I'm using xarray v2022.3.0 on python 3.8.6. With an earlier version of xarray (I think 0.16) the levels were arranged alphabetically.
I found an ugly workaround using xarray.Dataset.stack and xr.where(..., drop=True), in case anyone else is stuck with a similar problem.
import numpy as np # for unique, to cycle through values
import matplotlib.pyplot as plt # to get a legend
# instead of np.unique you could pass an iterable of your choice
# specifying the order
for value in np.unique(uvzr.zone):
# convert to a 1-D dataframe with a co-ordinate including all
# unique combinations of latitude-longitude values
uvzr_stacked = uvzr.stack({'location':('longitude', 'latitude')})
# now select only those grid points in zone value
uvzr_stacked = uvzr_stacked.where(uvzr_stacked.zone == value,
drop=True)
# the plotting function can't see the original dims any more;
# a new name is required, however
uvzr_stacked['lat'] = uvzr_stacked.latitude
uvzr_stacked['lon'] = uvzr_stacked.longitude
# plot!
xr.plot.scatter(uvzr_stacked, x='lon', y='lat', hue='zone',
add_guide=False)
plt.legend(title='zone')
I am trying to subset a matrix by using values from another smaller matrix. The number of rows in each are the same, but the smaller matrix has fewer columns. Each column in the smaller matrix contains the value of the column in the larger matrix that should be referenced. Here is what I have done, along with comments that hopefully describe this better, along with what I have tried. (The wrinkle in this is that the values of the columns to be used in each row change...)
I have tried Google, searching on stackoverflow, etc and can't find what I'm looking for. (The closest I came was something in sage called matrix_from_columns, which isn't being used here) So I'm probably making a very simple referencing error.
TIA,
mconsidine
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
#Problem: for each row in a matrix/image I need to replace
# a value in a particular column in that row by a
# weighted average of some of the values on either
# side of that column in that row. The wrinkle
# is that the column that needs to be changed may
# vary from row to row. The columns that need to
# have their values changes is stored in an array.
#
# How do I do something like:
# img[:, selectedcolumnarray] = somefunction(img,targetcolumnmatrix)
#
# I can do this for setting the selectedcolumnarray to a value, like 0
# But I am not figuring out how to select the targeted values to
# average.
#dimensions of subset of the matrix/image that will be averaged
rows = 7
columns = 5
#weights that will be used to average surrounding values
the_weights = np.ones((rows,columns)).astype(float)*(1/columns)
print(the_weights)
#make up some data to create a set of column
# values that vary by row
y = np.asarray(range(0,rows)).astype(float)
x = -0.095*(y**2) - 0.05*y + 12.123
fit=[x.astype(int),x-x.astype(int),y]
print(np.asarray(fit)[0])
#create a test array, eg "image' of 20 columns that will have
# values in targeted columns replaced
testarray = np.asarray(range(1,21))
img = np.ones((rows,20)).astype(np.uint16)
img = img*testarray.T #give it some values
print(img)
#values of the rows that will be replaced
targetcolumn = np.asarray(fit)[0].astype(int)
print(targetcolumn)
#calculate the range of columns in each row that
# will be used in the averaging
startcol = targetcolumn-2
endcol = targetcolumn+2
testcoords=np.linspace(startcol,endcol,5).astype(int).T
#this is the correct set of columns in the corresponding
# row to use for averaging
print(testcoords)
img2=img.copy()
#this correctly replaces the targetcolumn values with 0
# but I want to replace them with the sum of the values
# in the respective row of testcoords, weighted by the_weights
img2[np.arange(rows),targetcolumn]=0
#so instead of selecting the one column, I want to select
# the block of the image represented by testcoords, calculate
# a weighted average for each row, and use those values instead
# of 0 to set the values in targetcolumn
#starting again with the 7x20 (rowsxcolumns) "image"
img3=img.copy()
#this gives me the wrong size, ie 7,7,5 when I think I want 7,5;
print(testcoords.shape)
#I thought "take" might help, but ... nope
#img3=np.take(img,testcoords,axis=1)
#something here maybe??? :
#https://stackoverflow.com/questions/40084931/taking-subarrays-from-numpy-array-with-given-stride-stepsize
# but I can't figure out what
##### plot surface to try to visualize what is going on ####
'''
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
# Make data.
X = np.arange(0, 20, 1)
Y = np.arange(0, rows, 1)
X, Y = np.meshgrid(X, Y)
Z = img2
# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
# Customize the z axis.
ax.set_zlim(0, 20)
ax.zaxis.set_major_locator(LinearLocator(10))
# A StrMethodFormatter is used automatically
ax.zaxis.set_major_formatter('{x:.02f}')
# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)
plt.show()
It turns out that "take_along_axis" does the trick:
imgsubset = np.take_along_axis(img3,testcoords,axis=1)
print(imgsubset)
newvalues = imgsubset * the_weights
print(newvalues)
newvalues = np.sum(newvalues, axis=1)
print(newvalues)
img3[np.arange(rows),targetcolumn] = np.round(newvalues,0)
print(img3)
(It becomes more obvious when non trivial weights are used.)
Thanks for listening...
mconsidine
I am a medical physics student trying to simulate photon detection - I succeeded (below) but I want to make it better by speeding it up: it currently takes 50 seconds to run and I want it to run in some fraction of that time. I assume someone more knowledgeable in Python could optimize it to complete within less than 10 seconds (without reducing num_photons_detected values). Thank you very much for trying out this little optimization challenge.
from random import seed
from random import random
import random
import matplotlib.pyplot as plt
import numpy as np
rows, cols = (25, 25)
num_photons_detected = [10**3, 10**4, 10**5, 10**6, 10**7]
lesionPercentAboveNoiseLevel = [1, 0.20, 0.10, 0.05]
index_range = np.array([i for i in range(rows)])
for l in range(len(lesionPercentAboveNoiseLevel)):
pixels = np.array([[0.0 for i in range(cols)] for j in range(rows)])
for k in range(len(num_photons_detected)):
random.seed(a=None, version=2)
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)])
counts = 0
while num_photons_detected[k] > counts:
for i in photons_random_pixel_choice:
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)]) #further ensures random pixel selection
for j in photons_random_pixel_choice:
pixels[i,j] +=1
counts +=1
plt.imshow(pixels, cmap="gray") #in the resulting images/graphs, x is on the vertical and y on the horizontal
plt.show()
I think that, aside from efficiency issues, a problem with the code is that it does not select the positions of photons truly at random. Instead, it selects rows numbers, and then for each selected row, it picks column numbers where photons will be observed in that row. As a result, if a row number is not selected, there will be no photons in that row at all, and if the same row is selected several times, there will be many photons in it. This is visible in the produced plots which have a clear pattern of lighter and darker rows:
Assuming that this is unintended and that each pixel should have equal chances of being selected, here is a function generating an array of a given size, with a given number of randomly selected pixels:
import numpy as np
def generate_photons(rows, cols, num_photons):
rng = np.random.default_rng()
indices = rng.choice(rows*cols, num_photons)
np.add.at(pix:=np.zeros(rows*cols), indices, 1)
return pix.reshape(rows, cols)
You can use it to produce images with specified parameters. E.g.:
import matplotlib.pyplot as plt
pixels = generate_photons(rows=25, cols=25, num_photons=10**4)
plt.imshow(pixels, cmap="gray")
plt.show()
gives:
photons_random_pixel_choice = np.array([random.choice(index_range) for z in range(rows)])
It seems like the goal here is:
Use a pre-made sequence of integers, 0 to 24 inclusive, to select one of those values.
Repeat that process 25 times in a list comprehension, to get a Python list of 25 random values in that range.
Make a 1-d Numpy array from those results.
This is very much missing the point of using Numpy. If we want integers in a range, then we can directly ask for those. But more importantly, we should let Numpy do the looping as much as possible when using Numpy data structures. This is where it pays to read the documentation:
size: int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.
So, just make it directly: photons_random_pixel_choice = random.integers(rows, size=(rows,)).
I was plotting a scatter plot to show null values in dataframe. As you can see the plt.scatter() function is not expressive enough. Relation between list(range(0,1200)) and 'a' is not clear unless you see the previous lines. Can the plt.scatter(x,y) be written in a more explicit way where it could be easily understood how x and y is related. Like if somebody only see the plt.scatter(x,y) , they would understand what it is about.
a = []
for i in range(0,1200):
feature_with_na = [feature for feature in df.columns if df[feature].isnull().sum()>i]
a.append(len(feature_with_na))
plt.scatter(list(range(0,1200)), a)
On your x axis you have the number, then on the y-axis you want to plot the number of columns in your DataFrame that have more than that number of null values.
Instead of your loop you can count the number of null values within each column and use numpy.broadcasting, ([:, None]), to compare with an array of your numbers. This allows you to specify an xarr of the numbers, then you use that same array in the comparison.
Sample Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plot
df = pd.DataFrame(np.random.choice([1,2,3,4,5,np.NaN], (100,10)))
Code
# Range of 'x' values to consider
xarr = np.arange(0, 100)
plt.scatter(xarr, (df.isnull().sum().to_numpy()>xarr[:, None]).sum(axis=1))
ALollz answer is good, but here's a less numpy-heavy alternative if that's your thing:
feature_null_counts = df.isnull().sum()
n_nulls = list(range(100))
features_with_n_nulls = [sum(feature_null_counts > n) for n in n_nulls]
plt.scatter(n_nulls, features_with_n_nulls)
To illustrate my problem I prepared an example:
First, I have two arrays 'a'and 'b'and I'm interested in their distribution:
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
plt.show()
This code gives me a histogram with two 'curves'. Now I want to subtract one 'curve' from the other, and by this I mean that I do this for each bin separately:
n3 = n2-n1
I don't need negative counts so:
for i in range(0,len(n2)):
if n3[i]<0:
n3[i]=0
else:
continue
The new histogram curve should be plotted in the same range as the previous ones and it should have the same number of bins. So I have the number of bins and their position (which will be the same as the ones for the other curves, please refer to the block above) and the frequency or counts (n3) that every bins should have. Do you have any ideas of how I can do this with the data that I have?
You can use a step function to plot n3 = n2 - n1. The only issue is that you need to provide one more value, otherwise the last value is not shown nicely. Also you need to use the where="post" option of the step function.
import numpy as np
import matplotlib.pyplot as plt
a = np.array([1,2,2,2,2,4,8,1,9,5,3,1,2,9])
b = np.array([5,9,9,2,3,9,3,6,8,4,2,7,8,8])
n1,bin1,pat1 = plt.hist(a,np.arange(1,10,2),histtype='step')
n2,bin2,pat2 = plt.hist(b,np.arange(1,10,2), histtype='step')
n3=n2-n1
n3[n3<0] = 0
plt.step(np.arange(1,10,2),np.append(n3,[n3[-1]]), where='post', lw=3 )
plt.show()