Python: create multiple boxplots in one pannel - python

I have been using R for long time and I am recently learning Python.
I would like to create multiple box plots in one panel in Python.
My dataset is in a vector form and a label vector indicates which box plot each element of data corresponds. The example looks like this:
N = 50
data = np.random.lognormal(size=N, mean=1.5, sigma=1.75)
label = np.repeat([1,2,3,4,5],N/5)
From various websites (e.g., matplotlib: Group boxplots), Creating multiple boxplots requires a matrix object input whose column contains samples for one boxplot. So I created a list object based on data and label:
savelist = data[ label == 1]
for i in [2,3,4,5]:
savelist = [savelist, data[ label == i]]
However, the code below gives me an error:
boxplot(savelist)
Traceback (most recent call last):
File "<ipython-input-222-1a55d04981c4>", line 1, in <module>
boxplot(savelist)
File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2636, in boxplot
meanprops=meanprops, manage_xticks=manage_xticks)
File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 3045, in boxplot labels=labels)
File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/matplotlib/cbook.py", line 1962, in boxplot_stats
stats['mean'] = np.mean(x)
File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 2727, in mean
out=out, keepdims=keepdims)
File "/Users/yumik091186/anaconda/lib/python2.7/site-packages/numpy/core/_methods.py", line 66, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
ValueError: operands could not be broadcast together with shapes (2,) (10,)
Can anyone explain what is going on?

You're ending up with a nested list instead of a flat list. Try this instead:
savelist = [data[label == 1]]
for i in [2,3,4,5]:
savelist.append(data[label == i])
And it should work.

Related

Having Trouble with numpy.histogramdd

I am trying to create N-Dimensional histogram from 2D array which has complex values. I want to count the number of occurrences in real and imaginary parts of the array given the bins and store the result in a 3D array. It only runs for the first iteration when I hard code i=0 and remove the for loop. I have never used histograms in python before and I just cannot understand the error. The code is given below.
xsoft is defined as 2d array of complex type and I somehow compute bnd_edges by finding max, min values from xsoft and create edges to be given as bins.
xsoft = np.empty((M, MAX,), dtype=complex) # e.g has dims 4*100
xsoft[:] = np.nan
edges = np.linspace(-bnd_edges, bnd_edges, numbin) #numbin=10
pSOFT = np.empty((len(edges)-1, M, len(edges)-1)) # len(edges)= 10
pSOFT[:] = np.nan
for i in range(M):
pSOFT[:, i, :], edges = np.histogramdd((xsoft[i, :].real, xsoft[i, :].imag), bins=(edges, edges))
The code results in the following error
Traceback (most recent call last):
File " ", line 194, in <module>
pSOFT[:, i, :], edges = np.histogramdd((xsoft[i, :].real, xsoft[i, :].imag), bins=(edges, edges))
File "<__array_function__ internals>", line 5, in histogramdd
File " " line 1066, in histogramdd
raise ValueError(
ValueError: `bins[0]` must be a scalar or 1d array
Process finished with exit code 1
You are getting this error because you are overriding the original definition of edges with the second return value of histogramdd.
Replace the last line in your code with this:
pSOFT[:, i, :], edges_i = np.histogramdd((xsoft[i, :].real, xsoft[i, :].imag), bins=(edges, edges))

Cannot plot my function : return array(a, dtype, copy=False, order=order) TypeError: float() argument must be a string or a number

I'm trying to plot a function that gives the arctan of the angle of several scatterplots (it's a physics experiment):
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
filename='rawPhaseDataf2f_17h_15m.dat'
datatype=np.dtype( [('Shotnumber',np.dtype('>f8')),('A1',np.dtype('>f8')), ('A2',np.dtype('>f8')), ('f2f',np.dtype('>f8')), ('intensity',np.dtype('>f8'))])
data=np.fromfile(filename,dtype=datatype)
#time=data['Shotnumber']/9900 # reprate is 9900 Hz -> time in seconds
A1=data['A1']
A2=data['A2']
#np.sort()
i=range(1,209773)
def x(i) :
return arctan((A1.item(i)/A2.item(i))*(i/209772))
def y(i) :
return i*2*pi/209772
plot(x,y)
plt.figure('Scatterplot')
plt.plot(A1,A2,',') #Scatterplot
plt.xlabel('A1')
plt.ylabel('A2')
plt.figure('2D Histogram')
plt.hist2d(A1,A2,100) # 2D Histogram
plt.xlabel('A1')
plt.ylabel('A2')
plt.show()
My error is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell /sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/home/nelly/Bureau/ Téléchargements/Kr4 Experiment/read_rawPhaseData.py", line 21, in <module>
plot(x,y)
File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 2987, in plot
ret = ax.plot(*args, **kwargs)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 4138, in plot
self.add_line(line)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 1497, in add_line
self._update_line_limits(line)
File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 1508, in _update_line_limits
path = line.get_path()
File "/usr/lib/pymodules/python2.7/matplotlib/lines.py", line 743, in get_path
self.recache()
File "/usr/lib/pymodules/python2.7/matplotlib/lines.py", line 420, in recache
x = np.asarray(xconv, np.float_)
File "/usr/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
I know that the problem is from the plot(x,y). I think that my error comes from the definition of x and y. A1 and A2 are matrix, N the number of points and Ak is the index of the matrix. I want to have arctan(A1k/A2k)*(k/N).
There are lots of problems with your code, and your understanding of python and array operations. I'm just going to handle the first part of the code (and the error you get), and hopefully you can continue to fix it from there.
This should fix the error you're getting and generate a plot:
# size = 209772
size = A1.size # I'm assuming that the size of the array is 209772
z = np.arange(1, size+1)/(size+1) # construct an array from [1/209773, 1.0]
# Calculate the x and y arrays
x = np.arctan((A1/A2)*z)
y = z*2*pi
# Plot x and y
plt.plot(x, y)
Discussion:
There are lots of issues with this chunk of code:
i=range(1,209773)
def x(i) :
return arctan((A1.item(i)/A2.item(i))*(i/209772))
def y(i) :
return i*2*pi/209772
plot(x, y)
You're defining two functions called x and y, and then you are passing those functions to the plotting method. The plotting method accepts numbers (in lists or arrays), not functions. That is the reason for the error that you are getting. So you instead need to construct a list/array of numbers and pass that to the function.
You're defining a variable i which is a list of numbers. But when you define the functions x and y, you are creating new variables named i which have nothing to do with the list you created earlier. This is because of how "scope" works in python.
The functions arctan and plot are not defined "globally", instead they are only defined in the packages numpy and matplotlib. So you need to call them from those packages.

Constraint Mismatch Error

I am running the following code to create a simple line graph:
import matplotlib.pyplot as plt
import iris
import iris.coord_categorisation as iriscc
import iris.plot as iplt
import iris.quickplot as qplt
import iris.analysis.cartography
import matplotlib.dates as mdates
def main():
#bring in all the files we need and give them a name
TestFile= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/AFR_44_tas/Historical/1950-2005/tas_AFR-44_MOHC-HadGEM2-ES_historical_r1i1p1_CLMcom-CCLM4-8-17_v1_mon_194912-200512.nc'
#Load exactly one cube from given file
TestFile = iris.load_cube(TestFile)
print TestFile
#adjust longitude as data is out by 180degrees
#remove flat latitude and longitude and only use grid latitude and grid longitude which are in the 3rd and 4th column of the file
lats = iris.coords.DimCoord(TestFile.coords()[3].points[:,0], \
standard_name='latitude', units='degrees')
lons = TestFile.coords()[4].points[0]
for i in range(len(lons)):
if lons[i]>100.:
lons[i] = lons[i]-360.
lons = iris.coords.DimCoord(lons, \
standard_name='longitude', units='degrees')
TestFile.remove_coord('latitude')
TestFile.remove_coord('longitude')
TestFile.remove_coord('grid_latitude')
TestFile.remove_coord('grid_longitude')
TestFile.add_dim_coord(lats, 1)
TestFile.add_dim_coord(lons, 2)
#we are only interested in the latitude and longitude relevant to Malawi
Malawi = iris.Constraint(longitude=lambda v: 32.5 <= v <= 36., \
latitude=lambda v: -17. <= v <= -9.)
TestFile = TestFile.extract(Malawi)
#data is in Kelvin, but we would like to show it in Celcius
TestFile.convert_units('Celsius')
#We are interested in plotting the graph with time along the x ais, so we need a mean of all the coordinates, i.e. mean temperature across whole country
iriscc.add_year(TestFile, 'time')
TestFile = TestFile.aggregated_by('year', iris.analysis.MEAN)
TestFile.coord('latitude').guess_bounds()
TestFile.coord('longitude').guess_bounds()
TestFile_grid_areas = iris.analysis.cartography.area_weights(TestFile)
TestFile_mean = TestFile.collapsed(['latitude', 'longitude'],
iris.analysis.MEAN,
weights=TestFile_grid_areas)
#set major plot indicators for x-axis
plt.gca().xaxis.set_major_locator(mdates.YearLocator(5))
#assign the line colours
qplt.plot(TestFile_mean, label='TestFile', lw=1.5, color='blue')
#create a legend and set its location to under the graph
plt.legend(loc="upper center", bbox_to_anchor=(0.5,-0.05), fancybox=True, shadow=True, ncol=5)
#create a title
plt.title('Mean Near Surface Temperature for Malawi', fontsize=11)
#create the graph
plt.grid()
iplt.show()
if __name__ == '__main__':
main()
This is working well for the majority of the files, but two climate models, are coming up with Constraint Mismatch Errors:
runfile('/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/Python Code and Output Images/Line_Graph_Temp_Test.py', wdir='/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/Python Code and Output Images')
Traceback (most recent call last):
File "<ipython-input-83-4f4457568a8f>", line 1, in <module> runfile('/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/Python Code and Output Images/Line_Graph_Temp_Test.py', wdir='/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/Python Code and Output Images')
File "/usr/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "/usr/lib/python2.7/site-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/Python Code and Output Images/Line_Graph_Temp_Test.py", line 84, in <module>
main()
File "/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/Python Code and Output Images/Line_Graph_Temp_Test.py", line 21, in main
TestFile = iris.load_cube(TestFile)
File "/usr/lib64/python2.7/site-packages/iris/__init__.py", line 338, in load_cube
raise iris.exceptions.ConstraintMismatchError(str(e))
ConstraintMismatchError: failed to merge into a single cube.
cube.standard_name differs: None != u'air_temperature'
cube.long_name differs: None != u'Near-Surface Air Temperature'
cube.var_name differs: u'rotated_pole' != u'tas'
cube.units differs: Unit('no_unit') != Unit('K')
cube.attributes keys differ: 'grid_north_pole_latitude', 'grid_north_pole_longitude', 'grid_mapping_name'
cube.cell_methods differ
cube.shape differs: () != (660, 201, 194)
cube data dtype differs: |S1 != float32
cube data fill_value differs: '\x00' != 1e+20
Similarly, I get this error when trying to run the observed data (cru_ts4.00.1901.2015.tmp.dat.nc)
ConstraintMismatchError: failed to merge into a single cube.
cube.long_name differs: u'near-surface temperature' != None
cube.var_name differs: u'tmp' != u'stn'
cube.units differs: Unit('degrees Celsius') != Unit('1')
cube.attributes keys differ: 'correlation_decay_distance', 'description'
cube data dtype differs: float32 != int32
cube data fill_value differs: 9.96921e+36 != -2147483647
Any ideas on how I can fix this?
I received a response from Andrew Dawson on the Iris User Google Group. Posting here in case it is of any help to someone else. This helped me!
The function iris.load_cube is used to load exactly 1 and only 1 cube from the given file matching the given constraints. You haven't provided constraints which means you are expecting the file(s) your a loading from to reduce to exactly 1 cube. The ConstraintMismatchError from iris.load_cube is telling you that this is not possible due to some mismatched data. From the error it looks like you have more than 1 variable in your input file(s) for those models. You should consider adding an explicit constraint when loading, perhaps like:
iris.load_cube(filename, 'name_of_variable_here')
where the name_of_variable should be the name that cube would be loaded with, i.e. the result of cube.name(). This is different from the netcdf variable name. To work out how you need to do this I suggest loading all the cubes from one of the problematic datasets with
cubes = iris.load(the_filename) # load all the cubes in the input file
and then printing the names of the cubes
for cube in cubes:
print(cube.name())

In ggplot for Python, using discrete X scale with geom_point()?

The following example returns an error. It appears that using a discrete (not continuous) scale for the x-axis in ggplot in Python is not supported?
import pandas as pd
import ggplot
df = pd.DataFrame.from_dict({'a':['a','b','c'],
'percentage':[.1,.2,.3]})
p = ggplot.ggplot(data=df,
aesthetics=ggplot.aes(x='a',
y='percentage'))\
+ ggplot.geom_point()
print(p)
As mentioned, this returns:
Traceback (most recent call last):
File "/Users/me/Library/Preferences/PyCharm2016.1/scratches/scratch_1.py", line 30, in <module>
print(p)
File "/Users/me/lib/python3.5/site-packages/ggplot/ggplot.py", line 116, in __repr__
self.make()
File "/Users/me/lib/python3.5/site-packages/ggplot/ggplot.py", line 627, in make
layer.plot(ax, facetgroup, self._aes, **kwargs)
File "/Users/me/lib/python3.5/site-packages/ggplot/geoms/geom_point.py", line 60, in plot
ax.scatter(x, y, **params)
File "/Users/me/lib/python3.5/site-packages/matplotlib/__init__.py", line 1819, in inner
return func(ax, *args, **kwargs)
File "/Users/me/lib/python3.5/site-packages/matplotlib/axes/_axes.py", line 3838, in scatter
x, y, s, c = cbook.delete_masked_points(x, y, s, c)
File "/Users/me/lib/python3.5/site-packages/matplotlib/cbook.py", line 1848, in delete_masked_points
raise ValueError("First argument must be a sequence")
ValueError: First argument must be a sequence
Any workarounds for using ggplot with scatters on a discrete scale?
One option is to generate a continuous series, and use the original variable as labels. But this seems like a painful workaround.
df = pd.DataFrame.from_dict( {'a':[0,1,2],
'a_name':['a','b','c'],
'percentage':[.1,.2,.3]})
p = ggplot.ggplot(data=df,
aesthetics=ggplot.aes(x='a',
y='percentage'))\
+ ggplot.geom_point()\
+ ggplot.scale_x_continuous(breaks=list(df['a']),
labels=list(df['a_name']))
I was getting the same error when trying to plot 2 columns of a dataframe. I was reading the data from a csv file and converting it into a dataframe.
readdata=csv.reader(open(filename),delimiter="\t")
df= pd.DataFrame(data, columns=header)
df.columns=["pulseVoltage","dutVoltage","dutCurrent","leakageCurrent"]
print (df.dtypes)
When I checked the data types, for some reason they were shown as object instead of float that I expected (I am a newbie and this might be trivial knowledge which I don't know). Therefore, I went ahead and did an explicit conversion of columns to data type float.
df["dutVoltage"]=df["dutVoltage"].astype("float")
df["dutCurrent"]=df["dutCurrent"].astype("float")
Now I can use ggplot to plot the data without any error.
print ggplot(df, aes('dutVoltage','dutCurrent'))+ \
geom_point()

Python zero-size array to ufunc.reduce without identity

I'm trying to make a histogram of some data that is being stored in an ndarray. The histogram is part of a set of analysis which I've made into a class in a python program. The part of the code that isn't working is below.
def histogram(self, iters):
samples = T.MCMC(iters) #Returns an [iters,3,4] ndarray
histAC = plt.figure(self.ip) #plt is matplotlib's pyplot
self.ip+=1 #defined at the beginning of the class to start at 0
for l in range(0,4):
h = histAC.add_subplot(2,(iters+1)/2,l+1)
for i in range(0,0.5*self.chan_num):
intAvg = mean(samples[:,i,l])
print intAvg
for k in range(0,iters):
samples[k,i,l]=samples[k,i,l]-intAvg
print "Samples is ",samples
h.hist(samples,bins=5000,range=[-6e-9,6e-9],histtype='step')
h.legend(loc='upper right')
h.set_title("AC Pulse Integral Histograms: "+str(l))
figname = 'ACHistograms.png'
figpath = 'plot'+str(self.ip)
print "Finished!"
#plt.savefig(figpath + figname, format = 'png')
This gives me the following error message:
File "johnmcmc.py", line 257, in histogram
h.hist(samples,bins=5000,range=[-6e-9,6e-9],histtype='step') #removed label=apdlabel
File "/x/tsfit/local/lib/python2.6/site-packages/matplotlib/axes.py", line 7238, in hist
ymin = np.amin(m[m!=0]) # filter out the 0 height bins
File "/x/tsfit/local/lib/python2.6/site-packages/numpy/core/fromnumeric.py", line 1829, in amin
return amin(axis, out)
ValueError: zero-size array to ufunc.reduce without identity
The only search results I've found have been multiple copies of the same two conversations, from which the only thing I learned was that python histograms don't like getting fed empty arrays, which is why I added the print statement right above the line that's giving me trouble to make sure the array isn't empty.
Has anyone else come across this error before?

Categories