How to plot a stacked histogram with two arrays in python - python

I am trying to create a stacked histogram showing the clump thickness for malignant and benign tumors, with the malignant class colored red and the benign class colored blue.
I got the clump_thickness_array and benign_or_malignant_array. The benign_or_malignant_array consists of 2s and 4s.
If benign_or_malignant equals 2 it is benign(blue colored).
If it equals 4 it is malignant(red colored).
I can not figure out how to color the benign and malignant tumors. My Histogram is showing something other than what I try to achieve.
This is my code and my histogram so far:
fig, ax = plt.subplots(figsize=(12,8))
tmp = list()
for i in range(2):
indices = np.where(benign_or_malignant>=i )
tmp.append(clump_thickness[indices])
ax.hist(tmp,bins=10,stacked=True,color = ['b',"r"],alpha=0.73)

to obtain a stacked histogram using lists of different length for each group, you need to assemble a list of lists. This is what you are doing with your tmp variable. However, I think you are using the wrong indexes in your for loop. Above, you state that you want to label your data according to the variable benign_or_malignant. You want to test if it is exactly 2 or exactly 4. If you really just want these two possibilities, rewrite like this:
for i in [2,4]:
indices = np.where(benign_or_malignant==i )
tmp.append(clump_thickness[indices])

Related

plostly histogram facet row animation frame

Here is a sample of my data:
Time,Value,Name,Type
0,6.9,A,start
40,6.9,A,start
60,6.9,A,start
0,0.01,B,start
40,0.01,B,start
60,0.01,B,start
0,1.0,C,start
40,1.0,C,start
60,1.0,C,start
0,0.08,D,start
40,0.08,D,start
60,0.08,D,start
0,0.000131,E,End
40,0.00032,E,End
60,0.99209,E,End
0,0.002754,F,End
40,0.00392,F,End
60,0.01857,F,End
0,0.003,G,End
40,0.00516,G,End
60,0.00746,G,End
0,0.00426,H,End
40,0.0043,H,End
60,0.0095,H,End
0,0,I,End
40,0.0017,I,End
60,0.0183,I,End
And my code below:
import plotly.express as px
import pandas as pd
df=pd.read_csv('tohistogram.csv')
fig_bar = px.histogram(df,x='Name',y='Value',animation_frame='Time',color='Name',facet_row='Type')
fig_bar.update_layout(yaxis_title="value")
fig_bar.update_xaxes(matches=None)
fig_bar.for_each_xaxis(lambda xaxis: xaxis.update(showticklabels=True))
fig_bar.show()
`
Fig1:
Fig2:
With the data point listed above, I wanted 2 histogram separated by type (start,end) in one frame with one animation_frame
Tried the above code, as one can see from the image I could partial achieve but from Fig1: second histogram has (A,B,C,D),excepted just E to I.
2. Figure 2 was when I played the run button and auto scaled then I see A-D are gone and only E-I,
This is what I wanted to achieve in the first place itself, before running 2 histogram should sort as per 'Type'
A. Is it possible I tried couple of things like removed color
fig_bar = px.histogram(df,x='Name',y='Value',animation_frame='Time',facet_row='Type')
histogram sorts as per 'Type' of course no color but no label in second x-axis.
B.fig_bar = px.histogram(df,x='Name',y='Value',color='Name',facet_row='Type')
It sorts but no animation
What I am trying is it possible?
need 2 histogram with in the same frame sorted by 'Type',color and animation_frame?
C. Only if possible then, how to label y-axis of the first histogram from sumofValues to user-defined axis name and also have its own axis range.
D.I didn't come across any example but on the histogram, on mouse hover can I show another simple line graph image instead of text or value?
Thank you

How can I repeat a value in a list variable for coloring bar charts in matplotlib?

Hi I am trying to search for something but I don't know the correct words to find my answer (if it exists).
I am trying to color the bars on a barchart with 24 bars using this: https://python-graph-gallery.com/3-control-color-of-barplots/
I want to color bars 0-15 one color, and bars 16-23 another color. I was wondering if there's a way I can make a variable called "my_colors" and a list without actually repeating a hexcode over and over 24x. I only need 2 colors in my list repeated a bunch of times...
Is there some notation to write this sort of list?
Since your colors are in blocks, you can just do list multiplication:
# define the colors
my_colors = ['#AAAA00', '#DD00DD']
colors = my_colors[:1]*15 + my_colors[1:] * 9
# toy data
np.random.seed(1)
plt.bar(np.arange(24),
np.random.randint(1,10,24),
color=colors)
Output

Plotting categorical variable over multiple numeric variables in python

I need to plot one categorical variable over multiple numeric variables.
My DataFrame looks like this:
party media_user business_user POLI mass
0 Party_a 0.513999 0.404201 0.696948 0.573476
1 Party_b 0.437972 0.306167 0.432377 0.433618
2 Party_c 0.519350 0.367439 0.704318 0.576708
3 Party_d 0.412027 0.253227 0.353561 0.392207
4 Party_e 0.479891 0.380711 0.683606 0.551105
And I would like a scatter plot with different colors for the different variables; eg. one plot per party per [media_user, business_user, POLI, mass] each in different color.
So like this just with scatters instead of bars:
The closest I've come is this
sns.catplot(x="party", y="media_user", jitter=False, data=sns_df, height = 4, aspect = 5);
producing:
By messing around with some other graphs I found that by simply adding linestyle = '' I could remove the line and add markers. Hope this may help somebody else!
sim_df.plot(figsize = (15,5), linestyle = '', marker = 'o')

How to smooth or overlap bins in pyplot.hist2d?

I am plotting a 2D histogram to show, for example, the concentration of lightnings (given by their position registered in longitude and latitude). The number of data points is not too large (53) and the result is too coarse. Here is a picture of the result:
For this reason, I am trying to find a way to weight in data from surrounding bins. For example, there is a bin at longitude = 130 and latitude = 34.395 with 0 lightning registered, but with several around it. I would want this bin to reflect somehow the concentration around it. In other words, I want to smooth the data by having overlapping bins (so that a data point can be counted more than once, by different contiguous bins).
I understand that hist2d has the input option for "weights", but this would only work to make a data point more "important" within its bin.
The simplified code is below and I can clarify anything needed.
import numpy as np
import matplotlib.pyplot as plt
# Here are the data, to experiment if needed
longitude = np.array([119.165, 115.828, 110.354, 117.124, 119.16 , 107.068, 108.628, 126.914, 125.685, 116.608, 122.455, 116.278, 123.43, 128.84, 128.603, 130.192, 124.508, 121.916, 133.245, 125.088, 126.641, 127.224, 113.686, 129.376, 127.312, 121.353, 117.834, 125.219, 138.077, 153.299, 135.66 , 128.391, 118.011, 117.313, 119.986, 118.619, 119.178, 120.295, 121.991, 123.519, 135.948, 132.224, 129.317, 135.334, 132.923, 129.828, 139.006, 140.813, 116.207, 139.254, 120.922, 112.171, 143.508])
latitude = np.array([34.381, 34.351, 34.359, 34.357, 34.364, 34.339, 34.351, 34.38, 34.381, 34.366, 34.373, 34.366, 34.369, 34.387, 34.39 , 34.39 , 34.386, 34.371, 34.394, 34.386, 34.384, 34.387, 34.369, 34.4 , 34.396, 34.37 , 34.374, 34.383, 34.403, 34.429, 34.405, 34.385, 34.367, 34.36 , 34.367, 34.364, 34.363, 34.367, 34.367, 34.369, 34.399, 34.396, 34.382, 34.401, 34.396, 34.392, 34.401, 34.401, 34.362, 34.404, 34.382, 34.346, 34.406])
# Number of bins
Nbins = 15
# Plot histogram of the positions
plt.hist2d(longitude,latitude, bins=Nbins)
plt.plot(longitude,latitude,'o',markersize = 8, color = 'k')
plt.plot(longitude,latitude,'o',markersize = 6, color = 'w')
plt.colorbar()
plt.show()
Perhaps you're getting confused with the concept of 2D-histogram, or histogram. Besides the fact a histogram is a bar plot groupping data into plot, it is also a dicretized estimation of a probability funtion. In your case, the presence probability. For this reason, I would not try to overlap histograms.
Moreover, because the histogram is 'discrete', it will be necessarily coarse. Actually, the resolution of a histogram is an important parameter regarding the desired visualization.
Going back to your question, if you want to disminish the coarse effect, you may to simply want to play on Nbins.
Perhaps, other graph type would suit better your usage: see this gallery and the 2D-density plot with shading.

pandas plot line segments for each row

I have dataframes with columns containing x,y coordinates for multiple points. One row can consist of several points.
I'm trying to find out an easy way to be able to plot lines between each point generating a curve for each row of data.
Here is a simplified example where two lines are represented by two points each.
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
df.plot(y=['p1_y','p2_y'], x=['p1_x','p2_x'])
when trying to plot them I expect line 1 to start where x=1 and line 2 to start where x=2.
Instead, the x axis contains two value-pairs (1,2) and (2,3) and both lines have the same start and end-point in x-axis.
How do I get around this problem?
Edit:
If using matplotlib, the following hardcoded values generates the plot i'm interested in
plt.plot([[1,2],[2,3]],[[10,9],[11,12]])
While I'm sure that there should be a more succinct way using pure pandas, here's a simple approach using matplotlib and some derivatives from the original df.(I hope I understood the question correctly)
Assumption: In df, you place x values in even columns and y values in odd columns
Obtain x values
x = df.loc[:, df.columns[::2]]
x
p1_x p2_x
0 1 2
1 2 3
Obtain y values
y = df.loc[:, df.columns[1::2]]
y
p1_y p2_y
0 10 11
1 9 12
Then plot using a for loop
for i in range(len(df)):
plt.plot(x.iloc[i,:], y.iloc[i,:])
One does not need to create additional data frames. One can loop through the rows to plot these lines:
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
for i in range(len(df)): # for each row:
# plt.plot([list of Xs], [list of Ys])
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]])
plt.show()
The lines will be drawn in different colors. To get lines of same color, one can add option c='k' or whatever color one wants.
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]], c='k')
I generaly don't use the pandas plotting because I think it is rather limited, if using matplotlib is not an issue, the following code works:
from matplotlib import pyplot as plt
plt.plot(df.p1_x,df.p1_y)
plt.plot(df.p2_x,df.p2_y)
plt.plot()
if you got lots of lines to plot, you can use a for loop.

Categories