Plotting categorical variable over multiple numeric variables in python - python

I need to plot one categorical variable over multiple numeric variables.
My DataFrame looks like this:
party media_user business_user POLI mass
0 Party_a 0.513999 0.404201 0.696948 0.573476
1 Party_b 0.437972 0.306167 0.432377 0.433618
2 Party_c 0.519350 0.367439 0.704318 0.576708
3 Party_d 0.412027 0.253227 0.353561 0.392207
4 Party_e 0.479891 0.380711 0.683606 0.551105
And I would like a scatter plot with different colors for the different variables; eg. one plot per party per [media_user, business_user, POLI, mass] each in different color.
So like this just with scatters instead of bars:
The closest I've come is this
sns.catplot(x="party", y="media_user", jitter=False, data=sns_df, height = 4, aspect = 5);
producing:

By messing around with some other graphs I found that by simply adding linestyle = '' I could remove the line and add markers. Hope this may help somebody else!
sim_df.plot(figsize = (15,5), linestyle = '', marker = 'o')

Related

Altair Layer Chart Y Axis Not Resolving to Same Scale

I'm trying to plot data and compare it to a threshold set at a fixed value. When creating a layered chart, the y axis does not appear to hold for both layers. The same goes for hconcat.
I found this issue which mentions .resolve_scale(y='shared'), but that doesn't seem to work. When I specify the rule to be at 5, it appears above 15.
np.random.seed(42)
df = pd.DataFrame({
'x': np.linspace(0, 10, 500),
'y': np.random.randn(500).cumsum()
})
base = alt.Chart(df)
line = base.mark_line().encode(x='x', y='y')
rule = base.mark_rule().encode(y=alt.value(5))
alt.layer(line, rule).resolve_scale(y='shared')
To get the rule to appear at the value 5, I have to set it at 110.
rule = base.mark_rule().encode(y=alt.value(110))
alt.layer(line, rule).resolve_scale(y='shared')
How can I edit the chart so that the rule shows at the y-value specified?
Altair scales map a domain to a range. The domain describes the extent of the data values, while the range describes the extent of the visual features to which those values are mapped. For color encodings, the range might be "red", "blue", "green", etc. For positional encodings like x and y, the range is the pixel position of the mark on the chart.
When you use alt.value, you are specifying the range value, not the domain value. This is why you can use an encoding like color=alt.value('red'), to specify that you want the mark to appear as the color red. When you do y=alt.value(5), you are saying you want the mark to appear 5 pixels from the top of the y-axis.
Recent versions of Vega-Lite added the ability to specify the domain value via datum rather than value, but unfortunately Altair does not yet support this, and so the only way to make this work is to have a data field with the desired value. For example:
line = base.mark_line().encode(x='x', y='y')
rule = alt.Chart(pd.DataFrame({'y': [5]})).mark_rule().encode(y='y')
alt.layer(line, rule).resolve_scale(y='shared')

How can I repeat a value in a list variable for coloring bar charts in matplotlib?

Hi I am trying to search for something but I don't know the correct words to find my answer (if it exists).
I am trying to color the bars on a barchart with 24 bars using this: https://python-graph-gallery.com/3-control-color-of-barplots/
I want to color bars 0-15 one color, and bars 16-23 another color. I was wondering if there's a way I can make a variable called "my_colors" and a list without actually repeating a hexcode over and over 24x. I only need 2 colors in my list repeated a bunch of times...
Is there some notation to write this sort of list?
Since your colors are in blocks, you can just do list multiplication:
# define the colors
my_colors = ['#AAAA00', '#DD00DD']
colors = my_colors[:1]*15 + my_colors[1:] * 9
# toy data
np.random.seed(1)
plt.bar(np.arange(24),
np.random.randint(1,10,24),
color=colors)
Output

How to make the confidence interval (error bands) show on seaborn lineplot

I'm trying to create a plot of classification accuracy for three ML models, depending on the number of features used from the data (the number of features used is from 1 to 75, ranked according to a feature selection method). I did 100 iterations of calculating the accuracy output for each model and for each "# of features used". Below is what my data looks like (clsf from 0 to 2, timepoint from 1 to 75):
data
I am then calling the seaborn function as shown in documentation files.
sns.lineplot(x= "timepoint", y="acc", hue="clsf", data=ttest_df, ci= "sd", err_style = "band")
The plot comes out like this:
plot
I wanted there to be confidence intervals for each point on the x-axis, and don't know why it is not working. I have 100 y values for each x value, so I don't see why it cannot calculate/show it.
You could try your data set using Seaborn's pointplot function instead. It's specifically for showing an indication of uncertainty around a scatter plot of points. By default pointplot will connect values by a line. This is fine if the categorical variable is ordinal in nature, but it can be a good idea to remove the line via linestyles = "" for nominal data. (I used join = False in my example)
I tried to recreate your notebook to give a visual, but wasn't able to get the confidence interval in my plot exactly as you describe. I hope this is helpful for you.
sb.set(style="darkgrid")
sb.pointplot(x = 'timepoint', y = 'acc', hue = 'clsf',
data = ttest_df, ci = 'sd', palette = 'magma',
join = False);

pandas plot line segments for each row

I have dataframes with columns containing x,y coordinates for multiple points. One row can consist of several points.
I'm trying to find out an easy way to be able to plot lines between each point generating a curve for each row of data.
Here is a simplified example where two lines are represented by two points each.
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
df.plot(y=['p1_y','p2_y'], x=['p1_x','p2_x'])
when trying to plot them I expect line 1 to start where x=1 and line 2 to start where x=2.
Instead, the x axis contains two value-pairs (1,2) and (2,3) and both lines have the same start and end-point in x-axis.
How do I get around this problem?
Edit:
If using matplotlib, the following hardcoded values generates the plot i'm interested in
plt.plot([[1,2],[2,3]],[[10,9],[11,12]])
While I'm sure that there should be a more succinct way using pure pandas, here's a simple approach using matplotlib and some derivatives from the original df.(I hope I understood the question correctly)
Assumption: In df, you place x values in even columns and y values in odd columns
Obtain x values
x = df.loc[:, df.columns[::2]]
x
p1_x p2_x
0 1 2
1 2 3
Obtain y values
y = df.loc[:, df.columns[1::2]]
y
p1_y p2_y
0 10 11
1 9 12
Then plot using a for loop
for i in range(len(df)):
plt.plot(x.iloc[i,:], y.iloc[i,:])
One does not need to create additional data frames. One can loop through the rows to plot these lines:
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
for i in range(len(df)): # for each row:
# plt.plot([list of Xs], [list of Ys])
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]])
plt.show()
The lines will be drawn in different colors. To get lines of same color, one can add option c='k' or whatever color one wants.
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]], c='k')
I generaly don't use the pandas plotting because I think it is rather limited, if using matplotlib is not an issue, the following code works:
from matplotlib import pyplot as plt
plt.plot(df.p1_x,df.p1_y)
plt.plot(df.p2_x,df.p2_y)
plt.plot()
if you got lots of lines to plot, you can use a for loop.

How to plot a stacked histogram with two arrays in python

I am trying to create a stacked histogram showing the clump thickness for malignant and benign tumors, with the malignant class colored red and the benign class colored blue.
I got the clump_thickness_array and benign_or_malignant_array. The benign_or_malignant_array consists of 2s and 4s.
If benign_or_malignant equals 2 it is benign(blue colored).
If it equals 4 it is malignant(red colored).
I can not figure out how to color the benign and malignant tumors. My Histogram is showing something other than what I try to achieve.
This is my code and my histogram so far:
fig, ax = plt.subplots(figsize=(12,8))
tmp = list()
for i in range(2):
indices = np.where(benign_or_malignant>=i )
tmp.append(clump_thickness[indices])
ax.hist(tmp,bins=10,stacked=True,color = ['b',"r"],alpha=0.73)
to obtain a stacked histogram using lists of different length for each group, you need to assemble a list of lists. This is what you are doing with your tmp variable. However, I think you are using the wrong indexes in your for loop. Above, you state that you want to label your data according to the variable benign_or_malignant. You want to test if it is exactly 2 or exactly 4. If you really just want these two possibilities, rewrite like this:
for i in [2,4]:
indices = np.where(benign_or_malignant==i )
tmp.append(clump_thickness[indices])

Categories