how to replicate plot: density bar plot in Python - python

I'm working on a project and would like to plot by data in a similar way as this example from a book:
So I would like to create a density histogram for my categorical features (left image) and than add a separate column for each value of another feature (middle and right image).
In my case the feature I want to plot is called [district_code] and I would like to create columns based on a feature called [status_group]
What I've tried so far:
sns.kdeplot(data = raw, x = "district_code"): problem, it is a line plot, not a histogram
sns.kdeplot(data = raw, x = "district_code", col = "status_group"): problem, you can't use the col argument for this plottype
sns.displot(raw, x="district_code", col = 'status_group'): problem, col argument works, but it creates a countplot, not a density plot
I would really appreciate some suggestions about the correct code I could use.
This is just an example for one of my categorical features, but I have many more I would like to plot. Any suggestions on how to turn this into a function where I could run the code for a list of categorical features would be highly appreciated.
UPDATE:
sns.displot(raw, x="source_class", stat = 'density', col = 'status_group', color = 'black'): works but looks a bit akward for some features.
How could I improve this?
Good:
Not so good:

Related

Plotting a map using Geoview and using size/ colour option

I'm trying to visualize a dataset which I've filtered down to just longitude/latitude, country name, year and a count of deaths. I'm trying to plot that using geoviews as I wish to add lot more to my dataset and interactive map would be a great add on
My code is as follows: (for_plot is the dataframe)
# Plotting the graph
Best = gv.Dataset(for_plot)
points = Best.to(gv.Points, ['longitude', 'latitude'], ['deaths', 'country'])
(gts.Wikipedia * points).opts(
opts.Points(width=600, height=350, tools=['hover'],
size='deaths', cmap='viridis'))
This creates a perfect graph put the 'size' function doesn't work. If I change size to color, graph is not generated. I'm okay with either but just need atleast one marker.
Thanks for any help
Tried to switch values for color instead of size, works with year but not deaths

Which parts of my dataframe are being plotted?

The goal is to plot the data frame I'm working with on a single chart, with a line for each value of init_population where the y-axis is count and x-axis is tick_number.
I've figured out how to use groupby() and plot() together to make this:
As you can see, all the lines are there nicely, but I'm pretty confident that the blue at the top that doesn't follow the relationship the other lines are following is actually a different column of data.
So that this is reproducible, the data is available here.
import pandas as pd
max_runs_data = pd.read_csv('clean_table.csv')
del max_runs_data['visualization']
max_runs_data.columns = ['run_number','init_population', 'tick', 'turtle_count']
max_runs_data.set_index('tick', inplace = True)
test_plot_1 = max_runs_data.groupby('init_population')['turtle_count'].plot()
test_plot_2 = max_runs_data.groupby('init_population').plot(y='turtle_count')
test_plot_1 is the linked image, test_plot_2 is a separate plot for each group.
Is it obvious how to specify the columns for x and y without losing the grouping on a single chart?
Thanks

How to make the confidence interval (error bands) show on seaborn lineplot

I'm trying to create a plot of classification accuracy for three ML models, depending on the number of features used from the data (the number of features used is from 1 to 75, ranked according to a feature selection method). I did 100 iterations of calculating the accuracy output for each model and for each "# of features used". Below is what my data looks like (clsf from 0 to 2, timepoint from 1 to 75):
data
I am then calling the seaborn function as shown in documentation files.
sns.lineplot(x= "timepoint", y="acc", hue="clsf", data=ttest_df, ci= "sd", err_style = "band")
The plot comes out like this:
plot
I wanted there to be confidence intervals for each point on the x-axis, and don't know why it is not working. I have 100 y values for each x value, so I don't see why it cannot calculate/show it.
You could try your data set using Seaborn's pointplot function instead. It's specifically for showing an indication of uncertainty around a scatter plot of points. By default pointplot will connect values by a line. This is fine if the categorical variable is ordinal in nature, but it can be a good idea to remove the line via linestyles = "" for nominal data. (I used join = False in my example)
I tried to recreate your notebook to give a visual, but wasn't able to get the confidence interval in my plot exactly as you describe. I hope this is helpful for you.
sb.set(style="darkgrid")
sb.pointplot(x = 'timepoint', y = 'acc', hue = 'clsf',
data = ttest_df, ci = 'sd', palette = 'magma',
join = False);

Python scatterplot design - select specific values of a variable for the x axis based on another columns values

I am relatively new to python and am currently trying to generate a scatterplot based off of some data using pandas & seaborn.
The data I'm using ('ErrorMedianScatter') is as follows (apologies for the link, I have yet to get permissions to embed images!):
Image of data
Each participant has two data points of interest. The mean when MissingLimb = 0 or 1
I want to create a scatterplot for participants where the x-axis represents their value for 'mean' when 'MissingLimb' = 0, and the y-axis represents their value for 'mean' when 'MissingLimb' = 1.
I am using the current code so far to create a scatterplot:
sns.lmplot(("mean",
"mean",
data=ErrorMedianScatter,
fit_reg=False,
hue="participant")
This generates a perfectly functional, but very uninteresting, scatterplot. What I'm stuck on is creating an x-/y-axis variable that allows for me to specify that I'm interested in the mean of a participant based on the value of 'MissingLimb' column.
Many thanks in advance!
There are most likely multiple ways to solve your problem. The method I'd take is to first transform you dataset in such a way that there is a single row (observation) for each participant, and where (for each row) there is one column that reports the means where MissingLimb is 0 and another column that reports the means where MissingLimb is 1.
You can accomplish this data transformation with this code:
df = pd.pivot_table(ErrorMedianScatter,
values='mean',
index='participant',
columns='MissingLimb')
df.columns = ['MissingLimb 0', 'MissingLimb 1']
You can then use this (transformed) dataframe to create the scatterplot:
sns.lmplot(data=df, x='MissingLimb 0', y='MissingLimb 1')
Notice that in addition to specifying the data to plot (using the data parameter), I also specified the data to plot on the x- and y-axis (using the x and y parameters, respectively). You can add additional arguments to the sns.lmplot call and customize the plot to your specifications.

Heatmap with specific axis labels coloured

I am trying to plot a heatmap with 2 columns of data from a pandas dataframe. However, I would like to use a 3rd column to label the x axis, ideally by colour though another method such as an additional axis would be equally suitable. My dataframe is:
MUT SAMPLE VAR GROUP
True s1 1_1334442_T CC002
True s2 1_1334442_T CC006
True s1 1_1480354_GAC CC002
True s2 1_1480355_C CC006
True s2 1_1653038_C CC006
True s3 1_1730932_G CC002
...
Just to give a better idea of the data; there are 9 different types of 'GROUP', ~60,000 types of 'VAR' and 540 'SAMPLE's. I am not sure if this is the best way to build a heatmap in python but here is what I figured out so far:
pivot = pd.crosstab(df_all['VAR'],df_all['SAMPLE'])
sns.set(font_scale=0.4)
g = sns.clustermap(pivot, row_cluster=False, yticklabels=False, linewidths=0.1, cmap="YlGnBu", cbar=False)
plt.show()
I am not sure how to get 'GROUP' to display along the x-axis, either as an additional axis or just colouring the axis labels? Any help would be much appreciated.
I'm not sure if the 'MUT' column being a boolean variable is an issue here, df_all is 'TRUE' on every 'VAR' but as pivot is made, any samples which do not have a particular 'VAR' are filled as 0, others are filled with 1. My aim was to try and cluster samples with similar 'VAR' profiles. I hope this helps.
Please let me know if I can clarify anything further? Many thanks
Take look at this example. You can give a list or a dataframe column to the clustermap function. By specifying either the col_colors argument or the row_colors argument you can give colours to either the rows or the columns based on that list.
In the example below I use the iris dataset and make a pandas series object that specifies which colour the specific row should have. That pandas series is given as an argument for row_colors.
iris = sns.load_dataset("iris")
species = iris.pop("species")
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)
g = sns.clustermap(iris, row_colors=row_colors,row_cluster=False)
This code results in the following image.
You may need to tweak a bit further to also include a legend for the colouring for groups.

Categories