Plotting different lines for different states on the same chart - python

I am trying to create a distribution for the number of ___ across a few states.
I want to get all of the states on the same graph, represented by different lines.
Here is an example what my data looks like: you have the state ('which I want to filter lines by), the number of reviews (x axis), and the frequency of restaurants that have that many reviews (y axis)
State | num_of_reviews | Count_id
alaska 1 400
alaska 2 388
alaska 3 344
...
Wyoming 57 13
Whenever I try doing a simple line plot in seaborn or matplotlib, it just returns a messy graph.
Does anyone know a string of code where I easily can filter df['State']?

Assuming that you have 50+ states, I wouldn't plot the distribution for each on the same plot as it would get really messy and hard to read. Instead, I would suggest to use a FacetGrid (read more about it here).
Something like this should do.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(df, col="State", col_wrap=5, height=1.5)
g = g.map(plt.hist, "num_of_reviews")
You can find other possible solutions and ideas on how to visualize your data here.
If none of these work for you then it might be helpful if you explain a bit better your problem and provide a desired output and a minimal, complete, and verifiable example.

Related

Python data visualization: too small value to be visible - how to solve?

Here is dataset, i have:
Source
All Leads
Not Junks
Warms
Hots
Deals
Weighted Sum
web
281316
269490
10252
2508
1602
4376.5
telesales
30458
29732
431
138
85
316.2
networking
4249
4195
763
547
476
539.1
promos
1356
1308
30
1
0
10.8
I visualized it:
df.plot.bar()
And got this output:
Some columns got too small values, so that they are not visible, how can tackle this problem?
Setting bigger figure size isn't useful, it makes chart bigger, but columns ratio is still the same, so nothing changes
Any ideas how to make it look more sophisticated? Or maybe i should try different type of chart? Thank you!
Could try df.plot.bar(logy=true), but it's going to make useful interpretation of it messy. A Sankey diagram would probably be a better fit for showing how the data breaks down in each category.
Seaborn comes out a little nicer, but takes some transformation to produce the same type of output:
import seaborn as sns
df2 = df.melt('Source').rename(columns={'variable': 'Category', 'value': 'Values'})
sns.barplot(x='Source', y='Values', data=df2, hue='Category')
plt.show()
Output:
Or with log=True

How to change colors based on percentage, US map using plotly choropleth?

I have an example dataset, something like below and I am using this to plot us map, Here's the example
prob state_abbr state_code
0 0.240402 California CA
1 0.233483 Texas TX
2 0.130376 New York NY
3 0.117759 New Jersey NJ
4 0.115724 Virginia VA
5 0.081264 Illinois IL
6 0.080993 Georgia GA
I have used this code to plot US map and assign these accordingly based on data above and I was succesful and I can also view the plot properly no issues there,
import plotly.express as px
fig = px.choropleth(locations=df["state_code"], locationmode="USA-states",
color=df["prob"], scope="usa",
color_continuous_scale="Viridis")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
The changes I need is, in my example dataset, CA has prob of 24% I want it to be very dark, then TX 23% I want it to be less dark compared to previous etc., like that.
Also, if prob == 0%, I want it to be default gray.
And each dark color needs to be a bit different. How can I do it, can someone help out.
Look at the documentation on colorscales. There are also many builtin colorscales that I recommend looking at first.
You can also reverse a builtin colorscale:
You can reverse a built-in color scale by appending _r to its name, for color scales given either as a string or a plotly object.
So "Viridis" would become "Viridis_r".
You could also explicitly construct a colorscale:
color_continuous_scale=["red", "green", "blue"]
Or probably the closest to what you described in your question is to do something like this:
color_continuous_scale=[(0, "gray"), (0.1, "yellow"), (1, "purple")]
Which gives us:
Adjust the values above according to your requirements.

How to plot distribution of 30 features in one plot and differentiating by label in python

I am beginner in python trying to plot the distribution of the Breast Cancer Wisconsin (Diagnostic) Dataset from UCI machine learning respiratory.
My dataset looks like this
(Mean_Radius) (Mean_Texture) Mean_Perimeter) (Mean_Area) (Mean_Smoothness) Diagnosis
----------------------------------------------------------------------------------------------------------------
(17.99) (10.38) (122.80) (1001.0) (0.11840) M
(12.99) (11.38) (125.80) (1021.0) (0.12540) B
(15.99) (9.38) (123.80) (1000.0) (0.21840) M
(12.09) (12.38) (135.80) (900.0) (0.32540) B
I want to create something like the picture below (a distribution of all 30 features)
but with the both classes separated like this;
anyone knows how I can do this in python or matlab?
I tried this code but it is not giving me exactly what I want.
sns.pairplot(Data,vars=['Mean_Radius','Mean_Texture','Mean_Perimeter','Mean_Area','Mean_Smoothness','Mean_Compactness','Mean_Concavity','Mean_ConcavePts','Mean_Symmetry','Mean_FractalDim','SE_Radius','SE_Texture','SE_Perimeter','SE_Area','SE_Smoothness','SE_Compactness','SE_Concavity','SE_ConcavePts','SE_Symmetry','SE_FractalDim','Worst_Radius','Worst_Texture','Worst_Perimeter','Worst_Area','Worst_Smoothness','Worst_Compactness','Worst_Concavity','Worst_ConcavePts','Worst_Symmetry','Worst_FractalDim'], hue='Diagnosis')
Is there an alternative way I can make this plot for all 30 features that clearly shows both class?
the given example shows how to plot histogram for various features
import matplotlib.pyplot as plt
plt.figure(figsize=(24,200))
try:
for i, col in enumerate(Data.columns.to_list()):
plt.subplot(10, 3, i + 1)
plt.hist(Data[col], label=col,color='blue')
plt.legend()
plt.title(col)
plt.tight_layout()
except Exception as e:
print(col,e)

How to modify the output of my COXPH image drawn by cph.plot_covariate_groups

I do not know how I can modify the output image provide by lifelines since I am unfamiliar with "cph.plot_covariate_groups". Unfortunately, there seems no detailed description about it in the link here; https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html .
What I am looking for is, (1) how to shorten the event days (X axis), I do not want to show such a long days for the survival curve. Ideally, 4000 is the best. (2) Also, if possible, I would like to remove the baseline survival curve from my image. (3) I am also hoping if I could change the color of the survival curves from orange/blue to others.
import pandas as pd
from lifelines import AalenAdditiveFitter, CoxPHFitter, KaplanMeierFitter
data = pd.read_csv("cluster label.csv", index_col=0)
cph = CoxPHFitter()
cph.fit(data, duration_col="time", event_col="status")
cph.plot_covariate_groups('label', [0,1])
This is all possible. Information about specific functions and methods are available on the docs page: https://lifelines.readthedocs.io/en/latest/References.html
Specifically: https://lifelines.readthedocs.io/en/latest/lifelines.fitters.html#lifelines.fitters.coxph_fitter.CoxPHFitter.plot_covariate_groups
So try this:
cph.plot_covariate_groups('label', [0,1],
plot_baseline=False,
cmap='coolwarm'
)
plt.xlim(0, 4000)

Plotting boxplots for a groupby object

I would like to plot boxplots for several datasets based on a criterion.
Imagine a dataframe similar to the example below:
df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]
Group M F
0 1 0.465636 0.537723
1 1 0.560537 0.727238
2 1 0.268154 0.648927
3 2 0.722644 0.115550
4 3 0.586346 0.042896
5 2 0.562881 0.369686
6 2 0.395236 0.672477
7 3 0.577949 0.358801
8 1 0.764069 0.642724
9 3 0.731076 0.302369
In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded.
This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them.
. The desirable output would look something like this:
Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby
My questions are:
How to implement groupby to feed the desired data into the boxplot
What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)
I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.
import pandas as pd
import matplotlib.pyplot as plt
Import seaborn as sns
dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex')
sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex')
The plot looks similar to your required plot.
Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:
# here I prepare the data (group them manually and then store in lists)
Groups=[1,2,3]
Columns=df.columns.tolist()[1:]
print Columns
Mgroups=[]
Fgroups=[]
for g in Groups:
dfgc = df[df['Group']==g]
m=dfgc['M'].dropna()
f=dfgc['F'].dropna()
Mgroups.append(m.tolist())
Fgroups.append(f.tolist())
fig=plt.figure()
ax = plt.axes()
def setBoxColors(bp,cl):
plt.setp(bp['boxes'], color=cl, linewidth=2.)
plt.setp(bp['whiskers'], color=cl, linewidth=2.5)
plt.setp(bp['caps'], color=cl,linewidth=2)
plt.setp(bp['medians'], color=cl, linewidth=3.5)
bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6)
bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6)
setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/
setBoxColors(bpl, '#2C7BB6')
# draw temporary red and blue lines and use them to create a legend
plt.plot([], c='#D7191C', label='F')
plt.plot([], c='#2C7BB6', label='M')
plt.legend()
plt.yticks(xrange(0, len(Groups) * 3, 3), Groups)
plt.ylim(-3, len(Groups)*3)
#plt.xlim(0, 8)
plt.show()
The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...

Categories