I have the following code for a scatter graph
dimens = (12, 10)
fig, ax = plt.subplots(figsize=dimens)
sns.scatterplot(data = information, x = 'latitude', y = 'longitude', hue="genre", s=200,
x_jitter=4, y_jitter=4, ax=ax)
No matter what I change the jitter to, the plots still remain very close. Whats wrong with it?
Example dataframe:
store longitude latitude genre
mcdonalds 140.232323 40.434343 all
kfc 140.232323 40.434343 chicken
burgerking 138.434343 35.545433 burger
fiveguys 137.323984 36.543322 burger
In the help page, it writes:
{x,y}_jitterbooleans or floats Currently non-functional
You can either add a new column or do it on the fly:
import seaborn as sns
import pandas as pd
import numpy as np
information = pd.DataFrame({'store':['mcdonalds','kfc','burgerking','fiveguys'],
'longitude':[140.232323,140.232323,138.434343,137.323984],
'latitude':[40.434343,40.434343,35.545433,36.543322],
'genre':['all','chicken','burger','burger']})
def jitter(values,j):
return values + np.random.normal(j,0.1,values.shape)
sns.scatterplot(x = jitter(information.latitude,2),
y = jitter(information.longitude,2),
hue=information.genre,s=200,alpha=0.5)
The parameter s=200 sets the individual scatter points to a very large size.
Adding 4 points of jitter is very little compared to that.
Related
I have the code below with randomly generated dataframes and I would like to extract the x and y values of both plotted lines. These line plots show the Price on the Y-axis and are Volume weighted.
For some reason, the line values for the second distribution plot, cannot be stored on the variables "df_2_x", "df_2_y". The values of "df_1_x", "df_1_y" are also written on the other variables. Both print statements return True, so the arrays are completely equal.
If I put them in separate cells in a notebook, it does work.
I also looked at this solution: How to retrieve all data from seaborn distribution plot with mutliple distributions?
But this does not work for weighted distplots.
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2,12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100,3000)) for i in range(30)]
Price_2 = [round(random.uniform(0,10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100,1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1' : Price_1,
'Volume_1' : Volume_1})
df_2 = pd.DataFrame({'Price_2' : Price_2,
'Volume_2' :Volume_2})
df_1_x, df_1_y = sns.distplot(df_1.Price_1, hist_kws={"weights":list(df_1.Volume_1)}).get_lines()[0].get_data()
df_2_x, df_2_y = sns.distplot(df_2.Price_2, hist_kws={"weights":list(df_2.Volume_2)}).get_lines()[0].get_data()
print((df_1_x == df_2_x).all())
print((df_1_y == df_2_y).all())
Why does this happen, and how can I fix this?
Whether or not weight is used, doesn't make a difference here.
The principal problem is that you are extracting again the first curve in df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[0].get_data(). You'd want the second curve instead: df_2_x, df_2_y = sns.distplot(df_2....).get_lines()[1].get_data().
Note that seaborn isn't really meant to concatenate commands. Sometimes it works, but it usually adds a lot of confusion. E.g. sns.distplot returns an ax (which represents a subplot). Graphical elements such as lines are added to that ax.
Also note that sns.distplot has been deprecated. It will be removed from Seaborn in one of the next versions. It is replaced by sns.histplot and sns.kdeplot.
Here is how the code could look like:
import pandas as pd
import random
import seaborn as sns
import matplotlib.pyplot as plt
Price_1 = [round(random.uniform(2, 12), 2) for i in range(30)]
Volume_1 = [round(random.uniform(100, 3000)) for i in range(30)]
Price_2 = [round(random.uniform(0, 10), 2) for i in range(30)]
Volume_2 = [round(random.uniform(100, 1500)) for i in range(30)]
df_1 = pd.DataFrame({'Price_1': Price_1,
'Volume_1': Volume_1})
df_2 = pd.DataFrame({'Price_2': Price_2,
'Volume_2': Volume_2})
ax = sns.histplot(x=df_1.Price_1, weights=list(df_1.Volume_1), bins=10, kde=True, kde_kws={'cut': 3})
sns.histplot(x=df_2.Price_2, weights=list(df_2.Volume_2), bins=10, kde=True, kde_kws={'cut': 3}, ax=ax)
df_1_x, df_1_y = ax.lines[0].get_data()
df_2_x, df_2_y = ax.lines[1].get_data()
# use fill_between to demonstrate where the extracted curves lie
ax.fill_between(df_1_x, 0, df_1_y, color='b', alpha=0.2)
ax.fill_between(df_2_x, 0, df_2_y, color='r', alpha=0.2)
plt.show()
GOAL: I want to make a distribution function for registered dogs' ages in 2017 in Zurich from the 'Dogs of Zurich' dataset (Kaggle) (with Python). The variable I'm working with - 'GEBURTSJAHR_HUND' - gives the birth year for every registered dog as an int.
I have converted it to a 'dog_age' variable (= 2017 - birth_date) and want to plot the distribution function. See image below for sorted list of group size per age.
Size of dog age groups
PROBLEM: I'm running into is the fact that my distribution function's x axis has empty spaces/bars in it. Every age is shown on the graph, but in between some of these ages are empty bars.
Example: 1 and 2 are full bars, but between them is an empty space. Between 2 and 3, there is no empty space, but between 3 and 4 there is. Seemingly random which values have white spaces between them.
What my problematic distribution plot looks like at the moment
TRIED: I have previously tried three things to fix this.
plt.xticks(...)
Unfortunately this only changed the aesthetics of the x axis.
Tried ax = sns.distplot followed by ax.xaxis ticker lines, but this did not have the expected result.
ax.xaxis.set_major_locator(ticker.MultipleLocator())
ax.xaxis.set_major_formatter(ticker.ScalarFormatter(0))
Maybe problem is with 'dog_age' variable?
Used the original birth_date variable, but this had the same problem.
CODE:
dfnew = pd.read_csv(dog17_filepath,index_col='HALTER_ID')
dfnew.dropna(subset = ["ALTER"], inplace=True)
dfnew['dog_age'] = 2017 - dfnew['GEBURTSJAHR_HUND']
b = dfnew['dog_age']
sns.set_style("darkgrid")
plt.figure(figsize=(15,5))
sns.distplot(a=b,hist=True)
plt.xticks(np.arange(min(b), max(b)+1, 1))
plt.xlabel('Age Dog', fontsize=12)
plt.title('Distribution of age of dogs', fontsize=20)
plt.show()
Thanks in advance,
Arthur
The problem is that the age column is discrete: it only contains a short range of integers. Default the histogram divides the range of values (float) into a fixed number of bins, which usually don't align well with those integers. To get an appropriate histogram, the bins needs to be set explicitly, for example having a bin bound at every half.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
dfnew = pd.read_csv('hundehalter.csv')
dfnew.dropna(subset=["ALTER"], inplace=True)
dfnew['dog_age'] = 2017 - dfnew['GEBURTSJAHR_HUND']
b = dfnew['dog_age'][(dfnew['dog_age'] >= 0) & (dfnew['dog_age'] <= 25)]
sns.set_style("darkgrid")
plt.figure(figsize=(15, 5))
sns.distplot(a=b, hist=True, bins=np.arange(min(b)-0.5, max(b)+1, 1))
plt.xticks(np.arange(min(b), max(b) + 1, 1))
plt.xlabel('Age Dog', fontsize=12)
plt.title('Distribution of age of dogs', fontsize=20)
plt.xlim(min(b), max(b) + 1)
plt.show()
HI all I have the following groups of data:
sumcosts = df.groupby('AgeGroup').Costs.sum()
print(sumcosts):
AgeGroup
18-25 536295.37
25-35 1784085.88
35-45 2395250.62
45-55 5483060.33
55-65 11652094.30
65-75 9633490.63
75+ 5186867.32
Name: Costs, dtype: float64
countoftrips = df.groupby('AgeGroup').Booking.nunique()
print(countoftrips):
AgeGroup
18-25 139
25-35 398
35-45 379
45-55 738
55-65 1417
65-75 995
75+ 545
Name: Booking, dtype: int64
When trying to plot these i have used the following:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
sns.set()
fig, ax1 = plt.subplots()
sns.barplot(data=sumcosts, palette="rocket", ax=ax1)
ax2 = ax1.twinx()
sns.lineplot(data=countoftrips, palette="rocket", ax=ax2)
plt.show()
the output is this:
The line section looks correct but the bar chart has obviously stoppoed in the first age bracket. Any ideas on how to correct? I tried to define the x='Agegroup' and y='Costs' but then got errors and this is the most progress I can get to. Thanks very much!
your barplot appears to be showing the sum of all costs, not just those of the 18-25 age group. The fact this bar is appearing under the x-axis label for the 18-25 group is only b/c of the positioning of your axis for the line plot - which makes it confusing.
I created a dummy data set of 1000 rows in a .csv to graph this
example, but my values are different - so the plots will look visually
different, everything else will work the same for you.
Jupyter Notebook Setup:
(images added to reflect outputs)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
# Read in dataset 'df', showing the header
df = pd.read_csv('./data-raw.csv')
df.head()
Assuming you have no NaN values in your data ... otherwise you can use dropna() to remove them.
# Check if there are any NaN values in the all_stocks dataframe
print('Number of NaN values in the columns of our DataFrame:\n', df.isnull().sum())
# Remove any rows that contain NaN values using dropna (as applicable)
data.dropna(axis=0, inplace=True)
Your sumcosts and countoftrips are not a requirement for creating your plots, and I believe are the cause of your plotting error for the bar graph. I've included them here, but are not using them when creating the plot.
Plot Type:
It is also important to keep in mind that a bar plot shows only the mean (or other estimator, i.e std) value, but in many cases, it may be more informative to show the distribution of values at each level of the categorical variables. In that case, other approaches such as a box or violin plot may be more appropriate.
Solution:
This is assuming you want to have the line and bar plot layered over each other, as in your example:
# This plot has both graphs on the axis you outlined in your code,
# I used the ci = None parameter to remove the confidence intervals to
# make the combined plot easier to read (optional)
fig, ax1 = plt.subplots()
sb.barplot(data = df, x = 'AgeGroup', y = 'Costs', ci = None,
ax = ax1, palette = 'rocket', order = ['18-25',
'25-35','35-45','45-55','55-65', '65-75', '75+']);
ax2 = ax1.twinx()
sb.lineplot(data = df, x = 'AgeGroup', y = 'Booking', ax = ax2, ci = None);
plt.xlabel('Age Group Ranges');
plt.show()
Here is an alternative you could try, also using subplot, but separating the two plots.
# Adjusting the plot size just to make it easier to read here:
plt.figure(figsize = [14, 4])
#Bar Chart on Left
plt.subplot(1, 2, 1) # 1 row, 2 cols, subplot 1
sb.barplot(data = df, x = 'AgeGroup', y = 'Costs', palette = 'rocket',
ci = 'sd', order = ['18-25', '25-35', '35-45',
'45-55','55-65', '65-75', '75+']);
plt.xlabel('Age Group Ranges')
plt.ylabel('Costs')
# Line Chart on Right
plt.subplot(1, 2, 2) # 1 row, 2 cols, subplot 2
sb.lineplot(data = df, x = 'AgeGroup', y = 'Booking', ci = None)
plt.xlabel('Age Group Ranges')
plt.ylabel('Bookings');
Hope you find helpful!
I am trying to have in the same plot the visualization of three variables. I will explain better, this is the code:
import pandas as pd
from matplotlib import pyplot as plt
an_1 = pd.read_csv('an_1.csv', header=None, names=('Pd', 'V')) # M = 10 ^ 3 (gamma=0.01)
# ex for stack: an_1 = pd.DataFrame(data = {'Pd': [0.5,0.6,0.7,0.8], 'V':[200,210,230,240]})
plt.figure(figsize=(8,5), dpi=100)
plt.plot (an_1.Pd, an_1.V, 'r*--', label='Analyt_1')
perc_excedd = pd.read_csv('perc.csv', header=None, names=('Pd', 'V', 'exc'))
# ex for stack: perc_excedd = pd.DataFrame(data = {'Pd': [0.5,0.5,0.5,0.4,0.4,0.4],
#'V':[200,210,220,200,210,220], 'perc':[0.1,0.1,0.2,0.3,0.1,0.2,0.3]})
Basically an1.csv has different values of Pd and a specific value of V.
In perc.csv I have for a single value of Pd, different values of perc_exceed which corresponds to different values of V. In the comments I just put random values to help make it clear.
I would like to have the graph I already have and add to it another y axis with the the points of perc_exceed that depends either on Pd and on V.
Hope I've been clear enough. Thanks!
You can use the twinyfunction.
ax1 = plt.gca() # get the current axis
ax2 = ax1.twinx() # get another y axis.
ax1 .plot (an_1.Pd, an_1.V, 'r*--', label='Analyt_1')
ax2 .plot (perc_excedd .Pd, perc_excedd .V, 'g*--', label='Excedd')
I am dealing with the following data frame (only for illustration, actual df is quite large):
seq x1 y1
0 2 0.7725 0.2105
1 2 0.8098 0.3456
2 2 0.7457 0.5436
3 2 0.4168 0.7610
4 2 0.3181 0.8790
5 3 0.2092 0.5498
6 3 0.0591 0.6357
7 5 0.9937 0.5364
8 5 0.3756 0.7635
9 5 0.1661 0.8364
Trying to plot multiple line graph for the above coordinates (x as "x1 against y as "y1").
Rows with the same "seq" is one path, and has to be plotted as one separate line, like all the x, y coordinates corresponding the seq = 2 belongs to one line, and so on.
I am able to plot them, but on a separate graphs, I want all the lines on the same graph, Using subplots, but not getting it right.
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib notebook
df.groupby("seq").plot(kind = "line", x = "x1", y = "y1")
This creates 100's of graphs (which is equal to the number of unique seq). Suggest me a way to obtain all the lines on the same graph.
**UPDATE*
To resolve the above problem, I implemented the following code:
fig, ax = plt.subplots(figsize=(12,8))
df.groupby('seq').plot(kind='line', x = "x1", y = "y1", ax = ax)
plt.title("abc")
plt.show()
Now, I want a way to plot the lines with specific colors. I am clustering path from seq = 2 and 5 in cluster 1; and path from seq = 3 in another cluster.
So, there are two lines under cluster 1 which I want in red and 1 line under cluster 2 which can be green.
How should I proceed with this?
You need to init axis before plot like in this example
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['ProjID','Xcoord','Ycoord'])
# plot groupby results on the same canvas
fig, ax = plt.subplots(figsize=(8,6))
df.groupby('ProjID').plot(kind='line', x = "Xcoord", y = "Ycoord", ax=ax)
plt.show()
Consider the dataframe df
df = pd.DataFrame(dict(
ProjID=np.repeat(range(10), 10),
Xcoord=np.random.rand(100),
Ycoord=np.random.rand(100),
))
Then we create abstract art like this
df.set_index('Xcoord').groupby('ProjID').Ycoord.plot()
Another way:
for k,g in df.groupby('ProjID'):
plt.plot(g['Xcoord'],g['Ycoord'])
plt.show()
Here is a working example including the ability to adjust legend names.
grp = df.groupby('groupCol')
legendNames = grp.apply(lambda x: x.name) #Get group names using the name attribute.
#legendNames = list(grp.groups.keys()) #Alternative way to get group names. Someone else might be able to speak on speed. This might iterate through the grouper and find keys which could be slower? Not sure
plots = grp.plot('x1','y1',legend=True, ax=ax)
for txt, name in zip(ax.legend_.texts, legendNames):
txt.set_text(name)
Explanation:
Legend values get stored in the parameter ax.legend_ which in turn contains a list of Text() objects, with one item per group, where Text class is found within the matplotlib.text api. To set the text object values, you can use the setter method set_text(self, s).
As a side note, the Text class has a number of set_X() methods that allow you to change the font sizes, fonts, colors, etc. I haven't used those, so I don't know for sure they work, but can't see why not.
based on Serenity's anwser, i make the legend better.
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
# random df
df = pd.DataFrame(np.random.randint(0,10,size=(25, 3)), columns=['ProjID','Xcoord','Ycoord'])
# plot groupby results on the same canvas
grouped = df.groupby('ProjID')
fig, ax = plt.subplots(figsize=(8,6))
grouped.plot(kind='line', x = "Xcoord", y = "Ycoord", ax=ax)
ax.legend(labels=grouped.groups.keys()) ## better legend
plt.show()
and you can also do it like:
grouped = df.groupby('ProjID')
fig, ax = plt.subplots(figsize=(8,6))
g_plot = lambda x:x.plot(x = "Xcoord", y = "Ycoord", ax=ax, label=x.name)
grouped.apply(g_plot)
plt.show()
and it looks like: