Related
I have a dataframe with the weight and the number of measures of each user. The df looks like:
id_user
weight
number_of_measures
1
92.16
4
2
80.34
5
3
71.89
11
4
81.11
7
5
77.23
8
6
92.37
2
7
88.18
3
I would like to see an histogram with the attribute of the table (weight, but I want to do it for both cases) at the x-axis and the frequency in the y-axis.
Does anyone know how to do it with matplotlib?
Ok, it seems to be quite easy:
import pandas as pd
import matplotlib.pyplot as plt
hist = df.hist(bins=50)
plt.show()
I have the code as below to plot multiple plots on the same figure
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg) #this line is only to see the variable legend has the proper content
ax.legend(leg)
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
I get the plot as the below pic where the legend seems to be first 5 letters separately even though the variable legend has the right content
There was another similar question & the solution was to put a square bracket to the variable legend. I tried this with the code as below.
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg)#this line is only to see the variable legend has the proper content
ax.legend([leg])
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
Now I get the full legend but only the first legend is shown as the pic below
Can someone let me know how to get the full legend for all the plots? Thanks.
dummy data (the plot in pic will NOT match)
14nm 15nm 16nm 17nm 18nm 19nm layer_thickness
1 2 3 4 5 6 0
1 2 3 4 5 6 0
3 5 7 9 11 13 5700
1 2 3 4 5 6 0
3 5 7 9 11 13 8600
1 2 3 4 5 6 0
3 5 7 9 11 13 5000
1 2 3 4 5 6 0
45 55 65 75 85 95 100
1 2 3 4 5 6 0
8 15 22 29 36 43 16600
wave_lengths=['15nm','16nm','14nm','18nm']
Answer Update
Based on answer from Quang Hoang. The output pics using scatter plot from matplotlib & sns.scatterplot
With plt it is pretty natural:
def wl_ratioplot(wavelength1,wavelength2, dataframe,
x1=0.1,x2=1.5,y1=-500,y2=25000,
ax=None):
leg = "{} vs {}".format(wavelength1,wavelength2)
# set the label here, and let plt deal with it
# also, you don't need to copy the dataframe:
ax.scatter(x=dataframe[wavelength1]/dataframe[wavelength2],
y=dataframe['layer_thickness'],label=leg)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
fig, ax = plt.subplots(figsize=(25, 10))
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
ax.legend()
Output:
every time you call the function wl_ratioplot the legend is being reset the final value. use a array to store all the legends then access it all through a loop.
ax.legend([leg]) #it is resetting the legend after each call.
use a legends = [];
legends.append([leg])
after all function calls, draw the legend differently
ax.legend(legends)
I am trying to draw a frequency bar plot and a cumulative "ogive" in the same plot. If I draw them separately both are shown OK, but when shown in the same figure, the cumulative graphic is shown shifted. Below the code used.
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df['Correctas'].value_counts(sort = False).plot.bar();
df['Correctas'].value_counts(sort = False).cumsum().plot();
plt.show()
The frequency data is
2 1
3 3
4 7
5 14
6 20
7 24
8 27
9 29
10 30
11 31
12 32
So the cumulative shall start from 2 and it starts from 4 on x axis.
image showing the error
This has to do with bar chart plotting categorical x-axis. Here is a quick fix:
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df_counts = df['Correctas'].value_counts(sort = False)
df_counts.index = df_counts.index.astype('str')
df_counts.plot.bar(alpha=.8);
df_counts.cumsum().plot(color='k', kind='line');
plt.show();
Output:
Python beginner/first poster here.
I'm running into trouble adding color bars to scatter plots. I have two types of plot: one that shows all the data color-coded by date, and one that shows just the maximum values of my data color-coded by date. In the first case, I can use the df.index (which is datetime) to make my color bar, but in the second case, I am using df2['col'].idxmax to generate the colors because my df2 is a df.groupby object which I'm using to generate the daily maximums in my data, and it does not have an accessible index.
For the first type of plot, I have succeeded in generating a date-based color bar with the code below, cobbled together from online examples:
fig, ax = plt.subplots(1,1, figsize=(20,20))
smap=plt.scatter(df.col1, df.col2, s=140,
c=[date2num(i.date()) for i in df.index],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
However for the second type of plot, where I am trying to use df2['col'].idxmax to create the date series instead of df.index, the following does not work:
for n in cols1:
for m in cols2:
fig, ax = plt.subplots(1,1, figsize=(15,15))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna() #some NaNs in the
#.idxmax series were giving date2num trouble
smap2=plt.scatter(df2[n].max(), df2[m].max(),
s=160, c=[date2num(i.date()) for i in PlottableTimes],
marker='.')
cb2 = fig.colorbar(smap2, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
The error is: 'length of rgba sequence should be either 3 or 4'
Because the error was complaining of the color argument, I separately checked the output of the color (that is, c=) arguments in the respective plotting commands, and both look similar to me, so I can't figure out why one color argument works and the other doesn't:
one that works:
[736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
...]
one that doesn't work:
[736845.0,
736846.0,
736847.0,
736848.0,
736849.0,
736850.0,
736851.0,
736852.0,
736853.0,
736854.0,
...]
Any suggestions or explanations? I'm running python 3.5.2. Thank you in advance for helping me understand this.
Edit 1: I made the following example for others to explore, and in the process realized the crux of the issue is different than my first question. The code below works the way I want it to:
df=pd.DataFrame(np.random.randint(low=0, high=10, size=(169, 8)),
columns=['a', 'b', 'c', 'd', 'e','f','g','h']) #make sample data
date_rng = pd.date_range(start='1/1/2018', end='1/8/2018', freq='H')
df['i']=date_rng
df = df.set_index('i') #get a datetime index
df['ts']=date_rng #get a datetime column to group by
from pandas import Grouper
df2=df.groupby(Grouper(key='ts', freq='D'))
for n in ['a','b','c','d']: #now make some plots
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[date2num(i.date()) for i in PlottableTimes],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
The only difference between my real data and this example is that my real data has many NaNs scattered throughout. So, I think what is going wrong is that the 'c=' argument isn't long enough for the plotting command to interpret it as covering the whole date range...? For example, if I manually put in the output of the c= command, I get the following code which also works:
for n in ['a','b','c','d']:
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0, 736815.0, 736816.0],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
But, if I shorten the c= array by some amount, to emulate what is happening in my code when NaNs are being dropped from idxmax, it gives the same error I am seeing:
for n in ['a','b','c','d']:
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
So this means the real question is: how can I grab the grouper column after grouping from the groupby object, when none of the columns appear to be grab-able with df2.col? I would like to be able to grab 'ts' from the following and use it to be the color data, instead of using idxmax:
df2['a'].max()
ts
2018-01-01 9
2018-01-02 9
2018-01-03 9
2018-01-04 9
2018-01-05 9
2018-01-06 9
2018-01-07 9
2018-01-08 8
Freq: D, Name: a, dtype: int64
Essentially, your Grouper call is similar to indexing on your date time column and callingpandas.DataFrame.resample specifying the aggregate function:
df.set_index('ts').resample('D').max()
# a b c d e f g h
# ts
# 2018-01-01 9 9 8 9 9 9 9 9
# 2018-01-02 9 9 9 9 9 9 9 9
# 2018-01-03 9 9 9 9 9 9 9 9
# 2018-01-04 9 9 9 9 9 9 9 9
# 2018-01-05 9 9 9 9 9 9 9 9
# 2018-01-06 9 9 9 8 9 9 9 9
# 2018-01-07 9 9 9 9 9 9 9 9
# 2018-01-08 2 8 6 3 1 3 2 7
Therefore, the return of df2['a'].max() is a Pandas Resampler object, very similar to a Pandas Series and hence carries the index property which you can use for color bar specification:
df['a'].max().index
# DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
# '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
# dtype='datetime64[ns]', name='ts', freq='D')
From there you can pass into date2num without list comprehension:
date2num(df2['a'].max().index)
# array([736695., 736696., 736697., 736698., 736699., 736700., 736701., 736702.])
Altogether, simply use above in loop without needing maxTimes or PlottableTimes:
fig, ax = plt.subplots(1, 1, figsize = (5,5))
smap = plt.scatter(df2[n].max(), df2[m].max(), s = 160,
c = date2num(df2[n].max().index),
marker = '.')
cb = fig.colorbar(smap, orientation = 'vertical',
format = DateFormatter('%d %b %y'))
I'm trying to scatter plot the following dataframe:
mydf = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9],
'y':[9,8,7,6,5,4,3,2,1],
'z':np.random.randint(0,9, 9)},
index=["12:00", "1:00", "2:00", "3:00", "4:00",
"5:00", "6:00", "7:00", "8:00"])
x y z
12:00 1 9 1
1:00 2 8 1
2:00 3 7 7
3:00 4 6 7
4:00 5 5 4
5:00 6 4 2
6:00 7 3 2
7:00 8 2 8
8:00 9 1 8
I would like to see the times "12:00, 1:00, ..." as the x-axis and x,y,z columns on the y-axis.
When I try to plot with pandas via mydf.plot(kind="scatter"), I get the error ValueError: scatter requires and x and y column. Do I have to break down my dataframe into appropriate parameters? What I would really like to do is get this scatter plotted with seaborn.
Just running
mydf.plot(style=".")
works fine for me:
Seaborn is actually built around pandas.DataFrames. However, your data frame needs to be "tidy":
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Since you want to plot x, y, and z on the same plot, it seems like they are actually different observations. Thus, you really have three variables: time, value, and the letter used.
The "tidy" standard comes from Hadly Wickham, who implemented it in the tidyr package.
First, I convert the index to a Datetime:
mydf.index = pd.DatetimeIndex(mydf.index)
Then we do the conversion to tidy data:
pivoted = mydf.unstack().reset_index()
and rename the columns
pivoted = pivoted.rename(columns={"level_0": "letter", "level_1": "time", 0: "value"})
Now, this is what our data looks like:
letter time value
0 x 2019-03-13 12:00:00 1
1 x 2019-03-13 01:00:00 2
2 x 2019-03-13 02:00:00 3
3 x 2019-03-13 03:00:00 4
4 x 2019-03-13 04:00:00 5
Unfortunately, seaborn doesn't play with DateTimes that well, so you can just extract the hour as an integer:
pivoted["hour"] = pivoted["time"].dt.hour
With a data frame in this form, seaborn takes in the data easily:
import seaborn as sns
sns.set()
sns.scatterplot(data=pivoted, x="hour", y="value", hue="letter")
Outputs: