Using pandas series date as xtick label

Using pandas series date as xtick label - python

I have this dataframe called 'dfArrivalDate' (with the first 11 rows shown)
arrival_date count
0 2013-06-08 9
1 2013-06-27 8
2 2013-03-06 8
3 2013-06-01 8
4 2013-06-28 6
5 2012-11-28 6
6 2013-06-11 5
7 2013-06-29 5
8 2013-06-09 4
9 2013-06-03 3
10 2013-05-31 3
sortedArrivalDate = transform.sort('arrival_date')
I wanted to plot them in a bar chart to see the count by arrival date. I called
sortedArrivalDate.plot(kind = 'bar') [![enter image description here][1]]
but i'm getting the index as the row ticks of my bar chart. I figured i need to use 'xticks'.
sortedArrivalDate.plot(kind = 'bar', xticks = sortedArrivalDate.arrival_date)
but I run into the error: TypeError: Cannot compare type 'Timestamp' with type 'float'
I tried a different approach.
fig, ax = plt.subplots()
ax.plot(sortedArrivalDate.arrival_date, sortedArrivalDate.count)
This time the error is ValueError: x and y must have same first dimension
I'm thinking this might just be an easy fix and since I don't have much experience coding in pandas and matplotlib, I might be missing a very simple thing here. Care to guide me in the right direction? thanks.

IIUC:
df = df.sort_values(by='arrival_date')
df.plot(x='arrival_date', y='count', kind='bar')

Related

Plot with Histogram an attribute from a dataframe

I have a dataframe with the weight and the number of measures of each user. The df looks like:
id_user
weight
number_of_measures
1
92.16
4
2
80.34
5
3
71.89
11
4
81.11
7
5
77.23
8
6
92.37
2
7
88.18
3
I would like to see an histogram with the attribute of the table (weight, but I want to do it for both cases) at the x-axis and the frequency in the y-axis.
Does anyone know how to do it with matplotlib?

Ok, it seems to be quite easy:
import pandas as pd
import matplotlib.pyplot as plt
hist = df.hist(bins=50)
plt.show()

legends not print fully when multiple plots are plotted on same figure

I have the code as below to plot multiple plots on the same figure
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg) #this line is only to see the variable legend has the proper content
ax.legend(leg)
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
I get the plot as the below pic where the legend seems to be first 5 letters separately even though the variable legend has the right content
There was another similar question & the solution was to put a square bracket to the variable legend. I tried this with the code as below.
fig, ax = plt.subplots(figsize=(25, 10))
def wl_ratioplot(wavelength1,wavelength2, dataframe, x1=0.1,x2=1.5,y1=-500,y2=25000):
a=dataframe[['asphalt_index','layer_thickness',wavelength1,wavelength2]].copy()
sns.scatterplot(x=a[wavelength1]/a[wavelength2],y=a['layer_thickness'],data=a)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
leg = "{} vs {}".format(wavelength1,wavelength2)
print(leg)#this line is only to see the variable legend has the proper content
ax.legend([leg])
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=train_df_wo_outliers,x1=-.1,x2=3)
Now I get the full legend but only the first legend is shown as the pic below
Can someone let me know how to get the full legend for all the plots? Thanks.
dummy data (the plot in pic will NOT match)
14nm 15nm 16nm 17nm 18nm 19nm layer_thickness
1 2 3 4 5 6 0
1 2 3 4 5 6 0
3 5 7 9 11 13 5700
1 2 3 4 5 6 0
3 5 7 9 11 13 8600
1 2 3 4 5 6 0
3 5 7 9 11 13 5000
1 2 3 4 5 6 0
45 55 65 75 85 95 100
1 2 3 4 5 6 0
8 15 22 29 36 43 16600
wave_lengths=['15nm','16nm','14nm','18nm']
Answer Update
Based on answer from Quang Hoang. The output pics using scatter plot from matplotlib & sns.scatterplot

With plt it is pretty natural:
def wl_ratioplot(wavelength1,wavelength2, dataframe,
x1=0.1,x2=1.5,y1=-500,y2=25000,
ax=None):
leg = "{} vs {}".format(wavelength1,wavelength2)
# set the label here, and let plt deal with it
# also, you don't need to copy the dataframe:
ax.scatter(x=dataframe[wavelength1]/dataframe[wavelength2],
y=dataframe['layer_thickness'],label=leg)
ax.set_xlim(x1,x2)
ax.set_ylim(y1,y2)
fig, ax = plt.subplots(figsize=(25, 10))
wl_ratioplot(wave_lengths[2],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[0],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[3],wave_lengths[0],dataframe=df,x1=-.1,x2=3, ax=ax)
wl_ratioplot(wave_lengths[2],wave_lengths[1],dataframe=df,x1=-.1,x2=3, ax=ax)
ax.legend()
Output:

every time you call the function wl_ratioplot the legend is being reset the final value. use a array to store all the legends then access it all through a loop.
ax.legend([leg]) #it is resetting the legend after each call.
use a legends = [];
legends.append([leg])
after all function calls, draw the legend differently
ax.legend(legends)

panda DataFrame.value_counts().plot().bar() and DataFrame.value_counts().cumsum().plot() not using the same axis

I am trying to draw a frequency bar plot and a cumulative "ogive" in the same plot. If I draw them separately both are shown OK, but when shown in the same figure, the cumulative graphic is shown shifted. Below the code used.
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df['Correctas'].value_counts(sort = False).plot.bar();
df['Correctas'].value_counts(sort = False).cumsum().plot();
plt.show()
The frequency data is
2 1
3 3
4 7
5 14
6 20
7 24
8 27
9 29
10 30
11 31
12 32
So the cumulative shall start from 2 and it starts from 4 on x axis.
image showing the error

This has to do with bar chart plotting categorical x-axis. Here is a quick fix:
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df_counts = df['Correctas'].value_counts(sort = False)
df_counts.index = df_counts.index.astype('str')
df_counts.plot.bar(alpha=.8);
df_counts.cumsum().plot(color='k', kind='line');
plt.show();
Output:

How to make a date-based color bar based on df.idxmax series?

Python beginner/first poster here.
I'm running into trouble adding color bars to scatter plots. I have two types of plot: one that shows all the data color-coded by date, and one that shows just the maximum values of my data color-coded by date. In the first case, I can use the df.index (which is datetime) to make my color bar, but in the second case, I am using df2['col'].idxmax to generate the colors because my df2 is a df.groupby object which I'm using to generate the daily maximums in my data, and it does not have an accessible index.
For the first type of plot, I have succeeded in generating a date-based color bar with the code below, cobbled together from online examples:
fig, ax = plt.subplots(1,1, figsize=(20,20))
smap=plt.scatter(df.col1, df.col2, s=140,
c=[date2num(i.date()) for i in df.index],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
However for the second type of plot, where I am trying to use df2['col'].idxmax to create the date series instead of df.index, the following does not work:
for n in cols1:
for m in cols2:
fig, ax = plt.subplots(1,1, figsize=(15,15))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna() #some NaNs in the
#.idxmax series were giving date2num trouble
smap2=plt.scatter(df2[n].max(), df2[m].max(),
s=160, c=[date2num(i.date()) for i in PlottableTimes],
marker='.')
cb2 = fig.colorbar(smap2, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
The error is: 'length of rgba sequence should be either 3 or 4'
Because the error was complaining of the color argument, I separately checked the output of the color (that is, c=) arguments in the respective plotting commands, and both look similar to me, so I can't figure out why one color argument works and the other doesn't:
one that works:
[736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
736809.0,
...]
one that doesn't work:
[736845.0,
736846.0,
736847.0,
736848.0,
736849.0,
736850.0,
736851.0,
736852.0,
736853.0,
736854.0,
...]
Any suggestions or explanations? I'm running python 3.5.2. Thank you in advance for helping me understand this.
Edit 1: I made the following example for others to explore, and in the process realized the crux of the issue is different than my first question. The code below works the way I want it to:
df=pd.DataFrame(np.random.randint(low=0, high=10, size=(169, 8)),
columns=['a', 'b', 'c', 'd', 'e','f','g','h']) #make sample data
date_rng = pd.date_range(start='1/1/2018', end='1/8/2018', freq='H')
df['i']=date_rng
df = df.set_index('i') #get a datetime index
df['ts']=date_rng #get a datetime column to group by
from pandas import Grouper
df2=df.groupby(Grouper(key='ts', freq='D'))
for n in ['a','b','c','d']: #now make some plots
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[date2num(i.date()) for i in PlottableTimes],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
The only difference between my real data and this example is that my real data has many NaNs scattered throughout. So, I think what is going wrong is that the 'c=' argument isn't long enough for the plotting command to interpret it as covering the whole date range...? For example, if I manually put in the output of the c= command, I get the following code which also works:
for n in ['a','b','c','d']:
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0, 736815.0, 736816.0],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
But, if I shorten the c= array by some amount, to emulate what is happening in my code when NaNs are being dropped from idxmax, it gives the same error I am seeing:
for n in ['a','b','c','d']:
for m in ['e','f','g','h']:
print(m)
print(n)
fig, ax = plt.subplots(1,1, figsize=(5,5))
maxTimes=df2[n].idxmax()
PlottableTimes=maxTimes.dropna()
smap=plt.scatter(df2[n].max(), df2[m].max(), s=160,
c=[736809.0, 736810.0, 736811.0, 736812.0, 736813.0, 736814.0],
marker='.')
cb = fig.colorbar(smap, orientation='vertical',
format=DateFormatter('%d %b %y'))
plt.show()
So this means the real question is: how can I grab the grouper column after grouping from the groupby object, when none of the columns appear to be grab-able with df2.col? I would like to be able to grab 'ts' from the following and use it to be the color data, instead of using idxmax:
df2['a'].max()
ts
2018-01-01 9
2018-01-02 9
2018-01-03 9
2018-01-04 9
2018-01-05 9
2018-01-06 9
2018-01-07 9
2018-01-08 8
Freq: D, Name: a, dtype: int64

Essentially, your Grouper call is similar to indexing on your date time column and callingpandas.DataFrame.resample specifying the aggregate function:
df.set_index('ts').resample('D').max()
# a b c d e f g h
# ts
# 2018-01-01 9 9 8 9 9 9 9 9
# 2018-01-02 9 9 9 9 9 9 9 9
# 2018-01-03 9 9 9 9 9 9 9 9
# 2018-01-04 9 9 9 9 9 9 9 9
# 2018-01-05 9 9 9 9 9 9 9 9
# 2018-01-06 9 9 9 8 9 9 9 9
# 2018-01-07 9 9 9 9 9 9 9 9
# 2018-01-08 2 8 6 3 1 3 2 7
Therefore, the return of df2['a'].max() is a Pandas Resampler object, very similar to a Pandas Series and hence carries the index property which you can use for color bar specification:
df['a'].max().index
# DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
# '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
# dtype='datetime64[ns]', name='ts', freq='D')
From there you can pass into date2num without list comprehension:
date2num(df2['a'].max().index)
# array([736695., 736696., 736697., 736698., 736699., 736700., 736701., 736702.])
Altogether, simply use above in loop without needing maxTimes or PlottableTimes:
fig, ax = plt.subplots(1, 1, figsize = (5,5))
smap = plt.scatter(df2[n].max(), df2[m].max(), s = 160,
c = date2num(df2[n].max().index),
marker = '.')
cb = fig.colorbar(smap, orientation = 'vertical',
format = DateFormatter('%d %b %y'))

How do I get a simple scatter plot of a dataframe (preferrably with seaborn)

I'm trying to scatter plot the following dataframe:
mydf = pd.DataFrame({'x':[1,2,3,4,5,6,7,8,9],
'y':[9,8,7,6,5,4,3,2,1],
'z':np.random.randint(0,9, 9)},
index=["12:00", "1:00", "2:00", "3:00", "4:00",
"5:00", "6:00", "7:00", "8:00"])
x y z
12:00 1 9 1
1:00 2 8 1
2:00 3 7 7
3:00 4 6 7
4:00 5 5 4
5:00 6 4 2
6:00 7 3 2
7:00 8 2 8
8:00 9 1 8
I would like to see the times "12:00, 1:00, ..." as the x-axis and x,y,z columns on the y-axis.
When I try to plot with pandas via mydf.plot(kind="scatter"), I get the error ValueError: scatter requires and x and y column. Do I have to break down my dataframe into appropriate parameters? What I would really like to do is get this scatter plotted with seaborn.

Just running
mydf.plot(style=".")
works fine for me:

Seaborn is actually built around pandas.DataFrames. However, your data frame needs to be "tidy":
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
Since you want to plot x, y, and z on the same plot, it seems like they are actually different observations. Thus, you really have three variables: time, value, and the letter used.
The "tidy" standard comes from Hadly Wickham, who implemented it in the tidyr package.
First, I convert the index to a Datetime:
mydf.index = pd.DatetimeIndex(mydf.index)
Then we do the conversion to tidy data:
pivoted = mydf.unstack().reset_index()
and rename the columns
pivoted = pivoted.rename(columns={"level_0": "letter", "level_1": "time", 0: "value"})
Now, this is what our data looks like:
letter time value
0 x 2019-03-13 12:00:00 1
1 x 2019-03-13 01:00:00 2
2 x 2019-03-13 02:00:00 3
3 x 2019-03-13 03:00:00 4
4 x 2019-03-13 04:00:00 5
Unfortunately, seaborn doesn't play with DateTimes that well, so you can just extract the hour as an integer:
pivoted["hour"] = pivoted["time"].dt.hour
With a data frame in this form, seaborn takes in the data easily:
import seaborn as sns
sns.set()
sns.scatterplot(data=pivoted, x="hour", y="value", hue="letter")
Outputs:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using pandas series date as xtick label - python

IIUC: df = df.sort_values(by='arrival_date') df.plot(x='arrival_date', y='count', kind='bar')

Related

Plot with Histogram an attribute from a dataframe

legends not print fully when multiple plots are plotted on same figure

panda DataFrame.value_counts().plot().bar() and DataFrame.value_counts().cumsum().plot() not using the same axis

How to make a date-based color bar based on df.idxmax series?

How do I get a simple scatter plot of a dataframe (preferrably with seaborn)

Categories

Resources