Plotting pandas DataFrame with matplotlib - python

Here is a sample of the code I am using which works perfectly well..
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
# Data
df=pd.DataFrame({'x': np.arange(10), 'y1': np.random.randn(10), 'y2': np.random.randn(10)+
range(1,11), 'y3': np.random.randn(10)+range(11,21) })
print(df)
# multiple line plot
plt.plot( 'x', 'y1', data=df, marker='o', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4)
plt.plot( 'x', 'y2', data=df, marker='', color='olive', linewidth=2)
plt.plot( 'x', 'y3', data=df, marker='', color='olive', linewidth=2, linestyle='dashed', label="y3")
plt.legend()
plt.show()
The values in the column 'x' actually refers to 10 hours time period of the day, starting with 6 AM as 0 and 7 AM, and so on. Is there any way I could replace those values(x-axis) in my figure with the time periods, like replace the 0 with 6 AM?

It's always a good idea to store time or datetime information as Pandas datetime datatype.
In your example, if you only want to keep the time information:
df['time'] = (df.x + 6) * pd.Timedelta(1, unit='h')
Output
x y1 y2 y3 time
0 0 -0.523190 1.681115 11.194223 06:00:00
1 1 -1.050002 1.727412 13.360231 07:00:00
2 2 0.284060 4.909793 11.377206 08:00:00
3 3 0.960851 2.702884 14.054678 09:00:00
4 4 -0.392999 5.507870 15.594092 10:00:00
5 5 -0.999188 5.581492 15.942648 11:00:00
6 6 -0.555095 6.139786 17.808850 12:00:00
7 7 -0.074643 7.963490 18.486967 13:00:00
8 8 0.445099 7.301115 19.005115 14:00:00
9 9 -0.214138 9.194626 20.432349 15:00:00
If you have a starting date:
start_date='2018-07-29' # change this date appropriately
df['datetime'] = pd.to_datetime(start_date) + (df.x + 6) * pd.Timedelta(1, unit='h')
Output
x y1 y2 y3 time datetime
0 0 -0.523190 1.681115 11.194223 06:00:00 2018-07-29 06:00:00
1 1 -1.050002 1.727412 13.360231 07:00:00 2018-07-29 07:00:00
2 2 0.284060 4.909793 11.377206 08:00:00 2018-07-29 08:00:00
3 3 0.960851 2.702884 14.054678 09:00:00 2018-07-29 09:00:00
4 4 -0.392999 5.507870 15.594092 10:00:00 2018-07-29 10:00:00
5 5 -0.999188 5.581492 15.942648 11:00:00 2018-07-29 11:00:00
6 6 -0.555095 6.139786 17.808850 12:00:00 2018-07-29 12:00:00
7 7 -0.074643 7.963490 18.486967 13:00:00 2018-07-29 13:00:00
8 8 0.445099 7.301115 19.005115 14:00:00 2018-07-29 14:00:00
9 9 -0.214138 9.194626 20.432349 15:00:00 2018-07-29 15:00:00
Now the time / datetime column have a special datatype:
print(df.dtypes)
Out[5]:
x int32
y1 float64
y2 float64
y3 float64
time timedelta64[ns]
datetime datetime64[ns]
dtype: object
Which have a lot of nice properties, including automatic string formatting which you will find very useful in later parts of your projects.
Finally, to plot using matplotlib:
# multiple line plot
plt.plot( df.datetime.dt.hour, df['y1'], marker='o', markerfacecolor='blue', markersize=12, color='skyblue', linewidth=4)
plt.plot( df.datetime.dt.hour, df['y2'], marker='', color='olive', linewidth=2)
plt.plot( df.datetime.dt.hour, df['y3'], marker='', color='olive', linewidth=2, linestyle='dashed', label="y3")
plt.legend()
plt.show()

Related

Python : do a linear regression on time series

I would like to do a linear regression on a wave time serie.
I've got a dataframe with a date column ( DateHeure) and a column with my wave high (it contains some nan values...). My problem is : I don't manage to plot it with the date on the x-axis or it doesn't fit. I know that my problem is the x but I don't know how I could fix it.
My actual script :
CANDHIS =
DateHeure H13D
0 2017-01-01 00:00:00 1.7
1 2017-01-01 01:00:00 1.72
2 2017-01-01 02:00:00 2.04
3 2017-01-01 03:00:00 2.44
4 2017-01-01 04:00:00 nan
5 2017-01-01 05:00:00 2.51
6 2017-01-01 06:00:00 2.25
7 2017-01-01 07:00:00 2.28
8 2017-01-01 08:00:00 1.97
9 2017-01-01 09:00:00 1.95
10 2017-01-01 10:00:00 1.84
CANDHIS.set_index('DateHeure', inplace=True)
y=np.array(CANDHIS['H13D'].dropna().values, dtype=float)
x=np.array(pd.to_datetime(CANDHIS["H13D"].dropna().index.values), dtype=float)
slope, intercept, r_value, p_value, std_err =sp.linregress(x,y)
xf = np.linspace(min(x),max(x),100)
xf1 = xf.copy()
xf1 = pd.to_datetime(xf1)
yf = (slope*xf)+intercept
print('r = ', r_value, '\n', 'p = ', p_value, '\n', 's = ', std_err)
f, ax = plt.subplots(1, 1)
ax.plot(xf1, yf,label='Linear fit', lw=3)
CANDHIS['H13D'].dropna().plot(marker='o', ls='')
plt.ylabel('Hauteurs significatives')
ax.legend()
I know that is probably a stupid question but I'm still searching and I'm still lost...
Thanks
I would like to do a linear regression on a wave time serie and keep the date for the x-axis.

How to change the visual aspects of the xaxis, which consists of dates?

I'm reading some data with pandas and trying to plot them.
Now, I'd like to change the visual aspects of the x-axis tick labels: the yticks and xlabel is changing, but not the xticks.
I'd like to make my xticks, red, bold and larger, just like the yticks.
But for some reason, the code doesn't change them. What is going wrong?
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
plt.rcParams["font.weight"] = "bold"
plt.rcParams["axes.labelweight"] = "bold"
plt.rcParams.update({'font.size': 10})
model_obs = pd.read_csv("select_obs_data2.csv", sep=',')
print(model_obs.head(4))
model_obs['correlation'] = model_obs['real_obs'].corr(model_obs['Value'])
model_obs.iloc[1:, 6] = np.nan
model_obs['bias^2'] = (model_obs['Value'] - model_obs['real_obs']) ** 2
model_obs['RMS'] = model_obs['bias^2'].mean()
model_obs.iloc[1:, 8] = np.nan
model_obs.drop(['bias^2'], axis=1, inplace=True)
model_obs.drop(['time'], axis=1, inplace=True)
fig2, ax2 = plt.subplots()
locs = "upperleft"
model_obs["date"] = pd.to_datetime(model_obs["date"])
print(type(model_obs['date']))
print(model_obs["date"].dt.month)
model_obs.plot(x='date', y='Value', figsize=(16, 7), ax=ax2, style='--', label='Model')
model_obs.plot(x='date', y='real_obs', figsize=(16, 7), ax=ax2, label='Observation')
plt.legend(loc='upper right', prop={'size': 12})
ax2.xaxis.set_tick_params(labelsize=15)
ax2.yaxis.set_tick_params(labelsize=15)
ax2.set_xlabel('Date Time', fontsize=10, fontweight='bold')
ax2.set_ylabel('Sea Level (m)', fontsize=15, fontweight='bold')
ax2.tick_params(labelcolor='r', labelsize='large', width=8)
plt.show()
# x axis data
# model_obs['date']
0 2019-09-01 01:00:00
1 2019-09-01 02:00:00
2 2019-09-01 03:00:00
3 2019-09-01 04:00:00
4 2019-09-01 05:00:00
5 2019-09-01 06:00:00
6 2019-09-01 07:00:00
7 2019-09-01 08:00:00
8 2019-09-01 09:00:00
9 2019-09-01 10:00:00
10 2019-09-01 11:00:00
11 2019-09-01 12:00:00
12 2019-09-01 13:00:00
13 2019-09-01 14:00:00
14 2019-09-01 15:00:00
15 2019-09-01 16:00:00
16 2019-09-01 17:00:00
17 2019-09-01 18:00:00
966 2019-10-11 07:00:00
967 2019-10-11 08:00:00
968 2019-10-11 09:00:00
969 2019-10-11 10:00:00
The tick labels are not changing, because they are minor ticks. Default, tick_params() operates only on the major ticks. There are two optional parameters that control this:
axis : {'x', 'y', 'both'}, default is 'both', so operating on both axis
which : {'major', 'minor', 'both'}, default is 'major', so only having effect on the major ticks
tick_params() can be called multiple times to set the visual aspects of x, y, major or minor ticks individually.
For example, to set the major ticks red and the minor ticks blue:
ax2.tick_params(labelcolor='r', labelsize='large', width=8, which='major')
ax2.tick_params(labelcolor='b', labelsize='large', width=8, which='minor')

extract only hour:minute from datetimeindex to x-axis when creating sns.pointplot

I have a dataframe that looks like:
deploy deployed_today_rent total_rent cum_deploy hourly percent cum_percent
10min
2019-10-01 05:30:00 6 0 0 6 0.000000 0.000000
2019-10-01 05:40:00 0 0 0 6 0.000000 0.000000
2019-10-01 05:50:00 6 0 0 12 0.000000 0.000000
2019-10-01 06:00:00 13 0 0 25 0.000000 0.000000
2019-10-01 06:10:00 0 0 0 25 0.000000 0.000000
2019-10-01 06:20:00 0 1 1 25 0.040000 0.040000
2019-10-01 06:30:00 0 0 0 25 0.000000 0.040000
2019-10-01 06:40:00 0 1 1 25 0.040000 0.080000
2019-10-01 06:50:00 1 1 1 26 0.038462 0.118462
from this I am trying to create a pointplot where x-axis is datetime and y-axis is deployed_today_rent.
My code for creating visualization:
fig,(ax1, ax2)= plt.subplots(nrows=2)
fig.set_size_inches(22,17)
sns.pointplot(data=test, x=test.index, y="total_rent", ax=ax1,color="blue")
sns.pointplot(data=test, x=test.index, y="deployed_today_rent", ax=ax1, color="green")
ax1.set_xticklabels(test.index, rotation=90,
fontdict={
"fontsize":16,
"fontweight":30
})
I have two axes in a figure, right now since by x-axis ticks are full datetime and it is rotated 90 degrees the whole tick name is not showing, I want to extract only 05:30:00 from 2019-10-01 05:30:00 and use it on x-ticks. How can I do this?
Also in above ax1.set_xticklabels font_weight is not working.
Instead of test.index in your plot lines, use test.index.strftime('%H:%M:%S') this should get you just the Hours:Minutes:Seconds from the index.
Your code should be
sns.pointplot(data=test, x=test.index.strftime('%H:%M:%S'), y="total_rent", ax=ax1,color="blue")
sns.pointplot(data=test, x=test.index.strftime('%H:%M:%S'), y="deployed_today_rent", ax=ax1, color="green")

Overlapping two plots with different dates

I'm have a pandas dataframe with subscriber information and I'm attempting to plot two graphs on the same figure to see the change over the same period of time at two different instances. The dataframe resembles:
df =
Date 1_Month_Sub 3_Month_Sub
0 2010-01-01 00:00:00 2 4
1 2010-01-02 00:00:00 1 1
2 2010-01-03 00:00:00 3 6
3 2010-01-04 00:00:00 0 3
4 2010-01-05 00:00:00 2 1
...
1381 2014-01-01 00:00:00 4 3
1382 2014-01-02 00:00:00 2 2
1383 2014-01-03 00:00:00 3 0
1384 2014-01-04 00:00:00 2 4
...
if I do the following
df.plot(x='Date', y=['Year Sold'], stacked=False, grid=True,
xlim=['2011-10-01', '2011-10-31'], ylim=[0, 7], figsize=(40, 16),
marker='o', color='red')
df.plot(x='Date', y=['Year Sold'], stacked=False, grid=True,
xlim=['2012-10-01', '2012-10-31'], ylim=[0, 7], figsize=(40, 16),
marker='o', color='green')
plt.show()
I get two separate figures. I would like to put both figures on the same graph.
I realize that the x-axis is different in both cases, one being January 2011, and the other being January 2012 and I'm not too sure how to change that to accommodate both graphs. The idea I had was to use the index but that's not very helpful when I select two different periods that aren't the same month but different years (for instance plotting a 15 day period in January 2011 and a 15 day period in May 2011). Is there a way to maybe have both axis present?

summation of pandas timestamp and array containing timedelta values

I've a start date and an array containing irregular sample values in days that I would like to use as date index for pandas series.
Like:
In [233]: date = pd.Timestamp('2015-10-17 08:00:00')
Out[233]: Timestamp('2015-10-17 08:00:00')
In [234]: sample_size = np.array([0,10,13,19,30])
Out[234]: array([ 0., 16., 32., 48., 64.])
Now I could make use of a list and the following for loop to create the pandas datetime series:
In [235]: all_dates = []
for stepsize in sample_size:
days = pd.Timedelta(stepsize, 'D')
all_dates.append(date + days)
pd.Series(all_dates)
Out[235]: 2015-10-17 08:00:00
2015-10-27 08:00:00
2015-10-30 08:00:00
2015-11-05 08:00:00
2015-11-16 08:00:00
dtype: datetime64[ns]
But I was hoping for a purely numpy or pandas solution without the need of a list and for loop
In [11]:
pd.Series(pd.TimedeltaIndex(sample_size , unit = 'D') + date)
Out[11]:
0 2015-10-17 08:00:00
1 2015-10-27 08:00:00
2 2015-10-30 08:00:00
3 2015-11-05 08:00:00
4 2015-11-16 08:00:00
dtype: datetime64[ns]
first you need to create a time delta of all values you want to add to your date , notice I've assigned D as a parameter which means we need the time delta frequency to be in days , because we want to add days to our date
In [42]:
time_delta = pd.TimedeltaIndex(sample_size, unit = 'D')
time_delta
Out[42]:
TimedeltaIndex(['0 days', '10 days', '13 days', '19 days', '30 days'], dtype='timedelta64[ns]', freq=None)
then in order to add your time delta to your date , you need to fulfill two conditions , first you need to create a timeseries of your date so that later you can add time delta to it , the second thing is that newly created timeseries must have the same number of elements of your timedelta , and this can be achieved by repeat(len(sample_size)
In [40]:
time_stamp = pd.Series(np.array(date).repeat(len(sample_size)))
time_stamp
Out[40]:
0 2015-10-17 08:00:00
1 2015-10-17 08:00:00
2 2015-10-17 08:00:00
3 2015-10-17 08:00:00
4 2015-10-17 08:00:00
dtype: datetime64[ns]
In [41]:
time_stamp + time_delta
Out[41]:
0 2015-10-17 08:00:00
1 2015-10-27 08:00:00
2 2015-10-30 08:00:00
3 2015-11-05 08:00:00
4 2015-11-16 08:00:00
dtype: datetime64[ns]

Categories