How to fit a pandas timeseries to a 24h graph? - python

I have a pandas timeseries of multiple months and want to count occurences of a feature for different times of day.
I.e. I want to create a graph (using seaborn or matplotlib) with the time of day on the x axis (0 to 24 hours) and the relative number of occurences of a column on the y axis (like this).
I can't figure out how to format the timeseries correctly to make this work.
Edit:
This is a sample of the data I'm dealing with. "Open Data Channel Type" can assume five kinds (Online, Phone, Mobile, Unknown, Other). My goal is to plot every kind into one graph, displaying which kind occurs at which time of day.

You need to prepare the plot data first:
hour = df['Created Date'].dt.hour.rename('Hour')
df_plot = df.groupby(hour).apply(lambda x: x['Open Data Channel Type'].value_counts() / x.shape[0]) \
.rename_axis(index=['Hour', 'Channel Type']) \
.to_frame('Frequency') \
.reset_index()
A sample of df_plot:
Hour Channel Type Frequency
0 0 OTHER 0.223744
1 0 PHONE 0.210046
2 0 MOBILE 0.205479
3 0 UNKNOWN 0.198630
4 0 ONLINE 0.162100
5 1 UNKNOWN 0.206311
6 1 OTHER 0.203883
7 1 PHONE 0.201456
8 1 MOBILE 0.196602
9 1 ONLINE 0.191748
Then you can make the plot (here using Seaborn):
ax = sns.lineplot(data=df_plot, x='Hour', y='Frequency', hue='Channel Type')
ax.figure.set_size_inches(10, 4)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Result (using random data):

Related

Pandas + Seaborn : compute number of 0 regarding categorical datas

I'm currently struggling with my dataframe in Pandas (new to this).
I have a 3 columns dataframe : Categorical_data1, Categorical_data2,Output. (2400 rows x 3 columns).
Both categorical data (inputs) are strings and output is depending of inputs.
Categorical_data1 = ['type1','type2', ... , 'type6']
Categorical_data2 = ['rain1','rain2', 'rain3','rain4]
So 24 possible pairs of categorical data.
I want to plot a heatmap (using seaborn for instance) of the number of 0 in outputs regarding couples of categorical data (Cat_data1,Cat_data2). I tried several things using boolean.
I tried to figure out how to compute exact amount of 0
count = ((df['Output'] == 0) & (df(['Categorical_Data1'] == 'type1') & (df(['Categorical_Data2'] == 'rain1')))).sum()
but it failed.
The output belongs to [0,1] with a large amount of 0 (around 1200 over 2400). My goal is to have something like this Source by jcdoming (I can't upload images...) with months = Categorical Data1, years = Categorical Data2 ; and numbers of 0 in ouputs).
Thank you for your help.
Use a seaborn countplot. It gives counts of categorical data occurrences in a certain feature. Use hue to add in the second feature to the visualization:
import seaborn as sns
sns.countplot(data=dataframe, x='Categorical_Data1', hue='Categorical_Data2')

Plotting multiple lines on same x-axis with a normalized x-axis

I have a data frame where one column is my data values (“dFF”) and another column is my timestamps where the data was recorded ("Time). I also have a list of timestamps where events occurred. I want to plot the data 3 seconds before and 5 seconds after each event, with multiple events on the same plot.
df.head()
Time dFF
0 0.500267 0.617687
1 0.516673 0.737019
2 0.533079 0.801859
3 0.549485 0.762987
4 0.565891 0.572441
fig, ax0 = plt.subplots()
events = [24.541, 35.193, 45.461, 71.554, 95.954, 108.658, 134.592, 147.914, 163.671]
#Plot trials
for event in events:
begin = event - 3.0
end = event + 5
in_between = df['Time'].between(begin, end, inclusive=True)
ax0.plot(df['Time'].loc[in_between], df["dFF"].loc[in_between])
plt.show()
The plots should essentially be on top of each other. But since I am using the time stamps as the x-coordinates, they are instead plotting across the whole time axis. Is there a good way to standardize the x-axis so that I can get the plots on top of each other? There are the same number of data points for each event in the 8 second window.
The desired graph should look something like this:
Desired plot output with multiple lines on same x-axis
The two graphs that I currently make (whole trace and subsets) look like
this
and this.
To answer my question for anyone who comes across this in the future, I made a separate array for the x-values. I created an array using np.linspace from -3 to 5 (3 seconds before to 5 seconds after) using the length of the event array as the number of values.
#Plot trials
fig,ax0 = plt.subplots()
for event in events:
begin = event - 3.0
end = event + 5
in_between = df['Time'].between(begin, end, inclusive=True)
tmp = df.loc[in_between]
event_array = tmp['dFF'].to_numpy()
x_mask = np.linspace(-3,5,len(event_array))
ax0.plot(x_mask, df[dff].loc[in_between])
plt.show()
Which resulted in a plot that looks like this:

How do I get a Bokeh ColorBar to show the min and max value?

I am making a bubble chart using Bokeh and want the ColorBar to show the min and max value. Given data that looks like this
In[23]: group_counts.head()
cyl yr counts size
0 3 72 1 0.854701
1 3 73 1 0.854701
2 3 77 1 0.854701
3 3 80 1 0.854701
4 4 70 7 5.982906
I am generating a plot using
x_col = 'cyl'
y_col = 'yr'
color_transformer = transform.linear_cmap('counts', Inferno256,
group_counts.counts.min(),
group_counts.counts.max())
color_bar = ColorBar(color_mapper=color_transformer['transform'],
location=(0, 0))
source = ColumnDataSource(data=group_counts)
p = plotting.figure(x_range=np.sort(group_counts[x_col].unique()),
y_range=np.sort(group_counts[y_col].unique()),
plot_width=400, plot_height=300,
x_axis_label=x_col, y_axis_label=y_col)
p.add_layout(color_bar, 'right')
p.scatter(x=x_col, y=y_col, size='size', color=color_transformer,
source=source)
plotting.show(p)
Notice that the min and max values on the colorbar are not labelled. How do I force the colorbar to label these values?
You can do this using the FixedTicker class, located under bokeh.models. It is meant to be used to,
Generate ticks at fixed, explicitly supplied locations.
To provide the min and max data values, specify the desired tick values then pass that object to ColorBar using the ticker keyword argument.
mn = group_counts.counts.min()
mx = group_counts.counts.max()
n_ticks = 5 # how many ticks do you want?
ticks = np.linspace(mn, mx, n_ticks).round(1) # round to desired precision
color_ticks = FixedTicker(ticks=ticks)
color_bar = ColorBar(color_mapper=color_transformer['transform'],
location=(0, 0),
ticker=color_ticks, # <<< pass ticker object
)
If you want something a bit more exotic, there are 14 different tickers currently described in the bokeh.models.tickers documentation (do a word search for Ticker(** to quickly jump between the different options).

Matlab - How to illustrate the percentage growth by background colour ? [of a timeseries chart]

I want to plot a timeseries chart:
Sample data:
datetime = pd.date_range(start = '01/03/2019', periods = 60) #60 days = 2 months
data = [2000]
for i in range (29):
data.append(data[-1]*1.05) #first 30 days - growth 5%
for i in range(30):
data.append(data[-1]*1.15) #last 30 days - growth 15%
plt.plot(datetime, data)
plt.show()
We got the figure below:
So I want to have a background colour cutting in two 2 regions (2 rectangles)
The first one covers region with 5% growth
The second region covers region with 15% growth
I tried to do it but the difficulty is that date is converted into delta time by matlab so I could not cut it by input the day I want to cut
plt.xlim()
(737059.05, 737123.95) #delta time
plt.ylim()
(-25153.66313846143, 572226.92590769)
Please help me in an easy way so in future I could cut in any area like 5 days, 10 days , 15 days depending on the data.
You can work with DateTime objects, as long as you are not trying to mix and match the internal representation used by matplotlib (the one you see when you do plt.xlim()) and DateTime values.
d = pd.date_range(start = '01/03/2019', periods = 60) #60 days = 2 months
data = [2000]
for i in range (29):
data.append(data[-1]*1.05) #first 30 days - growth 5%
for i in range(30):
data.append(data[-1]*1.15) #last 30 days - growth 15%
cutoff = 30
cutoff_date = d[0]+cutoff*d.freq
plt.figure()
plt.plot(d, data)
plt.axvline(cutoff_date,ls='--',color='k')
plt.axvspan(xmin=d[0],xmax=cutoff_date, facecolor='r', alpha=.5)
plt.axvspan(xmin=cutoff_date,xmax=d[-1], facecolor='g', alpha=.5)
plt.gcf().autofmt_xdate()
plt.show()

Normalizing huge numeric data to create a valuable line plot

I have the following dataframe:
Year Month Value
2005 9 1127.080000
2016 3 9399.000000
5 3325.000000
6 120.000000
7 40.450000
9 3903.470000
10 2718.670000
12 12108501.620000
2017 1 981879341.949982
2 500474730.739911
3 347482199.470025
4 1381423726.830030
5 726155254.759981
6 750914893.859959
7 299991712.719955
8 133495941.729959
9 27040614303.435833
10 26072052.099796
11 956680303.349909
12 755353561.609832
2018 1 1201358930.319930
2 727311331.659607
3 183254376.299662
4 9096130.550197
5 972474788.569924
6 779912460.479959
7 1062566320.859962
8 293262028544467.687500
9 234792487863.501495
As you can see, i have some huge values grouped by month and year. My problem is that i want to create a line plot, but when i do it, it doesn't make any sense to me:
df.plot(kind = 'line', figsize = (20,10))
The visual representation of the data doesn't make much sense taking into account that the values fluctuate over the months and years, but a flat line is shown for the most of the period and big peak at the end.
I guess the problem may be in the y axis scale that is not correctly fitting the data. I have tried to apply a log transformation to the y axis, but this don't add any changes, i have also tried to normalize the data between 0 and 1 just for test, but the plot still the same. Any ideas about how to get a more accurate representation of my data over the time period? And also, how can I display the name of the month and year in the x axis?
EDIT:
This is how i applied the log transform:
df.plot(kind = 'line', figsize = (20,10), logy = True)
and this is the result:
for me this plot still not really readable, taking into account that the plotted values represent income over the time, applying a logarithmic transformation to money values doesn't make much sense to me anyway.
Here is how i normalized the data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_scaled.set_index(df.index, inplace = True)
And then i plotted it:
df_scaled.plot(kind = 'line', figsize = (20, 10), logy = True)
As you can see, noting seems to change with this, i'm a bit lost about how to correctly visualize these data over the given time periods.
The problem is that one value is much much bigger than the others, causing that spike. Instead, use a semi-log plot
df.plot(y='Value', logy=True)
outputs
To make it use the date as the x-axis do
df['Day'] = 1 # we need a day
df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
df.plot(x='Date', y='Value', logy=True)
which outputs

Categories