Plotly hovermode=x displays too many values - python

The hovermode='x' parameter on Plotly (specifically plotly express, which is what I'm using) isn't strict and can get very confusing. For example, I have a dataframe with rows for date (which will be the X), number (which will be the Y), and category (which will be the hue/color, making multiple lines).
If a category has a gap in dates (e.g. category1 has a value for 3/24 but category2 only has values for 3/23 and 3/25), when I hover over the x-value of 3/24, it will show the number value for category1 but instead of not showing anything at all for category2 (or showing a 0/nan), it will show the date and then number value of the closest point. So in this case, hovering on x=3/24 would produce 3 boxes: one (correct) box with the number for category1, and two (incorrect) boxes with the date and number for close category2 points that don't actually have values for 3/24.
In practice, I'm working with a very large, grouped dataset in pyspark, so ideally the categories that don't have data at that date would show a 0. However, not showing a box at all would be acceptable.
I've considered grouping the data and including rows with a count of 0 so that for each date every category has a row, but I couldn't find a way to do it in pyspark and pandas isn't fast enough.
I'm thinking that this must be possible somehow, because using the basic visualizations that Pyspark in Databricks offers works correctly--showing only hover text for categories that are actually present at the x value, and "0" for categories that aren't. The basic visualizations unfortunately only include 1000 rows, though, so they aren't viable.
Code to reproduce the issue:
import numpy as np
import pandas as pd
import datetime
base_categories = ['category1', 'category2']
dates, categories, numbers = [], [], []
for i in range(0, 50):
dates.append(datetime.datetime(2021, 3, 1, 12, np.random.randint(1, 30)))
numbers.append(np.random.randint(1, 1000))
categories.append(base_categories[np.random.randint(0, 2)])
df = pd.DataFrame(columns=['dates', 'categories', 'numbers'])
df.dates = dates
df.numbers = numbers
df.categories = categories
df = df.sort_values('dates')
fig = px.line(df, x='dates', y='numbers', color='categories')
fig.update_traces(mode='lines', hovertemplate=None)
fig.update_layout(height=450, width=750, hovermode='x')
fig

Related

Time series plot showing unique occurrences per day

I have a dataframe, where I would like to make a time series plot with three different lines that each show the daily occurrences (the number of rows per day) for each of the values in another column.
To give an example, for the following dataframe, I would like to see the development for how many a's, b's and c's there have been each day.
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
When I try the command below (my best guess so far), however, it does not filter for the different dates (I would like three lines representing each of the letters.
Any ideas on how to solve this?
df.groupby(['date']).count().plot()['letter']
I have also tried a solution in Matplotlib, though this one gives an error..
fig, ax = plt.subplots()
ax.plot(df['date'], df['letter'].count())
Based on your question, I believe you are looking for a line plot which has dates in X-axis and the counts of letters in the Y-axis. To achieve this, these are the steps you will need to do...
Group the dataframe by date and then letter - get the number of entries/rows for each which you can do using size()
Flatten the grouped dataframe using reset_index(), rename the new column to Counts and sort by letter column (so that the legend shows the data in the alphabetical format)... these are more to do with keeping the new dataframe and graph clean and presentable. I would suggest you do each step separately and print, so that you know what is happening in each step
Plot each line plot separately using filtering the dataframe by each specific letter
Show legend and rotate date so that it comes out with better visibility
The code is shown below....
df = pd.DataFrame({'date':pd.to_datetime(['2019-10-10','2019-10-14','2019-10-09','2019-10-10','2019-10-08','2019-10-14','2019-10-10','2019-10-08','2019-10-08','2019-10-13','2019-10-08','2019-10-12','2019-10-11','2019-10-09','2019-10-08']),
'letter':['a','b','c','a','b','b','b','b','c','b','b','a','b','a','c']})
df_grouped = df.groupby(by=['date', 'letter']).size().reset_index() ## New DF for grouped data
df_grouped.rename(columns = {0 : 'Counts'}, inplace = True)
df_grouped.sort_values(['letter'], inplace=True)
colors = ['r', 'g', 'b'] ## New list for each color, change as per your preference
for i, ltr in enumerate(df_grouped.letter.unique()):
plt.plot(df_grouped[df_grouped.letter == ltr].date, df_grouped[df_grouped.letter == ltr].Counts, '-o', label=ltr, c=colors[i])
plt.gcf().autofmt_xdate() ## Rotate X-axis so you can see dates clearly without overlap
plt.legend() ## Show legend
Output graph

pandas fill in 0 for non-existing categories in value_counts()

problem: I'm grouping results in my DataFrame, look at value_counts(normalize=True) and try to plot the result in a barplot.
The problem is that the barplot should contain frequencies. In some groups, some values don't occur. In that case, the corresponding value_count is not 0, it doesn't exist. For the barplot, this 0 value is not taken into account and the resulting bar is too big.
example: Here is a minimal example, which illustrates the problem: Let's say the DataFrame contains observations for experiments. When you perform such an experiment, a series of observations is collected. The result of the experiment are the relative frequencies of the observations collected for it.
df = pd.DataFrame()
df["id"] = [1]*3 + [2]*3 + [3]*3
df["experiment"] = ["a"]*6 + ["b"] * 3
df["observation"] = ["positive"]*3 + ["positive"]*2 + ["negative"]*1 + ["positive"]*2 + ["negative"]*1
there are two experiment types, "a" and "b"
observations that belong to the same evaluation of an experiment are given the same id.
So here, experiment a has been done 2 times, experiment b just once.
I need to group by id and experiment, then average the result.
plot_frame = pd.DataFrame(df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True))
plot_frame = plot_frame.rename(columns={"observation":"percentage"})
In the picture above, you can already see the problem. The evaluation with id 1 has seen only positive observations. The relative frequency of "negative" should be 0. Instead, it doesn't exist. If I plot this, the corresponding bar is too high, the blue bars should add up to one:
sns.barplot(data=plot_frame.reset_index(),
x="observation",
hue="experiment",
y="percentage")
plt.show()
You can add rows filled with 0 by using unstack/stack method with argument fill_value=0. Try this:
df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True).unstack(fill_value=0).stack()
I have found a hacky solution, by iterating over the index and manually filling in the missing values:
for a,b,_ in plot_frame.index:
if (a,b,"negative") not in plot_frame.index:
plot_frame.loc[(a,b,"negative"), "percentage"] = 0
Now this produces the desired plot:
I don't particularly like this solution, since it is very specific to my index and probably doesn't scale well if the categories become more complex

Preparing Data-frame for Bokeh Consumption

Trying to plot with Bokeh using a data-frame but plot is displaying empty. Beginner here; missing something fundamental.
My plot works if I hard code some basic X and Y variables so I know the issue has to do with the data-frame I'm trying to use as a source.
...
df = pd.DataFrame(j)
df.columns = ['Team','Type','Date','SLA_MET']
df['SLA_MET']= df['SLA_MET'].round(2)
pd.set_option('display.max_columns', 10)
print(df)
source = ColumnDataSource(df)
p = figure(background_fill_color='gray',
background_fill_alpha=0.5,
border_fill_color='blue',
border_fill_alpha=0.25,
plot_height=600,
plot_width=1000,
x_axis_label='Month',
x_axis_location='below',
y_axis_label='% SLA Met',
y_axis_location='left',
title='Percentage of SLA Met',
title_location='above',
toolbar_location='below',
tools='save')
p.line(source=source,x='Date',y='SLA_MET')
show(p)
Decided to pass clean lists to plot
for index, row in df.iterrows():
if row[2] =='Service Request':
sr_list.append(row[3])
else:
inc_list.append(row[3])
date_list.append(row[1]) # Only need 1 list of dates
Problem is dates in scientific notation and dates are not in order.
Bokeh does not know what to do with the strings in your Date column. You have two options:
convert this column to real python/numpy/pandas (numeric) datetime values, and also set x_axis_type="datetime" in your figure call, or
use the string values as categorical factors
It's not clear what your intention is, so I can't recommend one vs the other.

Python in Power BI - Show column names instead of numbers in matplotlib.pyplot.matshow

I have a dataset similar to the below, in Power BI:
last_updated product price
01-01-2019 Cycle 1000
02-01-2019 Cycle 1010
01-01-2019 Helmet 200
02-01-2019 Helmet 190
Basically, I wanted to dynamically let the user choose some products from the filters, and I'd get Python to pivot the data and plot a correlation matrix.
It's only my second day with Python, but I have managed to write the following code.
dataset = dataset.pivot(
index = 'last_updated',
columns = 'product',
values = 'price'
)
matplotlib.pyplot.matshow(dataset.corr('pearson'))
matplotlib.pyplot.show()
It works as expected, but it shows 0, 1, 2, etc., instead of Cycle, Helmet,...
How can I set the tick labels dynamically to the column names?
I see some examples use set_xticklabels(), but I am not able to figure out how to use it to set a literal string, let alone dynamic column names.
Solved it like this...
import seaborn as sns
dataset = dataset.pivot(
index = 'last_updated',
columns = 'symbol',
values = 'cumulative_return'
)
corr = dataset.corr('pearson')
sns.heatmap(corr, annot=True, xticklabels=corr.columns, yticklabels=corr.columns)
matplotlib.pyplot.show()

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

Categories