matplotlib conditional background color in python - python

How can I change the background color of a line chart based on a variable that is not in the chart?
For example if I have the following dataframe:
import numpy as np
import pandas as pd
dates = pd.date_range('20000101', periods=800)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(800))
df['B'] = np.random.randint(-1,2,size=800)
If I do a line chart of df.A, how can I change the background color based on the values of column 'B' at that point in time?
For example, if B = 1 in that date, then background at that date is green.
If B = 0 then background that date should be yellow.
If B = -1 then background that date should be red.
Adding the workaround that I originally was thinking of doing with axvline, but #jakevdp answer is what exactly was looking because no need of for loops:
First need to add an 'i' column as counter, and then the whole code looks like:
dates = pd.date_range('20000101', periods=800)
df = pd.DataFrame(index=dates)
df['A'] = np.cumsum(np.random.randn(800))
df['B'] = np.random.randint(-1,2,size=800)
df['i'] = range(1,801)
# getting the row where those values are true wit the 'i' value
zeros = df[df['B']== 0]['i']
pos_1 = df[df['B']==1]['i']
neg_1 = df[df['B']==-1]['i']
ax = df.A.plot()
for x in zeros:
ax.axvline(df.index[x], color='y',linewidth=5,alpha=0.03)
for x in pos_1:
ax.axvline(df.index[x], color='g',linewidth=5,alpha=0.03)
for x in neg_1:
ax.axvline(df.index[x], color='r',linewidth=5,alpha=0.03)

You can do this with a plot command followed by pcolor() or pcolorfast(). For example, using the data you define above:
ax = df['A'].plot()
ax.pcolorfast(ax.get_xlim(), ax.get_ylim(),
df['B'].values[np.newaxis],
cmap='RdYlGn', alpha=0.3)

Related

Split data frame in python based on one parameter shape

I have a data frame which is like the following :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import csv
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
df_input = pd.read_csv('combine_input.csv', delimiter=',')
df_output = pd.read_csv('combine_output.csv', delimiter=',')
In this data frame, there are many repeated rows for example the first row is repeated more than 1000 times, and so on for the other rows
when I plot the time distribution I got that figure which shows that the frequency of the time parameter
df_input.plot(y='time',kind = 'hist',figsize=(10,10))
plt.grid()
plt.show()
My question is how can I take the data only in the following red rectangular for example at time = 0.006 and frequency = 0.75 1e6 ( check the following pic )
Note: InPlace of target you have to write time as your column name Is time,or change column name to target
def calRows(df,x,y):
#df For consideration
df1 = pd.DataFrame(df.target[df.target<=x])
minCount = len(df1)
targets = df1.target.unique()
for i in targets:
count = int(df1[df1.target == i].count())
if minCount > count:
minCount = count
if minCount > y:
minCount = int(y)
return minCount
You have To pass your data frame, x-intercept of the graph, y-intercept of graph to calRows(df,x,y) function which will return the number of rows to take for each target.
rows = CalRows(df,6,75)
print(rows)
takeFeatures(df,rows,x) function will take dataframe, rows (result of first function), x-intercept of graph and will return you the final dataframe.
def takeFeatures(df,rows,x):
finalDf = pd.DataFrame(columns = df.columns)
df1 = df[df.target<=x]
targets = df1.target.unique()
for i in targets:
targeti = df1[df1.target==i]
sample = targeti.sample(rows)
finalDf = pd.concat([finalDf,sample])
return finalDf
Calling takeFeature() Function
final = takeFeatures(df,rows,6)
print(final)
Your Final DataFrame will have the Values ThatYou expected in Graph
And After Plotting this final dataframe you will get like this graph

Plotly: How to plot markers with time values on a different trace?

I have 2 data frames:
df1 contains columns: “time”, “bid_price”
df2 contains columns: “time”, “flag”
I want to plot a time series of df1 as a line graph and i want to put markers on that trace at points where df2 “flag” column value = True at those points in time
How can i do this?
You can do so in three steps:
set up a figure using go.Figure(),
add a trace for your bid_prices using fig.update(go.Scatter)
and do the same thing for your flags.
The snippet below does exactly what you're describing in your question. I've set up two dataframes df1 and df2, and then I've merged them together to make things a bit easier to reference later on.
I'm also showing flags for an accumulated series where each increment in the series > 0.9 is flagged in flags = [True if elem > 0.9 else False for elem in bid_price] . You should be able to easily adjust this to whatever your real world dataset looks like.
Plot:
Complete code with random data:
# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import random
# settings
observations = 100
np.random.seed(5); cols = list('a')
bid_price = np.random.uniform(low=-1, high=1, size=observations).tolist()
flags = [True if elem > 0.9 else False for elem in bid_price]
time = [t for t in pd.date_range('2020', freq='D', periods=observations).format()]
# bid price
df1=pd.DataFrame({'time': time,
'bid_price':bid_price})
df1.set_index('time',inplace = True)
df1.iloc[0]=0; d1f=df1.cumsum()
# flags
df2=pd.DataFrame({'time': time,
'flags':flags})
df2.set_index('time',inplace = True)
df = df1.merge(df2, left_index=True, right_index=True)
df.bid_price = df.bid_price.cumsum()
df['flagged'] = np.where(df['flags']==True, df['bid_price'], np.nan)
# plotly setup
fig = go.Figure()
# trace for bid_prices
fig.add_traces(go.Scatter(x=df.index, y=df['bid_price'], mode = 'lines',
name='bid_price'))
# trace for flags
fig.add_traces(go.Scatter(x=df.index, y=df['flagged'], mode = 'markers',
marker =dict(symbol='triangle-down', size = 16),
name='Flag'))
fig.update_layout(template = 'plotly_dark')
fig.show()

How to highlight multiline graph in Altair python

I'm trying to create an interactive timeseries chart with more than 20 lines of data using the Altair module in Python.
The code to create the dataframe of the shape I'm looking at is here:
import numpy as np
import altair as alt
year = np.arange(1995, 2020)
day = np.arange(1, 91)
def gen_next_number(previous, limit, max_reached):
if max_reached:
return np.NAN, True
increment = np.random.randint(0, 10)
output = previous + increment
if output >= 100:
output = 100
max_reached = True
return output, max_reached
def gen_list():
output_list = []
initial = 0
limit = 100
max_reached = False
value = 0
for i in range(1, 91):
value, max_reached = gen_next_number(value, limit, max_reached)
if max_reached:
value = np.NAN
output_list.append(value)
return output_list
df = pd.DataFrame(index = day, columns=year )
for y in year:
data = gen_list()
df[y] = data
df['day'] = df.index
df = df.melt("day")
df = df.dropna(subset=["value"])
I can use the following Altair code to produce the initial plot, but it's not pretty:
alt.Chart(df).mark_line().encode(
x='day:N',
color="variable:N",
y='value:Q',
tooltip=["variable:N", "value"]
)
But when I've tried this code to create something interactive, it fails:
highlight = alt.selection(type='single', on='mouseover',
fields='variable', nearest=True, empty="none")
alt.Chart(plottable).mark_line().encode(
x='day:N',
color="variable:N",
y=alt.condition(highlight, 'value:Q', alt.value("lightgray")),
tooltip=["variable:N", "value"]
).add_selection(
highlight
)
It fails with the error:
TypeError: sequence item 1: expected str instance, int found
Can someone help me out?
Also, is it possible to make the legend interactive? So a hover over a year highlights a line?
Two issues:
In alt.condition, you need to provide a list of fields rather than a single field
The y encoding does not accept a condition. I suspect you meant to put the condition on color.
With these two fixes, your chart works:
highlight = alt.selection(type='single', on='mouseover',
fields=['variable'], nearest=True, empty="none")
alt.Chart(df).mark_line().encode(
x='day:N',
y='value:Q',
color=alt.condition(highlight, 'variable:N', alt.value("lightgray")),
tooltip=["variable:N", "value"]
).add_selection(
highlight
)
Because the selection doesn't change z-order, you'll find that the highlighted line is often hidden behind other gray lines. If you want it to pop out in front, you could use an approach similar to the one in https://stackoverflow.com/a/55796860/2937831
I would like to create a multi-line plot similar to the one above
without a legend
without hovering or mouseover.
Would simply like to pass a highlighted_value and have a single line be highlighted.
I have modified the code because I am not terribly familiar with the proper use of "selection" and recognize that it is somewhat kludgy to get the result that I want.
Is there a cleaner way to do this?
highlight = alt.selection(type='single', on='mouseover',
fields=['variable'], nearest=True, empty="none")
background = alt.Chart(df[df['variable'] != 1995]).mark_line().encode(
x='day:N',
y='value:Q',
color=alt.condition( highlight, 'variable:N', alt.value("lightgray")),
tooltip=["variable:N", "value"],
).add_selection(
highlight
)
foreground = alt.Chart(df[df['variable'] == 1995]).mark_line(color= "blue").encode(
x='day:N',
y='value:Q',
color=alt.Color('variable',legend=None)
)
foreground + background

Get smooth line plot by filling missing values

I have multiple Dataframes (up to 30) which all contain timestamps with associated values. The timestamp in the DataFrames do not necessarily overlap and the recorded values can only stay the same or increase. A DataFrame may look like this:
time coverage
0 0.000000 32.111748
1 0.875050 32.482579
2 1.850576 32.784133
3 3.693440 34.205134
...
I uploaded a couple of csv files with data here 1, 2, 3, 4.
So what I am trying to do is to plot the increase of the mean and median coverage values over time for all recordings, as follows:
# data is a list of dataframes
keys = ["Run " + str(i) for i in range(len(data))]
glued = pd.concat(data, keys=keys).reset_index(level=0).rename(columns={'level_0': 'Run'})
glued["roundtime"] = glued["time"] / 60
glued["roundtime"] = glued["roundtime"].round(0) # 1 significant digit
f, (ax1, ax2) = plt.subplots(2)
my_dpi = 96
stepsize = 5
start = 0
end = 60
ax1.set_title("Mean")
ax2.set_title("Median")
f.set_size_inches(1980 / my_dpi, 1080 / my_dpi)
ax1 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator="mean", data=glued, ax=ax1)
ax1.set(xlabel="Time", ylabel="Coverage in percent")
ax1.xaxis.set_ticks(np.arange(start, end, stepsize))
ax1.set_xlim(0, 70)
ax2 = sns.lineplot(x="roundtime", y="coverage", ci="sd", estimator='median', data=glued, ax=ax2)
ax2.set(xlabel="Time", ylabel="Coverage in percent")
ax2.xaxis.set_ticks(np.arange(start, end, stepsize))
ax2.set_xlim(0, 70)
plt.show()
The result looks like this.
However, the curve should never decrease as the "coverage" values can never decrease either. The reason for this, I suspect, is that at certain points in time I only have recordings of some DataFrames with lower values and therefore the mean/median is also lower.
I tried to fix this by aligning the indices of all the DataFrames and filling missing values with previous recordings, before doing any of the previous code. Like this:
#create a common index
index = None
for df in data:
df.set_index("time", inplace=True, drop=False)
if index is not None:
index = index.union(df.index)
else:
index = df.index
# reindex all dataframes and fill missing values
new_data = []
for df in data:
print(df)
new_df = df.reindex(index, fill_value=np.NaN)
new_df = new_df.fillna(method="ffill")
new_data.append(new_df)
data = new_data
The result however does change much and decreases at certain times. It looks like this:
Is this approach wrong or am I simply missing something?

Obtaining values used in boxplot, using python and matplotlib

I can draw a boxplot from data:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.rand(100)
plt.boxplot(data)
Then, the box will range from the 25th-percentile to 75th-percentile, and the whisker will range from the smallest value to the largest value between (25th-percentile - 1.5*IQR, 75th-percentile + 1.5*IQR), where the IQR denotes the inter-quartile range. (Of course, the value 1.5 is customizable).
Now I want to know the values used in the boxplot, i.e. the median, upper and lower quartile, the upper whisker end point and the lower whisker end point. While the former three are easy to obtain by using np.median() and np.percentile(), the end point of the whiskers will require some verbose coding:
median = np.median(data)
upper_quartile = np.percentile(data, 75)
lower_quartile = np.percentile(data, 25)
iqr = upper_quartile - lower_quartile
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
I was wondering, while this is acceptable, would there be a neater way to do this? It seems that the values should be ready to pull-out from the boxplot, as it's already drawn.
Why do you want to do so? what you are doing is already pretty direct.
Yeah, if you want to fetch them for the plot, when the plot is already made, simply use the get_ydata() method.
B = plt.boxplot(data)
[item.get_ydata() for item in B['whiskers']]
It returns an array of the shape (2,) for each whiskers, the second element is the value we want:
[item.get_ydata()[1] for item in B['whiskers']]
I've had this recently and have written a function to extract the boxplot values from the boxplot as a pandas dataframe.
The function is:
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
And is called by passing an array of labels (the ones that you would pass to the boxplot plotting function) and the data returned by the boxplot function itself.
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
def get_box_plot_data(labels, bp):
rows_list = []
for i in range(len(labels)):
dict1 = {}
dict1['label'] = labels[i]
dict1['lower_whisker'] = bp['whiskers'][i*2].get_ydata()[1]
dict1['lower_quartile'] = bp['boxes'][i].get_ydata()[1]
dict1['median'] = bp['medians'][i].get_ydata()[1]
dict1['upper_quartile'] = bp['boxes'][i].get_ydata()[2]
dict1['upper_whisker'] = bp['whiskers'][(i*2)+1].get_ydata()[1]
rows_list.append(dict1)
return pd.DataFrame(rows_list)
data1 = np.random.normal(loc = 0, scale = 1, size = 1000)
data2 = np.random.normal(loc = 5, scale = 1, size = 1000)
data3 = np.random.normal(loc = 10, scale = 1, size = 1000)
labels = ['data1', 'data2', 'data3']
bp = plt.boxplot([data1, data2, data3], labels=labels)
print(get_box_plot_data(labels, bp))
plt.show()
Outputs the following from get_box_plot_data:
label lower_whisker lower_quartile median upper_quartile upper_whisker
0 data1 -2.491652 -0.587869 0.047543 0.696750 2.559301
1 data2 2.351567 4.310068 4.984103 5.665910 7.489808
2 data3 7.227794 9.278931 9.947674 10.661581 12.733275
And produces the following plot:
upper_whisker = data[data<=upper_quartile+1.5*iqr].max()
lower_whisker = data[data>=lower_quartile-1.5*iqr].min()
equal to
upper_whisker = data.max()
lower_whisker = data.min()
if you just want to get the real data points in the dataset. But statistically speaking, the whisker values are upper_quantile+1.5IQR and lower_quantile-1.5IQR

Categories