How to plot multiple traces with trendlines?

How to plot multiple traces with trendlines? - python

I'm trying to plot trendlines on multiple traces on scatters in plotly. I'm kind of stumped on how to do it.
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()
Trying to get something like this:

Here's how I resolved it. Basically I used numpy polyfit function to calculation my slop. I then added the slop for each data set as a tracer
import numpy as np
df_m, df_b = np.polyfit(df_df['Circumference (meters)'].to_numpy(), df_df['Height (meters)'].to_numpy(), 1)
wp_m, wp_b = np.polyfit(df_wp['Circumference (meters)'].to_numpy(), df_wp['Height (meters)'].to_numpy(), 1)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=df_df['Height (meters)'],
name='Douglas Fir', mode='markers')
)
fig.add_trace(go.Scatter(x=df_df['Circumference (meters)'],
y=(df_m*df_df['Circumference (meters)'] + df_b),
name='douglas fir trendline',
mode='lines')
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=df_wp['Height (meters)'],
name='White Pine',mode='markers'),
)
fig.add_trace(go.Scatter(x=df_wp['Circumference (meters)'],
y=(wp_m * df_wp['Circumference (meters)'] + wp_b),
name='white pine trendline',
mode='lines')
)
fig.update_layout(title="Tree Circumference vs Height (meters)",
xaxis_title=df_df['Circumference (meters)'].name,
yaxis_title=df_df['Height (meters)'].name,
title_x=0.5)
fig.show()

You've already put together a procedure that solves your problem, but I would like to mention that you can use plotly.express and do the very same thing with only a very few lines of code. Using px.scatter() there are actually two slightly different ways, depending on whether your data is of a long or wide format. Your data seems to be of the latter format, since you're asking:
how can I make this work with separate traces?
So I'll start with that. And I'll use a subset of the built-in dataset px.data.stocks() since you haven't provided a data sample.
Code 1 - Wide data
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
Code 2 - Long data
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 1 - Identical results
About the data:
A dataframe of a wide format typically has an index with unique values in the left-most column, variable names in the column headers, and corresponding values for each variable per index in the columns like this:
index AAPL MSFT
0 1.000000 1.000000
1 1.011943 1.015988
2 1.019771 1.020524
3 0.980057 1.066561
4 0.917143 1.040708
Here, adding information about another variable would require adding another column.
A dataframe of a long format, on the other hand, typically organizes the same data with only (though not necessarily only) three columns; index, variable and value:
index variable value
0 AAPL 1.000000
1 AAPL 1.011943
.
.
100 MSFT 1.720717
101 MSFT 1.752239
An contrary to the wide format, this means that index will have duplicate values. But for a good reason.
So what's the difference?
If you look at Code 1 you'll see that the only thing you need to specify for px.scatter in order to get multiple traces with trendlines, in this case AAPL and MSFT on the y-axis versus an index on the x-axis, is trendline = 'ols'. This is because plotly.express automatically identifies the data format as wide and knows how to apply the trendlines correctly. Different columns means different catrgories for which a trace and trendline are produced.
As for the "long approach", you've got both GOOG and AAPL in the same variable column, and values for both of them in the value column. But setting color = 'variable' lets plotly.express know how to categorize the variable column, correctly separate the data in in the value column, and thus correctly produce the trendlines. A different name in the variable column means that index and value in the same row belongs to different categories, for which a new trace and trendline are built.
Any pros and cons?
The arguably only advantage with the wide format is that it's easier to read (particularly for those of us damaged by too many years of sub-excellent data handling with Excel). And one great advantage with the long format is that you can easily illustrate more dimensions of the data if you have more categories with, for example, different symbols or sizes for the markers.
Another advantage with the long format occurs if the dataset changes, for example with the addition of another variable 'AMZN'. Then the name and the values of that variable will occur in the already existing columns instead of adding another one like you would for the wide format. This means that you actually won't have to change the code in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
... in order to add the data to the figure.
While for the wide format, you would have to specify y = ['GOOG', 'AAPL', 'AMZN'] in:
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT', 'AMZN'],
trendline = 'ols',
)
And I would strongly argue that this outweighs the slight inconvenience of speifying color = 'variable' in:
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
Plot 2 - A new variable:
Complete code
# imports
import pandas as pd
import plotly.express as px
# data
df = px.data.stocks()
# df.date = pd.to_datetime(df.date)
df_wide = df.drop(['date', 'GOOG', 'AMZN', 'NFLX', 'FB'], axis = 1).reset_index()
# df_wide = df.drop(['date', 'GOOG', 'NFLX', 'FB'], axis = 1).reset_index()
df_long = pd.melt(df_wide, id_vars = 'index')
df_long
fig_wide = px.scatter(df_wide, x = 'index', y = ['AAPL', 'MSFT'],
trendline = 'ols',
)
fig_long = px.scatter(df_long, x= 'index', y = 'value',
color = 'variable',
trendline = 'ols')
# fig_long.show()
fig_wide.show()

Related

Frequency map by country in python

I am trying to make a world map with some specific frequency data for some countries. I have tried to use plotly (below), but the base map is not available, and it won't let me load a new one I found.
The map I need is a color scale (intensity) for the countries with presence of this variable.
These are the data and the code with which I have tried to plot the map:
database = px.data.gapminder()
d = {'Australia':[3],
'Brazil' :[2],
'Canada':[6],
'Chile':[3],
'Denmark':[1],
'France':[16],
'Germany':[3],
'Israel':[1]}
data = pd.DataFrame(d).T.reset_index()
data.columns=['country', 'count']
df=pd.merge(database, yourdata, how='left', on='country')
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
fig = px.choropleth(df, locations="country",
locationmode='ISO-3',
geojson = f"{url}/world-countries.json",
color="count")
I keep getting the same error.
Thank you for your help

I think the reason why it is not displayed is because the location mode and the target column are incorrectly specified. If the location mode is 'iso_3', then for this data, the location would be 'iso_alpha'. Also, if the location mode is 'country names', then the location would be 'country'. Since there are many data presented, we extracted by year and changed the merging method.
import pandas as pd
d = {'Australia':[3],
'Brazil' :[2],
'Canada':[6],
'Chile':[3],
'Denmark':[1],
'France':[16],
'Germany':[3],
'Israel':[1]}
data = pd.DataFrame(d).T.reset_index()
data.columns=['country', 'count']
import plotly.express as px
database = px.data.gapminder().query('year == 2007')
df = pd.merge(database, data, how='inner', on='country')
url = (
"https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
fig = px.choropleth(df,
locations="country",#"iso_alpha",
locationmode="country names",#"ISO-3",
geojson = f"{url}/world-countries.json",
color="count"
)
fig.show()

plotting multiple lines in one line plot

I have a dataframe that hast 3 columns. I made it up from a bigger dataframe like this :
new_df = df[['client_name', 'time_window_end', 'tag_count']]
then I used groupby to find out the number of tags for each client in each day using this code :
new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count()
I totally have 70 client names in a list an I want to loop through my list to plot a line
plot for each costumer name. in the x axis I want to have 'time_window_end' and in the y axis I want to have 'tag_count'.
I want 70 plot but the for loop that I have written does not do that. I would be happy if you could help me to fix it.
clients = new_df['client_name'].unique()
client_list = clients.tolist()
for client in client_list[:60]:
temp = new_df.loc[new_df['client_name'] == client]
x = temp.groupby(temp['time_window_end'].dt.floor('d'))['tag_count'].sum()
df2 = x.to_frame()
df2.reset_index(inplace=True)
df2["time_window_end"]= pd.to_datetime(df2["time_window_end"])
line_chart = df2.copy()
plt.plot(line_chart.reset_index()["time_window_end"], x)

If I'm understanding this right, it sounds like the seaborn package might have what you need. The plotting functions take the argument 'hue' which splits plots up into multiple lines, based on the data in a column
import seaborn as sn
new_df = new_df.groupby(['client_name' ,'time_window_end']) ['tag_count'].count().reset_index()
sn.relplot(
data = new_df,
x = pd.to_datetime(new_df["time_window_end"]),
y = 'tag_count',
hue = 'client_name',
kind = 'line')
EDIT: to get multiple plots
import seaborn as sn
new_df["time_window_end"] = pd.to_datetime(new_df["time_window_end"])
g = sn.FacetGrid(
data = new_df,
row = 'client_name')
g.map(sn.lineplot, 'time_window_end', 'tag_count')
EDIT again: to get separate plot images
import matplotlib.pyplot as plt
for name in pd.unique(new_df.client_names):
sn.lineplot(
data = new_df.loc[new_df.client_names == name],
x = 'time_window_end',
y = 'tag_count',
label = name)
plt.show()

Plotly: How to animate a bar chart with multiple groups using plotly express?

I have a dataframe that looks like this:
I want to have one bar for old freq and one for new freq. Currently I have graph that looks like this:
This is what the code looks like:
freq_df['date'] = pd.to_datetime(freq_df['date'])
freq_df['hour'] = freq_df['hour'].astype(str)
fig = px.bar(freq_df, x="hour", y="old freq",hover_name = "date",
animation_frame= freq_df.date.dt.day)
fig.update_layout(transition = {'duration': 2000})
How do I add another bar?
Explanation about DF:
It has frequencies relevant to each hour in a specific date.
Edit:
One approach could be to create a category column and add old and new freq and assign values in another freq column. How do I do that :p ?
Edit:
Here is the DF
,date,hour,old freq,new freq
43,2020-09-04,18,273,224.0
44,2020-09-04,19,183,183.0
45,2020-09-04,20,99,111.0
46,2020-09-04,21,130,83.0
47,2020-09-04,22,48,49.0
48,2020-09-04,23,16,16.0
49,2020-09-05,0,8,6.0
50,2020-09-05,1,10,10.0
51,2020-09-05,2,4,4.0
52,2020-09-05,3,7,7.0
53,2020-09-05,4,25,21.0
54,2020-09-05,5,114,53.0
55,2020-09-05,6,284,197.0
56,2020-09-05,7,343,316.0
57,2020-09-05,8,418,419.0
58,2020-09-05,9,436,433.0
59,2020-09-05,10,469,396.0
60,2020-09-05,11,486,300.0
61,2020-09-05,12,377,140.0
62,2020-09-05,13,552,103.0
63,2020-09-05,14,362,117.0
64,2020-09-05,15,512,93.0
65,2020-09-05,16,392,41.0
66,2020-09-05,17,268,31.0
67,2020-09-05,18,223,30.0
68,2020-09-05,19,165,24.0
69,2020-09-05,20,195,15.0
70,2020-09-05,21,90,
71,2020-09-05,22,46,1.0
72,2020-09-05,23,17,1.0

The answer in two steps:
1. Perform a slight transformation of your data using pd.wide_to_long:
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
2. Plot two groups of bar traces using:
fig1 = px.bar(df_long, x='hour', y = 'freq', hover_name = "date", color='type',
animation_frame= 'date', barmode='group')
This is the result:
The details:
If I understand your question correctly, you'd like to animate a bar chart where you've got one bar for each hour for your two frequencies freq_old and freq_new like this:
If that's the case, then you sample data is no good since your animation critera is hour per date and you've only got four observations (hours) for 2020-09-04 and then 24 observations for 2020-09-05. But don't worry, since your question triggered my interest I just as well made some sample data that will in fact work the way you seem to want them to.
The only real challenge is that px.bar will not accept y= [freq_old, freq_new], or something to that effect, to build your two bar series of different categories for you. But you can make px.bar build two groups of bars by providing a color argument.
However, you'll need a column to identify your different freqs like this:
0 new
1 old
2 new
3 old
4 new
5 old
6 new
7 old
8 new
9 old
In other words, you'll have to transform your dataframe, which originally has a wide format, to a long format like this:
date hour type day freq
0 2020-01-01 0 new 1 7.100490
1 2020-01-01 0 old 1 2.219932
2 2020-01-01 1 new 1 7.015528
3 2020-01-01 1 old 1 8.707323
4 2020-01-01 2 new 1 7.673314
5 2020-01-01 2 old 1 2.067192
6 2020-01-01 3 new 1 9.743495
7 2020-01-01 3 old 1 9.186109
8 2020-01-01 4 new 1 3.737145
9 2020-01-01 4 old 1 4.884112
And that's what this snippet does:
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
stubnames uses a prefix to identify the columns you'd like to stack into a long format. And that's why I've renamed new_freq and old_freq to freq_new and freq_old, respectively. j='type' simply takes the last parts of your cartegory names using sep='_' and produces the column that we need to tell the freqs from eachother:
type
old
new
old
...
suffix='\w+' tells pd.wide_to_long that we're using non-integers as suffixes.
And that's it!
Complete code:
# imports
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import numpy as np
import random
# sample data
observations = 24*5
np.random.seed(5); cols = list('a')
freq_old = np.random.uniform(low=-1, high=1, size=observations).tolist()
freq_new = np.random.uniform(low=-1, high=1, size=observations).tolist()
date = [t[:10] for t in pd.date_range('2020', freq='H', periods=observations).format()]
hour = [int(t[11:13].lstrip()) for t in pd.date_range('2020', freq='H', periods=observations).format()]
# sample dataframe of a wide format such as yours
freq_df=pd.DataFrame({'date': date,
'hour':hour,
'freq_new':freq_new,
'freq_old':freq_old})
freq_df['day']=pd.to_datetime(freq_df['date']).dt.day
# attempt to make my random data look a bit
# like your real world data.
# but don't worry too much about that...
freq_df.freq_new = abs(freq_df.freq_new.cumsum())
freq_df.freq_old = abs(freq_df.freq_old.cumsum())
# sample dataframe of a long format that px.bar likes
df_long = pd.wide_to_long(freq_df, stubnames='freq',
i=['date', 'hour'], j='type',
sep='_', suffix='\w+').reset_index()
# plotly express bar chart with multiple bar groups.
fig = px.bar(df_long, x='hour', y = 'freq', hover_name = "date", color='type',
animation_frame= 'date', barmode='group')
# set up a sensible range for the y-axis
fig.update_layout(yaxis=dict(range=[df_long['freq'].min()*0.8,df_long['freq'].max()*1.2]))
fig.show()

I was able to create the bars for both the old and new frequencies, however using a separate plot for each day (Plotly Express Bar Charts don't seem to have support for multiple series). Here is the code for doing so:
# Import packages
import pandas as pd
import numpy as np
import plotly.graph_objs as go
import plotly
import plotly.express as px
from plotly.offline import init_notebook_mode, plot, iplot, download_plotlyjs
init_notebook_mode(connected=True)
plotly.offline.init_notebook_mode(connected=True)
# Start formatting data
allDates = np.unique(df.date)
numDates = allDates.shape[0]
print(numDates)
for i in range(numDates):
df = original_df.loc[original_df.date == allDates[i]]
oldFreqData = go.Bar(x=df["hour"].to_numpy(), y=df["old_freq"].to_numpy(), name="Old Frequency")
newFreqData = go.Bar(x=df["hour"].to_numpy(), y=df["new_freq"].to_numpy(), name="New Frequency")
fig = go.Figure(data=[oldFreqData,newFreqData])
fig.update_layout(title=allDates[i])
fig.update_xaxes(title='Hour')
fig.update_yaxes(title='Frequency')
fig.show()
where df is the dataframe DF from your question.
Here is the output:
However, if you prefer the use of the animation frame from Plotly Express, you can have two separate plots: one for old frequencies and one for new using this code:
# Reformat data
df = original_df
dates = pd.to_datetime(np.unique(df.date)).strftime('%Y-%m-%d')
numDays = dates.shape[0]
print(numDays)
hours = np.arange(0,24)
numHours = hours.shape[0]
allDates = []
allHours = []
oldFreqs = []
newFreqs = []
for i in range(numDays):
for j in range(numHours):
allDates.append(dates[i])
allHours.append(j)
if (df.loc[df.date == dates[i]].loc[df.hour == j].shape[0] != 0): # If data not missing
oldFreqs.append(df.loc[df.date == dates[i]].loc[df.hour == j].old_freq.to_numpy()[0])
newFreqs.append(df.loc[df.date == dates[i]].loc[df.hour == j].new_freq.to_numpy()[0])
else:
oldFreqs.append(0)
newFreqs.append(0)
d = {'Date': allDates, 'Hour': allHours, 'Old_Freq': oldFreqs, 'New_Freq': newFreqs, 'Comb': combined}
df2 = pd.DataFrame(data=d)
# Create px plot with animation
fig = px.bar(df2, x="Hour", y="Old_Freq", hover_data=["Old_Freq","New_Freq"], animation_frame="Date")
fig.show()
fig2 = px.bar(df2, x="Hour", y="New_Freq", hover_data=["Old_Freq","New_Freq"], animation_frame="Date")
fig2.show()
and here is the plot from that code:

How can I exclude certain dates (e.g., weekends) from time series plots?

In the following example, I'd like to exclude weekends and plot Y as a straight line, and specify some custom frequency for major tick labels since they would be a "broken" time series (e.g., every Monday, a la matplotlib's set_major_locator).
How would I do that in Altair?
import altair as alt
import pandas as pd
index = pd.date_range('2018-01-01', '2018-01-31', freq='B')
df = pd.DataFrame(pd.np.arange(len(index)), index=index, columns=['Y'])
alt.Chart(df.reset_index()).mark_line().encode(
x='index',
y='Y'
)

A quick way to do that is to specify the axis as an ordinal field. This would produce a very ugly axis, with the hours specified for every tick. To change that, I add a column to the dataframe with a given label. I also added the grid, as by default it is removed for an ordinal encoding, and set the labelAngle to 0.
df2 = df.assign(label=index.strftime('%b %d %y'))
alt.Chart(df2).mark_line().encode(
x=alt.X('label:O', axis=alt.Axis(grid=True, labelAngle=0)),
y='Y:Q'
)
Beware that it would remove any missing point. So, maybe you want to add a tooltip. This is discussed in the documentation here.
You can also play with labelOverlap in the axis setting depending of hat you want.
To customize the axis, we can build one up using mark_text and bring back the grid with mark_rule and a custom dataframe. It does not necessarily scale up well, but it can give you some ideas.
df3 = df2.loc[df2.index.dayofweek == 0, :].copy()
df3["Y"] = 0
text_chart = alt.Chart(df3).mark_text(dy = 15).encode(
x=alt.X('label:O', axis = None),
y=alt.Y('Y:Q'),
text=alt.Text('label:O')
)
tick_chart = alt.Chart(df3).mark_rule(color='grey').encode(
x=alt.X('label:O', axis=None),
)
line_chart = alt.Chart(df2).mark_line().encode(
x=alt.X('label:O', axis=None, scale=alt.Scale(rangeStep=15)),
y='Y:Q'
)
text_chart + tick_chart + line_chart

Scatter plot for non numeric data

I am learning to use matplotlib with pandas and I am having a little trouble with it. There is a dataframe which has districts and coffee shops as its y and x labels respectively. And the column values represent the start date of the coffee-shops in respective districts
starbucks cafe-cool barista ........ 60 shops
dist1 2008-09-18 2010-05-04 2007-02-21 ...............
dist2 2007-06-12 2011-02-17
dist3
.
.
100 districts
I want to plot a scatter plot with x axis as time series and y axis as coffee-shops. Since I couldn't figure out a direct one line way to plot this, I extracted the coffee-shops as one list and dates as other list.
shops = list(df.columns.values)
dt = pd.DataFrame(df.ix['dist1'])
dates = dt.set_index('dist1')
First I tried plt.plot(dates, shops). Got a ZeroDivisionError: integer division or modulo by zero - error. I could not figure out the reason for it. I saw on some posts that the data should be numeric, so I used ytick function.
y = [1, 2, 3, 4, 5, 6,...60]
still plt.plot(dates, y) threw same ZeroDivisionError. If I could get past this may be I would be able to plot using tick function. Source -
http://matplotlib.org/examples/ticks_and_spines/ticklabels_demo_rotation.html
I am trying to plot the graph for only first row/dist1. For that I fetched the first row as a dataframe df1 = df.ix[1] and then used the following
for badges, dates in df.iteritems():
date = dates
ax.plot_date(date, yval)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(badges)
yval+=1
.
I got an error at line ax.plot_date(date, yval) saying x and y should be have same first dimension. Since I am plotting one by one for each coffe-shop for dist1 shouldn't the length always be one for both x and y? PS: date is a datetime.date object

To achieve this you need to convert the dates to datetimes, see here for
an example. As mentioned you also need to convert the coffee shops into
some numbering system then change the tick labels accordingly.
Here is an attempt
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
from datetime import datetime
def get_datetime(string):
"Converts string '2008-05-04' to datetime"
return datetime.strptime(string, "%Y-%m-%d")
# Generate datarame
df = pd.DataFrame(dict(
starbucks=["2008-09-18", "2007-06-12"],
cafe_cool=["2010-05-04", "2011-02-17"],
barista=["2007-02-21"]),
index=["dist1", "dist2"])
ax = plt.subplot(111)
label_list = []
label_ticks = []
yval = 1 # numbering system
# Iterate through coffee shops
for coffee_shop, dates in df.iteritems():
# Convert strings into datetime list
datetimes = [get_datetime(date) for date in dates]
# Create list of yvals [yval, yval, ...] to plot against
yval_list = np.zeros(len(dates))+yval
ax.plot_date(datetimes, yval_list)
# Record the number and label of the coffee shop
label_ticks.append(yval)
label_list.append(coffee_shop)
yval+=1 # Change the number so they don't all sit at the same y position
# Now set the yticks appropriately
ax.set_yticks(label_ticks)
ax.set_yticklabels(label_list)
# Set the limits so we can see everything
ax.set_ylim(ax.get_ylim()[0]-1,
ax.get_ylim()[1]+1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to plot multiple traces with trendlines? - python

Related

Frequency map by country in python

plotting multiple lines in one line plot

Plotly: How to animate a bar chart with multiple groups using plotly express?

How can I exclude certain dates (e.g., weekends) from time series plots?

Scatter plot for non numeric data

Categories

Resources