Related
I am plotting a map using plotly express and geojson file.I want to show static values on the individual district. Currently those values are visible on hover, but I want the values to be seen all the time even without hovering on it.
This is my code:
import json
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
x = json.load(open("./odisha_disticts.geojson","r"))
user_data = []
for i in range(len(x['features'])):
d = x['features'][i]['properties']
d['Females'] = np.random.randint(0,100,1)[0]
user_data.append(d)
df = pd.DataFrame(user_data)
df.head()
ID_2 NAME_2 Females
0 16084 Angul 19
1 16085 Baleshwar 45
2 16086 Baragarh 52
3 16087 Bhadrak 81
4 16088 Bolangir 49
fig = px.choropleth(
df,
locations="ID_2",
featureidkey="properties.ID_2",
geojson=x,
color="Females"
)
fig.update_geos(fitbounds="locations", visible=False)
px.scatter_geo(
df,
geojson=x,
featureidkey="properties.NAME_2",
locations="District",
text = df["District"]
)
fig.show()
The link to required files is HERE
To annotate on a map, use a graph_object to go.Choroplethmapbox with go.Scattermapbox with textmode. As a preparation before creating the graph, we need the latitude and longitude for the annotation, so we use geopandas to read the geojson file and find the center of geometry. A warning is displayed at this point because the loaded geometry uses an inappropriate geodetic system to calculate the center. If you have a latitude and longitude you wish to use for your annotations use it. There are two caveats in creating the map: first, you will need the free Mapbox API token. Get it here. second, in go.Scattemapbox(), the mode is text + marker, but if you use text only, an error will occur. The reason is unknown.
import geopandas as gpd
import pandas as pd
import plotly.graph_objects as go
# read your data
data = pd.read_csv('./data.csv', index_col=0)
# read geojson
x = json.load(open("./odisha_disticts.geojson","r"))
gdf = gpd.read_file('./odisha_disticts.geojson')
gdf['centroid'] = gdf['geometry'].centroid
gdf['lon'] = gdf['centroid'].map(lambda p:p.x)
gdf['lat'] = gdf['centroid'].map(lambda p:p.y)
gdf.head()
ID_2 NAME_2 geometry centroid lon lat
0 16084 Angul POLYGON ((85.38891 21.17916, 85.31440 21.15510... POINT (84.90419 20.98316) 84.904186 20.983160
1 16085 Baleshwar POLYGON ((87.43902 21.76406, 87.47124 21.70760... POINT (86.90547 21.48738) 86.905470 21.487376
2 16086 Baragarh POLYGON ((83.79293 21.56323, 83.84026 21.52344... POINT (83.34884 21.22068) 83.348838 21.220683
3 16087 Bhadrak POLYGON ((86.82882 21.20137, 86.82379 21.13752... POINT (86.61598 20.97818) 86.615981 20.978183
4 16088 Bolangir POLYGON ((83.45259 21.05145, 83.44352 21.01535... POINT (83.16839 20.58812) 83.168393 20.588121
import plotly.express as px
import plotly.graph_objects as go
mapbox_token = open("mapbox_api_key.txt").read()
fig = go.Figure()
fig.add_trace(go.Scattermapbox(lat=gdf['lat'],
lon=gdf['lon'],
mode='text+markers',
textposition='top center',
text = [str(x) for x in data["District"]],
textfont=dict(color='blue')
))
fig.add_trace(go.Choroplethmapbox(geojson=x,
locations=data['id'],
z=data['Females'],
featureidkey="properties.ID_2",
colorscale='Reds',
zmin=0,
zmax=data['Females'].max(),
marker_opacity=0.8,
marker_line_width=0
)
)
fig.update_layout(height=600,
mapbox=dict(
center={"lat": gdf['lat'].mean(), "lon": gdf['lon'].mean()},
accesstoken=mapbox_token,
zoom=5.5,
style="light"
))
fig.show()
I have a dataframe that looks like this:
index, start, end, bar_len,name,color, gr
1,2300000.0,5300000.0,3000000.0,p36.32,#949494, g1
2, 5300000.0,7100000.0,1800000.0,p36.31,#FFFFFF, g1
3, 7100000.0,9100000.0,2000000.0,p36.23,#949494, g1
4, 9100000.0,12500000.0,3400000.0,p36.22,#FFFFFF, g1
I want to create an horizontal stacked barchar with the following output:
| - indx[1] [len=bar_len] | - indx[2] [len=bar_len] | - indx[3]
[len=bar_len] | - indx[4] [len=bar_len]
I tried doing this the following way:
import plotly.express as px
import pandas as pd
input_path = r"example.csv"
df = pd.read_csv(input_path)
df.set_index('start')
fig = px.bar(
df, x='bar_len', y='gr', color="DS_COLOR", orientation='h',
)
fig.update_layout(barmode='stack', xaxis={'categoryorder':'category ascending'})
The problem is that the values plotted on the barchar are not sorted by start column, which is what I am trying to do. Therefore, my question is: is there any way to plot a stacked bachar that plots the length of each of the elements based on one of the columns (bar_len) and sorts these plotted elements based on another column (start)?
UPDATE: I have seen that the problem raises when including the color label. This label resorts the barchart based on the color instead of preserving the original order based on index column. Is there any way to avoid this?
You can build it using plotly graph_objects. Code below to do the needful. Note: In the dataframe, I changed the color to HEX CODE which is #FF0000 for RED and #0000FF for BLUE. I have used only bar_len, color and gr columns. Adopted from this answer.
df looks like this
start end bar_len name color gr
0 2300000.0 5300000.0 3000000.0 p36.32 #FF0000 g1
1 5300000.0 7100000.0 1800000.0 p36.31 #0000FF g1
2 7100000.0 9100000.0 2000000.0 p36.23 #FF0000 g1
3 9100000.0 12500000.0 3400000.0 p36.22 #0000FF g1
The code is here:
import pandas as pd
import plotly.graph_objects as go
input_path = r"example.csv"
df = pd.read_csv(input_path)
data = []
for i in range(len(df)):
data.append(go.Bar(x=[df['bar_len'][i]], y=[df['gr'][i]], marker=dict(color=df['color'][i]), orientation = 'h'))
layout = dict(barmode='stack', yaxis={'title': 'gr'}, xaxis={'title': 'Length'})
fig = go.Figure(data=data, layout=layout)
fig.update_layout(showlegend=False, autosize=False, width=800, height=300)
fig.show()
OUTPUT GRAPH
Note: If the x-axis can be expressed as a timeline and you are able to get the x values as datetime, would suggest you also check out plotly.express.timeline charts which gives gantt chart form of graphs. Sample here - Check the first chart...
Im the dataframe below, I have unique compIds which can have multiple capei and multiple date. This is primarily a time series dataset.
date capei compId
0 200401 25.123777 31946.0
1 200401 15.844910 29586.0
2 200401 20.524131 32507.0
3 200401 15.844910 29586.0
4 200401 15.844910 29586.0
... ... ... ...
73226 202011 9.372320 2817.0
73227 202011 9.372320 2817.0
73228 202011 22.334842 28581.0
73229 202011 10.761727 31946.0
73230 202011 30.205348 15029.0
With the following visualization code, I get the plot but the color of the line plots are not different. I wanted different colors.
import seaborn as sns
a4_dims = (15, 5)
sns.set_palette("vlag")
**plot**
sns.set_style('ticks')
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
sns.relplot(x="date", ax=ax, y="capei", style='compId', kind='line',data=fDf, palette=sns.color_palette("Spectral", as_cmap=True) )
It generates image like this
However I am expecting plot as like
The compId in the picture generated figure 1 can be Month equivalent in figure 2.
Figure 2 is a screenshot from here.
How would be able to have different colors for compId in the Figure 1.
First of all, you should reformat your dataframe:
convert 'date' from str to datetime:
fDf['date'] = pd.to_datetime(fDf['date'], format='%Y%m')
convert 'compId' from float to str in order to be used as a categorical axis(1):
fDf['compId'] = fDf['compId'].apply(str)
Now you can pass 'compId' to seaborn.relplot as hue and/or style parameter, depending on your preferencies.
Complete Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
fDf = pd.read_csv(r'data/data.csv')
fDf['date'] = pd.to_datetime(fDf['date'], format='%Y%m')
fDf['compId'] = fDf['compId'].apply(str)
sns.set_palette("vlag")
sns.set_style('ticks')
sns.relplot(x="date", y="capei", style='compId', hue='compId', kind='line',data=fDf, estimator=None)
plt.show()
(plot drawn with a fake-dataframe)
This passage may or may not be necessary; given the context I suggest it is.
If you keep compId as numerical type, then the hue in the plot will be proportional to 'compId'value. This means 0.4 will have a color very different from 31946.0; but 31946.0 and 32507.0 will be practically indistinguishable by color.
If you convert compId to str, then the hue won't depent of compId numerical value, so the colors will be equally spaced among categories.
fDf['compId'] as it is
fDf['compId'].apply(str)
I am trying to figure out how should I manipulate my data so I can aggregate on multiple columns but for same grouped pandas data. The reason why I am doing this because, I need to get stacked line chart which take data from different aggregation on same grouped data. How can we do this some compact way? can anyone suggest possible way of doing this in pandas? any ideas?
my current attempt:
import pandas as pd
import matplotlib.pyplot as plt
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
df_re = df[df['retail_item'].str.contains("GROUND BEEF")]
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_rei = df_rei.reset_index(level=[0,1])
df_rei['week'] = pd.DatetimeIndex(df_rei['date']).week
df_rei['year'] = pd.DatetimeIndex(df_rei['date']).year
df_rei['week'] = df_rei['date'].dt.strftime('%W').astype('uint8')
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
similarly, I need to do data aggregation also like this:
df_re['price_gap'] = df_re['high_price'] - df_re['low_price']
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_rei1 = dff_rei1.reset_index(level=[0,1])
dff_rei1['week'] = pd.DatetimeIndex(dff_rei1['date']).week
dff_rei1['year'] = pd.DatetimeIndex(dff_rei1['date']).year
dff_rei1['week'] = dff_rei1['date'].dt.strftime('%W').astype('uint8')
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
problem
when I made data aggregation, those lines are similar:
df_rei = df_re.groupby(['date', 'retail_item']).agg({'number_of_ads': 'sum'})
df_ret_df1 = df_rei.groupby(['retail_item', 'week'])['number_of_ads'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
and
dff_rei1 = df_re.groupby(['date', 'retail_item']).agg({'price_gap': 'mean'})
dff_ret_df2 = dff_rei1.groupby(['retail_item', 'week'])['price_gap'].agg([max, min, 'mean']).stack().reset_index(level=[2]).rename(columns={'level_2': 'mm', 0: 'vals'}).reset_index()
I think better way could be I have to make custom function with *arg, **kwargs to make shift for aggregating the columns, but how should I show stacked line chart where y axis shows different quantities. Is that doable to do so in pandas?
line plot
I did for getting line chart as follow:
for g, d in df_ret_df1.groupby('retail_item'):
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='vals', hue='mm', data=d,alpha=.8)
y1 = d[d.mm == 'max']
y2 = d[d.mm == 'min']
plt.fill_between(x=y1.week, y1=y1.vals, y2=y2.vals)
for year in df['year'].unique():
data = df_rei[(df_rei.date.dt.year == year) & (df_rei.retail_item == g)]
sns.lineplot(x='week', y='price_gap', ci=None, data=data, palette=cmap,label=year,alpha=.8)
I want to minimize those so I could able to aggregate on different columns and make stacked line chart, where they share x-axis as week, and y axis shows number of ads and price_range respectively. I don't know is there any better way of doing this. I am doing this because stacked line chart (two vertical subplots), one shows number of ads on y axis and another one shows price ranges for same items along 52 weeks. can anyone suggest any possible way of doing this? any ideas?
This answer builds on the one by Andreas who has already answered the main question of how to produce aggregate variables of multiple columns in a compact way. The goal here is to implement that solution specifically to your case and to give an example of how to produce a single figure from the aggregated data. Here are some key points:
The dates in the original dataset are already on a weekly frequency so groupby('week') is not needed for df_ret_df1 and dff_ret_df2, which is why these contain identical values for min, max, and mean.
This example uses pandas and matplotlib so the variables do not need to be stacked as when using seaborn.
The aggregation step produces a MultiIndex for the columns. You can access the aggregated variables (min, max, mean) of each high-level variable by using df.xs.
The date is set as the index of the aggregated dataframe to use as the x variable. Using the DatetimeIndex as the x variable gives you more flexibility for formatting the tick labels and ensures that the data is always plotted in chronological order.
It is not clear in the question how the data for separate years should be displayed (in separate figures?) so here the entire time series is shown in a single figure.
Import dataset and aggregate it as needed
import pandas as pd # v 1.2.3
import matplotlib.pyplot as plt # v 3.3.4
# Import dataset
url = 'https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/\
raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv'
df = pd.read_csv(url, parse_dates=['date'])
# Create dataframe containing data for ground beef products, compute
# aggregate variables, and set the date as the index
df_gbeef = df[df['retail_item'].str.contains('GROUND BEEF')].copy()
df_gbeef['price_gap'] = df_gbeef['high_price'] - df_gbeef['low_price']
agg_dict = {'number_of_ads': [min, max, 'mean'],
'price_gap': [min, max, 'mean']}
df_gbeef_agg = (df_gbeef.groupby(['date', 'retail_item']).agg(agg_dict)
.reset_index('retail_item'))
df_gbeef_agg
Plot aggregated variables in single figure containing small multiples
variables = ['number_of_ads', 'price_gap']
colors = ['tab:orange', 'tab:blue']
nrows = len(variables)
ncols = df_gbeef_agg['retail_item'].nunique()
fig, axs = plt.subplots(nrows, ncols, figsize=(10, 5), sharex=True, sharey='row')
for axs_row, var, color in zip(axs, variables, colors):
for i, (item, df_item) in enumerate(df_gbeef_agg.groupby('retail_item')):
ax = axs_row[i]
# Select data and plot it
data = df_item.xs(var, axis=1)
ax.fill_between(x=data.index, y1=data['min'], y2=data['max'],
color=color, alpha=0.3, label='min/max')
ax.plot(data.index, data['mean'], color=color, label='mean')
ax.spines['bottom'].set_position('zero')
# Format x-axis tick labels
fmt = plt.matplotlib.dates.DateFormatter('%W') # is not equal to ISO week
ax.xaxis.set_major_formatter(fmt)
# Fomat subplot according to position within the figure
if ax.is_first_row():
ax.set_title(item, pad=10)
if ax.is_last_row():
ax.set_xlabel('Week number', size=12, labelpad=5)
if ax.is_first_col():
ax.set_ylabel(var, size=12, labelpad=10)
if ax.is_last_col():
ax.legend(frameon=False)
fig.suptitle('Cross-regional weekly ads and price gaps of ground beef products',
size=14, y=1.02)
fig.subplots_adjust(hspace=0.1);
I am not sure if this answers your question fully, but based on your headline I guess it all boils down to:
import pandas as pd
url = "https://gist.githubusercontent.com/adamFlyn/4657714653398e9269263a7c8ad4bb8a/raw/fa6709a0c41888503509e569ace63606d2e5c2ff/mydf.csv"
df = pd.read_csv(url, parse_dates=['date'])
# define which columns to group and in which way
dct = {'low_price': [max, min],
'high_price': min,
'year': 'mean'}
# actually group the columns
df.groupby(['region']).agg(dct)
Output:
low_price high_price year
max min min mean
region
ALASKA 16.99 1.33 1.33 2020.792123
HAWAII 12.99 1.33 1.33 2020.738318
MIDWEST 28.73 0.99 0.99 2020.690159
NORTHEAST 19.99 1.20 1.99 2020.709916
NORTHWEST 16.99 1.33 1.33 2020.736397
SOUTH CENTRAL 28.76 1.20 1.49 2020.700980
SOUTHEAST 21.99 1.33 1.48 2020.699655
SOUTHWEST 16.99 1.29 1.29 2020.704341
I want to plot machine observation data by days separately,
so changes between Current, Temperature etc. can be seen by hour.
Basically I want one plot for each day. Thing is when I make too many of these Jupyter Notebook can't display each one of them and plotly gives error.
f_day --> first day
n_day --> next day
I think of using sub_plots with a shared y-axis but then I don't know how I can put different dates in x-axis
How can I make these with graph objects and sub_plots ? So therefore using only 1 figure object so plots doesn't crash.
Data looks like this
,ID,IOT_ID,DATE,Voltage,Current,Temperature,Noise,Humidity,Vibration,Open,Close
0,9466,5d36edfe125b874a36c6a210,2020-08-06 09:02:00,228.893,4.17,39.9817,73.1167,33.3133,2.05,T,F
1,9467,5d36edfe125b874a36c6a210,2020-08-06 09:03:00,228.168,4.13167,40.0317,69.65,33.265,2.03333,T,F
2,9468,5d36edfe125b874a36c6a210,2020-08-06 09:04:00,228.535,4.13,40.11,71.7,33.1717,2.08333,T,F
3,9469,5d36edfe125b874a36c6a210,2020-08-06 09:05:00,228.597,4.14,40.1683,71.95,33.0417,2.0666700000000002,T,F
4,9470,5d36edfe125b874a36c6a210,2020-08-06 09:06:00,228.405,4.13333,40.2317,71.2167,32.9933,2.0,T,F
Code with display error is this
f_day = pd.Timestamp('2020-08-06 00:00:00')
for day in range(days_between.days):
n_day = f_day + pd.Timedelta('1 days')
fig_df = df[(df["DATE"] >= f_day) & (df["DATE"] <= n_day) & (df["IOT_ID"] == iot_id)]
fig_cn = px.scatter(
fig_df, x="DATE", y="Current", color="Noise", color_continuous_scale= "Sunset",
title= ("IoT " + iot_id + " " + str(f_day.date())),
range_color= (min_noise,max_noise)
)
f_day = n_day
fig_cn.show()
updated
The question was with respect to plotly not matplotlib. Same approach works. Clearly axis and titles need some beautification
import pandas as pd
import plotly.subplots
import plotly.express as px
import datetime as dt
import random
df = pd.DataFrame([{"DATE":d, "IOT_ID":random.randint(1,5), "Noise":random.uniform(0,1), "Current":random.uniform(15,25)}
for d in pd.date_range(dt.datetime(2020,9,1), dt.datetime(2020,9,4,23,59), freq="15min")])
# get days to plot
days = df["DATE"].dt.floor("D").unique()
# create axis for each day
fig = plotly.subplots.make_subplots(len(days))
iot_id=3
for i,d in enumerate(days):
# filter data and plot ....
mask = (df["DATE"].dt.floor("D")==d)&(df["IOT_ID"]==iot_id)
splt = px.scatter(df.loc[mask], x="DATE", y="Current", color="Noise", color_continuous_scale= "Sunset",
title= f"IoT ({iot_id}) Date:{pd.to_datetime(d).strftime('%d %b')}")
# select_traces() returns a generator so turn it into a list and take first one
fig.add_trace(list(splt.select_traces())[0], row=i+1, col=1)
fig.show()
It's simple - create the axis that you want to plot on first. Then plot. I've simulated your data as you didn't provide in your question.
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import random
df = pd.DataFrame([{"DATE":d, "IOT_ID":random.randint(1,5), "Noise":random.uniform(0,1), "Current":random.uniform(15,25)}
for d in pd.date_range(dt.datetime(2020,9,1), dt.datetime(2020,9,4,23,59), freq="15min")])
# get days to plot
days = df["DATE"].dt.floor("D").unique()
# create axis for each day
fig, ax = plt.subplots(len(days), figsize=[20,10],
sharey=True, sharex=False, gridspec_kw={"hspace":0.4})
iot_id=3
for i,d in enumerate(days):
# filter data and plot ....
df.loc[(df["DATE"].dt.floor("D")==d)&(df["IOT_ID"]==iot_id),].plot(kind="scatter", ax=ax[i], x="DATE", y="Current", c="Noise",
colormap= "turbo", title=f"IoT ({iot_id}) Date:{pd.to_datetime(d).strftime('%d %b')}")
ax[i].set_xlabel("") # it's in the titles...
output