plotting pandas groupby with x-axis in columns - python
I have a dataframe of time series where the columns are the time values (in order) and each row is a separate series. I also have extra columns that gives the category of each row, which in turn determines the linestyle and the color.
Here's the dataframe:
>>> df
cat (frac_norm, 2, 1) cluster
month_rel -5 -4 -3 -2 -1 0 1 2 3 4 5
user1 user2
3414845 4232621 -1b 0.760675 0.789854 0.95941 0.867755 0.790102 1 0.588729 0.719073 0.695572 0.647696 0.656323 4
4369232 3370279 -1b 0.580436 0.546761 0.71343 0.742033 0.802198 0.389957 0.861451 0.651786 0.798265 0.476305 0.896072 0
22771 3795428 -1b 0.946188 0.499531 0.834885 0.825772 0.754018 0.67823 0.430692 0.353989 0.333761 0.284759 0.260501 2
2660226 3126314 -1b 0.826701 0.81203 0.765182 0.680162 0.763475 0.802632 1 0.780186 0.844019 0.868698 0.722672 4
4154510 4348009 -1b 1 0.955656 0.677647 0.911556 0.76613 0.743759 0.61798 0.606536 0.715528 0.614902 0.482267 3
2860801 164553 -1b 0.870056 0.371981 0.640212 0.835185 0.673108 0.536585 1 0.850242 0.551198 0.873016 0.635556 4
120577 3480468 -1b 0.8197 0.879873 0.961178 1 0.855465 0.827824 0.827139 0.304011 0.574978 0.473996 0.358934 3
6692132 5095003 -1b 1 0.995859 0.738418 0.991217 0.854336 0.936518 0.910347 0.883205 0.987796 0.699433 0.815072 4
2515737 4263756 -1b 0.949047 0.990238 0.899524 1 0.961066 0.83703 0.835114 0.759142 0.749727 0.886913 0.936961 4
707596 2856619 -1b 0.780538 0.702179 0.568627 1 0.601382 0.789116 0 0.0714286 0 0.111969 0.0739796 2
I can make the following plot, where I the x-axis are the ordered values of ('frac_norm',2,1), the colors depend on the value of cluster, and the linestyle depends on the value of cat. However, it's row-by-row. Is there a way to vectorize this, say, by using groupby?
My code for generating image
import pandas as pd
import numpy as np
colors = ['r','g','b','c','y','k']
lnst = ['-','--']
cats = np.sort(df['cat'].unique())
clusters = np.sort(df['cluster'].unique())
colordict = dict(zip(clusters, colors))
lnstdict = dict(zip(cats,lnst))
fig, ax = plt.subplots()
# I first do it by `cluster` value
for clus_val in clusters:
clr = colordict[clus_val]
subset = df[df['cluster'] == clus_val]
# and then plot each row individually, setting the color and linestyle
for row in subset.iterrows():
ax.plot(row[1][('frac_norm', 2, 1)], color=clr,
linestyle=lnstdict[row[1]['cat'][0]]
)
Code for generating df
import pandas as pd
import numpy as np
vals = np.array([['-1b', 0.7606747496046389, 0.7898535589129476, 0.959409594095941,
0.8677546569280126, 0.7901020186672455, 1.0, 0.5887286145588728,
0.7190726452719073, 0.6955719557195572, 0.6476962793343348,
0.6563233814156323, 4],
['-1b', 0.5804363905325444, 0.5467611336032389,
0.7134300126103406, 0.7420329670329671, 0.8021978021978022,
0.389957264957265, 0.861451048951049, 0.6517857142857143,
0.798265460030166, 0.4763049450549451, 0.8960720130932898, 0],
['-1b', 0.9461875843454791, 0.49953095684803, 0.8348848603625673,
0.8257715338553662, 0.7540183696900115, 0.6782302664655606,
0.43069179143004643, 0.35398860398860393, 0.33376068376068374,
0.28475935828877, 0.260501012145749, 2],
['-1b', 0.8267008985879333, 0.8120300751879698,
0.7651821862348178, 0.680161943319838, 0.7634749524413443,
0.8026315789473684, 1.0, 0.7801857585139319, 0.8440191387559809,
0.8686980609418281, 0.7226720647773278, 4],
['-1b', 1.0, 0.955656108597285, 0.6776470588235294,
0.9115556882651537, 0.766129636568003, 0.7437589670014347,
0.6179800221975582, 0.6065359477124183, 0.715527950310559,
0.6149019607843138, 0.4822670674109059, 3],
['-1b', 0.8700564971751412, 0.3719806763285024,
0.6402116402116402, 0.8351851851851851, 0.6731078904991948,
0.5365853658536585, 1.0, 0.8502415458937197, 0.55119825708061,
0.873015873015873, 0.6355555555555555, 4],
['-1b', 0.8196997807387418, 0.879872907246731, 0.961178456344944,
1.0, 0.8554654738607772, 0.8278240873814314, 0.8271388025408839,
0.3040112596762843, 0.5749778172138421, 0.47399605003291634,
0.35893441346004046, 3],
['-1b', 1.0, 0.9958592132505176, 0.7384176764076977,
0.9912165129556433, 0.8543355440923606, 0.9365176566646254,
0.9103471520053926, 0.8832054560954816, 0.9877955758962623,
0.6994328922495274, 0.8150724637681159, 4],
['-1b', 0.9490474080638015, 0.9902376128200405,
0.8995240613432046, 1.0, 0.9610655737704917, 0.837029893924783,
0.8351136964569011, 0.759142496847415, 0.7497267759562841,
0.8869130313976105, 0.9369612979550449, 4],
['-1b', 0.7805383022774327, 0.7021791767554478,
0.5686274509803921, 1.0, 0.6013824884792627, 0.7891156462585033,
0.0, 0.07142857142857142, 0.0, 0.11196911196911197,
0.07397959183673469, 2]], dtype=object)
cols = pd.MultiIndex.from_tuples([( 'cat', ''),
(('frac_norm', 2, 1), -5),
(('frac_norm', 2, 1), -4),
(('frac_norm', 2, 1), -3),
(('frac_norm', 2, 1), -2),
(('frac_norm', 2, 1), -1),
(('frac_norm', 2, 1), 0),
(('frac_norm', 2, 1), 1),
(('frac_norm', 2, 1), 2),
(('frac_norm', 2, 1), 3),
(('frac_norm', 2, 1), 4),
(('frac_norm', 2, 1), 5),
( 'cluster', '')],
names=[None, 'month_rel'])
idx = pd.MultiIndex.from_tuples([(3414845, 4232621),
(4369232, 3370279),
( 22771, 3795428),
(2660226, 3126314),
(4154510, 4348009),
(2860801, 164553),
( 120577, 3480468),
(6692132, 5095003),
(2515737, 4263756),
( 707596, 2856619)],
names=['user1', 'user2'])
df = pd.DataFrame(vals, columns=cols, index=idx)
You should be able to use pandas to plot and avoid the loops:
from matplotlib.colors import LinearSegmentedColormap
colors = ['r','g','b','c','y','k']
lnst = ['-','--']
cats = np.sort(df['cat'].unique())
clusters = np.sort(df['cluster'].unique())
colordict = dict(zip(clusters, colors))
lnstdict = dict(zip(cats,lnst))
# transpose data frame
df1 = df.T
# map colors from colordict to cluster
cmap = df['cluster'].map(colordict).values.tolist()
# create a custom color map and line style
lscm = LinearSegmentedColormap.from_list('color', cmap)
lstyle = df['cat'].map(lnstdict).values.tolist()
# plot with pandas
df1.iloc[1:12].reset_index(level=0, drop=True).plot(figsize=(20,10),
colormap=lscm,
style=lstyle)
Update (assuming you want both on the same graph)
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
%matplotlib inline
colors = ['r','g','b','c','y','k']
lnst = ['-','--']
cats = np.sort(df['cat'].unique())
clusters = np.sort(df['cluster'].unique())
colordict = dict(zip(clusters, colors))
lnstdict = dict(zip(cats,lnst))
# transpose data frame
df1 = df.T
# map colors from colordict to cluster
cmap = df['cluster'].map(colordict).values.tolist()
# create a custom color map and line style
lscm = LinearSegmentedColormap.from_list('color', cmap)
lstyle = df['cat'].map(lnstdict).values.tolist()
c = df.columns
# not needed for your actually dataframe
# i am just converting your sample data to numeric
for i in range(len(df.columns[1:])-1):
df[c[i+1]] = pd.to_numeric(df[c[i+1]])
# groupby and get mean of cluster
df2 = df[c[1:]].groupby('cluster').mean()
# create sublots object from matplotlib
fig, ax = plt.subplots()
# add a twin y-axis
ax2 = ax.twiny()
# plot dataframe 1
df1.iloc[1:12].reset_index(level=0, drop=True).plot(ax=ax, figsize=(20,10),
colormap=lscm,
style=lstyle)
# create legend for ax
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels, loc='center left', borderaxespad=-20)
# subplot df2
df2.plot(ax=ax2, colormap='copper')
# create legend for ax2
handles, labels = ax2.get_legend_handles_labels()
ax2.legend(handles, labels, loc='center right', borderaxespad=-20)
Related
use a dataframe column for the color of a graph line with matplotlib
I draw a graph like this: p1=ax1.plot(df['timestamp'], df['break_even'], color='blue', zorder = 0) but I would like the line to change color based on another column: p1=ax1.plot(df['timestamp'], df['break_even'], color=df['trade_color'], zorder = 0) this will not work, I get: ValueError: Invalid RGBA argument: 0 red 1 green 2 red 3 red 4 green ... how can this be achieved? this is an example to test: data = [[1, 10, 'red'], [2, 15, 'green'], [3, 14, 'blue']] df = pd.DataFrame(data, columns = ['x', 'y', 'color']) fig, ax = plt.subplots() ax.plot(df['x'], df['y'], color='darkorange', zorder = 0) this will work, but: ax.plot(df['x'], df['y'], color=df['color'], zorder = 0) will not. How can I get each line segment to use the color I need? (I have just 2 colors if it makes a difference)
Just plot part of the dataframe each time with the color you want: import pandas as pd import matplotlib.pyplot as plt data = [[1, 10, 'red'], [2, 15, 'green'], [3, 14, 'blue']] df = pd.DataFrame(data, columns = ['x', 'y', 'color']) fig, ax = plt.subplots() for i in df.index: ''' Get two rows each time, every row has a point (x, y) Two points can draw a line, use the color defined by first row ''' partial = df.iloc[i:i+2, :] ax.plot(partial['x'], partial['y'], color=partial['color'].iloc[0], zorder = 0) plt.show()
bar chart legend based on coloring of bars by group not value
I've created a bar chart as described here where I have multiple variables (indicated in the 'value' column) and they belong to repeat groups. I've colored the bars by their group membership. I want to create a legend ultimately equivalent to the colors dictionary, showing the color corresponding to a given group membership. Code here: d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]} df = pd.DataFrame(data=d) colors = {1: 'r', 2: 'b', 3: 'g'} df['value'].plot(kind='bar', color=[colors[i] for i in df['group']]) plt.legend(df['group']) In this way, I just get a legend with one color (1) instead of (1, 2, 3). Thanks!
You can use sns: sns.barplot(data=df, x=df.index, y='value', hue='group', palette=colors, dodge=False) Output:
With pandas, you could create your own legend as follows: from matplotlib import pyplot as plt from matplotlib import patches as mpatches import pandas as pd d = {'value': [1, 2, 4, 5, 7 ,10], 'group': [1, 2, 3, 2, 2, 3]} df = pd.DataFrame(data=d) colors = {1: 'r', 2: 'b', 3: 'g'} df['value'].plot(kind='bar', color=[colors[i] for i in df['group']]) handles = [mpatches.Patch(color=colors[i]) for i in colors] labels = [f'group {i}' for i in colors] plt.legend(handles, labels) plt.show()
Animation multiple columns as dots with matplotlib very slow for large dataset with networkx graph as background
In my previous question, (How to Animate multiple columns as dots with matplotlib from pandas dataframe with NaN in python), I managed to animate multiple dots from a dataframe as an animation. However, I wanted to set a background for the animation as a network graph, so that it seems that the dots are moving on the lines of the network. Using the code from How to Animate multiple columns as dots with matplotlib from pandas dataframe with NaN in python I've created a new MCV example, the code is listed below: import random import networkx as nx import matplotlib.pyplot as plt import numpy as np import math import pandas as pd from matplotlib import animation #from JSAnimation import IPython_display %matplotlib inline # initialise graph object G = nx.Graph() color_map =[] G.add_node(1, pos=(1, 0)); color_map.append('r') G.add_node(2, pos=(2, 0)); color_map.append('r') G.add_node(3, pos=(3, -1)); color_map.append('r') G.add_node(4, pos=(3, 1)); color_map.append('r') G.add_node(5, pos=(4, -1)) ;color_map.append('r') G.add_node(6, pos=(4, 1)); color_map.append('r') G.add_node(7, pos=(5, 0)); color_map.append('r') G.add_node(8, pos=(6, 0)); color_map.append('r') e = [(1, 2, 1), (2, 3, 1), (2, 4, 2), (3, 5, 5), (4, 6, 2), (5, 7, 1), (6, 7, 2), (7, 8, 1)] G.add_weighted_edges_from(e) labels = nx.get_edge_attributes(G,'weight') nx.draw(G,nx.get_node_attributes(G, 'pos')) nx.draw_networkx_edge_labels(G,nx.get_node_attributes(G, 'pos'),edge_labels=labels) nx.draw_networkx_labels(G,nx.get_node_attributes(G, 'pos')) df_x = pd.DataFrame(data= np.array( [[np.NaN, np.NaN, np.NaN, np.NaN], [1, np.nan, np.NaN,np.NaN], [1.5, 4, np.NaN,np.NaN], [2, 5, 3,4]] ), index= [1, 2, 3, 4], columns=[1, 2, 3, 4]) print(df_x) df_y = pd.DataFrame(data=np.array( [[np.NaN, np.NaN, np.NaN, np.NaN], [0, np.nan, np.NaN,np.NaN], [0, -1, np.NaN,np.NaN], [0, 0, 1,1]] ), index= [1, 2, 3, 4], columns=[1, 2, 3, 4]) %matplotlib notebook from matplotlib import animation #from JSAnimation import IPython_display #from IPython.display import HTML fig = plt.figure(figsize=(10,10)) ax = plt.axes() nx.draw(G,nx.get_node_attributes(G, 'pos'),node_size = 10) n_steps = df_x.index graph, = plt.plot([],[],'o') def get_data_x(i): return df_x.loc[i] def get_data_y(i): return df_y.loc[i] def animate(i): x = get_data_x(i) y= get_data_y(i) graph.set_data(x,y) return graph, animation.FuncAnimation(fig, animate, frames=n_steps, repeat=True, blit = True) This creates a workable animation, which works. But however, when I use a very large dataset ( pandas dataframe index is ~8000 rows * 800 columns instead of the example pandas dataset I posted), the animation takes very long(hour or so) to render and most of the times my browser( google chrome) crashes. So I thought is maybe due to it needs to redraw the networks graph each frame? How can I set the background as the networkx graph? From there on it is just plotting points right? My actual graph is a bit larger (~5000 nodes, ~6000 edges). Hopes anyone can help me speed the rendering of the animation up!
After some digging around, I found no 'easy' solution to this problem when trying to animate large datasets into an animation with matplotlib in a jupyter notebook. I just decided to write everything to an mp4 file, which works just as good for animations. My code for this including the MVC example: import random import networkx as nx import matplotlib.pyplot as plt import numpy as np import math import pandas as pd from matplotlib import animation #from JSAnimation import IPython_display %matplotlib inline # initialise graph object G = nx.Graph() color_map =[] G.add_node(1, pos=(1, 0)); color_map.append('r') G.add_node(2, pos=(2, 0)); color_map.append('r') G.add_node(3, pos=(3, -1)); color_map.append('r') G.add_node(4, pos=(3, 1)); color_map.append('r') G.add_node(5, pos=(4, -1)) ;color_map.append('r') G.add_node(6, pos=(4, 1)); color_map.append('r') G.add_node(7, pos=(5, 0)); color_map.append('r') G.add_node(8, pos=(6, 0)); color_map.append('r') e = [(1, 2, 1), (2, 3, 1), (2, 4, 2), (3, 5, 5), (4, 6, 2), (5, 7, 1), (6, 7, 2), (7, 8, 1)] G.add_weighted_edges_from(e) labels = nx.get_edge_attributes(G,'weight') nx.draw(G,nx.get_node_attributes(G, 'pos')) nx.draw_networkx_edge_labels(G,nx.get_node_attributes(G, 'pos'),edge_labels=labels) nx.draw_networkx_labels(G,nx.get_node_attributes(G, 'pos')) df_x = pd.DataFrame(data= np.array( [[np.NaN, np.NaN, np.NaN, np.NaN], [1, np.nan, np.NaN,np.NaN], [1.5, 4, np.NaN,np.NaN], [2, 5, 3,4]] ), index= [1, 2, 3, 4], columns=[1, 2, 3, 4]) print(df_x) df_y = pd.DataFrame(data=np.array( [[np.NaN, np.NaN, np.NaN, np.NaN], [0, np.nan, np.NaN,np.NaN], [0, -1, np.NaN,np.NaN], [0, 0, 1,1]] ), index= [1, 2, 3, 4], columns=[1, 2, 3, 4]) def get_data_x(i): return df_x.loc[i] def get_data_y(i): return sdf_y.loc[i] def animate(i): x = get_data_x(i) y= get_data_y(i) graph.set_data(x,y) return graph, # Set up formatting for the movie files Writer = animation.writers['ffmpeg'] writer = Writer(fps=15, metadata=dict(artist='Me'), bitrate=1800) fig = plt.figure(figsize=(20,20)) ax = plt.axes() nx.draw(G,nx.get_node_attributes(G, 'pos'),node_size = 1) n_steps = df_x.index graph, = plt.plot([],[],'o') ani = animation.FuncAnimation(fig, animate, frames= n_steps, interval=1, repeat=True, blit = True) ani.save('path/file.mp4', writer=writer)
Dynamic treshold line in a bar chart
I have a stacked bar char where I want to add a dynamic threshold line. The threshold is calculated via a simple formular (90% of each specific value) Graphic attached. The green line is what I am looking for. Looking forward for any idea how to approach this problem.
Here is what I came up with: The idea was to have a continuous segment of Xs projected with a constant y value with a 0.5 excess before and after: import numpy as np import matplotlib.pyplot as plt groups = 9 X = list(range(1, groups)) y = [1, 1, 2, 2, 1, 2, 1, 1] threshold_interval_x = np.arange(min(X) - 0.5, max(X) + 0.5, 0.01).tolist() threshold_y = [] for y_elt in y: for i in range(0, int(len(threshold_interval_x) / (groups - 1))): threshold_y.append(y_elt * 0.9) plt.bar(X, y, width=0.4, align='center', color='yellow') plt.plot(threshold_interval_x, threshold_y, color='green') labels_X = ['PD', 'PZV', 'PP', 'FW', 'BA', 'IA', 'EA', 'NA'] plt.xticks(X, labels_X, rotation='horizontal') plt.show() And here's the output:
You could use matplotlibs step-function for this: import pandas as pd import matplotlib.pyplot as plt supposed your data is structured like this: df = pd.DataFrame({'In': [1, 1, 1, 2 , 0, 2, 0, 0], 'Out': [0, 0, 1, 0, 1, 0, 1, 1]}, index=['PD', 'PZV', 'PP', 'FW', 'BA', 'IA', 'EA', 'NA']) In Out PD 1 0 PZV 1 0 PP 1 1 FW 2 0 BA 0 1 IA 2 0 EA 0 1 NA 0 1 Then plotting the bars would be df.plot(kind='bar', stacked=True, rot=0, color=['gold', 'beige']) and plotting the threshold line at 90% of the sum would be plt.step(df.index, df.sum(1) * .9, 'firebrick', where='mid', label = 'Ziel: 90%') add legend: plt.legend() leads to:
Pandas Dataframe plot rows as x values and column header as y values
I have a Pandas DataFrame with displacements for different times (rows) and specific vertical locations (columns names). The goal is to plot the displacements (x axis) for the vertical location (y axis) for a given time (series). According to the next example (time = 0, 1, 2, 3, 4 and vertical locations = 0.5, 1.5, 2.5, 3.5), how can the displacements be plotted for the times 0 and 3? import pandas as pd import numpy as np import matplotlib.pyplot as plt np.random.seed(88) df = pd.DataFrame({ 'time': np.arange(0, 5, 1), '0.5': np.random.uniform(-1, 1, size = 5), '1.5': np.random.uniform(-2, 2, size = 5), '2.5': np.random.uniform(-3, 3, size = 5), '3.5': np.random.uniform(-4, 4, size = 5), }) df = df.set_index('time')
You can filter your dataframe to only contain the desired rows. Either by using the positional index filtered = df.iloc[[0,3],:] or by using the actualy index of the dataframe, filtered = df.iloc[(df.index == 3) | (df.index == 0),:] You can then plot a scatter plot like this: import pandas as pd import numpy as np import matplotlib.pyplot as plt np.random.seed(88) df = pd.DataFrame({ 'time': np.arange(0, 5, 1), '0.5': np.random.uniform(-1, 1, size = 5), '1.5': np.random.uniform(-2, 2, size = 5), '2.5': np.random.uniform(-3, 3, size = 5), '3.5': np.random.uniform(-4, 4, size = 5), }) df = df.set_index('time') filtered_df = df.iloc[[0,3],:] #filtered_df = df.iloc[(df.index == 3) | (df.index == 0),:] loc = list(map(float, df.columns)) fig, ax = plt.subplots() for row in filtered_df.iterrows(): ax.scatter(row[1], loc, label=row[1].name) plt.legend() plt.show()