How can I represent below data in comprehensive graph? Tried to with group by() from Pandas but the result in not comprehensive.
My objectif is to show what causes the most accidents between below combinations
pieton bicyclette camion_lourd vehicule
0 0 1 1
0 1 0 1
1 1 0 0
0 1 1 0
0 1 0 1
1 0 0 1
0 0 0 1
0 0 0 1
1 1 0 0
0 1 0 1
y = df.groupby(['pieton', 'bicyclette', 'camion_lourd', 'vehicule']).size()
y.unstack()
result:
Here are some visualizations that may help you:
#data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import matplotlib.pyplot as plt
columns = ['pieton', 'bicyclette', 'camion_lourd', 'vehicule']
df = pd.DataFrame([[0,0,1,1],[0,1,0,1],
[1,1,0,0],[0,1,1,0],
[1,0,0,1],[0,0,0,1],
[0,0,0,1],[1,1,0,0],
[0,1,0,1]], columns = columns)
You can start by seeing the proportion of accident per category:
# Set up a grid of plots
fig = plt.figure(figsize=(10,10))
fig_dims = (3, 2)
# Plot accidents depending on type
plt.subplot2grid(fig_dims, (0, 0))
df['pieton'].value_counts().plot(kind='bar',
title='Pieton')
plt.subplot2grid(fig_dims, (0, 1))
df['bicyclette'].value_counts().plot(kind='bar',
title='bicyclette')
plt.subplot2grid(fig_dims, (1, 0))
df['camion_lourd'].value_counts().plot(kind='bar',
title='camion_lourd')
plt.subplot2grid(fig_dims, (1, 1))
df['vehicule'].value_counts().plot(kind='bar',
title='vehicule')
Which gives:
Or if you prefer:
df.apply(pd.value_counts).plot(kind='bar',
title='all types')
But, more interestingly, I would do a comparison per pair. For example, for pedestrians:
pieton = {}
for col in columns:
pieton[col] = np.sum(df.pieton[df[col] == 1])
pieton.pop('pieton', None)
plt.bar(range(len(pieton)), pieton.values(), align='center')
plt.xticks(range(len(pieton)), pieton.keys())
plt.title("Who got an accident with a pedestrian?")
plt.legend(loc='best')
plt.show()
Which gives:
The similar plot can be done for bicycles, trucks and cars, giving:
It would be interesting to have more data points, to be able to draw better conclusions. However, this still tells us to watch out for bicycles if you are driving!
Hope this helped!
Related
I have a molecular dynamics simulation data. The system has 254 solute molecules and almost 12000 water molecules. The simulation has almost 4700 frames. I have extracted the H-bond data. The data is like if any of solute molecules show H-bond with any of the water molecule, it displays 1 otherwise 0. I want to plot H-bond data. So in total there is 254*4700 data points. The data is like as in given example
S1 S2 S3 S4 S5 ...
0 0 0 0 0 ...
0 0 0 0 0 ...
0 1 1 0 0 ...
0 0 0 0 0 ...
0 0 1 1 1 ...
0 0 0 0 1 ...
0 1 0 0 1 ...
0 0 0 0 1 ...
...
I want to plot like if the datapoint is 1, it shows a color otherwise if 0, no color (just like any other plot, e.g. scatter plot). Furthermore I want two axes on the plot such that
x-axis=Number of solutes (1 ... 254)
y-axis=number of frames (1 ... 4700)
So on y-axis only that datapoint related to x-axis should be colored that have 1.
Any help would be highly appreciated. Many thanks!
I would suggest plt.imshow for this task:
import matplotlib.pyplot as plt
import numpy as np
solutes = 254
frames = 4700
data = np.round(np.random.rand(frames, solutes))
plt.imshow(data, aspect='auto', interpolation='none')
plt.show()
I have a numpy array of a fixed size holding irregularly spaced data. An example would be:
[1 0 0 0 3 0 0 0 2 0
0 1 0 0 0 0 0 0 2 0
0 1 0 0 1 0 6 0 9 0
0 0 0 0 6 0 3 0 0 1]
I want to keep the array the same shape, but have all the 0 values overwritten with data interpolated from the points that do have data. If the data points in the array are thought of as height values, this would essentially be creating a surface over the points.
I have been trying to use scipy.interpolate.griddata but am continually getting errors. I start with an array of my known data points, as [x, y, value]. For the above, (first row only for brevity)
data = [0, 0, 1
0, 3, 3
0, 8, 2 ....................
I then define
points = (data[:,0], data[:,1])
values = (data[:,2])
Next, I define the points to sample at (in this case, the grid I desire)
grid = np.indices((4,10))
Finally, call griddata
t = interpolate.griddata(points, values, grid, method = 'linear')
This returns the following error
ValueError: number of dimensions in xi does not match x
Am I using the wrong function?
Thanks!
Solved: You need to pass the desired points as a tuple
t = interpolate.griddata(points, values, (grid[0,:,:], grid[1,:,:]), method = 'linear')
I have 2 models implemented with the same algorithm but with different number of features thus 2 different confusion matrix.
I would like to see which predicted items are similar between those 2 and plot the similarity predicted in a Venn diagram.
Answer
data = {"Mod1":[1,0,1,1,0,0,0,1,1,1],"Mod2":[1,0,1,0,1,0,0,1,0,1]}
df = pd.DataFrame(data)
df["Similar"] = np.where(df["Mod1"]==df["Mod2"],1,0)
df.head()
#output
Mod1Mod2Similar
0 1 1 1
1 0 0 1
2 1 1 1
3 1 0 0
4 0 1 0
This should do the job
Visualization
# !pip install matplotlib-venn
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
venn2(subsets = (3, 3, 7), set_labels = ('Mod1', 'Mod2'))
plt.show()
I know that you can use the mosaic plot from statsmodels but it is a bit frustrating when your categories have some empty values (like here). I was wondering whether it exists a solution with a graphic library like matplotlib or seaborn, which would be more handy.
I think it would be a nice feature for seaborn, as contingency tables are frequently built with pandas. However it seems that it won't be implemented anytime soon.
Finally, how to have a mosaic plot with 3 dimensions, and possible empty categories ?
Here is a generic mosaic plot (from wikipedia)
As nothing existed in python, here is the code I made. The last dimension should be of size 1 (i.e. a regular table) or 2 for now. Feel free to update the code to fix that, it might be unreadable with more than 3, though.
It's a bit long but it does the job. Example below.
There are few options, most are self explanatory, otherwise:
dic_color_row: a dictionary where keys are the outer-most index (Index_1 in example below) and the values are colors, avoid black/gray colors
pad: the space between each bar of the plot
alpha_label: the 3rd dimension use alpha trick to differentiate, between them, it will be rendered as dark grey / light grey in the legend and you can change the name of each label (similar to col_labels or row_labels)
color_label: to add background color to the y-tick labels. [True/False]
def mosaic_plot(df, dic_color_row, row_labels=None, col_labels=None, alpha_label=None, top_label="Size",
x_label=None, y_label=None, pad=0.01, color_ylabel=False, ax=None, order="Size"):
"""
From a contingency table NxM, plot a mosaic plot with the values inside. There should be a double-index for rows
e.g.
3 4 1 0 2 5
Index_1 Index_2
AA C 0 0 0 2 3 0
P 6 0 0 13 0 0
BB C 0 2 0 0 0 0
P 45 1 10 10 1 0
CC C 0 6 35 15 29 0
P 1 1 0 2 0 0
DD C 0 56 0 3 0 0
P 30 4 2 0 1 9
order: how columns are order, by default, from the biggest to the smallest in term of category. Possible values are
- "Size" [default]
- "Normal" : as the columns are order in the input df
- list of column names to reorder the column
top_label: Size of each columns. The label can be changed to adapt to your value.
If `False`, nothing is displayed and the secondary legend is set on top instead of on right.
"""
is_multi = len(df.index.names) == 2
if ax == None:
fig, ax = plt.subplots(1,1, figsize=(len(df.columns), len(df.index.get_level_values(0).unique())))
size_col = df.sum().sort_values(ascending=False)
prop_com = size_col.div(size_col.sum())
if order == "Size":
df = df[size_col.index.values]
elif order == "Normal":
prop_com = prop_com[df.columns]
size_col = size_col[df.columns]
else:
df = df[order]
prop_com = prop_com[order]
size_col = size_col[order]
if is_multi:
inner_index = df.index.get_level_values(1).unique()
prop_ii0 = (df.swaplevel().loc[inner_index[0]]/(df.swaplevel().loc[inner_index[0]]+df.swaplevel().loc[inner_index[1]])).fillna(0)
alpha_ii = 0.5
true_y_labels = df.index.levels[0]
else:
alpha_ii = 1
true_y_labels = df.index
Yt = (df.groupby(level=0).sum().iloc[:,0].div(df.groupby(level=0).sum().iloc[:,0].sum())+pad).cumsum() - pad
Ytt = df.groupby(level=0).sum().iloc[:,0].div(df.groupby(level=0).sum().iloc[:,0].sum())
x = 0
for j in df.groupby(level=0).sum().iteritems():
bot = 0
S = float(j[1].sum())
for lab, k in j[1].iteritems():
bars = []
ax.bar(x, k/S, width=prop_com[j[0]], bottom=bot, color=dic_color_row[lab], alpha=alpha_ii, lw=0, align="edge")
if is_multi:
ax.bar(x, k/S, width=prop_com[j[0]]*prop_ii0.loc[lab, j[0]], bottom=bot, color=dic_color_row[lab], lw=0, alpha=1, align="edge")
bot += k/S + pad
x += prop_com[j[0]] + pad
## Aesthetic of the plot and ticks
# Y-axis
if row_labels == None:
row_labels = Yt.index
ax.set_yticks(Yt - Ytt/2)
ax.set_yticklabels(row_labels)
ax.set_ylim(0, 1 + (len(j[1]) - 1) * pad)
if y_label == None:
y_label = df.index.names[0]
ax.set_ylabel(y_label)
# X-axis
if col_labels == None:
col_labels = prop_com.index
xticks = (prop_com + pad).cumsum() - pad - prop_com/2.
ax.set_xticks(xticks)
ax.set_xticklabels(col_labels)
ax.set_xlim(0, prop_com.sum() + pad * (len(prop_com)-1))
if x_label == None:
x_label = df.columns.name
ax.set_xlabel(x_label)
# Top label
if top_label:
ax2 = ax.twiny()
ax2.set_xlim(*ax.get_xlim())
ax2.set_xticks(xticks)
ax2.set_xticklabels(size_col.values.astype(int))
ax2.set_xlabel(top_label)
ax2.tick_params(top=False, right=False, pad=0, length=0)
# Ticks and axis settings
ax.tick_params(top=False, right=False, pad=5)
sns.despine(left=0, bottom=False, right=0, top=0, offset=3)
# Legend
if is_multi:
if alpha_label == None:
alpha_label = inner_index
bars = [ax.bar(np.nan, np.nan, color="0.2", alpha=[1, 0.5][b]) for b in range(2)]
if top_label:
plt.legend(bars, alpha_label, loc='center left', bbox_to_anchor=(1, 0.5), ncol=1, )
else:
plt.legend(bars, alpha_label, loc="lower center", bbox_to_anchor=(0.5, 1), ncol=2)
plt.tight_layout(rect=[0, 0, .9, 0.95])
if color_ylabel:
for tick, label in zip(ax.get_yticklabels(), true_y_labels):
tick.set_bbox(dict( pad=5, facecolor=dic_color_row[label]))
tick.set_color("w")
tick.set_fontweight("bold")
return ax
With a dataframe you get after a crosstabulation:
df
Index_1 Index_2 v w x y z
AA Q 0 0 0 2 3
AA P 6 0 0 13 0
BB Q 0 2 0 0 0
BB P 45 1 10 10 1
CC Q 0 6 0 15 9
CC P 0 1 0 2 0
DD Q 0 56 0 3 0
DD P 30 4 2 0 1
make sure that you have the 2 columns as index:
df.set_index(["Index_1", "Index_2"], inplace=True)
and then just call:
mosaic_plot(df,
{"AA":"r", "BB":"b", "CC":"y", "DD":"g"}, # dict of color, mandatory
x_label='My Category',
)
It's not perfect, but I hope it will help others.
I'm creating a checkerboard pattern as follows:
def CheckeredBoard( x=10 , y=10 , sq=2 , xmax = None , ymax = None ):
coords = np.ogrid[0:x , 0:y]
idx = (coords[0] // sq + coords[1] // sq) % 2
if xmax != None: idx[xmax:] = 0.
if ymax != None: idx[:, ymax:] = 0.
return idx
ch = CheckeredBoard( 100, 110 , 10 )
plt.imshow2( ch )
What I would like is to add a number in some of the boxes to number them so that when I run plt.imshow2( ch ) I get the numbers be part of the image.
The only way I can think of doing this is by using some sort of annotation and then saving the image and loading the annotated image but this seems really messy.
For example a succesfull matrix would look like:
1 1 1 1 0 0 0 0
1 1 0 1 0 0 0 0
1 0 0 1 0 0 0 0
1 1 0 1 0 0 0 0
1 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 0 0 0 0
0 0 0 0 1 1 1 1
0 0 0 0 1 1 0 1
0 0 0 0 1 0 1 0
0 0 0 0 1 0 0 0
0 0 0 0 1 0 1 0
0 0 0 0 1 1 0 1
0 0 0 0 1 1 1 1
The matrix above has a 1 and an 8 in the two corners.
Appreciate any help, let me know if you want additional information.
Thanks
EDIT
Here is something closer to what I'd actually like to end up with.
Red circles added for emphasis.
What about using PIL / Pillow?
import numpy as np
import pylab
from PIL import Image, ImageDraw, ImageFont
#-- your data array
xs = np.zeros((20,20))
#-- prepare the text drawing
img = Image.fromarray(xs)
d = ImageDraw.Draw(img)
d.text( (2,2), "4", fill=255)
#-- back to array
ys = np.asarray(img)
#-- just show
pylab.imshow(ys)
how about something like this?
n = 8
board = [[(i+j)%2 for i in range(n)] for j in range(n)]
from matplotlib import pyplot
fig = pyplot.figure()
ax = fig.add_subplot(1,1,1)
ax.imshow(board, interpolation="nearest")
from random import randint
for _ in range(10):
i = randint(0, n-1)
j = randint(0, n-1)
number = randint(0,9)
ax.annotate(str(number), xy=(i,j), color="white")
pyplot.show()
obviously you'll have your own way of locating the numbers, and and choosing them, other than that though, the annotate functionality has everything you need.
You might need to offset the numbers, and in that case you can either just have a set size and work out how much you need to offset them by, or you can work out the bounding box of the squares and offset them that way if you want.
Colouring the numbers you also have a few options - you can go with a standard colour for all, or you can opposite colour them;
for _ in range(10):
i = randint(0, n-1)
j = randint(0, n-1)
number = randint(0,9)
colour = "red"
if (i+j)%2 == 1:
colour = "blue"
ax.annotate(str(number), xy=(i,j), color=colour)
but honestly, i think the white option is more readable.