I have a binary data set of 0 and 1, where 0 is an absence and 1 is a presence of an event.
A sample of the data set looks like this:
events germany Italy
Rain 0 1
hail 1 0
sunny 0 0
I'd like to get a red and white picture of this data in the form of heat map by reading the data from a file.
Edit: In response to comments below, here is a sample data file (saved on disk as "data.txt"):
Rain 0 0 0 0 1 0 1 0 0 1
Hail 0 1 0 0 0 0 0 1 0 0
Sunny 1 1 1 0 1 0 1 0 1 1
In python, we can read the labels and plot this "heatmap" by:
from numpy import loadtxt
import pylab as plt
labels = loadtxt("data.txt", usecols=[0,],dtype=str)
A = loadtxt("data.txt", usecols=range(1,10))
plt.imshow(A, interpolation='nearest', cmap=plt.cm.Reds)
plt.yticks(range(A.shape[0]), labels)
plt.show()
import pylab as plt
See ?image. With your data
dat <- data.matrix(data.frame(Germany = c(0,1,0), Italy = c(1,0,0)))
rownames(dat) <- c("Rain","Hail","Sunny")
This gets us close:
image(z = dat, col = c("white","red"))
but better handling of axis labels would be nice... Try:
op <- par(mar = c(5,5,4,2) + 0.1)
image(z = dat, col = c("white","red"), axes = FALSE)
axis(side = 1, labels = rownames(dat),
at = seq(0, by = 0.5, length.out = nrow(dat)))
axis(side = 2, labels = colnames(dat), at = c(0,1), las = 1)
box()
par(op)
Which gives
To have the heatmap the other way round, transpose dat (image(z = t(dat), ....)) and make in the axis() calls, change side to 2 in the first and 1 in the second call (and move the las = 1 to the other call. I.e.:
op <- par(mar = c(5,5,4,2) + 0.1)
image(z = t(dat2), col = c("white","red"), axes = FALSE)
axis(side = 2, labels = rownames(dat2),
at = seq(0, by = 0.5, length.out = nrow(dat2)), las = 1)
axis(side = 1, labels = colnames(dat2), at = c(0,1))
box()
par(op)
With reshape and ggplot2 in R
library(reshape)
library(ggplot2)
dat <- data.frame(weather=c("Rain","Hail","Sunny"), Germany = c(0,1,0), Italy = c(1,0,0))
melt.data<-melt(dat, id.vars="weather", variable_name="country")
qplot(data=melt.data,
x=country,
y=weather,
fill=factor(value),
geom="tile")+scale_fill_manual(values=c("0"="white", "1"="red"))
in R try:
library(bipartite)
mat<-matrix(c(0,1,1,0,1,1),byrow=TRUE,nrow=3)
rownames(mat)<-c("Rain","hail","sunny")
colnames(mat)<-c("Germany","Italy")
visweb(mat,type="None")
for red squares and label size control:
visweb(mat,type="None",labsize=2,square="b",box.col="red")
Probably the simplest solution in base R is:
rownames(dat) = dat$weather
heatmap(as.matrix(dat[,2:3]), scale='none')
... assuming that your data frame is called dat. The heatmap is not pretty but it's quick and easy. The first line is not necessary. It only serves to make the weather labels show in the heatmap.
Related
I've got a table with data from which I'd like to show the interaction in an informative way.
I have counted the interactions between different people, and inputted this in a table, which looks like this:
ideally, I'd like to visualise this data in interesting ways (if you know more, please let me know!). I found these things, and I'd like to create one from this data myself.
I found some tutorials online, however, I can't seem to get it to work as I am unable to input my data the right way in an NX graph: when iterating through the table, I end up attaching wrong ends to eachother or skipping data.
data:
A
B
C
D
E
F
A
x
2
1
3
0
0
B
2
x
0
4
5
1
C
1
0
x
3
0
2
D
3
4
3
x
1
1
E
0
5
0
1
x
1
F
0
1
2
1
1
x
Best-Effort code:
import matplotlib.pyplot as plt
import networkx as nx
import matplotlib
namelist = []
for i in range(0,len(systeem)):
namelist.append(systeem.iloc[i,0])
G=nx.Graph()
G.add_nodes_from(namelist)
weightlist=[]
for i in range(0,len(namelist)):
for j in range(1,len(namelist)):
if int(systeem.iloc[i,j]) > 0:
W=int(systeem.iloc[i,j])
weightlist.append(W)
G.add_edge(namelist[i-1],namelist[j], weight= W)
else:
continue
plt.figure(figsize=(40,40))
pos = nx.circular_layout(G)
cmap = matplotlib.cm.get_cmap('plasma_r')
nx.draw_networkx(G, pos, width=1, node_color="blue", edge_cmap=cmap, with_labels=False)
labels_pos = {name:[pos_list[0], pos_list[1]-0.04] for name, pos_list in pos.items()}
nx.draw_networkx_labels(G, labels_pos, font_size=40, font_family="sans-serif", font_color="#000000", font_weight="bold")
ax = plt.gca()
ax.margins(0.25)
plt.axis("equal")
plt.tight_layout()
I'd like to ask for suggestions how to calculate lenght of gap between two datasets in matplotlib made of pandas dataframe. Ideally, I would like to have these gap values written in the plot and also, if it is possible, include them into the dataframe.
Here is my simplified example of dataframe:
import pandas as pd
d = {'Mean-1': [0.195842, 0.295069, 0.321345, 0.773725], 'SEM-1': [0.001216, 0.002687, 0.005267, 0.029974], 'Mean-2': [0.143103, 0.250505, 0.305767, 0.960804],'SEM-2': [0.000959, 0.001368, 0.003722, 0.150025], 'Atom Number': [1, 3, 5, 7]}
df=pd.DataFrame(d)
df
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number
0 0.195842 0.001216 0.143103 0.000959 1
1 0.295069 0.002687 0.250505 0.001368 3
2 0.321345 0.005267 0.305767 0.003722 5
3 0.773725 0.029974 0.960804 0.150025 7
Then I made plot, where we can see two lines representing Mean-1 and Mean-2, and then shaded area around each line representing standard error of the mean. This is done for the selected atom numbers.
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'])
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
plt.xticks(x)
What I would like to do further is to calculate the gap for each residue. The gap is the white space only, thus space where the lines as well as the shaded areas (SEMs) don't overlap.
And also would like to know if I can somehow print the gap values from the plot? And save them into column. Thank You for suggestions.
It's not a compact solution but you could try something like this (Check the order of things). Calculate all the position (y_i and upper and lower limits).
import numpy as np
df['y1_upper'] = y_1+error_1
df['y1_lower'] = y_1-error_1
df['y2_upper'] = y_2+error_2
df['y2_lower'] = y_2-error_2
which gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower
0 0.144319 0.141887
1 0.253192 0.247818
2 0.311034 0.300500
3 0.990778 0.930830
The distances (gaps) are calculated differently depending on if y_1 is over y_2and vice versa. So use conditions on the upper and lower limits and use linalg.norm to compute the distance.
conditions = [
(df['y1_lower'] >= df['y2_upper']),
(df['y1_lower'] < df['y2_upper'])]
choices = [np.linalg.norm(df['y1_lower']-df['y2_upper']), np.linalg.norm(df['y2_lower']-df['y1_upper'])]
df['dist'] = np.select(conditions, choices)
This gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower dist
0 0.144319 0.141887 0.255175
1 0.253192 0.247818 0.255175
2 0.311034 0.300500 0.255175
3 0.990778 0.930830 0.149605
As I said, check the order, but this is a possible solution.
IIUC, do you want something like this:
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'], figsize=(15,8))
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
ax.fill_between(df['Atom Number'], y_1+error_1, y_2-error_2, alpha=.2, edgecolor='k', facecolor='blue')
for i in range(len(x)):
gap = y_1[i]+error_1[i] - y_2[i]-error_2[i]
ylabel = min(y_1[i], y_2[i]) + abs(gap) / 2
_ = ax.annotate(f'{gap:0.4f}', xy=(x[i],ylabel), xytext=(x[i]-.14,y_1[i]+gap/abs(gap)*.2), arrowprops=dict(arrowstyle="-"))
plt.xticks(x);
Output:
This Python script:
import numpy as np
import vtk
from vtk.util.numpy_support import numpy_to_vtk
# Open a file, and create an unstructured grid.
filename = 'example.vtk'
writer = vtk.vtkUnstructuredGridWriter()
writer.SetFileName(filename)
grid = vtk.vtkUnstructuredGrid()
# Create 3 points
A,B,C = (0,0,0), (0,1,0), (1,0,0)
points = np.array( (A,B,C) )
vtk_points = vtk.vtkPoints()
vtk_points.SetData( numpy_to_vtk(points) )
grid.SetPoints(vtk_points)
# Cells: just 1 triangle
ntriangles = 1
npoints_per_triangle = 3
cells = np.array( [npoints_per_triangle, 0, 1, 2] )
vtk_cells = vtk.vtkCellArray()
id_array = vtk.vtkIdTypeArray()
id_array.SetVoidArray(cells, len(cells), 1)
vtk_cells.SetCells(ntriangles, id_array)
# Cell types: just 1 triangle.
cell_types = np.array( [vtk.VTK_TRIANGLE] , 'B')
vtk_cell_types = numpy_to_vtk(cell_types)
# Cell locations: the triangle is in `cells` at index 0.
cell_locations = np.array( [0,])
vtk_cell_locations = numpy_to_vtk(cell_locations, deep=1,
array_type=vtk.VTK_ID_TYPE)
# Cells: add to grid
grid.SetCells(vtk_cell_types, vtk_cell_locations, vtk_cells)
data = grid.GetCellData()
# Add scalar data to the triangle
data.SetActiveScalars('foo')
foo = np.array( [11.,] )
vtk_foo = numpy_to_vtk(foo)
vtk_foo.SetName("foo")
data.SetScalars(vtk_foo)
# Add other scalar data to the triangle
data.SetActiveScalars('bar')
bar = np.array( [12.,] )
vtk_bar = numpy_to_vtk(bar)
vtk_bar.SetName("bar")
data.SetScalars(vtk_bar)
# Write to file.
writer.SetInput(grid)
writer.Write()
print open(filename).read()
Produce the file:
# vtk DataFile Version 3.0
vtk output
ASCII
DATASET UNSTRUCTURED_GRID
POINTS 3 long
0 0 0 0 1 0 1 0 0
CELLS 1 4
3 0 1 2
CELL_TYPES 1
5
CELL_DATA 1
SCALARS bar double
LOOKUP_TABLE default
12
FIELD FieldData 1
foo 1 1 double
11
But I want CELL_DATA section to be:
CELL_DATA 1
SCALARS foo double
LOOKUP_TABLE default
11
SCALARS bar double
LOOKUP_TABLE default
12
Edit
Looking at the source code (WriteCellData, WriteScalarData and deeper), it seems impossible.
You can add how many arrays you want using AddArray instead of SetActiveScalars
See also http://public.kitware.com/pipermail/vtkusers/2004-August/026366.html
http://www.vtk.org/doc/nightly/html/classvtkCellData-members.html
From what I've read, vtk can't write multiple SCALARS, but can read it. (What a good API!).
I'll continue using the good old pyvtk (which also has the adavange to be readable):
import pyvtk
filename = 'example.vtk'
title = 'Unstructured Grid Example'
points = [[0,0,0],[0,1,0],[0,0,1]]
triangles = [[0,1,2]]
grid = pyvtk.UnstructuredGrid(points, triangle=triangles)
celldata = pyvtk.CellData( pyvtk.Scalars([11.,], name="foo"),
pyvtk.Scalars([12.,], name="bar"))
vtk = pyvtk.VtkData(grid, celldata, title)
vtk.tofile(filename)
print open(filename).read()
Which produce:
# vtk DataFile Version 2.0
Unstructured Grid Example
ASCII
DATASET UNSTRUCTURED_GRID
POINTS 3 int
0 0 0
0 1 0
0 0 1
CELLS 1 4
3 0 1 2
CELL_TYPES 1
5
CELL_DATA 1
SCALARS foo float 1
LOOKUP_TABLE default
11.0
SCALARS bar float 1
LOOKUP_TABLE default
12.0
I am using the following code to create a collection of color coded line plots:
for j in idlist[i]:
single_traj(lonarray, latarray, parray)
plt.savefig(savename, dpi = 400)
plt.close('all')
plt.clf()
where:
def single_traj(lonarray, latarray, parray, linewidth = 0.7):
"""
Plots XY Plot of one trajectory, with color as a function of p
Helper Function for DrawXYTraj
"""
global lc
x = lonarray
y = latarray
p = parray
points = np.array([x,y]).T.reshape(-1,1,2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
lc = col.LineCollection(segments, cmap=plt.get_cmap('Spectral'),
norm=plt.Normalize(100, 1000), alpha = 0.8)
lc.set_array(p)
lc.set_linewidth(linewidth)
plt.gca().add_collection(lc)
Somehow, this loop uses a lot of memory (> ~10GB), which is still being used after the plot is saved.
I used hpy to look at memory usage
Partition of a set of 27472988 objects. Total size = 10990671168 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 8803917 32 9226505016 84 9226505016 84 dict of matplotlib.path.Path
1 8888542 32 711083360 6 9937588376 90 numpy.ndarray
2 8803917 32 563450688 5 10501039064 96 matplotlib.path.Path
3 11 0 219679112 2 10720718176 98 guppy.sets.setsc.ImmNodeSet
4 25407 0 77593848 1 10798312024 98 list
5 89367 0 28232616 0 10826544640 99 dict (no owner)
6 7642 0 25615984 0 10852160624 99 dict of matplotlib.collections.LineCollection
7 15343 0 16079464 0 10868240088 99 dict of
matplotlib.transforms.CompositeGenericTransform
8 15327 0 16062696 0 10884302784 99 dict of matplotlib.transforms.Bbox
9 53741 0 15047480 0 10899350264 99 dict of weakref.WeakValueDictionary
At this point the plot is already saved, so all matplotlib related objects should be gone... But I cant "find" these objects, which means I don't know how to delete them.
EDIT:
Here is a stand-alone example which reproduces the leak (savefig throws an error for some reason but isn't relevant anyway):
# Memory leak test!
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.collections as col
def draw():
x = range(1000)
y = range(1000)
p = range(1000)
fig = plt.figure(figsize = (12,8))
ax = plt.gca()
ax.set_aspect('equal')
for i in range(1000):
if i%100 == 0:
print i
points = np.array([x,y]).T.reshape(-1,1,2)
segments = np.concatenate([points[:-1], points[1:]], axis=1)
lc = col.LineCollection(segments, cmap=plt.get_cmap('Spectral'),
norm=plt.Normalize(0, 1000), alpha = 0.8)
lc.set_array(p)
lc.set_linewidth(0.7)
plt.gca().add_collection(lc)
cb = fig.colorbar(lc, shrink = 0.7)
cb.set_label('p')
cb.ax.invert_yaxis()
plt.tight_layout()
#plt.savefig('./mem_test.png', dpi = 400)
plt.close('all')
plt.clf()
draw()
a = input('Wait...')
The draw() function should delete all plt objects, but they still use up memory after the function is called. I just check it with top/htop!
It seems from your hpy dump that the memory hog consists of a large number of matplotlib.path.Paths. This may be due to your variable lc. Have you tried del lc? It may be that plt.close is not (at least should not be!) able to delete them, as they are in your global variable lc.
I have the following code which generates 8 plots. I want to put the phases as titles in each plot. So I have succeded to put the phase on the plot. But instead of taking corresponding phase, it is always taking the last phase to show in each plot. The 8phases.txt file has the following 8 lines which I want to put in each plot -
-1 1 -1
-1 1 1
1 1 1
1 -1 1
-1 -1 -1
1 1 -1
1 -1 -1
-1 -1 1
Here is the code -
import numpy as np
import matplotlib.pyplot as plt
D=12
n=np.arange(1,4)
x = np.linspace(-D/2,D/2, 3000)
I = np.array([125,300,75])
phase = np.genfromtxt('8phases.txt')
I_phase = I*phase
for i in I_phase:
F = sum(m*np.cos(2*np.pi*l*x/D) for m,l in zip(i,n))
f,(ax1,ax2) = plt.subplots(2)
for row in phase:
ax1.plot(x,F,'g')
ax1.set_title(row)
plt.show()
I think your inner-most loop is unnecessary; it is recreating the same plot 8 times and updating the title 8 times with each of the 8 values.
If I understood what you are asking for, I believe this gives the correct results:
...
for index,i in enumerate(I_phase):
F = sum(m*np.cos(2*np.pi*l*x/D) for m,l in zip(i,n))
f,(ax1,ax2) = plt.subplots(2)
ax1.plot(x,F,'g')
ax1.set_title(phase[index])
...
(I would normally use "i" instead of "index", but you had already used "i")