Producing boxplot figures dynamically depending on number of columns in data - python

I'm writing a python function to produce boxplots of data using python's matplotlib. I require this function to dynamically determine the number of figures and subplots based on the number of columns in the data. To make the figures readable I want the maximum number of subplots to be 4 by 4. If there are more than 16 I want to function to fill as many 4 by 4 figures as necessary then partially fill a final figure with the remainder.
For example, the data I currently have has 43 columns. I therefore want this function to produce 2 full figure containing 16 subplots each and one partially filled figure containing the remaining subplots. My problem is that I am stuck with the logic of writing such a function.
My code:
import matplotlib.pyplot as plt
def boxplot_data(self,parameters_file):
data = read_csv(parameters_file)
header = data.keys()
number_of_full_subplots = len(header)/16
remainder = len(header)-(16*number_of_full_subplots)
for i in range(0,number_of_full_subplots):
plt.figure(i)
for j in range(0,16):
plt.subplot(4,4,j)
boxplot(data[header[0:16]])
My plan was to iterate over the 'full subplots' first then iterate over the remainder however this method produces two identical figures.
Does anybody have any suggestions?
Thanks

the figures are identical because this line
boxplot(data[header[0:16]])
does not change when i and j are incremented.
Replace it with
boxplot(data[header[16*i+j]])
import matplotlib.pyplot as plt
def boxplot_data(self, parameters_file):
data = read_csv(parameters_file)
header = data.keys()
number_of_full_subplots = len(header)/16
for i in range(number_of_full_subplots):
plt.figure(i)
for j in range(16):
plt.subplot(4, 4, j)
boxplot(data[header[16*i+j]])

Related

Python plot lines with specific x values from numpy

I have a situation with a bunch of datafiles, these datafiles have a number of samples in a given time frame that depends on the system. i.e. At time t=1 for instance I might have a file with 10 items, or 20 items, at later times in that file I will always have the same number of items. The format is time, x, y, z in columns, and loaded into a numpy array. The time values show which frame, but as mentioned there's always the same, let's go with 10 as a sample. So I'll have a (10,4) numpy array where the time values are identical, but there are many frames in the file, so lets say 100 frames, so really I have (1000,4). I want to plot the data with time on the x-axis and manipulations of the other data on the y, but I am unsure how to do this with line plot methods in matplotlib. Normally to provide both x,y values I believe I need to do a scatter plot, so I'm hoping there's a better way to do this. What I ideally want is to treat each line that has the same time code as a different series (so it will colour differently), and the next bit of data for that same line number in the next frame (time value) will be labelled the same colour, giving those good contiguous lines. We can look at the time column and figure out how many items share a time code, let's call it "n". Sample code:
a = numpy.loadtxt('sampledata.txt')
plt.plot(a[:0,:,n],a[:1,:1])
plt.show()
I think this code expresses what I'm going for, though it doesn't work.
Edit:
I hope this is what you wanted.
seaborn scatterplot can categorize data to some groups which have the same codes (time code in this case) and use the same colors to them.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r"E:\Programming\Python\Matplotlib\timecodes.csv",
names=["time","x","y","z","code"]) #use your file
df["time"]=pd.to_datetime(df["time"]) #recognize the data as Time
df["x"]=df["time"].dt.day # I changed the data into "Date only" and imported to x column. Easier to see on graph.
#just used random numbers in y and z in my data.
sns.scatterplot("x", "y", data = df, hue = "code") #hue does the grouping
plt.show()
I used csv file here but you can do to your text file as well by adding sep="\t" in the argument. I also added a code in the file. If you have it the code can group the data in the graph, so you don't have to separate or make a hierarchical index. If you want to change colors or grouping please see seaborn website.
Hope this helps.
Alternative, the method I used, but Tim's answer is still accurate as well. Since the time codes are not date/time information I modified my own code to add tags as a second column I call "p" (they're polymers).
import numpy as np
import pandas as pd
datain = np.loadtxt('somefile.txt')
df = pd.DataFrame(data = datain, columns = ["t","p","x","y","z"])
ax = sns.scatterplot("t","x", data = df, hue = "p")
plt.show()
And of course the other columns can be plotted similarly if desired.

pandas plot line segments for each row

I have dataframes with columns containing x,y coordinates for multiple points. One row can consist of several points.
I'm trying to find out an easy way to be able to plot lines between each point generating a curve for each row of data.
Here is a simplified example where two lines are represented by two points each.
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
df.plot(y=['p1_y','p2_y'], x=['p1_x','p2_x'])
when trying to plot them I expect line 1 to start where x=1 and line 2 to start where x=2.
Instead, the x axis contains two value-pairs (1,2) and (2,3) and both lines have the same start and end-point in x-axis.
How do I get around this problem?
Edit:
If using matplotlib, the following hardcoded values generates the plot i'm interested in
plt.plot([[1,2],[2,3]],[[10,9],[11,12]])
While I'm sure that there should be a more succinct way using pure pandas, here's a simple approach using matplotlib and some derivatives from the original df.(I hope I understood the question correctly)
Assumption: In df, you place x values in even columns and y values in odd columns
Obtain x values
x = df.loc[:, df.columns[::2]]
x
p1_x p2_x
0 1 2
1 2 3
Obtain y values
y = df.loc[:, df.columns[1::2]]
y
p1_y p2_y
0 10 11
1 9 12
Then plot using a for loop
for i in range(len(df)):
plt.plot(x.iloc[i,:], y.iloc[i,:])
One does not need to create additional data frames. One can loop through the rows to plot these lines:
line1 = {'p1_x':1, 'p1_y':10, 'p2_x':2, 'p2_y':11 }
line2 = {'p1_x':2, 'p1_y':9, 'p2_x':3, 'p2_y':12 }
df = pd.DataFrame([line1,line2])
for i in range(len(df)): # for each row:
# plt.plot([list of Xs], [list of Ys])
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]])
plt.show()
The lines will be drawn in different colors. To get lines of same color, one can add option c='k' or whatever color one wants.
plt.plot([df.iloc[i,0],df.iloc[i,2]],[df.iloc[i,1],df.iloc[i,3]], c='k')
I generaly don't use the pandas plotting because I think it is rather limited, if using matplotlib is not an issue, the following code works:
from matplotlib import pyplot as plt
plt.plot(df.p1_x,df.p1_y)
plt.plot(df.p2_x,df.p2_y)
plt.plot()
if you got lots of lines to plot, you can use a for loop.

Plotting boxplots for a groupby object

I would like to plot boxplots for several datasets based on a criterion.
Imagine a dataframe similar to the example below:
df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]
Group M F
0 1 0.465636 0.537723
1 1 0.560537 0.727238
2 1 0.268154 0.648927
3 2 0.722644 0.115550
4 3 0.586346 0.042896
5 2 0.562881 0.369686
6 2 0.395236 0.672477
7 3 0.577949 0.358801
8 1 0.764069 0.642724
9 3 0.731076 0.302369
In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded.
This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them.
. The desirable output would look something like this:
Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby
My questions are:
How to implement groupby to feed the desired data into the boxplot
What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)
I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.
import pandas as pd
import matplotlib.pyplot as plt
Import seaborn as sns
dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex')
sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex')
The plot looks similar to your required plot.
Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:
# here I prepare the data (group them manually and then store in lists)
Groups=[1,2,3]
Columns=df.columns.tolist()[1:]
print Columns
Mgroups=[]
Fgroups=[]
for g in Groups:
dfgc = df[df['Group']==g]
m=dfgc['M'].dropna()
f=dfgc['F'].dropna()
Mgroups.append(m.tolist())
Fgroups.append(f.tolist())
fig=plt.figure()
ax = plt.axes()
def setBoxColors(bp,cl):
plt.setp(bp['boxes'], color=cl, linewidth=2.)
plt.setp(bp['whiskers'], color=cl, linewidth=2.5)
plt.setp(bp['caps'], color=cl,linewidth=2)
plt.setp(bp['medians'], color=cl, linewidth=3.5)
bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6)
bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6)
setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/
setBoxColors(bpl, '#2C7BB6')
# draw temporary red and blue lines and use them to create a legend
plt.plot([], c='#D7191C', label='F')
plt.plot([], c='#2C7BB6', label='M')
plt.legend()
plt.yticks(xrange(0, len(Groups) * 3, 3), Groups)
plt.ylim(-3, len(Groups)*3)
#plt.xlim(0, 8)
plt.show()
The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...

Creating a faceted matplotlib/seaborn plot using indicator variables rather than a single column

Seaborn is great for creating faceted plots based on a categorical variable encoding the class of each facet. However, this assumes your categories are mutually exclusive. Is it possible to create a Seaborn FacetGrid (or similar) based on a set of indicator variables?
As a concrete example, think about comparing patients that are infected with one or more viruses, and plotting an attribute of interest by virus. Its possible that a patient carries more than one virus, so creating a virus column to create a grid on is not possible. You can, however, create a set of indicator variables (one for each virus) that flags the virus for each patient. There does not seem to be a way of passing a set of indicator variables to any of the Seaborn functions to do this.
I can't imagine I'm the first person to come across this scenario, so I'm hoping there are suggestions for how to do this, without coding it by hand in Matploltlib.
I don't see how to do it with FacetGrid, possibly because this isn't facetting the data, since a data-record might appear several times or only once in the plot. One of the standard tricks with a set of bitfields is to read them as binary, so you see each combination of the bits. That's unambiguous but gets messy:
import pandas as pd
import seaborn as sns
from numpy.random import random, randint
from numpy import concatenate
import matplotlib.pyplot as plt
# Dummy data
vdata = pd.DataFrame(concatenate((randint(2, size=(32,4)), random(size=(32,2))), axis=1))
vdata.columns=['Species','v1','v2','v3','x','y']
binary_v = vdata.v1 + vdata.v2*2 + vdata.v3*4
# Making a binary number out of the "virusX?" fields
pd.concat((vdata, binary_v), axis=1)
vdata = pd.concat((vdata, binary_v), axis=1)
vdata.columns=['Species','v1','v2','v3','x','y','binary_v']
# Plotting group membership by row
#g = sns.FacetGrid(vdata, col="Species", row='binary_v')
#g.map(plt.scatter, "x", "y")
#g.add_legend()
#plt.savefig('multiple_facet_binary_row') # Unreadably big.
h = sns.FacetGrid(vdata, col="Species", hue="binary_v")
h.map(plt.scatter, "x","y")
h.add_legend()
plt.savefig('multiple_facet_binary_hue')
If you have too many indicators to deal with the combinatorial explosion, explicitly making the new subsets works:
# Nope, need to pull out subsets:
bdata = vdata[vdata.v1 + vdata.v2 + vdata.v3 ==0.]
assert(len(bdata) > 0) # ... catch...
bdata['Virus'] = pd.Series(['none']*len(bdata), index=bdata.index)
for i in ['v1','v2','v3']:
on = vdata[vdata[i]==1.]
on['Virus'] = pd.Series([i]*len(on), index=on.index)
bdata = bdata.append(on)
j = sns.FacetGrid(bdata, col='Species', row='Virus')
j.map(plt.scatter, 'x', 'y')
j.add_legend()
j.savefig('multiple_facet_refish')

Plotting trajectories in python using matplotlib

I'm having some trouble using matplotlib to plot the path of something.
Here's a basic version of the type of thing I'm doing.
Essentially, I'm seeing if the value breaks a certain threshold (6 in this case) at any point during the path and then doing something with it later on.
Now, I have 3 lists set-up. The end_vector will be based on the other two lists. If the value breaks past 2 any time during a single simulation, I will add the last position of the object to my end_vector
trajectories_vect is something I want to keep track of my trajectories for all 5 simulations, by keeping a list of lists. I'll clarify this below. And, timestep_vect stores the path for a single simulation.
from random import gauss
from matplotlib import pyplot as plt
import numpy as np
starting_val = 5
T = 1 #1 year
delta_t = .1 #time-step
N = int(T/delta_t) #how many points on the path looked at
trials = 5 #number of simulations
#main iterative loop
end_vect = []
trajectories_vect = []
for k in xrange(trials):
s_j = starting_val
timestep_vect = []
for j in xrange(N-1):
xi = gauss(0,1.0)
s_j *= xi
timestep_vect.append(s_j)
trajectories_vect.append(timestep_vect)
if max(timestep_vect) > 5:
end_vect.append(timestep_vect[-1])
else:
end_vect.append(0)
Okay, at this part if I print my trajectories, I get something like this (I only posted two simulations, instead of the full 5):
[[ -3.61689976e+00 2.85839230e+00 -1.59673115e+00 6.22743522e-01
1.95127718e-02 -1.72827152e-02 1.79295788e-02 4.26807446e-02
-4.06175288e-02] [ 4.29119818e-01 4.50321728e-01 -7.62901016e-01
-8.31124346e-02 -6.40330554e-03 1.28172906e-02 -1.91664737e-02
-8.29173982e-03 4.03917926e-03]]
This is good and what I want to happen.
Now, my problem is that I don't know how to plot my path (y-axis) against my time (x-axis) properly.
First, I want to put my data into numpy arrays because I'll need to use them later on to compute some statistics and other things which from experience numpy makes very easy.
#creating numpy arrays from list
#might need to use this with matplotlib somehow
np_trajectories = np.array(trajectories_vect)
time_array = np.arange(1,10)
Here's the crux of the issue though. When i'm putting my trajectories (y-axis) into matplotlib, it's not treating each "list" (row in numpy) as one path. Instead of getting 5 paths for 5 simulations, I am getting 9 paths for 5 simulations. I believe I am inputing stuff wrong hence it is using the 9 time intervals in the wrong way.
#matplotlib stuff
plt.plot(np_trajectories)
plt.xlabel('timestep')
plt.ylabel('trajectories')
plt.show()
Here's the image produced:
Obviously, this is wrong for the aforementioned reason. Instead, I want to have 5 paths based on the 5 lists (rows) in my trajectories. I seem to understand what the problem is but don't know how to go about fixing it.
Thanks in advance for the help.
When you call np_trajectories = np.array(trajectories_vect), your list of trajectories is transformed into a 2d numpy array. The information about its dimensions is stored in np_trajectories.shape, and, in your case, is (5, 9). Therefore, when you pass np_trajectories to plt.plot(), the plotting library assumes that the y-values are stored in the first dimension, while the second dimension describes individual lines to plot.
In your case, all you need to do is to transpose your np_trajectories array. In numpy, it is as simple as
plt.plot(np_trajectories.T)
plt.xlabel('timestep')
plt.ylabel('trajectories')
plt.show()
If you want to plot the x-axis as time, instead of steps of one, you have to define your time progression as a list or an array. In numpy, you can do something like
times = np.linspace(0, T, N-1)
plt.plot(times, np_trajectories.T)
plt.xlabel('timestep')
plt.ylabel('trajectories')
plt.show()
which produces the following figure:

Categories