I am trying to create a plot extracting points from a .txt file. The points are separated by 'tab' space only. Also, there are too many points to be accommodated in only one column, so they have been spread over 3 columns. However, when I plot in matplotlib, I am a little suspicious I am not seeing all the numbers plotted. It may be the case the data is plotted only over the first column and is ignoring the other two columns.
Here is the sample example of such data file: https://www.dropbox.com/s/th6uwrk2xdnmhyi/n1l2m2.txt?dl=0
I also attached the simple code I am using to plot:
import matplotlib.pyplot as plt
%matplotlib inline
import sys
import os
import numpy
from pylab import *
exp_sum = '/home/trina/Downloads/n1l2m2.txt'
a= numpy.loadtxt(exp_sum, unpack =True)
plt.plot(a)
show()
and here is the output image:
I am interested to know if this plot covers all the points in my data file. Your suggestion is very appreciated.
By doing plt.plot(a), you are passing a 3 dimensional data set to be plotted onto a 2 dimensional graph.
From the matplotlib docs for plot
If x and/or y is 2-dimensional, then the corresponding columns will be
plotted.
So, your graph output is:
column 0 values at x = 0
column 1 values at x = 1
column 2 values at x = 2
Adding the following to the code:
for i in range(0,len(a)):
print('a'+str(i),max(a[i]),min(a[i]))
Outputs the following:
stats max min
a0 0.9999 0.0
a1 0.9856736 0.3736717
a2 -0.003469009 -0.08896232
Using the mouseover position readout with matplotlib, this looks correct.
On a general graphs point, I'd recommend using histograms, boxplots or violin plots if you want to visualise the frequency (and other stats) of data sets. See the matplotlib examples for histograms, boxplots and violin plots.
Edit: from the shading on the graph you have, it also looks like it does contain all the points, as your data columns are long tails when plotted individually. The long tail graphs correlate to the shading on the graph you have.
Related
I am new to python and trying to plot a color magnitude diagram(CMD) for a selected cluster by matplotlib, there are 3400000 stars that I need to plot, the data for each star would be color on x axis and magnitude on y axis, However, my code should read two columns in a csv file and plot. The problem is when I using a part of the data (3000 stars), I can plot a CMD succesfully but when I use all the data, the plot is very mess(see figure below) and it seems that points are ploted by their positions in the column instead of its value. For example, a point has data (0.92,20.64) should be close to the y-axis, but is actually located at the far right of the plot just becasue it placed at last few columns of the dataset. So I wanna know how can I plot the entire dataset and show a plot like the first figure.Thanks for yout time. These are my codes:
import matplotlib.pyplot as plt
import pandas as pd
import csv
data = pd.read_csv(r'C:\Users\Peter\Desktop\F275W test.csv', low_memory=False)
# Generate some test data
x = data['F275W-F336W']
y = data['F275W']
#remove the axis
plt.axis('off')
plt.plot(x,y, ',')
plt.show()
This is the plot I got for 3000 stars it's a CMD
This is the plot I got for entire dataset, which is very mess
I have a very simple data frame but I could not plot a line using a row and a column. Here is an image, I would like to plot a "line" that connects them.
enter image description here
I tried to plot it but x-axis disappeared. And I would like to swap those axes. I could not find an easy way to plot this simple thing.
Try:
import matplotlib.pyplot as plt
# Categories will be x axis, sexonds will be y
plt.plot(data["Categories"], data["Seconds"])
plt.show()
Matplotlib generates the axis dynamically, so if you want the labels of the x-axis to appear you'll have to increase the size of your plot.
It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!
I want to create 10 violin plots but within one diagram. I looked at many examples like this one: Violin plot matplotlib, what shows what I would like to have at the end.
But I did not know how to adapt it to a real data set. They all just generate some random data which is normal distributed.
I have data in form D[10,730] and if I try to adapt it from the link above with :
example:
axes[0].violinplot(all_data,showmeans=False,showmedians=True)
my code:
axes[0].violinplot(D,showmeans=False,showmedians=True)
it do not work.
It should print 10 violin plot in parallel (first dimension of D).
So how do my data need to look like to get the same type of violin plot?
You just need to transpose your data array D.
axes[0].violinplot(D.T,showmeans=False,showmedians=True)
This appears to be a small bug in matplotlib. The axes are treated in a non-consistent manner for a list of 1D arrays and a 2D array.
import numpy as np
import matplotlib.pyplot as plt
n_datasets = 10
n_samples = 730
data = np.random.randn(n_datasets,n_samples)
fig, axes = plt.subplots(1,3)
# http://matplotlib.org/examples/statistics/boxplot_vs_violin_demo.html
axes[0].violinplot([d for d in data])
# should be equivalent to:
axes[1].violinplot(data)
# is actually equivalent to
axes[2].violinplot(data.T)
You should file a bug report.
I am struggling to set xlim for each histogram and create 1 column of graphs so the x-axis ticks are aligned. Being new pandas, I am unsure of how to apply answer applies: Overlaying multiple histograms using pandas.
>import from pandas import DataFrame, read_csv
>import matplotlib.pyplot as plt
>import pandas as pd
>df=DataFrame({'score0':[0.047771,0.044174,0.044169,0.042892,0.036862,0.036684,0.036451,0.035530,0.034657,0.033666],
'score1':[0.061010,0.054999,0.048395,0.048327,0.047784,0.047387,0.045950,0.045707,0.043294,0.042243]})
>print df
score0 score1
0 0.047771 0.061010
1 0.044174 0.054999
2 0.044169 0.048395
3 0.042892 0.048327
4 0.036862 0.047784
5 0.036684 0.047387
6 0.036451 0.045950
7 0.035530 0.045707
8 0.034657 0.043294
9 0.033666 0.042243
>df.hist()
>plt.xlim(-1.0,1.0)
The result sets only one of the bounds on the x-axis to be [-1,1].
I'm very familiar ggplot in R and just trying out pandas/matplotlib in python. I'm open to suggestions for better plotting ideas. Any help would be greatly appreciated.
update #1 (#ct-zhu):
I have tried the following, but the xlim edit on the subplot does not seem to translate the bin widths across the new x-axis values. As a result, the graph now has odd bin widths and still has more than one column of graphs:
for array in df.hist(bins=10):
for subplot in array:
subplot.set_xlim((-1,1))
update #2:
Getting closer with the use of layout, but the width of bins does not equal the interval length divided by bin count. In the example below, I set bins=10. Hence, the width of each bin over the interval from [-1,1] should be 2/10=0.20; however, the graph does not have any bins with a width of 0.20.
for array in df.hist(layout=(2,1),bins=10):
for subplot in array:
subplot.set_xlim((-1,1))
There are two subplots, and you can access each of them and modify them seperately:
ax_list=df.hist()
ax_list[0][0].set_xlim((0,1))
ax_list[0][1].set_xlim((0.01, 0.07))
What you are doing, by plt.xlim, changes the limit of the current working axis only. In this case, it is the second plot which is the most recently generated.
Edit:
To make the plots into 2 rows 1 column, use layout argument. To make the bin edges aligns, use bins argument. Set the x limit to (-1, 1) is probably not a good idea, you numbers are all smallish.
ax_list=df.hist(layout=(2,1),bins=np.histogram(df.values.ravel())[1])
ax_list[0][0].set_xlim((0.01, 0.07))
ax_list[1][0].set_xlim((0.01, 0.07))
Or specify exactly 10 bins between (-1,1):
ax_list=df.hist(layout=(2,1),bins=np.linspace(-1,1,10))
ax_list[0][0].set_xlim((-1,1))
ax_list[1][0].set_xlim((-1,1))