I have the following dataframe that I wish to perform some regression on. I am using Seaborn but can't quite seem to find a non-linear function that fits. Below is my code and it's output, and below that is the dataframe I am using, df. Note I have truncated the axis in this plot.
I would like to fit either a Poisson or Gaussian distribution style of function.
import pandas
import seaborn
graph = seaborn.lmplot('$R$', 'Equilibrium Value', data = df, fit_reg=True, order=2, ci=None)
graph.set(xlim = (-0.25,10))
However this produces the following figure.
df
R Equilibrium Value
0 5.102041 7.849315e-03
1 4.081633 2.593005e-02
2 0.000000 9.990000e-01
3 30.612245 4.197446e-14
4 14.285714 6.730133e-07
5 12.244898 5.268202e-06
6 15.306122 2.403316e-07
7 39.795918 3.292955e-18
8 19.387755 3.875505e-09
9 45.918367 5.731842e-21
10 1.020408 9.936863e-01
11 50.000000 8.102142e-23
12 2.040816 7.647420e-01
13 48.979592 2.353931e-22
14 43.877551 4.787156e-20
15 34.693878 6.357120e-16
16 27.551020 9.610208e-13
17 29.591837 1.193193e-13
18 31.632653 1.474959e-14
19 3.061224 1.200807e-01
20 23.469388 6.153965e-11
21 33.673469 1.815181e-15
22 42.857143 1.381050e-19
23 25.510204 7.706746e-12
24 13.265306 1.883431e-06
25 9.183673 1.154141e-04
26 41.836735 3.979575e-19
27 36.734694 7.770915e-17
28 18.367347 1.089037e-08
29 44.897959 1.657448e-20
30 16.326531 8.575577e-08
31 28.571429 3.388120e-13
32 40.816327 1.145412e-18
33 11.224490 1.473268e-05
34 24.489796 2.178927e-11
35 21.428571 4.893541e-10
36 32.653061 5.177167e-15
37 8.163265 3.241799e-04
38 22.448980 1.736254e-10
39 46.938776 1.979881e-21
40 47.959184 6.830820e-22
41 26.530612 2.722925e-12
42 38.775510 9.456077e-18
43 6.122449 2.632851e-03
44 37.755102 2.712309e-17
45 10.204082 4.121137e-05
46 35.714286 2.223883e-16
47 20.408163 1.377819e-09
48 17.346939 3.057373e-08
49 7.142857 9.167507e-04
EDIT
Attached are two graphs produced from both this and another data set when increasing the order parameter beyond 20.
Order = 3
I have problems understanding why a lmplot is needed here. Usually you want to perform a fit by taking a model function and fit it to the data.
Assume you want a gaussian function
model = lambda x, A, x0, sigma, offset: offset+A*np.exp(-((x-x0)/sigma)**2)
you can fit it to your data with scipy.optimize.curve_fit:
popt, pcov = curve_fit(model, df["R"].values,
df["EquilibriumValue"].values, p0=[1,0,2,0])
Complete code:
import pandas as pd
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
df = ... # your dataframe
# plot data
plt.scatter(df["R"].values,df["EquilibriumValue"].values, label="data")
# Fitting
model = lambda x, A, x0, sigma, offset: offset+A*np.exp(-((x-x0)/sigma)**2)
popt, pcov = curve_fit(model, df["R"].values,
df["EquilibriumValue"].values, p0=[1,0,2,0])
#plot fit
x = np.linspace(df["R"].values.min(),df["R"].values.max(),250)
plt.plot(x,model(x,*popt), label="fit")
# Fitting
model2 = lambda x, sigma: model(x,1,0,sigma,0)
popt2, pcov2 = curve_fit(model2, df["R"].values,
df["EquilibriumValue"].values, p0=[2])
#plot fit2
x2 = np.linspace(df["R"].values.min(),df["R"].values.max(),250)
plt.plot(x2,model2(x2,*popt2), label="fit2")
plt.xlim(None,10)
plt.legend()
plt.show()
Related
I'm saving the daily stock price for several stocks in a Pandas Dataframe. I'm using python and Jupyter notebook.
Once saved, I'm using matplotlib to graph the prices to check the data.
The idea is to graph 9 stocks at at time in a 3 x 3 subplot.
When I want to check other stock tickers I have to mannualy change each ticker in each subplot, which takes a long time and seems inefficient.
¿Is there a way to do this with some sort of list and for loop?
Here is my current code. It works but it seems to long and hard to update. (Stock tickers are only examples from a vanguard model portfolio).
x = price_df.index
a = price_df["P_VOO"]
b = price_df["P_VGK"]
c = price_df["P_VPL"]
d = price_df["P_IEMG"]
e = price_df["P_MCHI"]
f = price_df["P_VNQ"]
g = price_df["P_GDX"]
h = price_df["P_BND"]
i = price_df["P_BNDX"]
# Plot a figure with various axes scales
fig = plt.figure(figsize=(15,10))
# Subplot 1
plt.subplot(331)
plt.plot(x, a)
plt.title("VOO")
plt.ylim([0,550])
plt.grid(True)
plt.subplot(332)
plt.plot(x, b)
plt.title("VGK")
plt.ylim([0,400])
plt.grid(True)
plt.subplot(333)
plt.plot(x, c)
plt.title('VPL')
plt.ylim([0,110])
plt.grid(True)
plt.subplot(334)
plt.plot(x, d)
plt.title('IEMG')
plt.ylim([0,250])
plt.grid(True)
plt.subplot(335)
plt.plot(x, e)
plt.title('MCHI')
plt.ylim([0,75])
plt.grid(True)
plt.subplot(336)
plt.plot(x, f)
plt.title('P_VNQ')
plt.ylim([0,55])
plt.grid(True)
plt.subplot(337)
plt.plot(x, g)
plt.title('P_GDX')
plt.ylim([0,8])
plt.grid(True)
plt.subplot(338)
plt.plot(x, h)
plt.title('P_BND')
plt.ylim([0,200])
plt.grid(True)
plt.subplot(339)
plt.plot(x, i)
plt.title('P_BNDX')
plt.ylim([0,350])
plt.grid(True)
plt.tight_layout()
Try with DataFrame.plot and enable subplots, set the layout and figsize:
axes = df.plot(subplots=True, title=df.columns.tolist(),
grid=True, layout=(3, 3), figsize=(15, 10))
plt.tight_layout()
plt.show()
Or use plt.subplots to set the layout then plot on those axes with DataFrame.plot:
# setup subplots
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15, 10))
# Plot DataFrame on axes
df.plot(subplots=True, ax=axes, title=df.columns.tolist(), grid=True)
plt.tight_layout()
plt.show()
Sample Data and imports:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
df = pd.DataFrame(np.random.randint(10, 100, (10, 9)),
columns=list("ABCDEFGHI"))
df:
A B C D E F G H I
0 88 71 26 83 18 72 37 40 90
1 17 86 25 63 90 37 54 87 85
2 75 57 40 94 96 28 19 51 72
3 11 92 26 88 15 68 10 90 14
4 46 61 37 41 12 78 48 93 29
5 28 17 40 72 21 77 75 65 13
6 88 37 39 43 99 95 17 26 24
7 41 19 48 57 26 15 44 55 69
8 34 23 41 42 86 54 15 24 57
9 92 10 17 96 26 74 18 54 47
Does this implementation not work out in your case?
x = price_df.index
cols = ["P_VOO","P_VGK",...] #Populate before running
ylims = [[0,550],...] #Populate before running
# Plot a figure with various axes scales
fig = plt.figure(figsize=(15,10))
# Subplot 1
for i, (col, ylim) in enumerate(zip(cols, ylims)):
plt.subplot(331+i)
plt.plot(x, price_df[col])
plt.title(col.split('_')[1])
plt.ylim(ylim)
plt.grid(True)
Haven't run the code in my local, could have some minor bugs. But you get the general idea, right?
The model to fit is the equation
def func(x, b):
return b*np.exp(-b*x)*(1.0 + b*x)/4.0
I know that b=0.1 is a good guess to my data
0 0.1932332495855138
1 0.0283534527253836
2 0.0188036856033853
3 0.0567007258167565
4 0.0704161703188139
5 0.0276463443409273
6 0.0144509808494943
7 0.0188027609145469
8 0.0049573500626925
9 0.0064589075683206
10 0.0118522499082115
11 0.0087201376939245
12 0.0055855004231049
13 0.0110355379801288
14 0.0024829496736532
15 0.0050982312687186
16 0.0041032075307342
17 0.0063991465281368
18 0.0047195530453669
19 0.0028479431829209
20 0.0177577032522473
21 0.0082863863356967
22 0.0057720347102372
23 0.0053694769677398
24 0.017408417311084
25 0.0023307847797263
26 0.0014090741613788
27 0.0019007144648791
28 0.0043599058193019
29 0.004435997067249
30 0.0015569027316533
31 0.0016127575928092
32 0.00120222948697
33 0.0006851723909766
34 0.0014497504163
35 0.0014245210449107
36 0.0011375555693977
37 0.0007939973846594
38 0.0005707034948325
39 0.0007890519641431
40 0.0006274139241806
41 0.0005899624312505
42 0.0003989619799181
43 0.0002212632688891
44 0.0001465605806698
45 0.000188075040325
46 0.0002779076010181
47 0.0002941294723591
48 0.0001690581072228
49 0.0001448055157076
50 0.0002734759385405
51 0.0003228484365634
52 0.0002120441778252
53 0.0002383276583408
54 0.0002156310534404
55 0.0004499244488764
56 0.0001408465706883
57 0.000135998586104
58 0.00028706917157
59 0.0001788548683777
But it doesn't matter if I set p0=0.1, or p0=1.0, the fitting parameter in both cases python says to be popt= [0.42992594] and popt=[0.42994105], which is almost the same value. Why the curve_fit function doesn't work in this case?
popt, pcov = curve_fit(func, xdata, ydata, p0=[0.1])
There's nothing too mysterious going on here. 0.4299... is just a better fit to the data, in the least-squares sense.
With b = 0.1, the first few points are not well fit at all. Least-squares heavily weights outliers, so the optimizer tries very hard to fit those better, even if it means doing slightly worse at other points. In other words, "most" points are fit "pretty well", and there is a very high penalty for fitting any point very badly (that's the "square" in least-squares).
Below is a plot of the data (blue) and your model function with b = 0.1 and b = 0.4299 in orange and green respectively. The value returned by curve_fit is better subjectively and objectively. Computing the MSE to the data in both cases gives about 0.18 using b = 0.1, and 0.13 using b = 0.4299.
Pandas has the very handy function to do pairwise correlation of columns using pd.corr().
That means it is possible to compare correlations between columns of any length. For instance:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 10)))
0 1 2 3 4 5 6 7 8 9
0 9 17 55 32 7 97 61 47 48 46
1 8 83 87 56 17 96 81 8 87 0
2 60 29 8 68 56 63 81 5 24 52
3 42 76 6 75 7 59 19 17 3 63
...
Now it is possible to test correlation between all 10 columns with df.corr(method='pearson'):
0 1 2 3 4 5 6 7 8 9
0 1.000000 0.082789 -0.094096 -0.086091 0.163091 0.013210 0.167204 -0.002514 0.097481 0.091020
1 0.082789 1.000000 0.027158 -0.080073 0.056364 -0.050978 -0.018428 -0.014099 -0.135125 -0.043797
2 -0.094096 0.027158 1.000000 -0.102975 0.101597 -0.036270 0.202929 0.085181 0.093723 -0.055824
3 -0.086091 -0.080073 -0.102975 1.000000 -0.149465 0.033130 -0.020929 0.183301 -0.003853 -0.062889
4 0.163091 0.056364 0.101597 -0.149465 1.000000 -0.007567 -0.017212 -0.086300 0.177247 -0.008612
5 0.013210 -0.050978 -0.036270 0.033130 -0.007567 1.000000 -0.080148 -0.080915 -0.004612 0.243713
6 0.167204 -0.018428 0.202929 -0.020929 -0.017212 -0.080148 1.000000 0.135348 0.070330 0.008170
7 -0.002514 -0.014099 0.085181 0.183301 -0.086300 -0.080915 0.135348 1.000000 -0.114413 -0.111642
8 0.097481 -0.135125 0.093723 -0.003853 0.177247 -0.004612 0.070330 -0.114413 1.000000 -0.153564
9 0.091020 -0.043797 -0.055824 -0.062889 -0.008612 0.243713 0.008170 -0.111642 -0.153564 1.000000
Is there a simple way to also get the corresponding p-values (ideally in pandas), as it is returned e.g. by scipy's kendalltau()?
Why not using the "method" argument of pandas.DataFrame.corr():
pearson : standard correlation coefficient.
kendall : Kendall Tau correlation coefficient.
spearman : Spearman rank correlation.
callable: callable with input two 1d ndarrays and returning a float.
from scipy.stats import kendalltau, pearsonr, spearmanr
def kendall_pval(x,y):
return kendalltau(x,y)[1]
def pearsonr_pval(x,y):
return pearsonr(x,y)[1]
def spearmanr_pval(x,y):
return spearmanr(x,y)[1]
and then
corr = df.corr(method=pearsonr_pval)
Probably just loop. It's basically what pandas does in the source code to generate the correlation matrix anyway:
import pandas as pd
import numpy as np
from scipy import stats
df_corr = pd.DataFrame() # Correlation matrix
df_p = pd.DataFrame() # Matrix of p-values
for x in df.columns:
for y in df.columns:
corr = stats.pearsonr(df[x], df[y])
df_corr.loc[x,y] = corr[0]
df_p.loc[x,y] = corr[1]
If you want to leverage the fact that this is symmetric, so you only need to calculate this for roughly half of them, then do:
mat = df.values.T
K = len(df.columns)
correl = np.empty((K,K), dtype=float)
p_vals = np.empty((K,K), dtype=float)
for i, ac in enumerate(mat):
for j, bc in enumerate(mat):
if i > j:
continue
else:
corr = stats.pearsonr(ac, bc)
#corr = stats.kendalltau(ac, bc)
correl[i,j] = corr[0]
correl[j,i] = corr[0]
p_vals[i,j] = corr[1]
p_vals[j,i] = corr[1]
df_p = pd.DataFrame(p_vals)
df_corr = pd.DataFrame(correl)
#pd.concat([df_corr, df_p], keys=['corr', 'p_val'])
This will work:
from scipy.stats import pearsonr
column_values = [column for column in df.columns.tolist() ]
df['Correlation_coefficent'], df['P-value'] = zip(*df.T.apply(lambda x: pearsonr(x[column_values ],x[column_values ])))
df_result = df[['Correlation_coefficent','P-value']]
Does this work for you?
#call the correlation function, you could round the values if needed
df_c = df_c.corr().round(1)
#get the p values
pval = df_c.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(*rho.shape)
#set the p values, *** for less than 0.001, ** for less than 0.01, * for less than 0.05
p = pval.applymap(lambda x: ''.join(['*' for t in [0.001,0.01,0.05] if x<=t]))
#dfc_2 below will give you the dataframe with correlation coefficients and p values
df_c2 = df_c.astype(str) + p
#you could also plot the correlation matrix using sns.heatmap if you want
#plot the triangle
matrix = np.triu(df_c.corr())
#convert to array for the heatmap
df_c3 = df_c2.to_numpy()
#plot the heatmap
plt.figure(figsize=(13,8))
sns.heatmap(df_c, annot = df_c3, fmt='', vmin=-1, vmax=1, center= 0, cmap= 'coolwarm', mask = matrix)
I'm trying to systematically regress a couple of different dependent variables (countries) on the same set of inputs/independent variables, and want to do this in a looped fashion in Python using Sklearn. The dependant variables look like this:
Europe UK Japan USA Canada
Jan-10 10 13 39 42 16
Feb-10 13 16 48 51 19
Mar-10 15 18 54 57 21
Apr-10 12 15 45 48 18
May-10 11 14 42 45 17
while the independent variables look like this:
Input 1 Input 2 Input 3 Input 4
Jan-10 90 50 3 41
Feb-10 95 54 5 43
Mar-10 92 52 1 45
Apr-10 91 60 1 49
May-10 90 67 11 49
I find it easy to manually regress them + store predictions one at a time (ie Europe on all four inputs, then Japan etc) but haven't figured out how to program a single looped function that could do them all in one go. I suspect I may need to use a list/dictionary to store the dependent variables and call them one-by-one but don't quite know how to write this in a Pythonic way.
The existing code for a single loop looks like this:
x = pd.DataFrame('countryinputs.csv')
countries = pd.DataFrame('countryoutputs.csv')
y = countries['Europe']
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
Simply iterate through the column names. Then pass name into a defined function. In fact, you can wrap the process in a dictionary comprehension and pass into DataFrame constructor to return a dataframe of predicted values (same shape as original dataframe):
X = pd.DataFrame(...)
countries = pd.DataFrame(...)
def reg_proc(label):
y = countries[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in countries.columns},
columns = countries.columns)
To demonstrate with random, seeded data where tools below would be your countries:
Data
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
np.random.seed(7172018)
tools = pd.DataFrame({'pandas': np.random.uniform(0,1000,50),
'r': np.random.uniform(0,1000,50),
'julia': np.random.uniform(0,1000,50),
'sas': np.random.uniform(0,1000,50),
'spss': np.random.uniform(0,1000,50),
'stata': np.random.uniform(0,1000,50)
},
columns=['pandas', 'r', 'julia', 'sas', 'spss', 'stata'])
X = pd.DataFrame({'Input1': np.random.randn(50)*10,
'Input2': np.random.randn(50)*10,
'Input3': np.random.randn(50)*10,
'Input4': np.random.randn(50)*10})
Model
def reg_proc(label):
y = tools[label]
regressor = LinearRegression()
regressor.fit(X, y)
y_pred = regressor.predict(X)
return(y_pred)
pred_df = pd.DataFrame({lab: reg_proc(lab) for lab in tools.columns},
columns = tools.columns)
print(pred_df.head(10))
# pandas r julia sas spss stata
# 0 547.631679 576.025733 682.390046 507.767567 246.020799 557.648181
# 1 577.334819 575.992992 280.579234 506.014191 443.044139 396.044620
# 2 430.494827 576.211105 541.096721 441.997575 386.309627 558.472179
# 3 440.662962 524.582054 406.849303 420.017656 508.701222 393.678200
# 4 588.993442 472.414081 453.815978 479.208183 389.744062 424.507541
# 5 520.215513 489.447248 670.708618 459.375294 314.008988 516.235188
# 6 515.266625 459.292370 477.485995 436.398180 446.777292 398.826234
# 7 423.930650 414.069751 629.444118 378.059735 448.760240 449.062734
# 8 549.769034 406.531405 653.557937 441.425445 348.725447 456.089921
# 9 396.826924 399.327683 717.285415 361.235709 444.830491 429.967976
I am using iPython to do some coding. When I open the notebook and run some codes by doing SHIFT+ENTER it runs. But after one or two times, it stops giving any output. Why is that. I have to shutdown the notebook again open it and then it runs for few times and same problem again.
Here is the code I have used.
Cell Toolbar:
Question 1: Rotational Invariance of PCA
I(1): Importing the data sets and plotting a scatter plot of the two.
In [1]:
# Channging the working directory
import os
os.getcwd()
path="/Users/file/"
os.chdir(path)
pwd=os.getcwd()
print(pwd)
# Importing the libraries
import pandas as pd
import numpy as np
import scipy as sp
# Mentioning the files to be imported
file=["2d-gaussian.csv","2d-gaussian-rotated.csv"]
# Importing the two csv files in pandas dataframes
XI=pd.read_csv(file[0],header=None)
XII=pd.read_csv(file[1],header=None)
#XI
XII
Out[5]:
0 1
0 1.372310 -2.111748
1 -0.397896 1.968246
2 0.336945 1.338646
3 1.983127 -2.462349
4 -0.846672 0.606716
5 0.582438 -0.645748
6 4.346416 -4.645564
7 0.830186 -0.599138
8 -2.460311 2.096945
9 -1.594642 2.828128
10 3.767641 -3.401645
11 0.455917 -0.224665
12 2.878315 -2.243932
13 -1.062223 0.142675
14 -0.698950 1.113589
15 -4.681619 4.289080
16 0.411498 -0.041293
17 0.276973 0.187699
18 1.500835 -0.284463
19 -0.387535 -0.265205
20 3.594708 -2.581400
21 2.263455 -2.660592
22 -1.686090 1.566998
23 1.381510 -0.944383
24 -0.085535 -1.697205
25 1.030609 -1.448967
26 3.647413 -3.322129
27 -3.474906 2.977695
28 -7.930797 8.506523
29 -0.931702 1.440784
... ... ...
70 4.433750 -2.515612
71 1.495646 -0.058674
72 -0.928938 0.605706
73 -0.890883 -0.005911
74 -2.245630 1.333171
75 -0.707405 0.121334
76 0.675536 -0.822801
77 1.975917 -1.757632
78 -1.239322 2.053495
79 -2.360047 1.842387
80 2.436710 -1.445505
81 0.348497 -0.635207
82 -1.423243 -0.017132
83 0.881054 -1.823523
84 0.052809 1.505141
85 -2.466735 2.406453
86 -0.499472 0.970673
87 4.489547 -4.443907
88 -2.000164 4.125330
89 1.833832 -1.611077
90 -0.944030 0.771001
91 -1.677884 1.920365
92 0.372318 -0.474329
93 -2.073669 2.020200
94 -0.131636 -0.844568
95 -1.011576 1.718216
96 -1.017175 -0.005438
97 5.677248 -4.572855
98 2.179323 -1.704361
99 1.029635 -0.420458
100 rows × 2 columns
The two raw csv files have been imported as data frames. Next we will concatenate both the dataframes into one dataframe to plot a combined scatter plot
In [6]:
# Joining two dataframes into one.
df_combined=pd.concat([XI,XII],axis=1,ignore_index=True)
df_combined
Out[6]:
0 1 2 3
0 2.463601 -0.522861 1.372310 -2.111748
1 -1.673115 1.110405 -0.397896 1.968246
2 -0.708310 1.184822 0.336945 1.338646
3 3.143426 -0.338861 1.983127 -2.462349
4 -1.027700 -0.169674 -0.846672 0.606716
5 0.868458 -0.044767 0.582438 -0.645748
6 6.358290 -0.211529 4.346416 -4.645564
7 1.010685 0.163375 0.830186 -0.599138
8 -3.222466 -0.256939 -2.460311 2.096945
9 -3.127371 0.872207 -1.594642 2.828128
10 5.069451 0.258798 3.767641 -3.401645
11 0.481244 0.163520 0.455917 -0.224665
12 3.621976 0.448577 2.878315 -2.243932
13 -0.851991 -0.650218 -1.062223 0.142675
14 -1.281659 0.293194 -0.698950 1.113589
15 -6.343242 -0.277567 -4.681619 4.289080
16 0.320172 0.261774 0.411498 -0.041293
17 0.063126 0.328573 0.276973 0.187699
18 1.262396 0.860105 1.500835 -0.284463
19 -0.086500 -0.461557 -0.387535 -0.265205
20 4.367168 0.716517 3.594708 -2.581400
21 3.481827 -0.280818 2.263455 -2.660592
22 -2.300280 -0.084211 -1.686090 1.566998
23 1.644655 0.309095 1.381510 -0.944383
24 1.139623 -1.260587 -0.085535 -1.697205
25 1.753325 -0.295824 1.030609 -1.448967
26 4.928210 0.230011 3.647413 -3.322129
27 -4.562678 -0.351581 -3.474906 2.977695
28 -11.622940 0.407100 -7.930797 8.506523
29 -1.677601 0.359976 -0.931702 1.440784
... ... ... ... ...
70 4.913941 1.356329 4.433750 -2.515612
71 1.099070 1.016093 1.495646 -0.058674
72 -1.085156 -0.228560 -0.928938 0.605706
73 -0.625769 -0.634129 -0.890883 -0.005911
74 -2.530594 -0.645206 -2.245630 1.333171
75 -0.586007 -0.414415 -0.707405 0.121334
76 1.059484 -0.104132 0.675536 -0.822801
77 2.640018 0.154351 1.975917 -1.757632
78 -2.328373 0.575707 -1.239322 2.053495
79 -2.971570 -0.366041 -2.360047 1.842387
80 2.745141 0.700888 2.436710 -1.445505
81 0.695584 -0.202735 0.348497 -0.635207
82 -0.994271 -1.018499 -1.423243 -0.017132
83 1.912425 -0.666426 0.881054 -1.823523
84 -1.026954 1.101637 0.052809 1.505141
85 -3.445865 -0.042626 -2.466735 2.406453
86 -1.039549 0.333189 -0.499472 0.970673
87 6.316906 0.032272 4.489547 -4.443907
88 -4.331379 1.502719 -2.000164 4.125330
89 2.435918 0.157511 1.833832 -1.611077
90 -1.212710 -0.122350 -0.944030 0.771001
91 -2.544347 0.171460 -1.677884 1.920365
92 0.598670 -0.072133 0.372318 -0.474329
93 -2.894802 -0.037809 -2.073669 2.020200
94 0.504119 -0.690281 -0.131636 -0.844568
95 -1.930254 0.499670 -1.011576 1.718216
96 -0.715406 -0.723096 -1.017175 -0.005438
97 7.247917 0.780923 5.677248 -4.572855
98 2.746180 0.335849 2.179323 -1.704361
99 1.025371 0.430754 1.029635 -0.420458
100 rows × 4 columns
Plotting two separate scatter plot of all the four columns onto one scatter diagram
In [ ]:
import matplotlib.pyplot as plt
# Fucntion for scatter plot
def scatter_plot():
# plots scatter for first two columns(Unrotated Gaussian data)
plt.scatter(df_combined.ix[:,0], df_combined.ix[:,1],color='red',marker='+')
# plots scatter for Rotated Gaussian data
plt.scatter(df_combined.ix[:,2], df_combined.ix[:,3] ,color='green', marker='x')
legend = plt.legend(loc='upper right')
# set ranges of x and y axes
plt.xlim([-12,12])
plt.ylim([-12,12])
plt.show()
# Function call
scatter_plot()
In [ ]:
def plot_me1():
# create figure and axes
fig = plt.figure()
# split the page into a 1x1 array of subplots and put me in the first one (111)
# (as a matter of fact, the only one)
ax = fig.add_subplot(111)
# plots scatter for x, y1
ax.scatter(df_combined.ix[:,0], df_combined.ix[:,1], color='red', marker='+', s=100)
# plots scatter for x, y2
ax.scatter(df_combined.ix[:,2], df_combined.ix[:,3], color='green', marker='x', s=100)
plt.xlim([-12,12])
plt.ylim([-12,12])
plt.show()
plot_me1()
In [ ]:
You should not use plt.show() in the notebook. This will open an external window that blocks the evaluation of your cell.
Instead begin your notebooks with %matplotlib inline or the cool new %matplotlib notebook (the latter is only possible with matplotlib >= 1.4.3 and ipython >= 3.0)
After the evaluation of each cell, the (still open) figure object is automatically shown in your notebook.
This minimal code example works in notebook. Note that it does not call plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
x = [1,2,3]
y = [3,2,1]
_ = plt.plot(x,y)
%matplotlib inline simply displays the image.
%matplotlib notebook was added recently and offers many of the cool features (zooming, measuring,...) of the interactive backends: