Python: Finding multiple linear trend lines in a scatter plot - python

I have the following pandas dataframe -
Atomic Number R C
0 2.0 49.0 0.040306
1 3.0 205.0 0.209556
2 4.0 140.0 0.107296
3 5.0 117.0 0.124688
4 6.0 92.0 0.100020
5 7.0 75.0 0.068493
6 8.0 66.0 0.082244
7 9.0 57.0 0.071332
8 10.0 51.0 0.045725
9 11.0 223.0 0.217770
10 12.0 172.0 0.130719
11 13.0 182.0 0.179953
12 14.0 148.0 0.147929
13 15.0 123.0 0.102669
14 16.0 110.0 0.120729
15 17.0 98.0 0.106872
16 18.0 88.0 0.061996
17 19.0 277.0 0.260485
18 20.0 223.0 0.164312
19 33.0 133.0 0.111359
20 36.0 103.0 0.069348
21 37.0 298.0 0.270709
22 38.0 245.0 0.177368
23 54.0 124.0 0.079491
The trend between r and C is generally a linear one. What I would like to do if possible is find an exhaustive list of all the possible combinations of 3 or more points and what their trends are with scipy.stats.linregress so that I can find groups of points that fit linearly the best.
Which would ideally look something like this for the data, (Source) but I am looking for all the other possible trends too.
So the question, how do I feed all the 16776915 possible combinations (sum_(i=3)^24 binomial(24, i)) of 3 or more points into lingress and is it even doable without a ton of code?

My following solution proposal is based on the RANSAC algorithm. It is method to fit a mathematical model (e.g. a line) to data with heavy of outliers.
RANSAC is one specific method from the field of robust regression.
My solution below first fits a line with RANSAC. Then you remove the data points close to this line from your data set (which is the same as keeping the outliers), fit RANSAC again, remove data, etc until only very few points are left.
Such approaches always have parameters which are data dependent (e.g. noise level or proximity of the lines). In the following solution and MIN_SAMPLES and residual_threshold are parameters which might require some adaption to the structure of your data:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
MIN_SAMPLES = 3
x = np.linspace(0, 2, 100)
xs, ys = [], []
# generate points for thee lines described by a and b,
# we also add some noise:
for a, b in [(1.0, 2), (0.5, 1), (1.2, -1)]:
xs.extend(x)
ys.extend(a * x + b + .1 * np.random.randn(len(x)))
xs = np.array(xs)
ys = np.array(ys)
plt.plot(xs, ys, "r.")
colors = "rgbky"
idx = 0
while len(xs) > MIN_SAMPLES:
# build design matrix for linear regressor
X = np.ones((len(xs), 2))
X[:, 1] = xs
ransac = linear_model.RANSACRegressor(
residual_threshold=.3, min_samples=MIN_SAMPLES
)
res = ransac.fit(X, ys)
# vector of boolean values, describes which points belong
# to the fitted line:
inlier_mask = ransac.inlier_mask_
# plot point cloud:
xinlier = xs[inlier_mask]
yinlier = ys[inlier_mask]
# circle through colors:
color = colors[idx % len(colors)]
idx += 1
plt.plot(xinlier, yinlier, color + "*")
# only keep the outliers:
xs = xs[~inlier_mask]
ys = ys[~inlier_mask]
plt.show()
In the following plot points shown as stars belong to the clusters detected by my code. You also see a few points depicted as circles which are the points remaining after the iterations. The few black stars form a cluster which you could get rid of by increasing MIN_SAMPLES and / or residual_threshold.

Related

Seaborn custom axis sxale: matplotlib.scale.FuncScale

I'm trying to figure out how to get a custom scale for my axis. My x-axis goes from 0 to 1,000,000 in 100,000 step increments, but I want to scale each of these numbers by 1/100, so that they go from 0 to 1,000 in 100 step increments. matplotlib.scale.FuncScale, but I'm having trouble getting it to work.
Here's what the plot currently looks like:
My code looks like this:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
dataPlot = pd.DataFrame({"plot1" : [1, 2, 3], "plot2" : [4, 5, 6], "plot3" : [7, 8, 9]})
ax = sns.lineplot(data = dataPlot, dashes = False, palette = ["blue", "red", "green"])
ax.set_xlim(1, numRows)
ax.set_xticks(range(0, numRows, 100000))
plt.ticklabel_format(style='plain')
plt.scale.FuncScale("xaxis", ((lambda x : x / 1000), (lambda y : y * 1000)))
When I run this code specifically, I get AttributeError: module 'matplotlib.pyplot' has no attribute 'scale', so I tried adding import matplotlib as mpl to the top of the code and then changing the last line to be mpl.scale.FuncScale("xaxis", ((lambda x : x / 1000), (lambda y : y * 1000))) and that actually ran without error, but but it didn't change anything.
How can I get this to properly scale the axis?
Based on the clarification from the question comments a straightforward solution scaling the x-axis data in the dataframe (x-data in the question case being the df index) and then plot.
Using example data since the code from the question wasn't running on its own.
x starting range is 0 to 100, and then scaled to 0 to 10, but that's equivalent to any other starting range and scaling.
1st the default df.plot: (just as reference)
import pandas as pd
import numpy as np
arr = np.arange(0, 101, 1) * 1.5
df = pd.DataFrame(arr, columns=['y_data'])
print(df)
y_data
0 0.0
1 1.5
2 3.0
3 4.5
4 6.0
.. ...
96 144.0
97 145.5
98 147.0
99 148.5
100 150.0
df.plot()
Note that per default df.plot uses the index as x-axis.
2nd scaling the x-data in the dataframe:
The interims dfs are only displayed to follow along.
Preparation
df.reset_index(inplace=True)
Getting the original index data as a column to further work with (see scaling below).
index y_data
0 0 0.0
1 1 1.5
2 2 3.0
3 3 4.5
4 4 6.0
.. ... ...
96 96 144.0
97 97 145.5
98 98 147.0
99 99 148.5
100 100 150.0
df = df.rename(columns = {'index':'x_data'}) # just to be more explicit
x_data y_data
0 0 0.0
1 1 1.5
2 2 3.0
3 3 4.5
4 4 6.0
.. ... ...
96 96 144.0
97 97 145.5
98 98 147.0
99 99 148.5
100 100 150.0
Scaling
df['x_data'] = df['x_data'].apply(lambda x: x/10)
x_data y_data
0 0.0 0.0
1 0.1 1.5
2 0.2 3.0
3 0.3 4.5
4 0.4 6.0
.. ... ...
96 9.6 144.0
97 9.7 145.5
98 9.8 147.0
99 9.9 148.5
100 10.0 150.0
3rd df.plot with specific columns:
df.plot(x='x_data', y = 'y_data')
By x= a specific column instead of the default = index is used as the x-axis.
Note that the y data hasn't changed but the x-axis is now scaled compared to the "1st the default df.plot" above.

How to calculate the similarity measure of text document?

I have CSV file that looks like:
idx messages
112 I have a car and it is blue
114 I have a bike and it is red
115 I don't have any car
117 I don't have any bike
I would like to have the code that reads the file and performs the similarity difference.
I have looked into many posts regarding this such as 1 2 3 4 but either it is hard for me to understand or not exactly what I want.
based on some posts and webpages that saying "a simple and effective one is Cosine similarity" or "Universal sentence encoder" or "Levenshtein distance".
It would be great if you can provide your help with code that I can run in my side as well. Thanks
I don't know that calculations like this can be vectorized particularly well, so looping is simple. At least use the fact that your calculation is symmetric and the diagonal is always 100 to cut down on the number of calculations you perform.
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
K = len(df)
similarity = np.empty((K,K), dtype=float)
for i, ac in enumerate(df['messages']):
for j, bc in enumerate(df['messages']):
if i > j:
continue
if i == j:
sim = 100
else:
sim = fuzz.ratio(ac, bc) # Use whatever metric you want here
# for comparison of 2 strings.
similarity[i, j] = sim
similarity[j, i] = sim
df_sim = pd.DataFrame(similarity, index=df.idx, columns=df.idx)
Output: df_sim
id 112 114 115 117
id
112 100.0 78.0 51.0 50.0
114 78.0 100.0 47.0 54.0
115 51.0 47.0 100.0 83.0
117 50.0 54.0 83.0 100.0

Interpolate Temperature Data On Urban Area Using Cartopy

I'm trying to interpolate temperature data observed on an urban area formed by 5 locations. I am using cartopy to interpolate and draw the map, however, when I run the script the temperature interpolation is not shown and I only get the layer of the urban area with the color palette. Can someone help me fix this error? The link of shapefile is
https://www.dropbox.com/s/0u76k3yegvr09sx/LimiteAMG.shp?dl=0
https://www.dropbox.com/s/yxsmm3v2ey3ngsp/LimiteAMG.cpg?dl=0
https://www.dropbox.com/s/yx05n31dfkggbb6/LimiteAMG.dbf?dl=0
https://www.dropbox.com/s/a6nk0xczgjeen2d/LimiteAMG.prj?dl=0
https://www.dropbox.com/s/royw7s51n2f0a6x/LimiteAMG.qpj?dl=0
https://www.dropbox.com/s/7k44dcl1k5891qc/LimiteAMG.shx?dl=0
Data
Lat Lon tmax
0 20.8208 -103.4434 22.8
1 20.7019 -103.4728 17.7
2 20.6833 -103.3500 24.9
3 20.6280 -103.4261 NaN
4 20.7205 -103.3172 26.4
5 20.7355 -103.3782 25.7
6 20.6593 -103.4136 NaN
7 20.6740 -103.3842 25.8
8 20.7585 -103.3904 NaN
9 20.6230 -103.4265 NaN
10 20.6209 -103.5004 NaN
11 20.6758 -103.6439 24.5
12 20.7084 -103.3901 24.0
13 20.6353 -103.3994 23.0
14 20.5994 -103.4133 25.0
15 20.6302 -103.3422 NaN
16 20.7400 -103.3122 23.0
17 20.6061 -103.3475 NaN
18 20.6400 -103.2900 23.0
19 20.7248 -103.5305 24.0
20 20.6238 -103.2401 NaN
21 20.4753 -103.4451 NaN
Code:
import cartopy
import cartopy.crs as ccrs
from matplotlib.colors import BoundaryNorm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import cartopy.io.shapereader as shpreader
from metpy.calc import get_wind_components
from metpy.cbook import get_test_data
from metpy.gridding.gridding_functions import interpolate, remove_nan_observation
from metpy.plots import add_metpy_logo
from metpy.units import units
to_proj = ccrs.PlateCarree()
data=pd.read_csv('/home/borisvladimir/Documentos/Datos/EMAs/EstacionesZMG/RedZMG.csv',usecols=(1,2,3),names=['Lat','Lon','tmax'],na_values=-99999,header=0)
fname='/home/borisvladimir/Dropbox/Diversos/Shapes/LimiteAMG.shp'
adm1_shapes = list(shpreader.Reader(fname).geometries())
lon = data['Lon'].values
lat = data['Lat'].values
xp, yp, _ = to_proj.transform_points(ccrs.Geodetic(), lon, lat).T
x_masked, y_masked, t = remove_nan_observations(xp, yp, data['tmax'].values)
#Interpola temp usando Cressman
tempx, tempy, temp = interpolate(x_masked, y_masked, t, interp_type='cressman', minimum_neighbors=3, search_radius=400000, hres=35000)
temp = np.ma.masked_where(np.isnan(temp), temp)
levels = list(range(-20, 20, 1))
cmap = plt.get_cmap('viridis')
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)
fig = plt.figure(figsize=(15, 10))
view = fig.add_subplot(1, 1, 1, projection=to_proj)
view.add_geometries(adm1_shapes, ccrs.PlateCarree(),edgecolor='black', facecolor='white', alpha=0.5)
view.set_extent([-103.8, -103, 20.3, 21.099 ], ccrs.PlateCarree())
ZapLon,ZapLat=-103.50,20.80
GuadLon,GuadLat=-103.33,20.68
TonaLon,TonaLat=-103.21,20.62
TlaqLon,TlaqLat=-103.34,20.59
TlajoLon,TlajoLat=-103.44,20.47
plt.text(ZapLon,ZapLat,'Zapopan',transform=ccrs.Geodetic())
plt.text(GuadLon,GuadLat,'Guadalajara',transform=ccrs.Geodetic())
plt.text(TonaLon,TonaLat,'Tonala',transform=ccrs.Geodetic())
plt.text(TlaqLon,TlaqLat,'Tlaquepaque',transform=ccrs.Geodetic())
plt.text(TlajoLon,TlajoLat,'Tlajomulco',transform=ccrs.Geodetic())
mmb = view.pcolormesh(tempx, tempy, temp,transform=ccrs.PlateCarree(),cmap=cmap, norm=norm)
plt.colorbar(mmb, shrink=.4, pad=0.02, boundaries=levels)
plt.show()
The problem is in the call to MetPy's interpolate function. With the setting of hres=35000, it is generating a grid spaced at 35km. However, it appears that your data points are spaced much more closely than that; together, that results in a generated grid that has only two points, as shown as the red points below (black points are the original stations with non-masked data):
The result is that it only creates two points for the grid, both of which are outside the bounds of your data points; therefore those points end up masked. If instead we set hres to something much lower, say 5km (i.e. 5000), then a much more sensible result comes out:

p_value is 0 when I use scipy.stats.kstest() for large dataset

I have a unique series with there frequencies and want to know if they are from normal distribution so I did a Kolmogorov–Smirnov test using scipy.stats.kstest. Since, to my knowledge, the function takes only a list so I transform the frequencies to a list before I put it into the function. However, the result is weird since the pvalue=0.0
The histogram of the original data and my code are in the followings:
Histogram of my dataset
[In]: frequencies = mp[['c','v']]
[In]: print frequencies
c v
31 3475.8 18.0
30 3475.6 12.0
29 3475.4 13.0
28 3475.2 8.0
20 3475.0 49.0
14 3474.8 69.0
13 3474.6 79.0
12 3474.4 78.0
11 3474.2 78.0
7 3474.0 151.0
6 3473.8 157.0
5 3473.6 129.0
2 3473.4 149.0
1 3473.2 162.0
0 3473.0 179.0
3 3472.8 145.0
4 3472.6 139.0
8 3472.4 95.0
9 3472.2 103.0
10 3472.0 125.0
15 3471.8 56.0
16 3471.6 75.0
17 3471.4 70.0
18 3471.2 70.0
19 3471.0 57.0
21 3470.8 36.0
22 3470.6 22.0
23 3470.4 20.0
24 3470.2 12.0
25 3470.0 23.0
26 3469.8 13.0
27 3469.6 17.0
32 3469.4 6.0
[In]: testData = map(lambda x: np.repeat(x[0], int(x[1])), frequencies.values)
[In]: testData = list(itertools.chain.from_iterable(testData))
[In]: print len(testData)
2415
[In]: print np.unique(testData)
[ 3469.4 3469.6 3469.8 3470. 3470.2 3470.4 3470.6 3470.8 3471.
3471.2 3471.4 3471.6 3471.8 3472. 3472.2 3472.4 3472.6 3472.8
3473. 3473.2 3473.4 3473.6 3473.8 3474. 3474.2 3474.4 3474.6
3474.8 3475. 3475.2 3475.4 3475.6 3475.8]
[In]: scs.kstest(testData, 'norm')
KstestResult(statistic=1.0, pvalue=0.0)
Thanks everyone at first.
Using 'norm' for your input will check if the distribution of your data is the same as scipy.stats.norm.cdf with default parameters: loc=0, scale=1.
Instead, you will need to fit a normal distribution to your data and then check if the data and the distribution are the same using the Kolmogorov–Smirnov test.
import numpy as np
from scipy.stats import norm, kstest
import matplotlib.pyplot as plt
freqs = [[3475.8, 18.0], [3475.6, 12.0], [3475.4, 13.0], [3475.2, 8.0], [3475.0, 49.0],
[3474.8, 69.0], [3474.6, 79.0], [3474.4, 78.0], [3474.2, 78.0], [3474.0, 151.0],
[3473.8, 157.0], [3473.6, 129.0], [3473.4, 149.0], [3473.2, 162.0], [3473.0, 179.0],
[3472.8, 145.0], [3472.6, 139.0], [3472.4, 95.0], [3472.2, 103.0], [3472.0, 125.0],
[3471.8, 56.0], [3471.6, 75.0], [3471.4, 70.0], [3471.2, 70.0], [3471.0, 57.0],
[3470.8, 36.0], [3470.6, 22.0], [3470.4, 20.0], [3470.2, 12.0], [3470.0, 23.0],
[3469.8, 13.0], [3469.6, 17.0], [3469.4, 6.0]]
data = np.hstack([np.repeat(x,int(f)) for x,f in freqs])
loc, scale = norm.fit(data)
# create a normal distribution with loc and scale
n = norm(loc=loc, scale=scale)
Plot the fit of the norm to the data:
plt.hist(data, bins=np.arange(data.min(), data.max()+0.2, 0.2), rwidth=0.5)
x = np.arange(data.min(), data.max()+0.2, 0.2)
plt.plot(x, 350*n.pdf(x))
plt.show()
This not a terribly good fit, most due to the long tail on the left. However, you can now run a proper Kolmogorov–Smirnov test using the cdf of the fitted normal distribution
kstest(data, n.cdf)
# returns:
KstestResult(statistic=0.071276854859734784, pvalue=4.0967451653273201e-11)
So we are still rejecting the null hypothesis of the distribution that produced the data being the same as the fitted distribution.

How to make axis tick labels visible on the other side of the plot in gridspec?

Plotting my favourite example dataframe,which looks like this:
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 NaN 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
I have two subplots, for some other reasons it is best for me to use gridspec. The plotting code is as follows (it is quite comprehensive, so I would like to avoid major changes in the code that otherwise works perfectly and just doesn't do one unimportant detail):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
import matplotlib as mpl
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
# setting limits for x and y
ylimit=(0,10)
yticks1=np.arange(0,11,1)
xlimit1=(10,20)
xticks1 = np.arange(10,21,1)
# general plot formatting (axes colour, background etc.)
plt.style.use('ggplot')
plt.rc('axes',edgecolor='black')
plt.rc('axes', facecolor = 'white')
plt.rc('grid', color = 'grey')
plt.rc('grid', alpha = 0.3) # alpha is percentage of transparency
colours = ['g','b','r']
title1 = 'The plot'
# GRIDSPEC INTRO - rows, cols, distance of individual plots
fig = plt.figure(figsize=(6,4))
gs=gridspec.GridSpec(1,2, hspace=0.15, wspace=0.08,width_ratios=[1,1])
## SUBPLOT of GRIDSPEC with lines
# the first plot
axes1 = plt.subplot(gs[0,0])
for count, vals in enumerate(df.columns.values[1:]):
X = np.asarray(df[vals])
h = vals
p1 = plt.plot(X,df.index,color=colours[count],linestyle='-',linewidth=1.5,label=h)
# formatting
p1 = plt.ylim(ylimit)
p1 = plt.yticks(yticks1, yticks1, rotation=0)
p1 = axes1.yaxis.set_minor_locator(mpl.ticker.MultipleLocator(0.1))
p1 = plt.setp(axes1.get_yticklabels(),fontsize=8)
p1 = plt.gca().invert_yaxis()
p1 = plt.ylabel('x [unit]', fontsize=14)
p1 = plt.xlabel("Value [unit]", fontsize=14)
p1 = plt.tick_params('both', length=5, width=1, which='minor', direction = 'in')
p1 = axes1.xaxis.set_minor_locator(mpl.ticker.MultipleLocator(0.1))
p1 = plt.xlim(xlimit1)
p1 = plt.xticks(xticks1, xticks1, rotation=0)
p1 = plt.setp(axes1.get_xticklabels(),fontsize=8)
p1 = plt.legend(loc='best',fontsize = 8, ncol=2) #
# the second plot (something random)
axes2 = plt.subplot(gs[0,1])
for count, vals in enumerate(df.columns.values[1:]):
nonans = df[vals].dropna()
result=nonans-0.5
p2 = plt.plot(result,nonans.index,color=colours[count],linestyle='-',linewidth=1.5)
p2 = plt.ylim(ylimit)
p2 = plt.yticks(yticks1, yticks1, rotation=0)
p2 = axes2.yaxis.set_minor_locator(mpl.ticker.MultipleLocator(0.1))
p2 = plt.gca().invert_yaxis()
p2 = plt.xlim(xlimit1)
p2 = plt.xticks(xticks1, xticks1, rotation=0)
p2 = axes2.xaxis.set_minor_locator(mpl.ticker.MultipleLocator(0.1))
p2 = plt.setp(axes2.get_xticklabels(),fontsize=8)
p2 = plt.xlabel("Other value [unit]", fontsize=14)
p2 = plt.tick_params('x', length=5, width=1, which='minor', direction = 'in')
p2 = plt.setp(axes2.get_yticklabels(), visible=False)
fig.suptitle(title1, size=16)
plt.show()
However, is it possible to show the y tick labels of the second subplot on the right hand side? The current code produces this:
And I would like to know if there is an easy way to get this:
No, ok, found out it is precisely what I wanted.
I want the TICKS to be on BOTH sides, just the LABELS to be on the right. The solution above removes my ticks from the left side of the subplot, which doesn't look good. However, this answer seems to get the right solution :)
To sum up:
to get the ticks on both sides and labels on the right, this is what fixes it:
axes2.yaxis.tick_right(‌​)
axes2.yaxis.set_ticks_p‌​osition('both')
And if you need the same for x axis, it's axes2.xaxis.tick_top(‌​)
try something like
axes2.yaxis.tick_right()
Just look around Python Matplotlib Y-Axis ticks on Right Side of Plot.

Categories