DBSCAN for clustering of geographic location data - python

I have a dataframe with latitude and longitude pairs.
Here is my dataframe look like.
order_lat order_long
0 19.111841 72.910729
1 19.111342 72.908387
2 19.111342 72.908387
3 19.137815 72.914085
4 19.119677 72.905081
5 19.119677 72.905081
6 19.119677 72.905081
7 19.120217 72.907121
8 19.120217 72.907121
9 19.119677 72.905081
10 19.119677 72.905081
11 19.119677 72.905081
12 19.111860 72.911346
13 19.111860 72.911346
14 19.119677 72.905081
15 19.119677 72.905081
16 19.119677 72.905081
17 19.137815 72.914085
18 19.115380 72.909144
19 19.115380 72.909144
20 19.116168 72.909573
21 19.119677 72.905081
22 19.137815 72.914085
23 19.137815 72.914085
24 19.112955 72.910102
25 19.112955 72.910102
26 19.112955 72.910102
27 19.119677 72.905081
28 19.119677 72.905081
29 19.115380 72.909144
30 19.119677 72.905081
31 19.119677 72.905081
32 19.119677 72.905081
33 19.119677 72.905081
34 19.119677 72.905081
35 19.111860 72.911346
36 19.111841 72.910729
37 19.131674 72.918510
38 19.119677 72.905081
39 19.111860 72.911346
40 19.111860 72.911346
41 19.111841 72.910729
42 19.111841 72.910729
43 19.111841 72.910729
44 19.115380 72.909144
45 19.116625 72.909185
46 19.115671 72.908985
47 19.119677 72.905081
48 19.119677 72.905081
49 19.119677 72.905081
50 19.116183 72.909646
51 19.113827 72.893833
52 19.119677 72.905081
53 19.114100 72.894985
54 19.107491 72.901760
55 19.119677 72.905081
I want to cluster this points which are nearest to each other(200 meters distance) following is my distance matrix.
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
array([[ 0. , 0.2522482 , 0.2522482 , ..., 1.67313071,
1.05925366, 1.05420922],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
[ 0.2522482 , 0. , 0. , ..., 1.44111548,
0.81742536, 0.98978355],
...,
[ 1.67313071, 1.44111548, 1.44111548, ..., 0. ,
1.02310118, 1.22871515],
[ 1.05925366, 0.81742536, 0.81742536, ..., 1.02310118,
0. , 1.39923529],
[ 1.05420922, 0.98978355, 0.98978355, ..., 1.22871515,
1.39923529, 0. ]])
Then I am applying DBSCAN clustering algorithm on distance matrix.
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
y_db = db.fit_predict(distance_matrix)
I don't know how to choose eps & min_samples value. It clusters the points which are way too far, in one cluster.(approx 2 km in distance) Is it because it calculates euclidean distance while clustering? please help.

You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit() takes the coordinates in radian units for the haversine metric.

DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors).
But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.
However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)
uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.
See the documentation of DBSCAN (emphasis added):
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
metric : string, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
similar for fit_predict:
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.
In other words, you need to do
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")

I don't know what implementation of haversine you're using but it looks like it returns results in km so eps should be 0.2, not 2 for 200 m.
For the min_samples parameter, that depends on what your expected output is. Here are a couple of examples. My outputs are using an implementation of haversine based on this answer which gives a distance matrix similar, but not identical to yours.
This is with db = DBSCAN(eps=0.2, min_samples=5)
[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]
This creates three clusters, 0, 1 and 2, and a lot of the samples don't fall into a cluster with at least 5 members and so are not assigned to a cluster (shown as -1).
Trying again with a smaller min_samples value:
db = DBSCAN(eps=0.2, min_samples=2)
[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8
8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3
7 -1 3 -1 -1 3]
Here most of the samples are within 200m of at least one other sample and so fall into one of eight clusters 0 to 7.
Edited to add
It looks like #Anony-Mousse is right, though I didn't see anything wrong in my results. For the sake of contributing something, here's the code I was using to see the clusters:
from math import radians, cos, sin, asin, sqrt
from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd
def haversine(lonlat1, lonlat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lat1, lon1 = lonlat1
lat2, lon2 = lonlat2
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed') # using "precomputed" as recommended by #Anony-Mousse
y_db = db.fit_predict(distance_matrix)
X['cluster'] = y_db
plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()

#eos Gives the best answer I think - as well as making use of Haversine distance (the most relevant distance measure in this case), it avoids the need to generate a precomputed distance matrix. If you create a distance matrix then you need to calculate the pairwise distances for every combination of points (although you can obviously save a bit of time by taking advantage of the fact that your distance metric is symmetric).
If you just supply DBSCAN with a distance measure and use the ball_tree algorithm though, it can avoid the need to calculate every possible distance. This is because the ball tree algorithm can use the triangular inequality theorem to reduce the number of candidates that need to be checked to find the nearest neighbours of a data point (this is the biggest job in DBSCAN).
The triangular inequality theorem states:
|x+y| <= |x| + |y|
...so if a point p is distance x from its neighbour n, and another point q is a distance y from p, if x+y is greater than our nearest neighbour radius, we know that q must be too far away from n to be considered a neighbour, so we don't need to calculate its distance.
Read more about how ball trees work in the scikit-learn documentation

There are three different things you can do to use DBSCAN with GPS data. The first is that you can use the eps parameter to specify the maximum distance between data points that you will consider to create a cluster, as specified in other answers you need to take into account the scale of the distance metric you are using a pick a value that makes sense. Then you can use the min_samples this can be used as a way to filtering out data points while moving. Last the metric will allow you to use whatever distance you want.
As an example, in a particular research project I'm working on I want to extract significant locations from a subject's GPS data locations collected from their smartphone. I'm not interested on how the subject navigates through the city and also I'm more comfortable dealing with distances in meters then I can do the next:
from geopy import distance
def mydist(p1, p2):
return distance.great_circle((p1[0],p1[1],100),(p2[0],p2[1],100)).meters
DBSCAN(eps=50,min_samples=50,n_jobs=-1,metric=mydist)
Here eps as per the DBSCAN documentation "The maximum distance between two samples for one to be considered as in the neighborhood of the other."
While min samples is "The number of samples (or total weight) in a neighborhood for a point to be considered as a core point." Basically with eps you control how close data points in a cluster should be, in the example above I selected 100 meters. Min samples is just a way to control for density, in the example above the data was captured at about one sample per second, because I'm not interested in when people are moving around but instead stationary locations I want to make sure I get at least the equivalent of 60 seconds of GPS data from the same location.
If this still does not make sense take a look at this DBSCAN animation.

Related

Panda dataframe of distribution of particles: group by ID and find the half flux and the half flux radius

I am using Panda dataframe; I have a distribution of particles, their distance from the center of the distribution, and the associated fluxes. I want to find the total flux enclosed in the "half flux radius" (or "half light radius"), which is the radius that encloses half of the flux, by definition. I make you an example and then I ask you If you have any idea of how to make it.
Here I list 2 distribution of particles, identified by dist_ID, the distance of each particle from the center of the distribution R, and the flux of each particle.
dist_ID R flux
0 702641.0 5.791781 0.097505
1 702641.0 2.806051 0.015750
2 702641.0 3.254907 0.086941
3 702641.0 8.291544 0.081764
4 702641.0 4.901959 0.053561
5 702641.0 8.630691 0.144661
...
228 802663.0 95.685763 0.025735
229 802663.0 116.070396 0.026012
230 802663.0 112.806001 0.022163
231 802663.0 229.388117 0.026154
For example, considering the particle distribution with dist_ID=702641.0, the total flux of the particle distribution is the sum of "flux": total_flux=0.48;
the half flux is half_flux=total_flux/2.=0.24;
the radius that encloses half of the flux is R_2<R_hf<R_3 (where R_2=3.25 of particle 2; R_3=8.29 of particle 3), so I would consider R_h as the upper limit of that interval, i.e. R_hf=R_3.
I want a way, grouping by dist_ID with Panda dataframe, half_flux and R_hf of each distribution. Thanks
Can be done in this way:
import pandas as pd
data = {'dist_ID': [702641.0,702641.0,702641.0,702641.0,702641.0,702641.0,802663.0,802663.0,802663.0,802663.0],
'R': [5.791781,2.806051,3.254907,8.291544,4.901959,8.630691,95.685763,116.070396,112.806001,229.388117],
'flux': [0.097505,0.015750,0.086941,0.081764,0.053561,0.144661,0.025735,0.026012,0.022163,0.026154]}
df = pd.DataFrame(data)
# Sort DF
df = df.sort_values(['dist_ID', 'R'])
# Calculate cumsum
df['flux_cumsum'] = df.groupby('dist_ID')['flux'].transform(pd.Series.cumsum)
# Calculate half_flux
df_halfflux = df.groupby('dist_ID').apply(lambda x: x.flux.sum() / 2).to_frame().rename(columns={0:'half_flux'})
df = pd.merge(df,df_halfflux, how="left", on=['dist_ID'])
# Calculate discrepancy
df['flux_diff'] = abs(df.half_flux- df.flux_cumsum)
print(df)
# Find R_hf-row
df = df.groupby(['dist_ID', 'half_flux']).agg({'flux_diff': 'min'}).rename(columns={'flux_diff': 'R_hf'})
print(df)
Upper code output this:
dist_ID R flux flux_cumsum half_flux flux_diff
0 702641.0 2.806051 0.015750 0.015750 0.240091 0.224341
1 702641.0 3.254907 0.086941 0.102691 0.240091 0.137400
2 702641.0 4.901959 0.053561 0.156252 0.240091 0.083839
3 702641.0 5.791781 0.097505 0.253757 0.240091 0.013666
4 702641.0 8.291544 0.081764 0.335521 0.240091 0.095430
5 702641.0 8.630691 0.144661 0.480182 0.240091 0.240091
6 802663.0 95.685763 0.025735 0.025735 0.050032 0.024297
7 802663.0 112.806001 0.022163 0.047898 0.050032 0.002134
8 802663.0 116.070396 0.026012 0.073910 0.050032 0.023878
9 802663.0 229.388117 0.026154 0.100064 0.050032 0.050032
R_hf
dist_ID half_flux
702641.0 0.240091 0.013666
802663.0 0.050032 0.002134
If you want the half flux it can be done by
df.groupby("dist_ID").apply(lambda x: x.flux.sum()/2)
Output
dist_ID
702641.0 16.838466
802663.0 276.975139
dtype: float64
Not sure how you want to compute the radius but hopefully this will help you figure it out.

matlab traduction to python for simple calculation

one more time i need your help,
To introduce the problem, i got this :
x=[0 1 3 4 5 6 7 8]
y=[9 10 11 12 13 14 15 16]
x=x(:)
y=y(:)
X=[x.^2, x.*y,y.^2,x,y]
a=sum(X)/(X'*X)
X=
0 0 81 0 9
1 10 100 1 10
9 33 121 3 11
16 48 144 4 12
25 65 169 5 13
36 84 196 6 14
49 105 225 7 15
64 128 256 8 16
a =
-0.0139 0.0278 -0.0139 -0.2361 0.2361
Considere that the matlab code is absolutely true
and i translate this to :
x=[0,1,3,4,5,6,7,8]
y=[9,10,11,12,13,14,15,16]
X=np.array([x*x,x*y,y*y,x,y]).T
a=np.sum(X)/np.dot(X.T,X)#line with the probleme
X is the same
But i get (5,5) matrix on a
Probleme come from the mult beetwen X.T and X i think, i'll try np.matmul, np.dot, transpose and T and i don't know why i can't get a (1,5) or (5,1) vector... what is wrong is the translation beetwen those 2 langage on the a calculation
Any Suggestions ?
The division of such two matrices in MATLAB:
s = sum(X)
XX = (X'*X)
a = s / XX
is solving for t the linear system: XX * t = s.
To achieve the same in Python/NumPy, just use np.linalg.solve() (making sure to use np.sum() with the correct axis parameter to mimic the same behavior as MATLAB's sum(), as indicated in the comments and #AnderBiguri's answer):
x=np.array([0,1,3,4,5,6,7,8])
y=np.array([9,10,11,12,13,14,15,16])
X=np.array([x*x,x*y,y*y,x,y]).T
s = np.sum(X, 0)
XX = np.dot(X.T, X)
a = np.linalg.solve(XX, s)
print(a)
# [-0.01388889 0.02777778 -0.01388889 -0.23611111 0.23611111]
The issue is sum.
In MATLAB, default sum sums over the first axis. In numpy sum sums all the values.
a=np.sum(X, axis=0)/np.dot(X.T,X)

How to add random points in between the given points?

I have data points as dataframe just like represented in figure1
sample data
df=
74 34
74.5 34.5
75 34.5
75 34
74.5 34
74 34.5
76 34
76 34.5
74.5 34
74 34.5
75.5 34.5
75.5 34
75 34
75 34.5
I want to add random points in between those points but keep the shape of the initial points.
Desired output will be somehow like in figure 2 (black dots represent the random points.And the red line represent the boundary)
~Any suggestions? I am looking for a general solution since the geometry of the outer boundary will change in problem
interpolation might be worth looking into:
import numpy as np
# lets suppose y = 2x and x[i], y[i] is a data point
x = [1, 5, 16, 20, 60]
y = [2, 10, 32, 40, 120]
interp_here = [7, 8, 9, 21] # the points where we are going to interpolate values.
print(np.interp(interp_here, x, y)) ## [14. 16. 18. 42.]
If you want random points, then you could use the above as a guide line and then for each point add/subtract some delta.
If the shape is convex it is pretty simple:
def get_random_point(points):
point_selectors = np.random.randint(0, len(points), 2)
scale = np.random.rand()#value between 0 and 1
start_point = points[point_selectors[0]]
end_point = points[point_selectors[1]]
return start_point + (end_point - start_point) * scale
The shape you have specified is not convex. But without you additionally specifying which points make up the exterior of your shape or additional constraints like e.g. you only want to allow for lines to go parallel to either x or y axis the shape you see is mathematically not sufficiently specified.
Final remark: There are algorithms which can check whether a point is within a polygon (Point in polygon).
You can then 1) specify bounding polygon 2) generate a point within the bounding rectangle of your shape and 3) test whether the point lies within the shape of your polygon.

Need help dividing the ratio of elements in a Python list

I am working on a problem where I have been asked to a) output Fibonacci numbers in a sequence based on user input, as I have done below, and b) divide and print the ratio of the two most recent terms.
fixed_start = [0, 1]
def fib(fixed_start, n):
if n == 0:
return fixed_start
else:
fixed_start.append(fixed_start[-1] + fixed_start[-2])
return fib(fixed_start, n -1)
numb = int(input('How many terms: '))
fibonacci_list = fib(fixed_start, numb)
print(fibonacci_list[:-1])
I would like for my output to look something like the below:
"How many terms:" 3
1 1
the ratio is 1.0
1 2
the ratio is 2.0
2 3
the ratio is 1.5
Are you looking for ratio of the last 2 items in the list? If yes, this should work.
print(fibonacci_list[-2:])
print(float(fibonacci_list[-1]/fibonacci_list[-2]))
Or, if you are looking for ratio between every 2 numbers (except 0 & 1 right at the start), the below code should do the trick
for x,y in zip(fibonacci_list[1:],fibonacci_list[2:]):
print(x,y)
print('the ratio is ' + str(round((y/x),3)))
output is something like below for a fibonacci list of 15 terms
1 1
the ratio is 1.0
1 2
the ratio is 2.0
2 3
the ratio is 1.5
3 5
the ratio is 1.667
5 8
the ratio is 1.6
8 13
the ratio is 1.625
13 21
the ratio is 1.615
21 34
the ratio is 1.619
34 55
the ratio is 1.618
55 89
the ratio is 1.618
89 144
the ratio is 1.618
144 233
the ratio is 1.618
233 377
the ratio is 1.618
377 610
the ratio is 1.618
610 987
the ratio is 1.618
As you have already solved the part one of generating the Fibonacci series in the form of a list, you can access the last two elements (most recent) from it and take their ratio. Python allows us to access the elements of the list from backwards using the negative indexing
def fibonacci_ratio(fibonacci_list):
last_element = fibonacci_list[-1]
second_last_element = fibonacci_list[-2]
ratio = last_element//second_last_element
return ratio
The double // in python will ensure floating point division.
Hope this helps!

Python computing error

I’m using the API mpmath to compute the following sum
Let us consider the serie u0, u1, u2 defined by:
u0 = 3/2 = 1,5
u1 = 5/3 = 1,6666666…
un+1 = 2003 - 6002/un + 4000/un un-1
The serie converges on 2, but with rounding problem it seems to converge on 2000.
n Calculated value Rounded off exact value
2 1,800001 1,800000000
3 1,890000 1,888888889
4 3,116924 1,941176471
5 756,3870306 1,969696970
6 1996,761549 1,984615385
7 1999,996781 1,992248062
8 1999,999997 1,996108949
9 2000,000000 1,998050682
10 2000,000000 1,999024390
My code :
from mpmath import *
mp.dps = 50
u0=mpf(3/2.0)
u1=mpf(5/3.0)
u=[]
u.append(u0)
u.append(u1)
for i in range (2,11):
un1=(2003-6002/u[i-1]+(mpf(4000)/mpf((u[i-1]*u[i-2]))))
u.append(un1)
print u
my bad results :
[mpf('1.5'),
mpf('1.6666666666666667406815349750104360282421112060546875'),
mpf('1.8000000000000888711326751945268011597589466120961647'),
mpf('1.8888888889876302386905492787148253684796100079942617'),
mpf('1.9411765751351638992775070422559330255517747908588059'),
mpf('1.9698046831709839591526211645628191427874374792786951'),
mpf('2.093979191783975876606205176530675127058752077926479'),
mpf('106.44733511712489354422046139349654833300787666477228'),
mpf('1964.5606972399290690749220686397494349501387742896911'),
mpf('1999.9639916238009625032390578545797067344576357100626'),
mpf('1999.9999640260895343960004614025893194430187653900418')]
I tried to perform with some others functions (fdiv…) or to change the precision: same bad result
What’s wrong with this code ?
Question:
How to change my code to find the value 2.0 ??? with the formula :
un+1 = 2003 - 6002/un + 4000/un un-1
thanks
Using the decimal module, you can see the series also has a solution converging at 2000:
from decimal import Decimal, getcontext
getcontext().prec = 100
u0=Decimal(3) / Decimal(2)
u1=Decimal(5) / Decimal(3)
u=[u0, u1]
for i in range(100):
un1 = 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
u.append(un1)
print un1
The recurrence relation has multiple fixed points (one at 2 and the other at 2000):
>>> u = [Decimal(2), Decimal(2)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
Decimal('2')
>>> u = [Decimal(2000), Decimal(2000)]
>>> 2003 - 6002/u[-1] + 4000/(u[-1]*u[-2])
Decimal('2000.000')
The solution at 2 is an unstable fixed-point. The attractive fixed-point is at 2000.
The convergence gets very close to two and when the round-off causes the value to slightly exceed two, that difference gets amplified again and again until hitting 2000.
Your (non-linear) recurrence sequence has three fixed points: 1, 2 and 2000. The values 1 and 2 are close to each other compared to 2000, which is usually an indication of unstable fixed points because they are "almost" double roots.
You need to do some maths in order to diverge less early. Let v(n) be a side sequence:
v(n) = (1+2^n)u(n)
The following holds true:
v(n+1) = (1+2^(n+1)) * (2003v(n)v(n-1) - 6002(1+2^n)v(n-1) + 4000(1+2^n)(1+2^n-1)) / (v(n)v(n-1))
You can then simply compute v(n) and deduce u(n) from u(n) = v(n)/(1+2^n):
#!/usr/bin/env python
from mpmath import *
mp.dps = 50
v0 = mpf(3)
v1 = mpf(5)
v=[]
v.append(v0)
v.append(v1)
u=[]
u.append(v[0]/2)
u.append(v[1]/3)
for i in range (2,25):
vn1 = (1+2**i) * (2003*v[i-1]*v[i-2] \
- 6002*(1+2**(i-1))*v[i-2] \
+ 4000*(1+2**(i-1))*(1+2**(i-2))) \
/ (v[i-1]*v[i-2])
v.append(vn1)
u.append(vn1/(1+2**i))
print u
And the result:
[mpf('1.5'),
mpf('1.6666666666666666666666666666666666666666666666666676'),
mpf('1.8000000000000000000000000000000000000000000000000005'),
mpf('1.8888888888888888888888888888888888888888888888888892'),
mpf('1.9411764705882352941176470588235294117647058823529413'),
mpf('1.969696969696969696969696969696969696969696969696969'),
mpf('1.9846153846153846153846153846153846153846153846153847'),
mpf('1.992248062015503875968992248062015503875968992248062'),
mpf('1.9961089494163424124513618677042801556420233463035019'),
mpf('1.9980506822612085769980506822612085769980506822612089'),
mpf('1.9990243902439024390243902439024390243902439024390251'),
mpf('1.9995119570522205954123962908735968765251342118106393'),
mpf('1.99975591896509641200878691725652916768367097876495'),
mpf('1.9998779445868424264616135725619431221774685707311133'),
mpf('1.9999389685688129386634116570033567287152883735123589'),
mpf('1.9999694833531691537733833806341359211449845890933504'),
mpf('1.9999847414437645909944001098616048949448403192089965'),
mpf('1.9999923706636759668276456631037666033431751771913355'),
...
Note that this will still diverge eventually. In order to really converge, you need to compute v(n) with arbitrary precision. But this is now a lot easier since all the values are integers.
You calculate your initial values to 53-bits of precision and then assign that rounded value to the high-precision mpf variable. You should use u0=mpf(3)/mpf(2) and u1=mpf(5)/mpf(3). You'll stay close to 2 for a few more interations, but you'll still end up converging at 2000. This is due to rounding error. One alternative is to compute with fractions. I used gmpy and the following code converges to 2.
from __future__ import print_function
import gmpy
u = [gmpy.mpq(3,2), gmpy.mpq(5,3)]
for i in range(2,300):
temp = (2003 - 6002/u[-1] + 4000/(u[-1]*u[-2]))
u.append(temp)
for i in u: print(gmpy.mpf(i,300))
If you compute with infinite precision then you get 2 otherwise you get 2000:
import itertools
from fractions import Fraction
def series(u0=Fraction(3, 2), u1=Fraction(5, 3)):
yield u0
yield u1
while u0 != u1:
un = 2003 - 6002/u1 + 4000/(u1*u0)
yield un
u1, u0 = un, u1
for i, u in enumerate(itertools.islice(series(), 100)):
err = (2-u)/2 # relative error
print("%d\t%.2g" % (i, err))
Output
0 0.25
1 0.17
2 0.1
3 0.056
4 0.029
5 0.015
6 0.0077
7 0.0039
8 0.0019
9 0.00097
10 0.00049
11 0.00024
12 0.00012
13 6.1e-05
14 3.1e-05
15 1.5e-05
16 7.6e-06
17 3.8e-06
18 1.9e-06
19 9.5e-07
20 4.8e-07
21 2.4e-07
22 1.2e-07
23 6e-08
24 3e-08
25 1.5e-08
26 7.5e-09
27 3.7e-09
28 1.9e-09
29 9.3e-10
30 4.7e-10
31 2.3e-10
32 1.2e-10
33 5.8e-11
34 2.9e-11
35 1.5e-11
36 7.3e-12
37 3.6e-12
38 1.8e-12
39 9.1e-13
40 4.5e-13
41 2.3e-13
42 1.1e-13
43 5.7e-14
44 2.8e-14
45 1.4e-14
46 7.1e-15
47 3.6e-15
48 1.8e-15
49 8.9e-16
50 4.4e-16
51 2.2e-16
52 1.1e-16
53 5.6e-17
54 2.8e-17
55 1.4e-17
56 6.9e-18
57 3.5e-18
58 1.7e-18
59 8.7e-19
60 4.3e-19
61 2.2e-19
62 1.1e-19
63 5.4e-20
64 2.7e-20
65 1.4e-20
66 6.8e-21
67 3.4e-21
68 1.7e-21
69 8.5e-22
70 4.2e-22
71 2.1e-22
72 1.1e-22
73 5.3e-23
74 2.6e-23
75 1.3e-23
76 6.6e-24
77 3.3e-24
78 1.7e-24
79 8.3e-25
80 4.1e-25
81 2.1e-25
82 1e-25
83 5.2e-26
84 2.6e-26
85 1.3e-26
86 6.5e-27
87 3.2e-27
88 1.6e-27
89 8.1e-28
90 4e-28
91 2e-28
92 1e-28
93 5e-29
94 2.5e-29
95 1.3e-29
96 6.3e-30
97 3.2e-30
98 1.6e-30
99 7.9e-31
Well, as casevh said, I just added the mpf function in first initials terms in my code :
u0=mpf(3)/mpf(2)
u1=mpf(5)/mpf(3)
and the value converge for 16 steps to the correct value 2.0 before diverged again (see below).
So, even with a good python library for arbitrary-precision floating-point arithmetic and some basics operations the result can become totally false and it is not algorithmic, mathematical or recurrence problem as I read sometimes.
So it is necessary to remain watchful and critic !!! ( I’m very afraid about the mpmath.lerchphi(z, s, a) function ;-)
2 1.8000000000000000000000000000000000000000000000022 3
1.8888888888888888888888888888888888888888888913205 4 1.9411764705882352941176470588235294117647084569125 5 1.9696969696969696969696969696969696969723495083846 6 1.9846153846153846153846153846153846180779422496889 7 1.992248062015503875968992248062018218070968279944 8 1.9961089494163424124513618677070049064461141667961 9 1.998050682261208576998050684991268132991329645551 10 1.9990243902439024390243929766241359876402781522945 11 1.9995119570522205954151303455889283862002420414092 12 1.9997559189650964147435086295745928366095548127257 13 1.9998779445868451615169464386495752584786229236677 14 1.9999389685715481608370784691478769380770569091713 15 1.9999694860884747554701272066241108169217231319376 16 1.9999874767910784720428384947047783821702386000249 17 2.0027277350948824117795762659330557916802871427763 18 4.7316350177463946015607576536159982430500337286276 19 1156.6278675611076227796014310764287933259776352198 20 1998.5416721291457644804673979070312813731252347786 21 1999.998540608689366669273522363692463645090555294 22 1999.9999985406079725746311606572627439743947878652
The exact solution to your recurrence relation (with initial values u_0 = 3/2, u_1 = 5/3) is easily verified to be
u_n = (2^(n+1) + 1) / (2^n + 1). (*)
The problem you're seeing is that although the solution is such that
lim_{n -> oo} u_n = 2,
this limit is a repelling fixed point of your recurrence relation. That is, any departure from the correct values of u_{n-1}, u{n-2}, for some n, will result in further values diverging from the correct limit. Consequently, unless your implementation of the recurrence relation correctly represents every u_n value exactly, it can be expected to exhibit eventual divergence from the correct limit, converging to the incorrect value of 2000 that just happens to be the only attracting fixed point of your recurrence relation.
(*) In fact, u_n = (2^(n+1) + 1) / (2^n + 1) is the solution to any recurrence relation of the form
u_n = C + (7 - 3C)/u_{n-1} + (2C - 6)/(u_{n-1} u_{n-2})
with the same initial values as given above, where C is an arbitrary constant. If I haven't made a mistake finding the roots of the characteristic polynomial, this will have the set of fixed points {1, 2, C - 3}\{0}. The limit 2 can be either a repelling fixed point or an attracting fixed point, depending on the value of C. E.g., for C = 2003 the set of fixed points is {1, 2, 2000} with 2 being a repellor, whereas for C = 3 the fixed points are {1, 2} with 2 being an attractor.

Categories