Calculate distance based on lat long, groupby 2 field in python - python

I have a set of data for my vehicle tracking system that requires me to calculate the distance base on lat and long. Understand that by using haversine formula can help getting distance between rows but I'm sort of stucked as I need my distance based on 2 field(Model type and mode).
As shown below is my code:
def haversine(lat1,lon1,lat2,lon2, to_radians = True, earth_radius =6371):
if to_radians:
lat1,lon1,lat2,lon2 = np.radians([lat1,lon1,lat2,lon2])
a = np.sin((lat2-lat1)/2.0)**2+ np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius *2 * np.arcsin(np.sqrt(a))
mydataset = pd.read_csv(x + '.txt')
print (mydataset.shape)
mydataset = mydataset.sort_values(by=['Model','timestamp']) #sort
mydataset['dist'] =
np.concatenate(mydataset.groupby(["Model"]).apply(lambda
x: haversine(x['Latitude'],x['Longitude'],
x['Latitude'].shift(),x['Longitude'].shift())).values)
With this, I am able to calculate the distance based on the model(by using sorting) between the rows.
But I would like to take it a step further to calculate based on both Mode and model. My fields are "Index, Model, Mode, Lat, Long, Timestamp"
Please advice!
Index, Model, Timestamp, Long, Lat, Mode(denote as 0 or 2), Distance Calculated
1, X, 2018-01-18 09:16:37.070, 103.87772815, 1.35653496, 0, 0.0
2, X, 2018-01-18 09:16:39.071, 103.87772815, 1.35653496, 0, 0.0
3, X, 2018-01-18 09:16:41.071, 103.87772815, 1.35653496, 0, 0.0
4, X, 2018-01-18-09:16:43.071, 103.87772052, 1.35653496, 0, 0.0008481795
5, X, 2018-01-18 09:16:45.071, 103.87770526, 1.35653329, 0, 0.0017064925312804799
6, X, 2018-01-18 09:16:51.070, 103.87770526, 1.35653329, 2, 0.0
7, X, 2018-01-18 09:16:53.071, 103.87770526, 1.35653329, 2, 0.0
8, X, 2018-01-18 09:59:55.072, 103.87770526, 1.35652828, 0, 0.0005570865824842293
I need it to calculate distance of total journey of a model and also distance of total journey of a model in whichever mode

I think need add DataFrame contructor to function and then add another column name to groupby like ["Model", "Mode(denote as 0 or 2)"] or ["Model", "Mode"] by columns names:
def haversine(lat1,lon1,lat2,lon2, to_radians = True, earth_radius =6371):
if to_radians:
lat1,lon1,lat2,lon2 = np.radians([lat1,lon1,lat2,lon2])
a = np.sin((lat2-lat1)/2.0)**2+ np.cos(lat1) * np.cos(lat2) * np.sin((lon2-
lon1)/2.0)**2
return pd.DataFrame(earth_radius *2 * np.arcsin(np.sqrt(a)))
mydataset['dist'] = (mydataset.groupby(["Model", "Mode(denote as 0 or 2)"])
.apply(lambda x: haversine(x['Lat'],
x['Long'],
x['Lat'].shift(),
x['Long'].shift())).values)
#if need replace NaNs to 0
mydataset['dist'] = mydataset['dist'].fillna(0)
print (mydataset)
Index Model Timestamp Long Lat \
0 1 X 2018-01-18 09:16:37.070 103.877728 1.356535
1 2 X 2018-01-18 09:16:39.071 103.877728 1.356535
2 3 X 2018-01-18 09:16:41.071 103.877728 1.356535
3 4 X 2018-01-18 09:16:43.071 103.877721 1.356535
4 5 X 2018-01-18 09:16:45.071 103.877705 1.356533
5 6 X 2018-01-18 09:16:51.070 103.877705 1.356533
6 7 X 2018-01-18 09:16:53.071 103.877705 1.356533
7 8 X 2018-01-18 09:59:55.072 103.877705 1.356528
Mode(denote as 0 or 2) Distance Calculated dist
0 0 0.000000 0.000000
1 0 0.000000 0.000000
2 0 0.000000 0.000000
3 0 0.000848 0.000848
4 0 0.001706 0.001706
5 2 0.000000 0.000557
6 2 0.000000 0.000000
7 0 0.000557 0.000000

Related

Calculation in pandas dataframe using loop

I want to perform desired computations based in my 'x' and 'y' coordinate values as in below table (Table 1):
TIMESTEP nparticles v_x v_y radius area_sum vx_a1 vy_a1 phi_1
0 1 0.0 0.0 0.490244 0.7550478008000959 1.90579 -1.83605 0.36630
100 1 0.369944 0.196252 0.490244 0.7550478008000959
200 1 -0.110178 -0.233131 0.490244 0.7550478008000959
...
...
97400 1 -1.03617 -7.24768 0.461981 0.6704989496863082
97500 1 -1.30016 -7.25768 0.461981 0.6704989496863082
...
...
For which I am using this code for my above generated dataframe:
bindistance = 0.25
orfl = -4.0
orfr = 4.0
bin_xc = np.arange(orfl, orfr, bindistance)
nbins = len(bin_xc)
binx = 0.25
xo_min = -4.0
xo_max = 4.0
xb1_c = xo_min
xb1_max = xb1_c + (binx * 2)
xb1_min = xb1_c - (binx * 2)
yb_min = -0.5
yb_max = 0.5
yb_c = 0
x_particle1 = df.loc[(df['x'] < xb1_max) &
(df['x'] > xb1_min)]
xy_particle1 =
x_particle1.loc[(x_particle1['y'] < yb_max)
& (x_particle1['y'] > yb_min)]
output1 = xy_particle1.groupby("TIMESTEP").agg(nparticles = ("id", "count"), v_x=("vx", "sum"), v_y=("vy", "sum"), radius = ("radius", "sum"), area_sum = ("Area", "sum"))
nsum1 = output1["nparticles"].sum()
vxsum1 = output1["v_x"].sum()
vysum1 = output1["v_y"].sum()
v_a1 = vxsum1 / nsum1
vy_a1 = vysum1 / nsum1
phi_1 = output1["area_sum"].sum() / 1001
But I am having very large number of desired dataframes (first dataframe is above shown) based in my 'x' and 'y' coordinate conditions. So manually writing each code 50 or more times is not feasible. How I can do with using a loop or otherwise? Please help
This is my input dataset (df):
TIMESTEP id radius x y vx vy Area
0 42 0.490244 -3.85683 0.489375 0.0 0.0 0.7550478008000959
0 245 0.479994 -2.88838 0.479446 0.0 0.0 0.7238048519265009
0 344 0.463757 -1.94613 0.463363 0.0 0.0 0.6756640757454175
0 313 0.503268 -0.981364 0.501991 0.0 0.0 0.7956984398459999
...
...
100000 1051 0.542993 0.887743 1.71649 -0.309668 -5.83282 0.9262715700848821
100000 504 0.540275 2.87158 1.94939 -5.76545 -2.30889 0.9170217083878441
100000 589 0.450005 3.86868 1.89373 -4.49676 -2.63977 0.636186649597414
...
...

Product scoring in pandas dataframe

I do have product id dataframe. I would like to find the best product by scoring each product. For each variable the more the value the better the product score except returns which means more returns less score.Also I need to assign different weight to score for the variables Shipped revenue and returns that maybe increased by 20 percent of their importance.
A scoring function can look like this
Score=ShippedUnits+1.2*ShippedRevenue+OrderedUnits-1.2Returns+View+Stock
where 0<=Score<=100
Please help. Thank you.
df_product=pd.DataFrame({'ProductId':['1','2','3','4','5','6','7','8','9','10'],'ShippedUnits':
[6,8,0,4,27,3,4,14,158,96],'ShippedRevenue':[268,1705,1300,950,1700,33380,500,2200,21000,24565]
,'OrderedUnits':[23,78,95,52,60,76,68,92,34,76],'Returns':[0,0,6,0,2,5,6,5,2,13],'View':
[0,655,11,378,920,12100,75,1394,12368,14356],'Stock':[24,43,65,27,87,98,798,78,99,231]
})
df_product=pd.DataFrame({'ProductId':['1','2','3','4','5','6','7','8','9','10'],'ShippedUnits':
[6,8,0,4,27,3,4,14,158,96],'ShippedRevenue':[268,1705,1300,950,1700,33380,500,2200,21000,24565]
,'OrderedUnits':[23,78,95,52,60,76,68,92,34,76],'Returns':[0,0,6,0,2,5,6,5,2,13],'View':
[0,655,11,378,920,12100,75,1394,12368,14356],'Stock':[24,43,65,27,87,98,798,78,99,231]
})
df_product['score'] = df_product['ShippedUnits'] +1.2*df_product['ShippedRevenue']+df_product['OrderedUnits']-1.2*df_product['Returns']+df_product['View']+df_product['Stock']
df_product['score']=(df_product['score']-df_product['score'].min())/(df_product['score'].max()-df_product['score'].min())*100
df_product
df["Score"] = df["ShippedUnits"] + df["OrderedUnits"] \
+ df["View"] + df["Stock"] \
+ 1.2 * df["ShippedRevenue"] \
- 1.2 * df["Returns"]
df["Norm1"] = df["Score"] / df["Score"].max() * 100
df["Norm2"] = df["Score"] / df["Score"].sum() * 100
df["Norm3"] = (df["Score"] - df["Score"].min()) / (df["Score"].max() - df["Score"].min()) * 100
>>> df[["ProductId", "Score", "Norm1", "Norm2", "Norm3"]]
ProductId Score Norm1 Norm2 Norm3
0 1 374.6 0.715883 0.250040 0.000000
1 2 2830.0 5.408298 1.888986 4.726249
2 3 1723.8 3.294284 1.150613 2.596993
3 4 1601.0 3.059606 1.068646 2.360622
4 5 3131.6 5.984673 2.090300 5.306781
5 6 52327.0 100.000000 34.927558 100.000000
6 7 1537.8 2.938827 1.026460 2.238973
7 8 4212.0 8.049382 2.811452 7.386377
8 9 37856.6 72.346208 25.268763 72.146811
9 10 44221.4 84.509718 29.517180 84.398026

How to find customized average which is based on weightage including handling of nan value in pandas?

I have a data frame df_ss_g as
ent_id,WA,WB,WC,WD
123,0.045251836,0.614582906,0.225930615,0.559766482
124,0.722324239,0.057781167,,0.123603561
125,,0.361074325,0.768542766,0.080434134
126,0.085781742,0.698045853,0.763116684,0.029084545
127,0.909758657,,0.760993759,0.998406211
128,,0.32961283,,0.90038336
129,0.714585519,,0.671905291,
130,0.151888772,0.279261613,0.641133263,0.188231227
now I have to compute the average(AVG_WEIGHTAGE) which is based on a weightage i.e. =(WA*0.5+WB*1+WC*0.5+WD*1)/(0.5+1+0.5+1)
but while I am computing it using below method i.e.
df_ss_g['AVG_WEIGHTAGE']= df_ss_g.apply(lambda x:((x['WA']*0.5)+(x['WB']*1)+(x['WC']*0.5)+(x['WD']*1))/(0.5+1+0.5+1) , axis=1)
IT output as i.e. for NaN value it is giving NaN as AVG_WEIGHTAGE as null which is wrong.
all I wanted is that null should not be considered in denominator and numerator
e.g.
ent_id,WA,WB,WC,WD,AVG_WEIGHTAGE
128,,0.32961283,,0.90038336,0.614998095 i.e. (WB*1+WD*1)/1+1
129,0.714585519,,0.671905291,,0.693245405 i.e. (WA*0.5+WC*0.5)/0.5+0.5
IIUC:
import numpy as np
weights = np.array([0.5, 1, 0.5, 1]))
values = df.drop('ent_id', axis=1)
df['AVG_WEIGHTAGE'] = np.dot(values.fillna(0).to_numpy(), weights)/np.dot(values.notna().to_numpy(), weights)
df['AVG_WEIGHTAGE']
0 0.436647
1 0.217019
2 0.330312
3 0.383860
4 0.916891
5 0.614998
6 0.693245
7 0.288001
Try this method using dot products -
def av(t):
#Define weights
wt = [0.5, 1, 0.5, 1]
#Create a vector with 0 for null and 1 for non null
nulls = [int(i) for i in ~t.isna()]
#Take elementwise products of the nulls vector with both weights and t.fillna(0)
wt_new = np.dot(nulls, wt)
t_new = np.dot(nulls, t.fillna(0))
#return division
return np.divide(t_new,wt_new)
df['WEIGHTED AVG'] = df.apply(av, axis=1)
df = df.reset_index()
print(df)
ent_id WA WB WC WD WEIGHTED AVG
0 123 0.045252 0.614583 0.225931 0.559766 0.481844
1 124 0.722324 0.057781 NaN 0.123604 0.361484
2 125 NaN 0.361074 0.768543 0.080434 0.484020
3 126 0.085782 0.698046 0.763117 0.029085 0.525343
4 127 0.909759 NaN 0.760994 0.998406 1.334579
5 128 NaN 0.329613 NaN 0.900383 0.614998
6 129 0.714586 NaN 0.671905 NaN 1.386491
7 130 0.151889 0.279262 0.641133 0.188231 0.420172
It boils down to masking the nan values with 0 so they don't contribute to either weights or sum:
# this is the weights
weights = np.array([0.5,1,0.5,1])
# the columns of interest
s = df.iloc[:,1:]
# where the valid values are
mask = s.notnull()
# use `fillna` and then `#` for matrix multiplication
df['AVG_WEIGHTAGE'] = (s.fillna(0) # weights) / (mask#weights)

How to search a certain value in series python

I've got a series:p
0 353.267439
1 388.483605
2 0.494685
3 1.347499
4 404.202001
5 6.163468
6 29.782820
7 28.972926
8 2.822725
9 0.000000
10 1.309716
11 1.309716
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 63.199779
17 62.669258
18 0.306850
19 0.000000
20 28.218308
21 32.078732
22 4.394789
23 0.995053
24 236.355502
25 172.802915
26 1.207798
27 0.174134
28 0.706518
29 0.922744
1666374 0.000000
1666375 0.000000
1666376 0.000000
1666377 0.000000
1666378 0.033375
1666379 0.033375
1666380 0.118138
1666381 0.118138
1666382 12.415525
1666383 12.415525
1666384 24.252089
1666385 0.270588
1666386 24.292072
1666387 12.415525
1666388 12.415525
1666389 0.000000
1666390 0.000000
1666391 0.000000
1666392 0.118138
1666393 0.118138
1666394 0.118138
1666395 0.000000
1666396 0.000000
1666397 0.000000
1666398 0.000000
1666399 0.000000
1666400 0.118138
1666401 0.000000
1666402 0.118138
1666403 0.118138
Name: Dis, Length: 1666404, dtype: float64
and I believe there is a value '4.74036126519e-07' in it
I try some methods to find the value:
p[p =='value']
or function:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
but they return nothing
strangely, when I call:
p[p ==0]
it can return the index
I wanna ask why and how to find value in series properly
code:
def haversine_np(lon1, lat1, lon2, lat2):
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
def DisM(df,ID):
df_user=df.loc[df['UserID'] == ID]
p= haversine_np(df_user.Longitude.shift(), df_user.Latitude.shift(), df_user.ix[1:, 'Longitude'], df_user.ix[1:, 'Latitude'])
p=p.iloc[1:]
p=p.rename("Dis")
return (p)
p = DisM(df,1)
for num in np.arange(2,4861):
p= p.append(DisM(df,num))
p=p.reset_index(drop=True)
df is a dataframe contain users' location information (longtitude latitude)
and use haversine to count the distance between their trips
then use a for loop to append together the distance :p
actually the number i try to find is not so important . i cannot get a result from searching other values in the series either like 353.267439 (the first element)
This adds the rounding in you checking function:
def find(s, el, n):
for i in range(len(s)):
if round(s[i],n) == round(el,n):
return i
return None
n is the number of digits the number will be rounded to.
You can test it using a simple script like this one
series = []
with open('series.txt','r') as f:
for line in f:
series.append(line.strip().split())
res = [float(x[1]) for x in series]
check = [353.267,0.706518,24.292]
print [find(res, x, 3) for x in check]
# yields [0, 28, 42]
Where series.txt is a text file with the data you posted (with one removed empty line). The above will print the correct indexes - it mimics the situation where rounding is up to the 3 decimal which is the precision of the input in check - except for the middle element.
Similarly it will work if the values in check have some trailing numbers,
check = [353.2671111,0.7065181111,24.292111]
print [find(res, x, 3) for x in check]
# yields [0, 28, 42]
But it will not - except for the exact one - if you increase the precision past the lowest one,
check = [353.267,0.706518,24.292]
print [find(res, x, 7) for x in check]
# yields [None, 28, None]

Multiplying data within columns python

I've been working on this all morning and for the life of me cannot figure it out. I'm sure this is very basic, but I've become so frustrated my mind is being clouded. I'm attempting to calculate the total return of a portfolio of securities at each date (monthly).
The formula is (1 + r1) * (1+r2) * (1+ r(t))..... - 1
Here is what I'm working with:
Adj_Returns = Adj_Close/Adj_Close.shift(1)-1
Adj_Returns['Risk Parity Portfolio'] = (Adj_Returns.loc['2003-01-31':]*Weights.shift(1)).sum(axis = 1)
Adj_Returns
SPY IYR LQD Risk Parity Portfolio
Date
2002-12-31 NaN NaN NaN 0.000000
2003-01-31 -0.019802 -0.014723 0.000774 -0.006840
2003-02-28 -0.013479 0.019342 0.015533 0.011701
2003-03-31 -0.001885 0.010015 0.001564 0.003556
2003-04-30 0.088985 0.045647 0.020696 0.036997
For example, with 2002-12-31 being base 100 for risk parity, I want 2003-01-31 to be 99.316 (100 * (1-0.006840)), 2003-02-28 to be 100.478 (99.316 * (1+ 0.011701)) so on and so forth.
Thanks!!
You want to use pd.DataFrame.cumprod
df.add(1).cumprod().sub(1).sum(1)
Consider the dataframe of returns df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.normal(.025, .03, (10, 5)), columns=list('ABCDE'))
df
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 0.024191 0.034487 0.035463 0.046461 0.048123
2 0.006754 0.035572 0.014424 0.012524 -0.002347
3 0.020724 0.047405 -0.020125 0.043341 0.037007
4 -0.003783 0.069827 0.014605 -0.019147 0.056897
5 0.056890 0.042756 0.033886 0.001758 0.049944
6 0.069609 0.032687 -0.001997 0.036253 0.009415
7 0.026503 0.053499 -0.006013 0.053447 0.047013
8 0.062084 0.029664 -0.015238 0.029886 0.062748
9 0.048341 0.065248 -0.024081 0.019139 0.028955
We can see the cumulative return or total return is
df.add(1).cumprod().sub(1)
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 -0.015641 0.020983 0.000139 0.001702 0.063343
2 -0.008993 0.057301 0.014565 0.014247 0.060847
3 0.011544 0.107423 -0.005853 0.058206 0.100105
4 0.007717 0.184750 0.008666 0.037944 0.162699
5 0.065046 0.235405 0.042847 0.039769 0.220768
6 0.139183 0.275786 0.040764 0.077464 0.232261
7 0.169375 0.344039 0.034505 0.135051 0.290194
8 0.241974 0.383909 0.018742 0.168973 0.371151
9 0.302013 0.474207 -0.005791 0.191346 0.410852
Plot it
df.add(1).cumprod().sub(1).plot()
Add sum of returns to new column
df.assign(Portfolio=df.add(1).cumprod().sub(1).sum(1))
A B C D E Portfolio
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521 -0.114311
1 0.024191 0.034487 0.035463 0.046461 0.048123 0.070526
2 0.006754 0.035572 0.014424 0.012524 -0.002347 0.137967
3 0.020724 0.047405 -0.020125 0.043341 0.037007 0.271425
4 -0.003783 0.069827 0.014605 -0.019147 0.056897 0.401777
5 0.056890 0.042756 0.033886 0.001758 0.049944 0.603835
6 0.069609 0.032687 -0.001997 0.036253 0.009415 0.765459
7 0.026503 0.053499 -0.006013 0.053447 0.047013 0.973165
8 0.062084 0.029664 -0.015238 0.029886 0.062748 1.184749
9 0.048341 0.065248 -0.024081 0.019139 0.028955 1.372626

Categories