I have the following dataframe:
id x_coordinate y_coordinate money time (hr)
0 545 0.676576 3.079094 4200 1.414706
1 4138 0.262979 -0.769170 700 0.943230
2 5281 -0.301234 -3.568590 200 1.314108
3 4369 -0.585544 1.610388 11600 0.703957
4 2173 -1.239105 3.168139 29200 0.666473
5 9971 -1.556373 -1.624628 18700 0.776165
6 2622 -1.747544 3.145381 100 0.842138
7 4522 -1.923251 -2.695298 36700 0.186741
8 7299 -2.697775 2.038365 500 0.469136
9 5425 -4.443474 0.428256 1400 0.760269
Say the starting point is the first row. I want to manipulate the dataframe so that the second row is the row whose coordinates are closest to the first row's and the third row is the row whose coordinates are closest to the second row's coordinates out of the remaining rows and so on.
Imagine the coordinates on a graph and I start at (0,0) for example. I am trying to find the most efficient way to 'travel' between these coordinates and end up at (0,0)
The output could be a list of coordinates sorted following the most efficient route for example. Or a new dataframe - just trying to figure out a way to solve this
So far I have tried to sort the df like this:
df = df.sort_values(["x_coordinate", "y_coordinate"], ascending = (False, True))
I have alternatively also tried to convert the df columns to lists using df.to_list() and then sorting them.
However, neither approach doesn't give the result I want. So any tips on how to go about manipulating a dataframe like this?
Any help is much appreciated!
I think, sort won't be of any help. You will need to take a Euclidean distance of this points from your coordinates and then sort on it. That will give you the closest points.
This is what I tried quickly. See if this works. I didn't get a chance to verify the results.
import numpy as np
import pandas as pd
from scipy import spatial
df = pd.DataFrame({
'id': [545,4138,5281,4369,2173,9971,2622,4522,7299,5425],
'x_coordinate': [0.676576, 0.262979, -0.301234, -0.585544, -1.239105,-1.556373,-1.747544,-1.923251,-2.697775,-4.443474],
'y_coordinate': [3.079094, -0.769170, -3.568590, 1.610388, 3.168139, -1.624628,3.145381,-2.695298,2.038365,0.428256],
})
print(df)
id x_coordinate y_coordinate
0 545 0.676576 3.079094
1 4138 0.262979 -0.769170
2 5281 -0.301234 -3.568590
3 4369 -0.585544 1.610388
4 2173 -1.239105 3.168139
5 9971 -1.556373 -1.624628
6 2622 -1.747544 3.145381
7 4522 -1.923251 -2.695298
8 7299 -2.697775 2.038365
9 5425 -4.443474 0.428256
def shortest_neighbour(pt,nebr):
tree = spatial.KDTree(nebr)
dist = tree.query(pt,2)
return dist[0][1],dist[1][1]
arr=df[['x_coordinate','y_coordinate']].to_numpy().reshape(len(df),2)
df[['Dist','Ord']]=pd.DataFrame(df.apply(lambda row : shortest_neighbour(arr[row.name],arr),axis = 1).tolist(),index=df.index)
print(df)
id x_coordinate y_coordinate Dist Ord
0 545 0.676576 3.079094 1.917749 4
1 4138 0.262979 -0.769170 2.010435 5
2 5281 -0.301234 -3.568590 1.842167 7
3 4369 -0.585544 1.610388 1.689299 4
4 2173 -1.239105 3.168139 0.508948 6
5 9971 -1.556373 -1.624628 1.131783 7
6 2622 -1.747544 3.145381 0.508948 4
7 4522 -1.923251 -2.695298 1.131783 5
8 7299 -2.697775 2.038365 1.458912 6
9 5425 -4.443474 0.428256 2.374851 8
dfs=df.sort_values(by=['Ord','Dist'],ascending =[True,True])
print(dfs)
id x_coordinate y_coordinate Dist Ord
6 2622 -1.747544 3.145381 0.508948 4
3 4369 -0.585544 1.610388 1.689299 4
0 545 0.676576 3.079094 1.917749 4
7 4522 -1.923251 -2.695298 1.131783 5
1 4138 0.262979 -0.769170 2.010435 5
4 2173 -1.239105 3.168139 0.508948 6
8 7299 -2.697775 2.038365 1.458912 6
5 9971 -1.556373 -1.624628 1.131783 7
2 5281 -0.301234 -3.568590 1.842167 7
9 5425 -4.443474 0.428256 2.374851 8
And one more thing I tired. Thats the distance from (0,0). See below if that is what you want.
df['Dist']=df.apply(lambda row: np.linalg.norm(np.zeros(2)-np.array([row.x_coordinate,row.y_coordinate])),axis = 1)
print(df.sort_values(by=['Dist'],ascending =[True]))
id x_coordinate y_coordinate Dist
1 4138 0.262979 -0.769170 0.812884
3 4369 -0.585544 1.610388 1.713538
5 9971 -1.556373 -1.624628 2.249825
0 545 0.676576 3.079094 3.152551
7 4522 -1.923251 -2.695298 3.311122
8 7299 -2.697775 2.038365 3.381260
4 2173 -1.239105 3.168139 3.401836
2 5281 -0.301234 -3.568590 3.581281
6 2622 -1.747544 3.145381 3.598240
9 5425 -4.443474 0.428256 4.464064
I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.
So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))
I use python pandas to caculate the following formula
(https://i.stack.imgur.com/XIKBz.png)
I do it in python like this :
EURUSD['SMA2']= EURUSD['Close']. rolling (2).mean()
EURUSD['TMA2']= ( EURUSD['Close'] + EURUSD[SMA2']) / 2
The proplem is long coding when i calculated TMA 100 , so i need to use " for loop " to easy change TMA period .
Thanks in advance
Edited :
I had found the code but there is an error :
values = []
for i in range(1,201): values.append(eurusd['Close']).rolling(window=i).mean() values.mean()
TMA is average of averages.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
print(df)
# df['mean0']=df.mean(0)
df['mean1']=df.mean(1)
print(df)
df['TMA'] = df['mean1'].rolling(window=10,center=False).mean()
print(df)
Or you can easily print it.
print(df["mean1"].mean())
Here is how it looks:
0 1 2 3 4
0 0.643560 0.412046 0.072525 0.618968 0.080146
1 0.018226 0.222212 0.077592 0.125714 0.595707
2 0.652139 0.907341 0.581802 0.021503 0.849562
3 0.129509 0.315618 0.711265 0.812318 0.757575
4 0.881567 0.455848 0.470282 0.367477 0.326812
5 0.102455 0.156075 0.272582 0.719158 0.266293
6 0.412049 0.527936 0.054381 0.587994 0.442144
7 0.063904 0.635857 0.244050 0.002459 0.423960
8 0.446264 0.116646 0.990394 0.678823 0.027085
9 0.951547 0.947705 0.080846 0.848772 0.699036
0 1 2 3 4 mean1
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581
0 1 2 3 4 mean1 TMA
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449 NaN
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890 NaN
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470 NaN
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257 NaN
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397 NaN
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313 NaN
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901 NaN
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046 NaN
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842 NaN
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581 0.436115
i'm attempting to make a HeatMap just like this one using Bokeh.
Here is my dataframe Data from which i'm trying to make the HeatMap
Day Code Total
0 1 6001 44
1 1 6002 40
2 1 6006 8
3 1 6008 2
4 1 6010 38
5 1 6011 1
6 1 6014 19
7 1 6018 1
8 1 6019 1
9 1 6023 10
10 1 6028 4
11 2 6001 17
12 2 6010 2
13 2 6014 4
14 2 6020 1
15 2 6028 2
16 3 6001 48
17 3 6002 24
18 3 6003 1
19 3 6005 1
20 3 6006 2
21 3 6008 18
22 3 6010 75
23 3 6011 1
24 3 6014 72
25 3 6023 34
26 3 6028 1
27 3 6038 3
28 4 6001 19
29 4 6002 105
30 5 6001 52
...
And here is my code:
from bokeh.io import output_file
from bokeh.io import show
from bokeh.models import (
ColumnDataSource,
HoverTool,
LinearColorMapper
)
from bokeh.plotting import figure
output_file('SHM_Test.html', title='SHM', mode='inline')
source = ColumnDataSource(Data)
TOOLS = "hover,save"
# Creating the Figure
SHM = figure(title="HeatMap",
x_range=[str(i) for i in range(1,32)],
y_range=[str(i) for i in range(6043,6000,-1)],
x_axis_location="above", plot_width=400, plot_height=970,
tools=TOOLS, toolbar_location='right')
# Figure Styling
SHM.grid.grid_line_color = None
SHM.axis.axis_line_color = None
SHM.axis.major_tick_line_color = None
SHM.axis.major_label_text_font_size = "5pt"
SHM.axis.major_label_standoff = 0
SHM.toolbar.logo = None
SHM.title.text_alpha = 0.3
# Color Mapping
CM = LinearColorMapper(palette='RdPu9', low=Data.Total.min(), high=Data.Total.max())
SHM.rect(x='Day', y="Code", width=1, height=1,source=source,
fill_color={'field': 'Total','transform': CM})
show(SHM)
When i excecute my code i don't get any errors but i just get an empty Frame, as shown in the image below.
I've been struggling trying to find where is my mistake, ¿Why i'm getting this? ¿Where is my error?
The problem with your code is the data type that you are setting for the x and y axis range and the data type of your ColumnDataSource are different. You are setting the x_range and y_range to be a list of strings, but from looking at your data in csv format it will be treated as integers.
In your case, you would want to make sure that your Day and Code column are in
string format.
This can be easily done using the pandas package with
Data['Day'] = Data['Day'].astype('str')
Data['Code'] = Date['Code'].astype('str')
Given two values, how can I get all the value between those two values. Thank you
for example:
dataframe:
Quarter GDP
0 1947q1 243.1
1 1947q2 246.3
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
8 1949q1 275.4
9 1949q2 271.7
10 1949q3 273.3
11 1949q4 271.0
12 1950q1 281.2
13 1950q2 290.7
14 1950q3 308.5
15 1950q4 320.3
16 1951q1 336.4
given 1947q3 and 1948q4, I need to get all the data between(inclusive) those two values
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
This will give you the desired result
df[(df['Quarter'] >= '1947q3') & (df['Quarter'] <= '1948q4')]
Quarter GDP
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
You can also use .between
df[df['Quarter'].between('1947q3', '1948q4', inclusive=True)]
put the data in dataframe.txt
f = open('dataframe.txt', 'r')
f_r = f.readline()
data = []
while f_r:
infos = f_r.split(' ')
infos = [info.strip() for info in infos if info]
if len(infos) == 3:
data.append((infos[1], infos[2]))
f_r = f.readline()
def get_rangedata_by_quarter(quarter_s, quarter_b):
""" quarter_s is the small one
quarter_b is the big one
"""
for info in data:
quarter = info[0]
if quarter >= quarter_s and quarter <= quarter_b:
print quarter, info[1]
get_rangedata_by_quarter('1947q3', '1948q4')