Get sliced dataframe by giving two values - python

Given two values, how can I get all the value between those two values. Thank you
for example:
dataframe:
Quarter GDP
0 1947q1 243.1
1 1947q2 246.3
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
8 1949q1 275.4
9 1949q2 271.7
10 1949q3 273.3
11 1949q4 271.0
12 1950q1 281.2
13 1950q2 290.7
14 1950q3 308.5
15 1950q4 320.3
16 1951q1 336.4
given 1947q3 and 1948q4, I need to get all the data between(inclusive) those two values
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7

This will give you the desired result
df[(df['Quarter'] >= '1947q3') & (df['Quarter'] <= '1948q4')]
Quarter GDP
2 1947q3 250.1
3 1947q4 260.3
4 1948q1 266.2
5 1948q2 272.9
6 1948q3 279.5
7 1948q4 280.7
You can also use .between
df[df['Quarter'].between('1947q3', '1948q4', inclusive=True)]

put the data in dataframe.txt
f = open('dataframe.txt', 'r')
f_r = f.readline()
data = []
while f_r:
infos = f_r.split(' ')
infos = [info.strip() for info in infos if info]
if len(infos) == 3:
data.append((infos[1], infos[2]))
f_r = f.readline()
def get_rangedata_by_quarter(quarter_s, quarter_b):
""" quarter_s is the small one
quarter_b is the big one
"""
for info in data:
quarter = info[0]
if quarter >= quarter_s and quarter <= quarter_b:
print quarter, info[1]
get_rangedata_by_quarter('1947q3', '1948q4')

Related

How to add column for every month and generate number i.e. 1,2,3..etc

I have a huge csv file of dataframe. However, I don't have the date column. I only have the sales for every month from Jan-2022 until Dec-2034. Below is the example of my dataframe:
import pandas as pd
data = [[6661, 'Mobile Phone', 43578, 5000, 78564, 52353, 67456, 86965, 43634, 32546, 56332, 58944, 98878, 68588, 43634, 3463, 74533, 73733, 64436, 45426, 57333, 89762, 4373, 75457, 74845, 86843, 59957, 74563, 745335, 46342, 463473, 52352, 23622],
[6672, 'Play Station', 4475, 2546, 5757, 2352, 57896, 98574, 53536, 56533, 88645, 44884, 76585, 43575, 74573, 75347, 57573, 5736, 53737, 35235, 5322, 54757, 74573, 75473, 77362, 21554, 73462, 74736, 1435, 4367, 63462, 32362, 56332],
[6631, 'Laptop', 35347, 36376, 164577, 94584, 78675, 76758, 75464, 56373, 56343, 54787, 7658, 76584, 47347, 5748, 8684, 75373, 57573, 26626, 25632, 73774, 847373, 736646, 847457, 57346, 43732, 347346, 75373, 6473, 85674, 35743, 45734],
[6600, 'Camera', 14365, 60785, 25436, 46747, 75456, 97644, 63573, 56433, 25646, 32548, 14325, 64748, 68458, 46537, 7537, 46266, 7457, 78235, 46223, 8747, 67453, 4636, 3425, 4636, 352236, 6622, 64625, 36346, 46346, 35225, 6436],
[6643, 'Lamp', 324355, 143255, 696954, 97823, 43657, 66686, 56346, 57563, 65734, 64484, 87685, 54748, 9868, 573, 73472, 5735, 73422, 86352, 5325, 84333, 7473, 35252, 7547, 73733, 7374, 32266, 654747, 85743, 57333, 46346, 46266]]
ds = pd.DataFrame(data, columns = ['ID', 'Product', 'SalesJan-22', 'SalesFeb-22', 'SalesMar-22', 'SalesApr-22', 'SalesMay-22', 'SalesJun-22', 'SalesJul-22', 'SalesAug-22', 'SalesSep-22', 'SalesOct-22', 'SalesNov-22', 'SalesDec-22', 'SalesJan-23', 'SalesFeb-23', 'SalesMar-23', 'SalesApr-23', 'SalesMay-23', 'SalesJun-23', 'SalesJul-23', 'SalesAug-23', 'SalesSep-23', 'SalesOct-23', 'SalesNov-23', 'SalesDec-23', 'SalesJan-24', 'SalesFeb-24', 'SalesMar-24', 'SalesApr-24', 'SalesMay-24', 'SalesJun-24', 'SalesJul-24']
Since I have more than 10 monthly sales column, I want to loop the date after each of the month sales column. Then, the first 6 months will generate number 1, while the next 12 months will generate number 2, then another 12 months will generate number 3, another subsequent 12 months will generate number 4 and so on.
Below shows the sample of result that I want:
Is there any way to perform the loop and adding the date column beside each of the sales month?
Here is the simplest approach I can think of:
for i, col in enumerate(ds.columns[2:]):
ds.insert(2 * i + 2, col.removeprefix("Sales"), (i - 6) // 12 + 2)
Here is a vectorial approach (using insert repeatedly is inefficient):
# convert (valid) columns to datetime
cols = pd.to_datetime(ds.columns, format='Sales%b-%y', errors='coerce')
# identify valid dates
m = cols.notna()
# get year
y = cols[m].year
# calculate number (1 for first 6 months, then +1 per 12 months)
num = ((cols[m].month+12*(y-y.min()))+5)//12+1
# slice dates columns, assign the number, rename
df2 = (ds.loc[:, m].assign(**dict(zip(ds.columns[m], num)))
.rename(columns=lambda x: x[5:])
)
# get new order of columns
idx = np.r_[np.zeros((~m).sum()), np.tile(np.arange(m.sum()), 2)+1]
# concat and reorder
out = pd.concat([ds, df2], axis=1).iloc[:, np.argsort(idx)]
print(out)
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 SalesMay-22 May-22 SalesJun-22 Jun-22 SalesJul-22 Jul-22 SalesAug-22 Aug-22 Sep-22 SalesSep-22 Oct-22 SalesOct-22 SalesNov-22 Nov-22 Dec-22 SalesDec-22 Jan-23 SalesJan-23 Feb-23 SalesFeb-23 SalesMar-23 Mar-23 Apr-23 SalesApr-23 SalesMay-23 May-23 SalesJun-23 Jun-23 Jul-23 SalesJul-23 SalesAug-23 Aug-23 Sep-23 SalesSep-23 SalesOct-23 Oct-23 Nov-23 SalesNov-23 Dec-23 SalesDec-23 Jan-24 SalesJan-24 Feb-24 SalesFeb-24 Mar-24 SalesMar-24 Apr-24 SalesApr-24 May-24 SalesMay-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 1 5000 1 78564 1 52353 1 67456 1 86965 1 43634 2 32546 2 2 56332 2 58944 98878 2 2 68588 2 43634 2 3463 74533 2 2 73733 64436 2 45426 2 3 57333 89762 3 3 4373 75457 3 3 74845 3 86843 3 59957 3 74563 3 745335 3 46342 3 463473 52352 3 23622 4
1 6672 Play Station 4475 1 2546 1 5757 1 2352 1 57896 1 98574 1 53536 2 56533 2 2 88645 2 44884 76585 2 2 43575 2 74573 2 75347 57573 2 2 5736 53737 2 35235 2 3 5322 54757 3 3 74573 75473 3 3 77362 3 21554 3 73462 3 74736 3 1435 3 4367 3 63462 32362 3 56332 4
2 6631 Laptop 35347 1 36376 1 164577 1 94584 1 78675 1 76758 1 75464 2 56373 2 2 56343 2 54787 7658 2 2 76584 2 47347 2 5748 8684 2 2 75373 57573 2 26626 2 3 25632 73774 3 3 847373 736646 3 3 847457 3 57346 3 43732 3 347346 3 75373 3 6473 3 85674 35743 3 45734 4
3 6600 Camera 14365 1 60785 1 25436 1 46747 1 75456 1 97644 1 63573 2 56433 2 2 25646 2 32548 14325 2 2 64748 2 68458 2 46537 7537 2 2 46266 7457 2 78235 2 3 46223 8747 3 3 67453 4636 3 3 3425 3 4636 3 352236 3 6622 3 64625 3 36346 3 46346 35225 3 6436 4
4 6643 Lamp 324355 1 143255 1 696954 1 97823 1 43657 1 66686 1 56346 2 57563 2 2 65734 2 64484 87685 2 2 54748 2 9868 2 573 73472 2 2 5735 73422 2 86352 2 3 5325 84333 3 3 7473 35252 3 3 7547 3 73733 3 7374 3 32266 3 654747 3 85743 3 57333 46346 3 46266 4
Here's a little solution : (I put the year unstead of your 1, 2, ... incrementation since i thought it is more representative, but you can change it easily)
idx_counter = 0
for idx, col in enumerate(ds.columns):
if col.startswith('Sales'):
date = col.replace('Sales', '')
year = col.split('-')[1]
ds.insert(loc=idx + 1 + idx_counter, column=date, value=[year] * ds.shape[0])
idx_counter += 1
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 ... SalesMar-24 Mar-24 SalesApr-24 Apr-24 SalesMay-24 May-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 22 5000 22 78564 22 52353 22 ... 745335 24 46342 24 463473 24 52352 24 23622 24
1 6672 Play Station 4475 22 2546 22 5757 22 2352 22 ... 1435 24 4367 24 63462 24 32362 24 56332 24
2 6631 Laptop 35347 22 36376 22 164577 22 94584 22 ... 75373 24 6473 24 85674 24 35743 24 45734 24
3 6600 Camera 14365 22 60785 22 25436 22 46747 22 ... 64625 24 36346 24 46346 24 35225 24 6436 24
4 6643 Lamp 324355 22 143255 22 696954 22 97823 22 ... 654747 24 85743 24 57333 24 46346 24 46266 24
This should do the trick.
import math
new_cols = []
old_cols = [x for x in df.columns if x.startswith('Sales')]
for i, col in enumerate(old_cols):
new_cols.append(col[5:])
if i < 6:
val = 1
else:
val = ((i+6)/12)+1
df[col[5:]] = math.floor(val)
df[['ID', 'Product'] + [x for y in zip(old_cols, new_cols) for x in y]]

Sorting an Dataframe wtih Index in form of List

I have a pandas df that I need to sort based on a fixed order given in a list. The problem that I'm having is that the sort I'm attempting is not moving the data rows in the designated order that I'm expecting from the order of the list. My list and DataFrame (df) looks like this:
months = ['5','6','7','8','9','10','11','12','1','2']
df =
year 1992 1993 1994 1995
month
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
The closest that I have gotten is this -
newdf = pd.DataFrame(df.values, index=list(months))
...but it does not move the rows. This command only adds the months in the index column w/out moving the data.
0 1 2 3
5 -0.343107 -0.211959 0.437974 -1.219363
6 -0.383353 0.888650 1.054926 0.714846
7 0.057198 1.246682 0.042684 0.275701
8 -0.100018 -0.801554 0.001111 0.382633
9 -0.283815 0.204448 0.350705 0.130652
10 0.042195 -0.433849 -1.481228 -0.236004
11 1.059776 0.875214 0.304638 0.127819
12 -0.328911 -0.256656 1.081157 1.057449
1 -0.488213 -0.957050 -0.813885 1.403822
2 0.973031 -0.246714 0.600157 0.579038
I need the result to look like -
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
Assuming df.index is dtype('int64'), first convert months to integers. Then use loc:
months = [*map(int, months)]
out = df.loc[months]
If df.index is dtype('O'), you can use loc right away, i.e. you don't need the first line.
Output:
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846

Pandas filter using multiple conditions and ignore entries that contain duplicates of substring

I have a dataframe derived from a massive list of market tickers from a crypto exchange.
The list includes ALL combos yet I only need the tickers that are vs USD stablecoins.
The 1st 15 entries of the original dataframe...
Asset Price
0 1INCHBTC 0.00009650
1 1INCHBUSD 5.74340000
2 1INCHUSDT 5.74050000
3 AAVEBKRW 164167.00000000
4 AAVEBNB 0.77600000
5 AAVEBTC 0.00615200
6 AAVEBUSD 365.00200000
7 AAVEDOWNUSDT 2.02505200
8 AAVEETH 0.17212000
9 AAVEUPUSDT 81.89500000
10 AAVEUSDT 365.57600000
11 ACMBTC 0.00018420
12 ACMBUSD 10.91700000
13 ACMUSDT 10.89500000
14 ADAAUD 1.59600000
Now...there are many USD stablecoins, however not every ticker has a pair with one.
So I used the most popular ones in order to make sure every asset has at least one match.
df = df.loc[(df.Asset.str[-3:] == 'DAI')|
(df.Asset.str[-4:] == 'USDT')|
(df.Asset.str[-4:] == 'BUSD')|
(df.Asset.str[-4:] == 'TUSD')]
The 1st 15 entries of the new but 'messy' dataframe...
Asset Price
0 1INCHBUSD 5.74340000
1 1INCHUSDT 5.74050000
2 AAVEBUSD 365.00200000
3 AAVEDOWNUSDT 2.02505200
4 AAVEUPUSDT 81.89500000
5 AAVEUSDT 365.57600000
6 ACMBUSD 10.91700000
7 ACMUSDT 10.89500000
8 ADABUSD 1.21439000
9 ADADOWNUSDT 3.46482700
10 ADATUSD 1.21284000
11 ADAUPUSDT 76.12900000
12 ADAUSDT 1.21394000
13 AERGOBUSD 0.43012000
14 AIONBUSD 0.07210000
How do i filter/merge entries in this dataframe so that it removes duplicates?
I also need the substring to be removed at the end, so I'm left with just the asset and the USD price.
It should look something like this...
Asset Price
0 1INCH 5.74340000
2 AAVE 365.00200000
3 AAVEDOWN 2.02505200
4 AAVEUP 81.89500000
6 ACM 10.91700000
8 ADA 1.21439000
9 ADADOWN 3.46482700
11 ADAUP 76.12900000
13 AERGO 0.43012000
14 AION 0.07210000
This is for a portfolio tracker.
Also if there is a better way to do this without the middle step I'm all ears.
According your expected output, you want to remove duplicates but keep first item:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.drop_duplicates(subset="Asset", keep="first")
print(df)
Prints:
Asset Price
0 1INCH 5.743400
2 AAVE 365.002000
3 AAVEDOWN 2.025052
4 AAVEUP 81.895000
6 ACM 10.917000
8 ADA 1.214390
9 ADADOWN 3.464827
11 ADAUP 76.129000
13 AERGO 0.430120
14 AION 0.072100
EDIT: To group and average:
df.Asset = df.Asset.str.replace(r"(DAI|USDT|BUSD|TUSD)$", "")
df = df.groupby("Asset")["Price"].mean().reset_index()
print(df)
Prints:
Asset Price
0 1INCH 5.741950
1 AAVE 365.289000
2 AAVEDOWN 2.025052
3 AAVEUP 81.895000
4 ACM 10.906000
5 ADA 1.213723
6 ADADOWN 3.464827
7 ADAUP 76.129000
8 AERGO 0.430120
9 AION 0.072100
Just do
con1 = df.Asset.str[-3:] == 'DAI'
con2 = df.Asset.str[-4:] == 'USDT'
con3 = df.Asset.str[-4:] == 'BUSD'
con4 = df.Asset.str[-4:] == 'TUSD'
df['new'] = np.select(['con1','con2','con3','con4'],
['DAI','USDT','BUSD','TUSD'])
out = df[con1 | con2 | con3 | con4].groupby('new').head(1)
or
df[con1 | con2 | con3 | con4].drop_duplicates('new')

extracting upper and lower row if a condition is met

Regards.
I have the following coordinate dataframe, divided by blocks. Each block starts at seq0_leftend, seq0_rightend, seq1_leftend, seq1_rightend, seq2_leftend, seq2_rightend, seq3_leftend, seq3_rightend, and so on. I would like that, for each block given the condition if, coordinates are negative, extract the upper and lower row. example of my dataframe file:
seq0_leftend seq0_rightend
0 7 107088
1 107089 108940
2 108941 362759
3 362760 500485
4 500486 509260
5 509261 702736
seq1_leftend seq1_rightend
0 1 106766
1 106767 108619
2 108620 355933
3 355934 488418
4 488419 497151
5 497152 690112
6 690113 700692
7 700693 721993
8 721994 722347
9 722348 946296
10 946297 977714
11 977715 985708
12 -985709 -990725
13 991992 1042023
14 1042024 1259523
15 1259524 1261239
seq2_leftend seq2_rightend
0 1 109407
1 362514 364315
2 109408 362513
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
13 1015418 1069976
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
0 1 112030
1 112031 113882
2 113883 381662
3 381663 519575
4 519576 528317
5 528318 724500
6 724501 735077
7 735078 759456
8 759457 763157
9 763158 996929
10 996931 1034492
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
14 1125427 1353901
15 1353902 1356209
16 1356210 1392818
seq4_leftend seq4_rightend
0 1 105722
1 105723 107575
2 107576 355193
3 355194 487487
4 487488 496220
5 496221 689560
6 689561 700139
7 700140 721438
8 721458 721497
9 721498 947183
10 947184 978601
11 978602 986595
12 -986596 -991612
13 994605 1046245
14 1046247 1264692
15 1264693 1266814
Finally write a new csv with the data of interest, an example of the final result that I would like, would be this:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
seq4_leftend seq4_rightend
11 978602 986595
12 -986596 -991612
13 994605 1046245
I assume that you have a list of DataFrames, let's call it src.
To convert a single DataFrame, define the following function:
def findRows(df):
col = df.iloc[:, 0]
if col.lt(0).any():
return df[col.lt(0) | col.shift(1).lt(0) | col.shift(-1).lt(0)]
else:
return None
Note that this function starts with reading column 0 from the source
DataFrame, so it is independent of the name of this column.
Then it checks whether any element in this column is < 0.
If found, the returned object is a DataFrame with rows which
contain a value < 0:
either in this element,
or in the previous element,
or in the next element.
If not found, this function returns None (from your expected result
I see that in such a case you don't want even any empty DataFrame).
The first stage is to collect results of this function called on each
DataFrame from src:
result = [ findRows(df) for df in src ]
An the last part is to filter out elements which are None:
result = list(filter(None.__ne__, result))
To see the result, run:
for df in result:
print(df)
For src containing first 3 of your DataFrames, I got:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
As you can see, the resulting list contains only results
originating from the second and third source DataFrame.
The first was filtered out, since findRows returned
None from its processing.

Is there a way to rank some items in a pandas dataframe and exclude others?

I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.
So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))

Categories