To sum column with condition - python

I have data in textfile. example.
A B C D E F
10 0 0.9775 39.3304 0.9311 60.5601
10 1 0.9802 32.3287 0.9433 56.1201
10 2 0.9816 39.9759 0.9446 54.0428
10 3 0.9737 37.8779 0.9419 56.3865
10 4 0.9798 34.9152 0.905 69.0879
10 5 0.9803 50.057 0.9201 64.6289
10 6 0.9805 39.1062 0.9093 68.4061
10 7 0.9781 33.8874 0.9327 60.7631
10 8 0.9802 32.5734 0.9376 60.9165
10 9 0.9798 32.3466 0.94 54.7645
11 0 0.9749 40.2712 0.9042 71.2873
11 1 0.9755 35.6546 0.9195 63.7436
11 2 0.9766 36.753 0.9507 51.7864
11 3 0.9779 35.6485 0.9371 59.2483
11 4 0.9803 35.2712 0.8833 79.0257
11 5 0.981 46.5462 0.9156 66.6951
11 6 0.9809 41.8181 0.8642 83.7533
11 7 0.9749 36.7484 0.9259 62.36
11 8 0.9736 36.8859 0.9395 58.1538
11 9 0.98 32.4069 0.9255 61.202
12 0 0.9812 37.2547 0.9121 68.1347
12 1 0.9808 31.4568 0.9372 55.9992
12 2 0.9813 36.5316 0.9497 53.1687
12 3 0.9803 33.1063 0.9051 69.8894
12 4 0.9786 35.0318 0.8968 72.9963
12 5 0.9756 63.441 0.9091 69.9482
12 6 0.9804 39.1602 0.9156 65.2399
12 7 0.976 35.5875 0.9248 62.6284
12 8 0.9779 33.7774 0.9416 56.3755
12 9 0.9804 32.0849 0.9401 55.2871
I want the sum of column C. With the condition that. Column A has a value that is unique (10 lines). Please advise me.

>>> L=map(str.split, """10 0 0.9775 39.3304 0.9311 60.5601
... 10 1 0.9802 32.3287 0.9433 56.1201
... 10 2 0.9816 39.9759 0.9446 54.0428
... 10 3 0.9737 37.8779 0.9419 56.3865
... 10 4 0.9798 34.9152 0.905 69.0879
... 10 5 0.9803 50.057 0.9201 64.6289
... 10 6 0.9805 39.1062 0.9093 68.4061
... 10 7 0.9781 33.8874 0.9327 60.7631
... 10 8 0.9802 32.5734 0.9376 60.9165
... 10 9 0.9798 32.3466 0.94 54.7645
... 11 0 0.9749 40.2712 0.9042 71.2873
... 11 1 0.9755 35.6546 0.9195 63.7436
... 11 2 0.9766 36.753 0.9507 51.7864
... 11 3 0.9779 35.6485 0.9371 59.2483
... 11 4 0.9803 35.2712 0.8833 79.0257
... 11 5 0.981 46.5462 0.9156 66.6951
... 11 6 0.9809 41.8181 0.8642 83.7533
... 11 7 0.9749 36.7484 0.9259 62.36
... 11 8 0.9736 36.8859 0.9395 58.1538
... 11 9 0.98 32.4069 0.9255 61.202
... 12 0 0.9812 37.2547 0.9121 68.1347
... 12 1 0.9808 31.4568 0.9372 55.9992
... 12 2 0.9813 36.5316 0.9497 53.1687
... 12 3 0.9803 33.1063 0.9051 69.8894
... 12 4 0.9786 35.0318 0.8968 72.9963
... 12 5 0.9756 63.441 0.9091 69.9482
... 12 6 0.9804 39.1602 0.9156 65.2399
... 12 7 0.976 35.5875 0.9248 62.6284
... 12 8 0.9779 33.7774 0.9416 56.3755
... 12 9 0.9804 32.0849 0.9401 55.2871""".split("\n"))
>>> from collections import defaultdict
>>> D = defaultdict(float)
>>> for a,b,c,d,e,f in L:
... D[a] += float(c)
...
>>> D
defaultdict(<type 'float'>, {'11': 9.7756, '10': 9.791699999999999, '12': 9.7925})
>>> dict(D.items())
{'11': 9.7756, '10': 9.791699999999999, '12': 9.7925}

with open('data.txt') as f:
next(f)
d=dict()
for x in f:
if x.split()[0] not in d:
d[x.split()[0]]=float(x.split()[2])
else:
d[x.split()[0]]+=float(x.split()[2])
output:
{'11': 9.7756, '10': 9.791699999999999, '12': 9.7925}

For fun
#!/usr/bin/env ksh
while <file; do
((a[$1]+=$3))
done
print -C a
output
([10]=9.7917 [11]=9.7756 [12]=9.7925)
Requires the undocumented FILESCAN compile-time option.

If you want the sum grouped by A value:
awk '{sums[$1] += $3} END {for (sum in sums) print sum, sums[sum]}' inputfile

import csv
with open("file.txt","rU") as f:
reader = csv.reader(f)
# read header
reader.next()
# summarize
a_values = []
sum = 0
for row in reader:
if row[0] not in a_values:
a_values.append(row[0])
sum += float(row[2])

Related

How to add column for every month and generate number i.e. 1,2,3..etc

I have a huge csv file of dataframe. However, I don't have the date column. I only have the sales for every month from Jan-2022 until Dec-2034. Below is the example of my dataframe:
import pandas as pd
data = [[6661, 'Mobile Phone', 43578, 5000, 78564, 52353, 67456, 86965, 43634, 32546, 56332, 58944, 98878, 68588, 43634, 3463, 74533, 73733, 64436, 45426, 57333, 89762, 4373, 75457, 74845, 86843, 59957, 74563, 745335, 46342, 463473, 52352, 23622],
[6672, 'Play Station', 4475, 2546, 5757, 2352, 57896, 98574, 53536, 56533, 88645, 44884, 76585, 43575, 74573, 75347, 57573, 5736, 53737, 35235, 5322, 54757, 74573, 75473, 77362, 21554, 73462, 74736, 1435, 4367, 63462, 32362, 56332],
[6631, 'Laptop', 35347, 36376, 164577, 94584, 78675, 76758, 75464, 56373, 56343, 54787, 7658, 76584, 47347, 5748, 8684, 75373, 57573, 26626, 25632, 73774, 847373, 736646, 847457, 57346, 43732, 347346, 75373, 6473, 85674, 35743, 45734],
[6600, 'Camera', 14365, 60785, 25436, 46747, 75456, 97644, 63573, 56433, 25646, 32548, 14325, 64748, 68458, 46537, 7537, 46266, 7457, 78235, 46223, 8747, 67453, 4636, 3425, 4636, 352236, 6622, 64625, 36346, 46346, 35225, 6436],
[6643, 'Lamp', 324355, 143255, 696954, 97823, 43657, 66686, 56346, 57563, 65734, 64484, 87685, 54748, 9868, 573, 73472, 5735, 73422, 86352, 5325, 84333, 7473, 35252, 7547, 73733, 7374, 32266, 654747, 85743, 57333, 46346, 46266]]
ds = pd.DataFrame(data, columns = ['ID', 'Product', 'SalesJan-22', 'SalesFeb-22', 'SalesMar-22', 'SalesApr-22', 'SalesMay-22', 'SalesJun-22', 'SalesJul-22', 'SalesAug-22', 'SalesSep-22', 'SalesOct-22', 'SalesNov-22', 'SalesDec-22', 'SalesJan-23', 'SalesFeb-23', 'SalesMar-23', 'SalesApr-23', 'SalesMay-23', 'SalesJun-23', 'SalesJul-23', 'SalesAug-23', 'SalesSep-23', 'SalesOct-23', 'SalesNov-23', 'SalesDec-23', 'SalesJan-24', 'SalesFeb-24', 'SalesMar-24', 'SalesApr-24', 'SalesMay-24', 'SalesJun-24', 'SalesJul-24']
Since I have more than 10 monthly sales column, I want to loop the date after each of the month sales column. Then, the first 6 months will generate number 1, while the next 12 months will generate number 2, then another 12 months will generate number 3, another subsequent 12 months will generate number 4 and so on.
Below shows the sample of result that I want:
Is there any way to perform the loop and adding the date column beside each of the sales month?
Here is the simplest approach I can think of:
for i, col in enumerate(ds.columns[2:]):
ds.insert(2 * i + 2, col.removeprefix("Sales"), (i - 6) // 12 + 2)
Here is a vectorial approach (using insert repeatedly is inefficient):
# convert (valid) columns to datetime
cols = pd.to_datetime(ds.columns, format='Sales%b-%y', errors='coerce')
# identify valid dates
m = cols.notna()
# get year
y = cols[m].year
# calculate number (1 for first 6 months, then +1 per 12 months)
num = ((cols[m].month+12*(y-y.min()))+5)//12+1
# slice dates columns, assign the number, rename
df2 = (ds.loc[:, m].assign(**dict(zip(ds.columns[m], num)))
.rename(columns=lambda x: x[5:])
)
# get new order of columns
idx = np.r_[np.zeros((~m).sum()), np.tile(np.arange(m.sum()), 2)+1]
# concat and reorder
out = pd.concat([ds, df2], axis=1).iloc[:, np.argsort(idx)]
print(out)
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 SalesMay-22 May-22 SalesJun-22 Jun-22 SalesJul-22 Jul-22 SalesAug-22 Aug-22 Sep-22 SalesSep-22 Oct-22 SalesOct-22 SalesNov-22 Nov-22 Dec-22 SalesDec-22 Jan-23 SalesJan-23 Feb-23 SalesFeb-23 SalesMar-23 Mar-23 Apr-23 SalesApr-23 SalesMay-23 May-23 SalesJun-23 Jun-23 Jul-23 SalesJul-23 SalesAug-23 Aug-23 Sep-23 SalesSep-23 SalesOct-23 Oct-23 Nov-23 SalesNov-23 Dec-23 SalesDec-23 Jan-24 SalesJan-24 Feb-24 SalesFeb-24 Mar-24 SalesMar-24 Apr-24 SalesApr-24 May-24 SalesMay-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 1 5000 1 78564 1 52353 1 67456 1 86965 1 43634 2 32546 2 2 56332 2 58944 98878 2 2 68588 2 43634 2 3463 74533 2 2 73733 64436 2 45426 2 3 57333 89762 3 3 4373 75457 3 3 74845 3 86843 3 59957 3 74563 3 745335 3 46342 3 463473 52352 3 23622 4
1 6672 Play Station 4475 1 2546 1 5757 1 2352 1 57896 1 98574 1 53536 2 56533 2 2 88645 2 44884 76585 2 2 43575 2 74573 2 75347 57573 2 2 5736 53737 2 35235 2 3 5322 54757 3 3 74573 75473 3 3 77362 3 21554 3 73462 3 74736 3 1435 3 4367 3 63462 32362 3 56332 4
2 6631 Laptop 35347 1 36376 1 164577 1 94584 1 78675 1 76758 1 75464 2 56373 2 2 56343 2 54787 7658 2 2 76584 2 47347 2 5748 8684 2 2 75373 57573 2 26626 2 3 25632 73774 3 3 847373 736646 3 3 847457 3 57346 3 43732 3 347346 3 75373 3 6473 3 85674 35743 3 45734 4
3 6600 Camera 14365 1 60785 1 25436 1 46747 1 75456 1 97644 1 63573 2 56433 2 2 25646 2 32548 14325 2 2 64748 2 68458 2 46537 7537 2 2 46266 7457 2 78235 2 3 46223 8747 3 3 67453 4636 3 3 3425 3 4636 3 352236 3 6622 3 64625 3 36346 3 46346 35225 3 6436 4
4 6643 Lamp 324355 1 143255 1 696954 1 97823 1 43657 1 66686 1 56346 2 57563 2 2 65734 2 64484 87685 2 2 54748 2 9868 2 573 73472 2 2 5735 73422 2 86352 2 3 5325 84333 3 3 7473 35252 3 3 7547 3 73733 3 7374 3 32266 3 654747 3 85743 3 57333 46346 3 46266 4
Here's a little solution : (I put the year unstead of your 1, 2, ... incrementation since i thought it is more representative, but you can change it easily)
idx_counter = 0
for idx, col in enumerate(ds.columns):
if col.startswith('Sales'):
date = col.replace('Sales', '')
year = col.split('-')[1]
ds.insert(loc=idx + 1 + idx_counter, column=date, value=[year] * ds.shape[0])
idx_counter += 1
output:
ID Product SalesJan-22 Jan-22 SalesFeb-22 Feb-22 SalesMar-22 Mar-22 SalesApr-22 Apr-22 ... SalesMar-24 Mar-24 SalesApr-24 Apr-24 SalesMay-24 May-24 SalesJun-24 Jun-24 SalesJul-24 Jul-24
0 6661 Mobile Phone 43578 22 5000 22 78564 22 52353 22 ... 745335 24 46342 24 463473 24 52352 24 23622 24
1 6672 Play Station 4475 22 2546 22 5757 22 2352 22 ... 1435 24 4367 24 63462 24 32362 24 56332 24
2 6631 Laptop 35347 22 36376 22 164577 22 94584 22 ... 75373 24 6473 24 85674 24 35743 24 45734 24
3 6600 Camera 14365 22 60785 22 25436 22 46747 22 ... 64625 24 36346 24 46346 24 35225 24 6436 24
4 6643 Lamp 324355 22 143255 22 696954 22 97823 22 ... 654747 24 85743 24 57333 24 46346 24 46266 24
This should do the trick.
import math
new_cols = []
old_cols = [x for x in df.columns if x.startswith('Sales')]
for i, col in enumerate(old_cols):
new_cols.append(col[5:])
if i < 6:
val = 1
else:
val = ((i+6)/12)+1
df[col[5:]] = math.floor(val)
df[['ID', 'Product'] + [x for y in zip(old_cols, new_cols) for x in y]]

How do i find maximum and count of an object in a dataframe for Python?

The complete dataset along with supplementary
information and variable descriptions can be downloaded from the Harvard Dataverse at
https://doi.org/10.7910/DVN/HG7NV7
Codes that i have used :
import pandas as pd
import numpy as np
df_2005=pd.read_csv('2005.csv.bz2')
df_2006=pd.read_csv('2006.csv.bz2')
airport_df=pd.read_csv('airports.csv')
carrier_df=pd.read_csv('carriers.csv')
planes_df=pd.read_csv('plane-data.csv')
variables_df=pd.read_csv('variable-descriptions.csv')
main_df=pd.concat([df_2005,df_2006],ignore_index=True)
grouped_df = main_df.groupby("Month")
grouped_df = grouped_df.agg({"Dest": "unique"})
grouped_df = grouped_df.reset_index()
print(grouped_df)
Output for grouped_df :
Month Dest
0 1 [ORD, BOS, SAT, DAY, MSP, SLC, SFO, DEN, PDX, ...
1 2 [SAT, SJC, BOS, ORD, DAY, SLC, PDX, SFO, PIT, ...
2 3 [ORD, SJC, DEN, MSY, IND, SAN, OAK, SFO, SEA, ...
3 4 [MIA, DEN, LAX, MSP, SJC, CLT, ORD, HDN, MCI, ...
4 5 [DEN, MSP, SNA, SAN, ORD, EWR, SFO, ABQ, BWI, ...
5 6 [SNA, OMA, DEN, LGA, SAN, SJC, CLT, ORD, STL, ...
6 7 [ORD, SEA, DEN, DFW, PVD, SNA, MIA, OMA, SFO, ...
7 8 [STL, DEN, PDX, ORD, SFO, MSY, MIA, PHL, IAH, ...
8 9 [DEN, ORD, EWR, SFO, ABQ, MSP, IAH, PDX, CLT, ...
9 10 [DEN, ORD, ATL, LAX, MSP, DFW, IAD, SAN, DTW, ...
10 11 [PDX, DEN, MCI, SFO, CLT, ORD, ATL, MSP, DFW, ...
11 12 [PDX, LAX, ORD, BDL, DEN, SLC, SFO, BOS, MCI, ...
The question was: How does the number of people flying between different locations change over time?
Hence, I was planning to find the maximum flights to the Destination (under the Dest column) each month.
Expected output:
Month HighestDest CountOfHighestDest
0 1 ORD 55
1 2 SAT 54
2 3 ORD 33
3 4 MIA 45
4 5 DEN 66
5 6 SNA 73
6 7 ORD 54
7 8 STL 23
8 9 DEN 11
9 10 DEN 44
10 11 PDX 45
11 12 PDX 47
Where it shows the maximum frequency of Destination per month and the count of it.

extracting upper and lower row if a condition is met

Regards.
I have the following coordinate dataframe, divided by blocks. Each block starts at seq0_leftend, seq0_rightend, seq1_leftend, seq1_rightend, seq2_leftend, seq2_rightend, seq3_leftend, seq3_rightend, and so on. I would like that, for each block given the condition if, coordinates are negative, extract the upper and lower row. example of my dataframe file:
seq0_leftend seq0_rightend
0 7 107088
1 107089 108940
2 108941 362759
3 362760 500485
4 500486 509260
5 509261 702736
seq1_leftend seq1_rightend
0 1 106766
1 106767 108619
2 108620 355933
3 355934 488418
4 488419 497151
5 497152 690112
6 690113 700692
7 700693 721993
8 721994 722347
9 722348 946296
10 946297 977714
11 977715 985708
12 -985709 -990725
13 991992 1042023
14 1042024 1259523
15 1259524 1261239
seq2_leftend seq2_rightend
0 1 109407
1 362514 364315
2 109408 362513
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
13 1015418 1069976
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
0 1 112030
1 112031 113882
2 113883 381662
3 381663 519575
4 519576 528317
5 528318 724500
6 724501 735077
7 735078 759456
8 759457 763157
9 763158 996929
10 996931 1034492
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
14 1125427 1353901
15 1353902 1356209
16 1356210 1392818
seq4_leftend seq4_rightend
0 1 105722
1 105723 107575
2 107576 355193
3 355194 487487
4 487488 496220
5 496221 689560
6 689561 700139
7 700140 721438
8 721458 721497
9 721498 947183
10 947184 978601
11 978602 986595
12 -986596 -991612
13 994605 1046245
14 1046247 1264692
15 1264693 1266814
Finally write a new csv with the data of interest, an example of the final result that I would like, would be this:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
seq3_leftend seq3_rightend
11 1034493 1040984
12 -1040985 -1061402
13 1071212 1125426
seq4_leftend seq4_rightend
11 978602 986595
12 -986596 -991612
13 994605 1046245
I assume that you have a list of DataFrames, let's call it src.
To convert a single DataFrame, define the following function:
def findRows(df):
col = df.iloc[:, 0]
if col.lt(0).any():
return df[col.lt(0) | col.shift(1).lt(0) | col.shift(-1).lt(0)]
else:
return None
Note that this function starts with reading column 0 from the source
DataFrame, so it is independent of the name of this column.
Then it checks whether any element in this column is < 0.
If found, the returned object is a DataFrame with rows which
contain a value < 0:
either in this element,
or in the previous element,
or in the next element.
If not found, this function returns None (from your expected result
I see that in such a case you don't want even any empty DataFrame).
The first stage is to collect results of this function called on each
DataFrame from src:
result = [ findRows(df) for df in src ]
An the last part is to filter out elements which are None:
result = list(filter(None.__ne__, result))
To see the result, run:
for df in result:
print(df)
For src containing first 3 of your DataFrames, I got:
seq1_leftend seq1_rightend
11 977715 985708
12 -985709 -990725
13 991992 1042023
seq2_leftend seq2_rightend
3 364450 504968
4 -504969 -515995
5 515996 671291
6 -671295 -682263
7 682264 707010
8 -707011 -709780
9 709781 934501
10 973791 1015417
11 -961703 -973790
12 948955 961702
14 1069977 1300633
15 -1300634 -1301616
16 1301617 1344821
17 -1515463 -1596433
18 1514459 1515462
19 -1508094 -1514458
20 1346999 1361467
21 -1361468 -1367472
22 1369840 1508093
As you can see, the resulting list contains only results
originating from the second and third source DataFrame.
The first was filtered out, since findRows returned
None from its processing.

Is there a way to rank some items in a pandas dataframe and exclude others?

I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.
So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))

Permute the last column only using python

I have tried
import numpy as np
import paandas as pd
df1 = pd.read_csv('Trans_ZS_Control_64')
df = df1.apply(np.random.permutation)
This permuted the entire data but i want to permute the value of last column only upto 100 times.
How do I proceed for this.
Input Data
0.051424 0.535067 0.453645 0.161857 -0.017189 -0.001850 0.481861 0.711553 0.083747 0.583215 ... 0.541249 0.048360 0.370659 0.890987 0.723995 -0.014502 1.295998 0.150719 0.885673 1
-0.067129 0.673519 0.212407 0.195590 -0.034868 -0.231418 0.480255 0.643735 -0.054970 0.511684 ... 0.524751 0.206757 0.578314 0.614924 0.230632 -0.074980 0.747007 0.047382 1.413796 1
-0.994564 -0.881392 -1.150127 -0.589125 -0.663275 -0.955622 -1.088923 -1.210452 -0.922861 -0.689851 ... -0.442188 -1.294110 -0.934985 -1.085506 -0.808874 -0.779111 -1.032484 -1.026208 -0.248476 1
-0.856323 -0.619472 -1.113073 -0.691285 -0.515566 -1.080643 -0.513487 -0.912825 -1.010245 -0.870335 ... -0.941149 -1.012917 -1.647812 -0.654150 -0.735166 -0.984510 -0.949168 -1.052115 -0.052492 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
-0.145871 0.832727 -0.003379 0.327546 1.409891 0.840316 0.700613 0.184477 0.962488 0.200397 ... -0.337530 0.988197 0.751663 0.480126 0.663302 -0.522189 0.512744 -0.063515 1.125415 0
0.972923 0.857971 -0.195672 0.190443 1.652155 0.763571 0.604728 0.115846 0.942269 0.453387 ... -0.522834 0.985770 0.570573 0.438632 0.737030 -0.445704 0.387023 0.031686 1.266407 0
0.281427 1.060266 0.172624 0.258344 1.544505 0.859777 0.689876 0.439106 0.955198 0.335523 ... -0.442724 0.929343 0.707809 0.290670 0.688595 -0.438848 0.762695 -0.105879 0.944989 0
0.096601 1.112720 0.105861 -0.133927 1.526764 0.773759 0.661673 -0.007070 0.884725 0.478899 ... -0.404426 0.966646 0.994733 0.418965 0.862612 -0.174580 0.407309 -0.010520 1.044876 0
-0.298780 1.036580 0.131270 0.019826 1.381928 0.879310 0.619529 -0.022691 0.982060 -0.039355 ... -0.702316 0.985320 0.457767 0.215949 0.752685 -0.405060 0.166226 -0.216972 1.021018 0
Expected output: randomly permute the last column
0.051424 0.535067 0.453645 0.161857 -0.017189 -0.001850 0.481861 0.711553 0.083747 0.583215 ... 0.541249 0.048360 0.370659 0.890987 0.723995 -0.014502 1.295998 0.150719 0.885673 0
-0.067129 0.673519 0.212407 0.195590 -0.034868 -0.231418 0.480255 0.643735 -0.054970 0.511684 ... 0.524751 0.206757 0.578314 0.614924 0.230632 -0.074980 0.747007 0.047382 1.413796 0
-0.994564 -0.881392 -1.150127 -0.589125 -0.663275 -0.955622 -1.088923 -1.210452 -0.922861 -0.689851 ... -0.442188 -1.294110 -0.934985 -1.085506 -0.808874 -0.779111 -1.032484 -1.026208 -0.248476 1
-0.856323 -0.619472 -1.113073 -0.691285 -0.515566 -1.080643 -0.513487 -0.912825 -1.010245 -0.870335 ... -0.941149 -1.012917 -1.647812 -0.654150 -0.735166 -0.984510 -0.949168 -1.052115 -0.052492 1
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
-0.145871 0.832727 -0.003379 0.327546 1.409891 0.840316 0.700613 0.184477 0.962488 0.200397 ... -0.337530 0.988197 0.751663 0.480126 0.663302 -0.522189 0.512744 -0.063515 1.125415 0
0.972923 0.857971 -0.195672 0.190443 1.652155 0.763571 0.604728 0.115846 0.942269 0.453387 ... -0.522834 0.985770 0.570573 0.438632 0.737030 -0.445704 0.387023 0.031686 1.266407 1
0.281427 1.060266 0.172624 0.258344 1.544505 0.859777 0.689876 0.439106 0.955198 0.335523 ... -0.442724 0.929343 0.707809 0.290670 0.688595 -0.438848 0.762695 -0.105879 0.944989 0
0.096601 1.112720 0.105861 -0.133927 1.526764 0.773759 0.661673 -0.007070 0.884725 0.478899 ... -0.404426 0.966646 0.994733 0.418965 0.862612 -0.174580 0.407309 -0.010520 1.044876 0
-0.298780 1.036580 0.131270 0.019826 1.381928 0.879310 0.619529 -0.022691 0.982060 -0.039355 ... -0.702316 0.985320 0.457767 0.215949 0.752685 -0.405060 0.166226 -0.216972 1.021018 1
No sure if this is what you meant, but you could do it like this
import pandas as pd
import numpy as np
v = np.arange(0,10)
df = pd.DataFrame({'c1': v, 'c2': v, 'c3': v})
df
this would create the following df:
c1 c2 c3
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
to permute the last column you could run this:
df1 = df.copy()
df1.c3 = np.random.permutation(df1.c3)
df1
resulting in:
c1 c2 c3
0 0 0 5
1 1 1 9
2 2 2 2
3 3 3 6
4 4 4 0
5 5 5 4
6 6 6 8
7 7 7 7
8 8 8 1
9 9 9 3
I hope it helps
Just create a dataframe from your last column and permutate that. It seems like permutating individual columns with apply doesn't work the way you expect it to.
import numpy as np
import pandas as pd
df = pd.read_csv('Trans_ZS_Control_64')
column_to_change = pd.DataFrame(df['last_column_name'])
for i in range(100):
column_to_change = column_to_change.apply(np.random.permutation)
df['last_column_name'] = column_to_change

Categories