Get 25 quantile in cumsum pandas - python

Suppose I have the following DataFrame:
df = pd.DataFrame({'id': [2, 4, 10, 12, 13, 14, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31, 42, 50, 54],
'value': [37410.0, 18400.0, 200000.0, 392000.0, 108000.0, 423000.0, 80000.0, 307950.0,
50807.0, 201740.0, 182700.0, 131300.0, 282005.0, 428800.0, 56000.0, 412400.0, 1091595.0, 1237200.0,
927500.0]})
And I do the following:
df.sort_values(by='id').set_index('id').cumsum()
value
id
2 37410.0
4 55810.0
10 255810.0
12 647810.0
13 755810.0
14 1178810.0
19 1258810.0
20 1566760.0
21 1617567.0
22 1819307.0
24 2002007.0
25 2133307.0
27 2415312.0
29 2844112.0
30 2900112.0
31 3312512.0
42 4404107.0
50 5641307.0
54 6568807.0
I want to know the first element of id that is bigger than 25% of the cumulative sum. In this example, 25% of the cumsum would be 1,642,201.75. The first element to exceed that would be 22. I know it can be done with a for, but I think it would be pretty inefficient.

You could do:
percentile_25 = df['value'].sum() * 0.25
res = df[df['value'].cumsum() > percentile_25].head(1)
print(res)
Output
id value
9 22 201740.0
Or use searchsorted to do the search in O(log N):
percentile_25 = df['value'].sum() * 0.25
i = df['value'].cumsum().searchsorted(percentile_25)
res = df.iloc[i]
print(res)
Output
id 22.0
value 201740.0
Name: 9, dtype: float64

Related

Python: How to split dataframe with datetime index by number of observations?

data = {
'aapl': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
'aal' : [33, 33, 33, 32, 31, 30, 34, 29, 27, 26],
}
data = pd.DataFrame(data)
data.index = pd.date_range('2011-01-01', '2011-01-10')
n_obs = len(data) * 0.3
train, test = data[:n_obs], data[n_obs:]
>>> TypeError: cannot do slice indexing on DatetimeIndex with these indexers [3.0] of type float
I can probably slice the dataframe by date like df[ : '2011-01-05' ], but I want to be splitting the data by number of observations, which I have difficulties using the method above.
You need to ensure having an integer for slicing:
n_obs = int(len(data) * 0.3)
train, test = data[:n_obs], data[n_obs:]
output:
# train
aapl aal
2011-01-01 11 33
2011-01-02 12 33
2011-01-03 13 33
# test
aapl aal
2011-01-04 14 32
2011-01-05 15 31
2011-01-06 16 30
2011-01-07 17 34
2011-01-08 18 29
2011-01-09 19 27
2011-01-10 20 26
If you want to train/test a model you might be interested in getting a random sample:
test = data.sample(frac=0.3)
train = data.loc[data.index.difference(test.index)]

Get n * k unique sets of 2 from list of length n in Python

I have the following Python brainteaser: We arrange a 30-day programme with 48 participants. Every day in the programme, participants are paired in twos. Participants cannot have the same partners twice and all participants have to be partnered up every day. P.S. I hope my math is right in the title.
I've managed an implementation but it feels very clunky. Is there an efficient way to do this? Perhaps using the cartesian product somehow? All feedback and tips are much appreciated.
# list of people: 48
# list of days: 30
# each day, the people need to be split into pairs of two.
# the same pair cannot occur twice
import random
from collections import Counter
class person ():
def __init__ (self, id):
self.id = id
class schedule ():
def __init__ (self, days):
self.people_list = []
self.days = days
self.placed_people = []
self.sets = []
def create_people_list(self, rangex):
for id in range(rangex):
new_person = person(id)
self.people_list.append(new_person)
print(f"{len(self.people_list)} people and {self.days} days will be considered.")
def assign_pairs(self):
for day in range(self.days): # for each of the 30 days..
print("-" * 80)
print(f"DAY {day + 1}")
self.placed_people = [] # we set a new list to contain ids of placed people
while Counter([pers.id for pers in self.people_list]) != Counter(self.placed_people):
pool = list( set([pers.id for pers in self.people_list]) - set(self.placed_people))
# print(pool)
person_id = random.choice(pool) # pick random person
person2_id = random.choice(pool) # pick random person
if person_id == person2_id: continue
if not set([person_id, person2_id]) in self.sets or len(pool) == 2:
if len(pool) == 2: person_id, person2_id = pool[0], pool[1]
self.sets.append(set([person_id, person2_id]) )
self.placed_people.append(person_id)
self.placed_people.append(person2_id)
print(f"{person_id} {person2_id}, ", end="")
schdl = schedule(30) # initiate schedule with 30 days
schdl.create_people_list(48)
schdl.assign_pairs()
Outputs:
48 people and 30 days will be considered.
--------------------------------------------------------------------------------
DAY 1
37 40, 34 4, 1 46, 13 39, 12 35, 18 33, 25 24, 23 31, 17 42, 32 19, 36 0, 11 9, 7 45, 10 21, 44 43, 29 41, 38 16, 15 22, 2 20, 26 47, 30 28, 3 8, 6 27, 5 14,
--------------------------------------------------------------------------------
DAY 2
42 28, 25 15, 6 17, 2 14, 7 40, 11 4, 22 37, 33 20, 0 16, 3 39, 19 47, 46 24, 12 27, 26 1, 34 10, 45 8, 23 13, 32 41, 9 29, 44 31, 30 5, 38 18, 43 21, 35 36,
--------------------------------------------------------------------------------
DAY 3
8 28, 33 12, 40 26, 5 35, 13 31, 29 43, 44 21, 11 30, 1 7, 34 2, 47 45, 46 17, 4 23, 32 15, 14 22, 36 42, 16 41, 37 19, 38 3, 20 6, 10 0, 24 9, 27 25, 18 39,
--------------------------------------------------------------------------------
[...]
--------------------------------------------------------------------------------
DAY 29
4 18, 38 28, 24 22, 23 33, 9 41, 40 20, 26 39, 2 42, 15 10, 12 21, 11 45, 46 7, 35 27, 29 36, 3 31, 19 6, 47 32, 25 43, 13 44, 1 37, 14 0, 16 17, 30 34, 8 5,
--------------------------------------------------------------------------------
DAY 30
17 31, 25 7, 6 10, 35 9, 41 4, 16 40, 47 43, 39 36, 19 44, 23 11, 13 29, 21 46, 32 34, 12 5, 26 14, 15 0, 28 24, 2 37, 8 22, 27 38, 45 18, 3 20, 1 33, 42 30,
Thanks for your time! Also, a follow up question: How can I calculate whether it is possible to solve the task, i.e. to arrange all the participants in unique pairs every day?
Round-robin tournaments in real life
Round-robin tournaments are extremely easy to organize. In fact, the algorithm is so simple that you can organize a round-robin tournament between humans without any paper or computer, just by giving the humans simple instructions.
You have an even number N = 48 humans to pair up. Imagine you have a long table with N // 2 seats on one side, facing N // 2 seats on the other side. Ask all the humans to seat at that table.
This is your first pairing.
Call one of the seats "seat number 1".
To move to the next pairing: the human at seat number 1 doesn't move. Every other human moves clockwise one seat around the table.
Current pairing
1 2 3 4
8 7 6 5
Next pairing
1 8 2 3
7 6 5 4
Round-robin tournaments in python
# a table is a simple list of humans
def next_table(table):
return [table[0]] + [table[-1]] + table[1:-1]
# [0 1 2 3 4 5 6 7] -> [0 7 1 2 3 4 5 6]
# a pairing is a list of pairs of humans
def pairing_from_table(table):
return list(zip(table[:len(table)//2], table[-1:len(table)//2-1:-1]))
# [0 1 2 3 4 5 6 7] -> [(0,7), (1,6), (2,5), (3,4)]
# a human is an int
def get_programme(programme_length, number_participants):
table = list(range(number_participants))
pairing_list = []
for day in range(programme_length):
pairing_list.append(pairing_from_table(table))
table = next_table(table)
return pairing_list
print(get_programme(3, 8))
# [[(0, 7), (1, 6), (2, 5), (3, 4)],
# [(0, 6), (7, 5), (1, 4), (2, 3)],
# [(0, 5), (6, 4), (7, 3), (1, 2)]]
print(get_programme(30, 48))
If you want the humans to be custom objects instead of ints, you can replace the second argument number_participants by the list table directly; then the user can supply a list of whatever they want:
def get_programme(programme_length, table):
pairing_list = []
for day in range(programme_length):
pairing_list.append(pairing_from_table(table))
table = next_table(table)
return pairing_list
print(get_programme(3, ['Alice', 'Boubakar', 'Chen', 'Damian']))
# [[('Alice', 'Damian'), ('Boubakar', 'Chen')],
# [('Alice', 'Chen'), ('Damian', 'Boubakar')],
# [('Alice', 'Boubakar'), ('Chen', 'Damian')]]
Follow-up question: when does there exist a solution?
If there are N humans, each human can be paired with N-1 different humans. If N is even, then the round-robin circle-method will make sure that the first N-1 rounds are correct. After that, the algorithm is periodic: the Nth round will be identical to the first round.
Thus there is a solution if and only if programme_length < number_participants and the number of participants is even; and the round-robin algorithm will find a solution in that case.
If the number of participants is odd, then every day of the programme, there must be at least one human who is not paired. The round-robin tournament can still be applied in this case: add one extra "dummy" human (usually called bye-player). The dummy human behaves exactly like a normal human for the purposes of the algorithm. Every round, one different real human will be paired with the dummy human, meaning they are not paired with a real human this round. With this method, all you need is programme_length <= number_participants.

Can you explain the output: diff.sort_values(ascending=False).index.astype

Can anyone explain the following statement.
list(diff.sort_values(ascending=False).index.astype(int)[0:5])
Output: Int64Index([24, 26, 17, 2, 1], dtype='int64')
It sorts first, but what is the index doing and how do i get 24, 26, 17, 2 ,1 ??
diff is series
ipdb> diff
1 0.017647
2 0.311765
3 -0.060000
4 -0.120000
5 -0.040000
6 -0.120000
7 -0.190000
8 -0.200000
9 -0.100000
10 -0.011176
11 -0.130000
12 0.008824
13 -0.060000
14 -0.090000
15 -0.060000
16 0.008824
17 0.341765
18 -0.140000
19 -0.050000
20 -0.060000
21 -0.040000
22 -0.210000
23 0.008824
24 0.585882
25 -0.060000
26 0.555882
27 -0.031176
28 -0.060000
29 -0.170000
30 -0.220000
31 -0.170000
32 -0.040000
dtype: float64
Yout code return list of index values of top5 values of Series sorted in descending order.
First 'column' printed in pandas Series is called index, so your code after sorting convert values of index to integers and slice by indexing.
print (diff.sort_values(ascending=False))
24 0.585882
26 0.555882
17 0.341765
2 0.311765
1 0.017647
12 0.008824
23 0.008824
16 0.008824
10 -0.011176
27 -0.031176
32 -0.040000
21 -0.040000
5 -0.040000
19 -0.050000
15 -0.060000
3 -0.060000
13 -0.060000
25 -0.060000
28 -0.060000
20 -0.060000
14 -0.090000
9 -0.100000
6 -0.120000
4 -0.120000
11 -0.130000
18 -0.140000
31 -0.170000
29 -0.170000
7 -0.190000
8 -0.200000
22 -0.210000
30 -0.220000
Name: a, dtype: float64
print (diff.sort_values(ascending=False).index.astype(int))
Int64Index([24, 26, 17, 2, 1, 12, 23, 16, 10, 27, 32, 21, 5, 19, 15, 3, 13,
25, 28, 20, 14, 9, 6, 4, 11, 18, 31, 29, 7, 8, 22, 30],
dtype='int64')
print (diff.sort_values(ascending=False).index.astype(int)[0:5])
Int64Index([24, 26, 17, 2, 1], dtype='int64')
print (list(diff.sort_values(ascending=False).index.astype(int)[0:5]))
[24, 26, 17, 2, 1]
Here's what's happening:
diff.sort_values(ascending) - sorts a Series. By default, ascending is True, but you've kept it false, so it returns sorted Series in descending order.
pandas.Series.index returns a row-labels of the index (the sorted numbers 1 - 32 in your case)
.as_type(int) typecasts index row-labels as integers.
[0: 5] just picks the cells 0 through 5
Let me know if this helps!

Slicing a list in sublists according to unknown indices

What is the best method to slice a list (here: lst_num) into (more than two) parts of variable length according to another list containing the indices?
A string of numbers has to be split into sublists that contain the numbers standing between all subsequent occurrences of a certain number. E.g.: "30 24 17 30 22 1 67 2 4 3 30 24 95 34 29 56 30 43 24" and "30" yields: [24, 17], [22, 1, 22, 1, 67, 2, 4, 3 ] and [24, 95, 34, 29, 56]
str_num="30 24 17 30 22 1 67 2 4 3 30 24 95 34 29 56 30 43 24"
lst_num=[int(x) for x in ciphtext.split()]
idx=[i for i, x in enumerate(lst_num) if x==30]
for i in idx: ???
To slice the list, the first argument should be "i+1", but how to obtain the subsequent index from idx as stop index? Is there a way to give each sublist a unique name in the iteration?
One little step to go:
>>> [lst_num[start+1:end] for start, end in zip(idx, idx[1:])]
[[24, 17], [22, 1, 67, 2, 4, 3], [24, 95, 34, 29, 56]]
Just zip the indexes into pairs of slice boundaries.

Pandas dot product ValueError

I am trying to calculate the dot product of a data frame and a series, but I am getting ValueError: matrices are not aligned and I do not really understand why. I get
if (len(common) > len(self.columns) or len(common) > len(other.index)):
raise ValueError('matrices are not aligned')
with the error message, which I totally understand. But when I check my series, it has 25 values:
weights
Out[193]:
0 0.000002
1 0.000577
2 0.002480
3 0.004720
4 0.003640
5 0.001480
6 0.000054
7 0.000022
8 0.009060
9 0.000511
10 0.034900
11 0.140000
12 0.065600
13 0.325000
14 0.072900
15 0.031100
16 0.209000
17 0.003280
18 0.001390
19 0.002100
20 0.000847
21 0.009560
22 0.006320
23 0.014000
24 0.061900
Name: 3, dtype: float64
And when I check my data frame, it also has 25 columns:
In [195]: data
Out[195]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 0 to 130
Data columns (total 25 columns):
(etc)
So I don't understand why I get the error message. What am I missing here?
Some additional information:
I am using weightedave=data.dot(weights)
And I just figured out in the dot code that it does common = data.columns.union(weights.index) to get the common referred to in the error message. So I tested that, but in my case that becomes
In[220]: common
Out[220]: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, u'100_AVET', u'101_AVET', u'102_AVET', u'13_AVET', u'14_AVET', u'15_AVET', u'18_AVET', u'19_AVET', u'20_AVET', u'22_AVET', u'36_AVET', u'62_AVET', u'74_AVET', u'78_AVET', u'79_AVET', u'80_AVET', u'83_AVET', u'85_AVET', u'86_AVET', u'88_AVET', u'94_AVET', u'95_AVET', u'96_AVET', u'97_AVET', u'99_AVET'], dtype=object)
Which indeed is longer (50) than my number of columns/indices (25). Should I rename either my series or the columns in my data frame?

Categories