Counting choice streaks in Python

Counting choice streaks in Python - python

I have a dataset that looks like the following:
Subject | Session | Trial | Choice
--------+---------+-------+-------
1 | 1 | 1 | A
1 | 1 | 2 | B
1 | 1 | 3 | B
1 | 1 | 4 | B
1 | 1 | 5 | B
1 | 1 | 6 | A
2 | 1 | 1 | A
2 | 1 | 2 | A
2 | 1 | 3 | A
I would like to use a Python script to generate the following table:
Subject | Session | streak_count
--------+---------+-------------
1 | 1 | 3
2 | 1 | 1
Where streak_count is a count of the total number of choice streaks made by a given subject during a given session, and a streak is any number of choices of one particular item in a row (>0).
I've tried using some of the suggestions to similar questions here, but I'm having trouble figuring out how to count these instances, rather than measure their length, etc., which seem to be more common queries.

def count():
love = []
love1 = []
streak = -1
k = 0
session = 1
subject = raw_input("What is your subject? ")
trials = raw_input("How many trials do you wish to do? ")
trial = 0
for i in range(int(trials)):
choice = raw_input("What was the choice? ")
love.append(choice)
love1.append(choice)
trial += 1
print subject, trial, choice
if love[i] == love1[i-1]:
streak += 1
print subject, session, streak
This may be what you want it takes in how many trials you wish to do and whatever your subject is and if there is a streak it adds one. The reason streak starts at -1 is because when you put your first answer it adds one because of the negative index going back to its self.

I think this is what you are asking for;
import itertools
data = [
[1, 1, 1, 'A'],
[1, 1, 2, 'B'],
[1, 1, 3, 'B'],
[1, 1, 4, 'B'],
[1, 1, 5, 'B'],
[1, 1, 6, 'A'],
[2, 1, 1, 'A'],
[2, 1, 2, 'A'],
[2, 1, 3, 'A']
]
grouped = itertools.groupby(data, lambda x: x[0])
results = dict()
this, last = None, None
for key, group in grouped:
results[key] = 0
for c, d in enumerate(group):
this = d
streak = c == 0 or this[3] != last[3]]
if streak:
results[key] += 1
last = this
print results
This yields;
{1: 3, 2: 1}

Related

How to vectorize and speed-up double for-loop for pandas dataframe when doing text similarity scoring

I have the following dataframe:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1. Next cluster is number 2 and three entities from name column belong to this cluster: Dog, Big Dog and Cat. Dog and Big Dog have high similarity score and their unique id will be, say 2. For Cat unique id will be, say 3. And so on.
I created a code for the logic above:
# pip install thefuzz
from thefuzz import fuzz
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test['id'] = 0
i = 1
for index, row in df_test.iterrows():
for index_, row_ in df_test.iterrows():
if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
if fuzz.ratio(row['name'], row_['name']) > 50:
df_test.loc[index_,'id'] = int(i)
is_i_used = True
if is_i_used == True:
i += 1
is_i_used = False
Code generates expected result:
name cluster_number id
0 South Beach 1 1
1 Dog 2 2
2 Bird 3 3
3 Ant 3 4
4 Big Dog 2 2
5 Beach 1 1
6 Dear 4 5
7 Cat 2 6
Note, for Cat we got id as 6 but it is fine because it is unique anyway.
While algorithm above works for test data I am not able to use it for real data that I have (about 1 million rows) and I am trying to understand how to vectorize the code and get rid of two for-loops.
Also thefuzz module has process function and it allows to process data at once:
from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))
But I don't see if it can help with speeding up the code.

tl;dr: Avoid O(N^2) running time if N is big.
help with speeding up the code.
People get down on .iterrows(), calling it "slow".
Switching from .iterrows to a vectorized approach
might "speed things up" somewhat, but that's a relative measure.
Let's talk about complexity.
time complexity
Your current algorithm is quadratic;
it features a pair of nested .iterrows loops.
But then immediately we filter on
if different_cluster and not_yet_assigned:
Now, that could be workable for "small" N.
But an N of 400K quickly becomes infeasible:
>>> 419_776 ** 2 / 1e9
176.211890176
One hundred seventy-six billion iterations (with a "B")
is nothing to sneeze your nose at,
even if each filter step has trivial (yet non-zero) cost.
At the risk of reciting facts that have tediously been
repeated many times before,
sorting costs O(N log N), and
N log N is very signicantly less than N^2
I'm not convinced that what you want is to "go fast".
Rather, I suspect what you really want is to "do less".
Start by ordering your rows, and then make
a roughly linear pass over that dataset.
You didn't specify your typical cluster group size G.
But since there's many distinct cluster numbers,
we definitely know that G << N.
We can bring complexity down from O(N^2) to O(N × G^2).
df = df_test.sort_values(['cluster_number', 'name'])
You wrote
for index, row in df_test.iterrows():
for index_, row_ in df_test.iterrows():
Turn that into
for index, row in df.iterrows():
while ...
and use .iloc() to examine relevant rows.
The while loop gets to terminate as soon
as a new cluster number is seen, instead
of every time having to slog through hundreds of thousands
of rows until end-of-dataframe is seen.
Why can it exit early?
Due to the sort order.
A more convenient way to structure this might be
to write a clustering helper.
def get_clusters(df):
cur_num = -1
cluster = []
for _, row in df.iterrows():
if row.cluster_number != cur_num and cluster:
yield cluster
cluster = []
cur_num = row.cluster_number
cluster.append(row)
Now your top level code can iterate through a bunch
of clusters, performing a fuzzy match of cost O(G^2)
on each cluster.
The invariant on each generated cluster
is that all rows within cluster
shall have identical cluster_number.
And, due to the sorting,
we guarantee that a given cluster_number
shall be generated at most once.
https://stackoverflow.com/help/self-answer
Please measure current running time,
implement these suggests,
measure again,
and post code + timings.

Attempt #1
Based on #J_H suggestions I made some changes in the original code:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
df_test['id'] = 0
i = 1
for index, row in df_test.iterrows():
row_ = row
index_ = index
while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
df_test.loc[index_,'id'] = i
is_i_used = True
index_ += 1
if is_i_used == True:
i += 1
is_i_used = False
Now instead of hours of computations it runs only 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows.
While it is significant improvement I still looking for vectorized option.
Attempt #2
I created vectorized version:
from rapidfuzz import process, fuzz
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")
but it is not possible to use for dataframe with 1 millions rows because cdist returns a matrix of len(queries) x len(choices) x size(dtype). By default this dtype is float. So for 1 million names, the result matrix would require 3.6 terrabytes of memory.

Following on your own answer, you don't need to compute process.cdist on all the names, you are interested only on those on the same cluster.
To do so, you can iterate over groups:
threshold = 50
index_start = 0
groups = []
for grp_name, grp_df in df_test.groupby("cluster_number"):
names = grp_df["name"]
scores = pd.DataFrame(
data = (process.cdist(names, names, workers=-1)),
columns = names,
index = names
)
x, y = np.where(scores > threshold)
grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.assign(restaurant_id = lambda t: t.index + index_start)
.explode("name")
)
index_start = grps_in_group["restaurant_id"].max()+1
groups.append(grps_in_group)
df_test.merge(pd.concat(groups), on="name")
| | name | cluster_number | id | restaurant_id |
|---:|:------------|-----------------:|-----:|----------------:|
| 0 | Beach | 1 | 0 | 0 |
| 1 | South Beach | 1 | 0 | 0 |
| 2 | Big Dog | 2 | 0 | 1 |
| 3 | Cat | 2 | 0 | 2 |
| 4 | Dog | 2 | 0 | 1 |
| 5 | Dry Fish | 2 | 0 | 3 |
| 6 | Fish | 2 | 0 | 3 |
| 7 | Ant | 3 | 0 | 4 |
| 8 | Bird | 3 | 0 | 5 |
| 9 | Dear | 4 | 0 | 6 |
Yet I am not sure this is an improvement.
Now, transforming the loop functionality in a function we can use .groupby(...).apply(...), however we lose track of the consecutive index. To address that I am using a trick with pandas categorical type:
def create_restaurant_id(
dframe: pd.DataFrame,
threshold: int = 50
) -> pd.DataFrame:
names = dframe["name"]
scores = pd.DataFrame(
data = (process.cdist(names, names, workers=-1)),
columns = names,
index = names
)
x, y = np.where(scores > threshold)
grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.assign(restaurant_id = lambda t: t.index)
.explode("name")
)
return grps_in_group
(df_test
.groupby("cluster_number")
.apply(create_restaurant_id)
.reset_index(level=0)
.assign(restaurant_id = lambda t: (
t["cluster_number"].astype(str) + t["restaurant_id"].astype(str)
).astype("category").cat.codes
)
)
In term of performance on my laptop, with such small dataframe, the two are almost identical.

I think you are thinking very analytically. Try this :
What I'm doing here is giving a non-repeating ID number (Details below).
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
# Does the word occur more than once ? (int)
repeat = 0
for i in range(df_test.shape[0]):
heywtu = df_test[df_test['name'].str.contains(*df_test['name'][i].split())].index.values
# 0 enters the special case, so we took it as 1 directly.
if i == 0:
df_test.loc[i,'id'] = i+1
else :
# Does the word occur more than once?
repeat += len(heywtu) == 2
# Fill all id column with specific id number
df_test.loc[i,'id'] = i - repeat
# Editing the id of people with the same name other than 0
if (len(heywtu) == 2) & (i!=0):
df_test.loc[i,'id'] = heywtu[0]
continue
# Special case, If there only 2 values:
if (len(df_test['name'])==2):
df_test.loc[1,'id'] =2
# For first d_test values
print(df_test.head(10))
>>> name cluster_number id
>>> 0 Beach 1 1.0
>>> 1 South Beach 1 1.0
>>> 2 Big Dog 2 2.0
>>> 3 Cat 2 3.0
>>> 4 Dog 2 2.0
>>> 5 Ant 3 4.0
>>> 6 Bird 3 5.0
>>> 7 Dear 4 6.0
# For last d_test values
print(df_test.head(10))
>>> name cluster_number id
>>> 0 Beach 1 1.0
>>> 1 South Beach 1 1.0
>>> 2 Big Dog 2 2.0
>>> 3 Cat 2 3.0
>>> 4 Dog 2 2.0
>>> 5 Dry Fish 2 4.0
>>> 6 Fish 2 4.0
>>> 7 Ant 3 5.0
>>> 8 Bird 3 6.0
>>> 9 Dear 4 7.0
# If there only 2 values
df_test.head()
>>> name cluster_number id
>>> 0 Big Dog 1 1.0
>>> 1 South Beach 2 2.0
What is repeat? Well, if other strings contains Dog word its gonna be counted, like Dog and Big Dog , and substrack numbers with index number. I hope its gonna helpful for your problem.

How should I solve logic error in timestamp using Python?

I have written a code to calculate a, b, and c. They were initialized at 0.
This is my input file
-------------------------------------------------------------
| Line | Time | Command | Data |
-------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | WRITING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | WRITING | |
and the timestamps and commands are used to process the calculation for a, b, and c.
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
#print(m[2])
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
#print(m[3], m[2])
else:
timestamps[m[3]] = [m[2]]
#print(m[3], m[2])
a = b = c = 0
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
if timestamps["ACTIVE"] > timestamps["PRECHARGE"]: #line causing logic error
a = a + 1
print(a)
Before getting into the calculation, I assign the timestamps with respect to the commands. This is the output for this section.
ACTIVE : 3, ['0015', '0150', '0318']
WRITING : 4, ['0030', '0200', '0447', '0503']
WRITING_A : 3, ['0100', '0345', '0430']
PRECHARGE : 2, ['0115', '0314']
REFRESH : 1, ['0120']
To get a, the timestamps of ACTIVE must be greater than PRECHARGE and WRITING must be greater than ACTIVE. (Line 4, 6, 7 will contribute to the first a and Line 8, 9, and 12 contributes to the second a)
To get b, the timestamps of WRITING must be greater than ACTIVE. For the lines that contribute to a such as Line 4, 6, 7, 8, 9, and 12, they cannot be used to calculate b. So, Line 1 and 2 contribute to b.
To get c, the rest of the unused lines containing WRITING will contribute to c.
The expected output:
a = 2
b = 1
c = 1
However, in my code, when I print a, it displays 0, which shows the logic has some error. Any suggestion to amend my code to achieve the goal? I have tried for a few days and the problem is not solved yet.

I made a function that will return the commands in order that match a pattern with gaps allowed.
I also made a more compact version of your file reading.
There is probably a better version to divide the list into two parts, the problem was to only allow elements in that match the whole pattern. In this one I iterate over the elements twice.
import re
commands = list()
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
_, line, time, command, data, _ = m
commands.append((line,time,command))
def search_pattern(pattern, iterable, key=None):
iter = 0
count = 0
length = len(pattern)
results = []
sentinel = object()
for elem in iterable:
original_elem = elem
if key is not None:
elem = key(elem)
if elem == pattern[iter]:
iter += 1
results.append((original_elem,sentinel))
if iter >= length:
iter = iter % length
count += length
else:
results.append((sentinel,original_elem))
matching = []
nonmatching = []
for res in results:
first,second = res
if count > 0:
if second is sentinel:
matching.append(first)
count -= 1
elif first is sentinel:
nonmatching.append(second)
else:
value = first if second is sentinel else second
nonmatching.append(value)
return matching, nonmatching
pattern_a = ['PRECHARGE','ACTIVE','WRITING']
pattern_b = ['ACTIVE','WRITING']
pattern_c = ['WRITING']
matching, nonmatching = search_pattern(pattern_a, commands, key=lambda t: t[2])
a = len(matching)//len(pattern_a)
matching, nonmatching = search_pattern(pattern_b, nonmatching, key=lambda t: t[2])
b = len(matching)//len(pattern_b)
matching, nonmatching = search_pattern(pattern_c, nonmatching, key=lambda t: t[2])
c = len(matching)//len(pattern_c)
print(f'{a=}')
print(f'{b=}')
print(f'{c=}')
Output:
a=2
b=1
c=1

Pandas: how to incrementally add one to column while sum is less than corresponding column?

I am trying to increment a column by 1 while the sum of that column is less than or equal to a total supply figure. I also need that column to be less than the corresponding value in the 'allocation' column. The supply variable will be dynamic from 1-400 based on user input. Below is the desired output (Allocation Final column).
supply = 14
| rank | allocation | Allocation Final |
| ---- | ---------- | ---------------- |
| 1 | 12 | 9 |
| 2 | 3 | 3 |
| 3 | 1 | 1 |
| 4 | 1 | 1 |
Below is the code I have so far:
data = [[1.05493,12],[.94248,3],[.82317,1],[.75317,1]]
df = pd.DataFrame(data,columns=['score','allocation'])
df['rank'] = df['score'].rank()
df['allocation_new'] = 0
#static for testing
supply = 14
for index in df.index:
while df.loc[index, 'allocation_new'] < df.loc[index, 'allocation'] and df.loc[index, 'allocation_new'].sum() < supply:
df.loc[index, 'allocation_new'] += 1
print(df)

This should do:
def allocate(df, supply):
if supply > df['allocation'].sum():
raise ValueError(f'Unacheivable supply {supply}, maximal {df["allocation"].sum()}')
under_alloc = pd.Series(True, index=df.index)
df['allocation final'] = 0
while (missing := supply - df['allocation final'].sum()) >= 0:
assert under_alloc.any()
if missing <= under_alloc.sum():
df.loc[df.index[under_alloc][:missing], 'allocation final'] += 1
return df
df.loc[under_alloc, 'allocation final'] = (
df.loc[under_alloc, 'allocation final'] + missing // under_alloc.sum()
).clip(upper=df.loc[under_alloc, 'allocation'])
under_alloc = df['allocation final'] < df['allocation']
return df
At every iteration, we add the missing quotas to any rows that did not reach the allocation yet (rounded down, that’s missing // under_alloc.sum()), then using pd.Series.clip() to ensure we stay below the allocation.
If there’s less missing quotas than available ranks to which to allocate (e.g. run the same dataframe with supply=5 or 6), we allocate to the first missing ranks.
>>> df = pd.DataFrame( {'allocation': {0: 12, 1: 3, 2: 1, 3: 1}, 'rank': {0: 1, 1: 2, 2: 3, 3: 4}})
>>> print(allocate(df, 14))
allocation rank allocation final
0 12 1 9
1 3 2 3
2 1 3 1
3 1 4 1
>>> print(allocate(df, 5))
allocation rank allocation final
0 12 1 2
1 3 2 1
2 1 3 1
3 1 4 1

Here is a simpler version:
def allocate(series, supply):
allocated = 0
values = [0]*len(series)
while True:
for i in range(len(series)):
if allocated >= supply:
return values
if values[i] < series.iloc[i]:
values[i]+=1
allocated+=1
pass
allocate(df['allocation'], 14)
output:
[9,3,1,1]

Aggregate list dataframe in pandas using dedicated function

I have the following dataframe in pandas:
data = {'ID_1': {0: '10A00', 1: '10B00', 2: '20001', 3: '20001'},
'ID_2_LIST': {0: [20009, 30006], 1: [20001, 30006],
2: [30009, 30006], 3: [20001, 30003]},
'ID_OCCURRENCY_LIST': {0: [1, 2], 1: [5, 6], 2: [2, 4], 3: [1, 3]}}
# create df
df = pd.DataFrame(data)
| | ID_1 | ID_2_LIST | ID_OCCURRENCY_LIST |
|---:|:-------|:---------------|:---------------------|
| 0 | 10A00 | [20009, 30006] | [1, 2] |
| 1 | 10B00 | [20001, 30006] | [5, 6] |
| 2 | 20001 | [30009, 30006] | [2, 4] |
| 3 | 20001 | [20001, 30003] | [1, 3] |
I would aggregate by ID_1 field applying an external function (in order to identify similar ID_1, let's say "similarID(ID1,ID2)", which returns ID1 or ID2 according to some internal rules), re-generate the list of ID2 and sum the occurrencies for all the equal ID2.
The outcome should be:
**INDEX ID_1 ID_2_LIST ID_OCCURRENCY_LIST**
0 10A00 [20009,30006,20001] [1, 8, 5]
1 10B00 [20001,30006, 30003,20001] [5, 6, 4, 2]
1 20001 [30009,30006, 20001,30003] [2, 4, 1, 3]
EDIT
The code for the function is the following(s1=first string,c1=second string, p1=similarity percentage l1=confidence level, demeraulevenshtein is a literature function):
def pySimilar(s1,c1,p1,l1):
if s1 is None or c1 is None:
return 0
if len(s1)<=5 or len(c1)<=5:
return 0
s1=s1.strip()
c1=c1.strip()
s=s1
c=c1
if s1[3:len(s1)]==c1[3:len(c1)]:
return 1
if len(s1)>=len(c1):
ITERATIONLENGTH=len(c1)/2
else:
ITERATIONLENGTH=len(s1)/2
if len(s1)>=len(c1):
a=int(len(c1)/2)+1
if s1.find(c1[3:a])<0:
return 0
else:
b=int(len(s1)/2)+1
if c1.find(s1[3:b])<0:
return 0
v=[]
CNT=0
TMP=0
max_res=0
search=s1
while CNT < ITERATIONLENGTH:
TMP=(100-((pyDamerauLevenschtein(s[3:len(s)],c[3:len(c)]))*100)/(len(c)-3)) * ((len(search)-3)/(len(s1)-3))
v.append(TMP)
CNT=CNT+1
if TMP>max_res:
max_res = TMP
#s=s[0:len(s)-CNT]
search=s1[0:len(s1)-CNT]
s=s1[0:len(s1)-CNT]
c=c1[0:len(c1)-CNT]
if ((p1-(l1*p1/100)<=sum(v)/len(v) and sum(v)/len(v)<=p1+(l1*p1/100)) or sum(v)/len(v)>=p1+(l1*p1/100)) :
return 1
else:
return 0
I have implemented a function to be applied in the dataframe but it is very slow:
def aggregateListAndOccurrencies(list1,list2):
final = []
final_cnt = []
output = []
cnt_temp = 0
while list1:
elem = list1.pop(0)
cnt = list2.pop(0)
i=0
cnt_temp = cnt
for item in list1:
if pyMATCHSIMILARPN(elem,item,65,20)==1:
cnt_temp = list2[i]+cnt_temp
list1.pop(i)
list2.pop(i)
i+=1
final.append(elem)
final_cnt.append(cnt_temp)
output.append(final)
output.append(final_cnt)
return output
How could apply this in pandas? Any suggestions?

You can simply do a groupby over your ID_1 and just sum the ID_2_List and ID_OCCURRENCY_LIST columns:
df.groupby('ID_1').agg({'ID_2_LIST': 'sum', 'ID_OCCURRENCY_LIST': 'sum'})
if there's a spicific function you'd like the groupby to work with you can then you can use lambda to add it in the .agg:
df.groupby('ID_1').agg({'ID_2_LIST': 'sum', 'ID_OCCURRENCY_LIST': lambda x: ' '.join(x)})

Python: Efficient lookup by interval

I have a large lookup table where the key is an interval:
| min | max | value |
|-----|-----|---------|
| 0 | 3 | "Hello" |
| 4 | 5 | "World" |
| 6 | 6 | "!" |
| ... | ... | ... |
The goal is to create a lookup structure my_lookup that returns a value for each integer, depending on the range the integer is in.
For example: 2 -> "Hello", 3 -> "Hello", 4 -> "World".
Here is an implementation that does what I want:
d = {
(0, 3): "Hello",
(4, 5): "World",
(6, 6): "!"
}
def my_lookup(i: int) -> str:
for key, value in d.items():
if key[0] <= i <= key[1]:
return value
But looping over all entries seems inefficient (the actual lookup table contains 400.000 lines). Is there a faster way?

If your intervals are sorted (in ascending order), you can use bisect module (doc). The search is O(log n) instead of O(n):
min_lst = [0, 4, 6]
max_lst = [3, 5, 6]
values = ['Hello', 'World', '!']
import bisect
val = 2
idx = bisect.bisect_left(max_lst, val)
if idx < len(max_lst) and min_lst[idx] <= val <= max_lst[idx]:
print('Value found ->', values[idx])
else:
print('Value not found')
Prints:
Value found -> Hello

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting choice streaks in Python - python

Related

How to vectorize and speed-up double for-loop for pandas dataframe when doing text similarity scoring

How should I solve logic error in timestamp using Python?

Pandas: how to incrementally add one to column while sum is less than corresponding column?

Aggregate list dataframe in pandas using dedicated function

Python: Efficient lookup by interval

Categories

Resources