I am trying to increment a column by 1 while the sum of that column is less than or equal to a total supply figure. I also need that column to be less than the corresponding value in the 'allocation' column. The supply variable will be dynamic from 1-400 based on user input. Below is the desired output (Allocation Final column).
supply = 14
| rank | allocation | Allocation Final |
| ---- | ---------- | ---------------- |
| 1 | 12 | 9 |
| 2 | 3 | 3 |
| 3 | 1 | 1 |
| 4 | 1 | 1 |
Below is the code I have so far:
data = [[1.05493,12],[.94248,3],[.82317,1],[.75317,1]]
df = pd.DataFrame(data,columns=['score','allocation'])
df['rank'] = df['score'].rank()
df['allocation_new'] = 0
#static for testing
supply = 14
for index in df.index:
while df.loc[index, 'allocation_new'] < df.loc[index, 'allocation'] and df.loc[index, 'allocation_new'].sum() < supply:
df.loc[index, 'allocation_new'] += 1
print(df)
This should do:
def allocate(df, supply):
if supply > df['allocation'].sum():
raise ValueError(f'Unacheivable supply {supply}, maximal {df["allocation"].sum()}')
under_alloc = pd.Series(True, index=df.index)
df['allocation final'] = 0
while (missing := supply - df['allocation final'].sum()) >= 0:
assert under_alloc.any()
if missing <= under_alloc.sum():
df.loc[df.index[under_alloc][:missing], 'allocation final'] += 1
return df
df.loc[under_alloc, 'allocation final'] = (
df.loc[under_alloc, 'allocation final'] + missing // under_alloc.sum()
).clip(upper=df.loc[under_alloc, 'allocation'])
under_alloc = df['allocation final'] < df['allocation']
return df
At every iteration, we add the missing quotas to any rows that did not reach the allocation yet (rounded down, that’s missing // under_alloc.sum()), then using pd.Series.clip() to ensure we stay below the allocation.
If there’s less missing quotas than available ranks to which to allocate (e.g. run the same dataframe with supply=5 or 6), we allocate to the first missing ranks.
>>> df = pd.DataFrame( {'allocation': {0: 12, 1: 3, 2: 1, 3: 1}, 'rank': {0: 1, 1: 2, 2: 3, 3: 4}})
>>> print(allocate(df, 14))
allocation rank allocation final
0 12 1 9
1 3 2 3
2 1 3 1
3 1 4 1
>>> print(allocate(df, 5))
allocation rank allocation final
0 12 1 2
1 3 2 1
2 1 3 1
3 1 4 1
Here is a simpler version:
def allocate(series, supply):
allocated = 0
values = [0]*len(series)
while True:
for i in range(len(series)):
if allocated >= supply:
return values
if values[i] < series.iloc[i]:
values[i]+=1
allocated+=1
pass
allocate(df['allocation'], 14)
output:
[9,3,1,1]
I am trying to find pairs of intervals that overlap by at least some minimum overlap length that is set by the user. The intervals are from this pandas dataframe:
import pandas as pds
print(df1.head())
print(df1.tail())
query_id start_pos end_pos read_length orientation
0 1687655 1 4158 4158 F
1 2485364 1 7233 7233 R
2 1412202 1 3215 3215 R
3 1765889 1 3010 3010 R
4 2944965 1 4199 4199 R
query_id start_pos end_pos read_length orientation
3082467 112838 27863832 27865583 1752 F
3082468 138670 28431208 28431804 597 R
3082469 171683 28489928 28490409 482 F
3082470 2930053 28569533 28569860 328 F
3082471 1896622 28589281 28589554 274 R
where start_pos is where the interval starts and end_pos is where the interval ends. read_length is the length of the interval.
The data is sorted by start_pos.
The program should have the following output format:
query_id1 -- query_id2 -- read_length1 -- read_length2 -- overlap_length
I am running the program on a compute node with up to 512gb RAM and 4x Intel Xeon E7-4830 CPU (32 cores).
I've tried running my own code to find the overlaps, but it is taking too long to run.
Here is the code that I tried,
import pandas as pds
overlap_df = pds.DataFrame()
def create_overlap_table(df1, ovl_len):
...
(sort and clean the data here)
...
def iterate_queries(row):
global overlap_df
index1 = df1.index[df1['query_id'] == row['query_id']]
next_int_index = df1.index.get_loc(index1[0]) + 1
if row['read_length'] >= ovl_len:
if df1.index.size-1 >= next_int_index:
end_pos_minus_ovlp = (row['end_pos'] - ovl_len) + 2
subset_df = df1.loc[(df1['start_pos'] < end_pos_minus_ovlp)]
subset_df = subset_df.loc[subset_df.index == subset_df.index.max()]
subset_df = df1.iloc[next_int_index:df1.index.get_loc(subset_df.index[0])]
subset_df = subset_df.loc[subset_df['read_length'] >= ovl_len]
rows1 = pds.DataFrame({'read_id1': np.repeat(row['query_id'], repeats=subset_df.index.size), 'read_id2': subset_df['query_id']})
overlap_df = overlap_df.append(rows1)
df1.apply(iterate_queries, axis=1)
print(overlap_df)
Again, I ran this code on the compute node, but it was running for hours before I finally cancelled the job.
I've also tried using two packages for this problem--PyRanges, as well as an R package called IRanges, but they take too long to run as well. I've seen posts on interval trees and a python library called pybedtools, and I was planning on looking into them as a next step.
Any feedback would be really appreciated
EDIT:
For minimum overlap length of, say 800, the first 5 rows should look like this,
query_id1 query_id2 read_length1 read_length2 overlap_length
1687655 2485364 4158 7233 4158
1687655 1412202 4158 3215 3215
1687655 1765889 4158 3010 3010
1687655 2944965 4158 4199 4158
2485364 1412202 7233 3215 3215
So, query_id1 and query_id2 cannot be identical. Also, no duplications (i.e., an overlap between A and B should not appear twice in the output).
Here's an algorithm.
Prepare a set of intervals sorted by starting point. Initially the set is empty.
Sort all starting and ending points.
Traverse the points. If a starting point is encountered, add the corresponding interval to the set. If an ending point is encountered, remove the corresponding interval from the set.
When removing an interval, look at other intervals in the set. They all overlap the interval being removed, and they are sorted by the length of the overlap, longest first. Traverse the set until the length is too short, and report each overlap.
pyranges author here. Thanks for trying my library.
How big are your data? When both PyRanges were 1e7 rows, the heavy part done by pyranges took about 12 seconds using 24 cores on a busy server with 200 gb ram.
Setup:
import pyranges as pr
import numpy as np
import mkl
mkl.set_num_threads(1)
### Create data ###
length = int(1e7)
minimum_overlap = 5
gr = pr.random(length)
gr2 = pr.random(length)
# add ids
gr.id1 = np.arange(len(gr))
gr2.id2 = np.arange(len(gr))
# add lengths
gr.length1 = gr.lengths()
gr2.length2 = gr2.lengths()
gr
# +--------------+-----------+-----------+--------------+-----------+-----------+
# | Chromosome | Start | End | Strand | id1 | length1 |
# | (category) | (int32) | (int32) | (category) | (int64) | (int32) |
# |--------------+-----------+-----------+--------------+-----------+-----------|
# | chr1 | 146230338 | 146230438 | + | 0 | 100 |
# | chr1 | 199561432 | 199561532 | + | 1 | 100 |
# | chr1 | 189095813 | 189095913 | + | 2 | 100 |
# | chr1 | 27608425 | 27608525 | + | 3 | 100 |
# | ... | ... | ... | ... | ... | ... |
# | chrY | 21533766 | 21533866 | - | 9999996 | 100 |
# | chrY | 30105890 | 30105990 | - | 9999997 | 100 |
# | chrY | 49764407 | 49764507 | - | 9999998 | 100 |
# | chrY | 3319478 | 3319578 | - | 9999999 | 100 |
# +--------------+-----------+-----------+--------------+-----------+-----------+
# Stranded PyRanges object has 10,000,000 rows and 6 columns from 25 chromosomes.
Doing your analysis:
j = gr.join(gr2, new_pos="intersection", nb_cpu=24)
# CPU times: user 3.85 s, sys: 3.56 s, total: 7.41 s
# Wall time: 12.3 s
j.overlap = j.lengths()
out = j.df["id1 id2 length1 length2 overlap".split()]
out = out[out.overlap >= minimum_overlap]
out.head()
id1 id2 length1 length2 overlap
1 2 485629 100 100 74
2 2 418820 100 100 92
3 3 487066 100 100 13
4 7 191109 100 100 31
5 11 403447 100 100 76
I faced the problem of quickly finding the nearest neighbors in a given range.
Example of dataset:
id | string | float
0 | AA | 0.1
12 | BB | 0.5
2 | CC | 0.3
102| AA | 1.1
33 | AA | 2.8
17 | AA | 0.5
For each line, print the number of lines satisfying the following conditions:
string field is equal to current
float field <= current float - del
For this example with del = 1.5:
id | count
0 | 0
12 | 0
2 | 0
102| 2 (string is equal row with id=0,33,17 but only in row id=0,17 float value: 1.1-1.5<=0.1, 1.1-1.5<=0.5)
33 | 0 (string is equal row with id=0,102,17 but 2.8-1.5>=0.1/1.1/1.5)
17 | 1
To solve this problem, I used a class BallTree with custom metric, but it works for a very long time due to a reverse tree walk (on a large dataset).
Can someone suggest other solutions or how you can increase the speed of custom metrics to the speed of the metrics from the sklearn.neighbors.DistanceMetric?
My code:
from sklearn.neighbors import BallTree
def distance(x, y):
if(x[0]==y[0] and x[1]>y[1]):
return (x[1] - y[1])
else:
return (x[1] + y[1])
tree2 = BallTree(X, leaf_size=X.shape[0], metric=distance)
mas=tree2.query_radius(X, r=del, count_only = True)
What the program should do is take steps and a number and than output you how many unique sequences there are with exactly x steps to create number.
Does someone know how I can save some memory - as I should make this work for pretty huge numbers within a 4 second limit.
def IsaacRule(steps, number):
if number in IsaacRule.numbers:
return 0
else:
IsaacRule.numbers.add(number)
if steps == 0:
return 1
counter = 0
if ((number - 1) / 3) % 2 == 1:
counter += IsaacRule(steps-1, (number - 1) / 3)
if (number * 2) % 2 == 0:
counter += IsaacRule(steps-1, number*2)
return counter
IsaacRule.numbers = set()
print(IsaacRule(6, 2))
If someone knows a version with memoization I would be thankful, right now it works, but there is still room for improvement.
Baseline: IsaacRule(50, 2) takes 6.96s
0) Use the LRU Cache
This made the code take longer, and gave a different final result
1) Eliminate the if condition: (number * 2) % 2 == 0 to True
IsaacRule(50, 2) takes 0.679s. Thanks Pm2Ring for this one.
2) Simplify ((number - 1) / 3) % 2 == 1 to number % 6 == 4 and use floor division where possible:
IsaacRule(50, 2) takes 0.499s
Truth table:
| n | n-1 | (n-1)/3 | (n-1)/3 % 2 | ((n-1)/3)%2 == 1 |
|---|-----|---------|-------------|------------------|
| 1 | 0 | 0.00 | 0.00 | FALSE |
| 2 | 1 | 0.33 | 0.33 | FALSE |
| 3 | 2 | 0.67 | 0.67 | FALSE |
| 4 | 3 | 1.00 | 1.00 | TRUE |
| 5 | 4 | 1.33 | 1.33 | FALSE |
| 6 | 5 | 1.67 | 1.67 | FALSE |
| 7 | 6 | 2.00 | 0.00 | FALSE |
Code:
def IsaacRule(steps, number):
if number in IsaacRule.numbers:
return 0
else:
IsaacRule.numbers.add(number)
if steps == 0:
return 1
counter = 0
if number % 6 == 4:
counter += IsaacRule(steps-1, (number - 1) // 3)
counter += IsaacRule(steps-1, number*2)
return counter
3) Rewrite code using sets
IsaacRule(50, 2) takes 0.381s
This lets us take advantage of any optimizations made for sets. Basically I do a breadth first search here.
4) Break the cycle so we can skip keeping track of previous states.
IsaacRule(50, 2) takes 0.256s
We just need to add a check that number != 1 to break the only known cycle. This gives a speed up, but you need to add a special case if you start from 1. Thanks Paul for suggesting this!
START = 2
STEPS = 50
# Special case since we broke the cycle
if START == 1:
START = 2
STEPS -= 1
current_candidates = {START} # set of states that can be reached in `step` steps
for step in range(STEPS):
# Get all states that can be reached from current_candidates
next_candidates = set(number * 2 for number in current_candidates if number != 1) | set((number - 1) // 3 for number in current_candidates if number % 6 == 4)
# Next step of BFS
current_candidates = next_candidates
print(len(next_candidates))
I'm currently trying to program a recursive sudoku solving algorithm in Python. I made a class for Sudokus which contains some methods to help me manipulate the sudoku grid.
Here is my code :
class Sudoku:
def __init__(self, input, tokens = None):
self.data = {}
if tokens is None:
self.tokens = list(range(1, 10))
else:
self.tokens = tokens
assert len(self.tokens) == 9
if type(input) == dict:
self.data = input
else:
for i, cell in enumerate(input):
if cell in self.tokens:
self.data[i % 9, i // 9] = cell
def __repr__(self):
string = ''
canvas = [['.'] * 9 for line in range(9)]
for (col, row), cell in self.data.items():
canvas[row][col] = str(cell)
for y, row in enumerate(canvas):
if not y % 3:
string += "+-------+-------+-------+\n"
string += '| {} | {} | {} |\n'.format(' '.join(row[:3]), ' '.join(row[3:6]), ' '.join(row[6:]))
string += "+-------+-------+-------+"
return string
#classmethod
def sq_coords(cls, cell_x, cell_y):
#returns all coordinates of cells in the same square as the one in (cell_x, cell_y)
start_x, start_y = cell_x // 3 * 3, cell_y // 3 * 3
for dx in range(3):
for dy in range(3):
yield (start_x +dx, start_y + dy)
def copy(self):
return Sudoku(self.data)
def clues(self, cell_x, cell_y):
assert not self.data.get((cell_x, cell_y))
allowed = set(self.tokens)
#Remove all numbers on the same row, column and square as the cell
for row in range(9):
allowed.discard(self.data.get((cell_x, row)))
for col in range(9):
allowed.discard(self.data.get((col, cell_y)))
for coords in self.sq_coords(cell_x, cell_y):
allowed.discard(self.data.get(coords))
return allowed
def get_all_clues(self):
clues = {}
for row in range(9):
for col in range(9):
if not self.data.get((col, row)):
clues[col, row] = self.clues(col, row)
return clues
def fill_singles(self):
still_going = True
did_something = False
while still_going:
still_going = False
for (col, row), clues in self.get_all_clues().items():
if len(clues) == 1:
still_going = True
did_something = True
self.data[col, row] = clues.pop()
return did_something
def place_finding(self):
still_going = True
did_something = False
while still_going:
still_going = False
for token in self.tokens:
for group in self.get_groups():
available_spots = [coords for coords, cell in group.items() if cell == None and token in self.clues(*coords)]
if len(available_spots) == 1:
self.data[available_spots.pop()] = token
still_going = True
did_something = True
return did_something
def fill_obvious(self):
still_going = True
while still_going:
a = self.fill_singles()
b = self.place_finding()
still_going = a or b
def get_groups(self):
for y in range(9):
yield {(x, y) : self.data.get((x, y)) for x in range(9)}
for x in range(9):
yield {(x, y) : self.data.get((x, y)) for y in range(9)}
for n in range(9):
start_x, start_y = n % 3 * 3, n // 3 * 3
yield {(x, y) : self.data.get((x, y)) for x, y in self.sq_coords(start_x, start_y)}
def is_valid(self):
for group in self.get_groups():
if any([list(group.values()).count(token) > 1 for token in self.tokens]):
return False
return True
def is_solved(self):
return self.is_valid() and len(self.data) == 9 * 9
def solve(sudoku):
def loop(su):
if su.is_solved():
print(su)
elif su.is_valid():
su.fill_obvious()
print(su)
for coords, available_tokens in sorted(su.get_all_clues().items(), key=lambda kv: len(kv[1])):
for token in available_tokens:
new_su = su.copy()
new_su.data[coords] = token
loop(new_su)
loop(sudoku)
with open('input.txt') as f:
numbers = ''
for i, line in enumerate(f):
if i >= 9:
break
numbers += line.rstrip().ljust(9)
s = Sudoku(numbers, tokens='123456789')
print(s)
solve(s)
print(s)
Sorry if this seems messy but I'd rather give you everything I have than only some data that may or may not contain the problem.
As you can see, the first thing it does is filling the Sudoku with only 100% sure numbers using the methods fill_singles (fills every cell that can only be filled with number x, eg the 8 other possibilities are in its row, column or block) and place_finding (checks all tokens and see if there's only one space in a group - row, column or block - where they can fit). It loops through both of those until nothing can be done.
Afterwards, it tries every possibility in spaces that have least and tries to solves the newly made grid with the same method. That's where my problem is. Currently, for debug purposes, the program prints grids it comes across whenever they're valid (no same number twice in a group). Actually that's what I'd like it to do. However it doesn't work like this; with this input :
+-------+-------+-------+
| . . . | . 2 6 | . . 4 |
| . . . | 7 9 . | 5 . . |
| . . . | . . . | 9 1 . |
+-------+-------+-------+
| . 8 . | 1 . . | . . . |
| 2 3 6 | . . . | 1 8 5 |
| . . . | . . 3 | . 7 . |
+-------+-------+-------+
| . 4 7 | . . . | . . . |
| . . 3 | . 7 8 | . . . |
| 5 . . | 6 3 . | . . . |
+-------+-------+-------+
It outputs grids such as this one, which is obviously not valid:
+-------+-------+-------+
| 1 4 9 | 3 2 6 | 7 3 4 |
| 3 2 4 | 7 9 1 | 5 1 8 |
| 3 7 5 | 3 2 4 | 9 1 2 |
+-------+-------+-------+
| 7 8 6 | 1 1 4 | 3 2 5 |
| 2 3 6 | 9 4 7 | 1 8 5 |
| 4 1 7 | 2 3 3 | 6 7 4 |
+-------+-------+-------+
| 2 4 7 | 1 1 3 | 5 4 6 |
| 1 3 3 | 4 7 8 | 2 5 1 |
| 5 5 4 | 6 3 2 | 1 3 7 |
+-------+-------+-------+
I really cannot understand why it allows such grids to pass the is_valid test, especially considering that when I manually input the grid above, it doesn't pass the test :
>>> s
+-------+-------+-------+
| 1 4 9 | 3 2 6 | 7 3 4 |
| 3 2 4 | 7 9 1 | 5 1 8 |
| 3 7 5 | 3 2 4 | 9 1 2 |
+-------+-------+-------+
| 7 8 6 | 1 1 4 | 3 2 5 |
| 2 3 6 | 9 4 7 | 1 8 5 |
| 4 1 7 | 2 3 3 | 6 7 4 |
+-------+-------+-------+
| 2 4 7 | 1 1 3 | 5 4 6 |
| 1 3 3 | 4 7 8 | 2 5 1 |
| 5 5 4 | 6 3 2 | 1 3 7 |
+-------+-------+-------+
>>> s.is_valid()
False
Can anyone see an error in my code that I haven't noticed? I'm sorry I'm not really being specific but I tried looking through every piece of my code and can't seem to find anything.
For #AnandSKatum :
26 4
79 5
91
8 1
236 185
3 7
47
3 78
5 63
here is a better implementation that I wrote for you
from numpy.lib.stride_tricks import as_strided
from itertools import chain
import numpy
def block_view(A, block= (3, 3)):
"""Provide a 2D block view to 2D array. No error checking made.
Therefore meaningful (as implemented) only for blocks strictly
compatible with the shape of A."""
# simple shape and strides computations may seem at first strange
# unless one is able to recognize the 'tuple additions' involved ;-)
shape= (A.shape[0]/ block[0], A.shape[1]/ block[1])+ block
strides= (block[0]* A.strides[0], block[1]* A.strides[1])+ A.strides
return chain.from_iterable(as_strided(A, shape= shape, strides= strides))
def check_board(a):
"""
a is a 2d 9x9 numpy array, 0 represents None
"""
for row,col,section in zip(a,a.T,block_view(a,(3,3))):
s = list(chain.from_iterable(section))
if any(sum(set(x)) != sum(x) for x in [row,col,s]):
return False
return True
a = numpy.array(
[
[9,8,7,6,5,4,3,2,1],
[2,1,5,0,0,0,0,0,0],
[3,6,4,0,0,0,0,0,0],
[4,2,0,0,0,0,0,0,0],
[5,3,0,0,0,0,0,0,0],
[6,7,0,0,0,0,0,0,0],
[7,0,0,0,0,0,0,0,0],
[8,0,0,0,0,0,0,0,0],
[1,0,0,0,0,0,0,0,0],
]
)
print check_board(a)