Related
I was using joblib for parallel processing a list (>500k rows) to find out duplicates in the file. Therefore, I needed to track indices of the input list. However, the result returned indices in each thread/processing and they were not originally indices in the list (range 0-500k+). How can I track the original indices of the input in the parallel processing? Thank you.
import time
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from joblib import Parallel, delayed
start_time = time.time()
texts = a_list
def match_name(texts):
result = []
for i, text in enumerate(texts):
for j, name in enumerate(texts[i+1:]):
fratio = fuzz.token_set_ratio(text, name)
if fratio>=75:
result.append([i,j, fratio])
return result
results2 = Parallel(n_jobs=200, verbose=5, backend="loky")(map(delayed(match_name), texts))
print(time.time() - start_time)
The actual result is:
[[[1, 1, 100],
[1, 4, 100],
[1, 6, 100],
[2, 2, 100],
[2, 4, 100],
[3, 2, 100],
[3, 4, 100],
[5, 1, 100],
[6, 1, 100]],
[[0, 14, 100],
[1, 6, 100],
[1, 14, 100],
[2, 9, 100],
[2, 14, 100],
[8, 7, 100],
[9, 0, 100],
[9, 12, 100],
[10, 11, 100],
[12, 4, 100],
[13, 9, 100]],
[[1, 24, 100],
[3, 21, 100],
[5, 7, 100],
[6, 17, 100],
[9, 1, 100],
[9, 9, 100],
[11, 7, 100],
[12, 2, 100],
[17, 4, 100]],
[[0, 18, 100],
[0, 19, 100],
[2, 5, 100],
...]
The expected result ranges 0 to 500k+, which is the length of the list.
edit)
Sorry my question wasn't clear. So, I wanted to find the min of a certain row and col without taking their intersection point, as #ParthSindhu said :)
I would like to find the min number from 2d array except the one number. (I'm using numpy array)
array([[30, 15, 41, 26, 12],
[ 4, 19, 22, 40, 1],
[41, 21, 0, 43, 22],
[ 9, 40, 6, 10, 30],
[24, 49, 22, 8, 41]])
For example, in row 2 and col 2, I would like to find the smallest number in each row and col except 0.
So, the answer want is 21 in row 2, and 6 in col 2.
I've tried to implement this code with 1d array,
a = np.arange(9, -1, -1) # a = array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
b = a[np.arange(len(a))!=3] # b = array([9, 8, 7, 5, 4, 3, 2, 1, 0])
But, I could only find the one in the row but not in col.
a[np.arange(len(a))!=1].min()
The code just above returns 6
How could I do the same thing with col?
Sorry if the question is not so clear.
You Could this in case you are ignoring the intersection point:
import numpy as np
a = np.array([[30, 15, 41, 26, 12],
[ 4, 19, 22, 40, 1],
[41, 21, 0, 43, 22],
[ 9, 40, 6, 10, 30],
[24, 49, 22, 8, 41]]
row = 2
col = 2
col_indices = np.delete(np.arange(a.shape[0]), row)
row_indices = np.delete(np.arange(a.shape[1]), col)
col_min = a[col_indices, col].min()
row_min = a[row, row_indices].min()
print(col_min, row_min)
I'm sure there are better ways than this, this is just how i would do it.
You can use np.amin(a, axis = 1) to get an array with the smallest number in each row.
a = np.array([[30, 15, 41, 26, 12],
[ 4, 19, 22, 40, 1],
[41, 21, 0, 43, 22],
[ 9, 40, 6, 10, 30],
[24, 49, 22, 8, 41]])
print(np.amin(a, axis = 1))
This result in
>> [12 1 0 6 8]
Now you can run this again to find the smallest number in this array.
Dummy = np.amin(a, axis = 1)
print(np.amin(Dummy))
And you get the smallest number.
>> 0
You can change the axis if you set axis to 0. So you can perform this operation on each axis of the array.
a = np.array([[30, 15, 41, 26, 12],
[ 4, 19, 22, 40, 1],
[41, 21, 0, 43, 22],
[ 9, 40, 6, 10, 30],
[24, 49, 22, 8, 41]])
print(np.amin(a, axis = 0))
>> [ 4 15 0 8 1]
You could use masked arrays:
a = np.array([[30, 15, 41, 26, 12],
[ 4, 19, 22, 40, 1],
[41, 21, 0, 43, 22],
[ 9, 40, 6, 10, 30],
[24, 49, 22, 8, 41]])
masked_a = np.ma.masked_array(a, mask=a == 0)
min_cols = masked_a.min(axis=0).data
min_rows = masked_a.min(axis=1).data
print(min_rows)
print(min_cols)
[12 1 21 6 8]
[ 4 15 6 8 1]
One possible way, replace 0 with inf:
a = np.array([[30, 15, 41, 26, 12],
[ 4, 19, 22, 40, 1],
[41, 21, 0, 43, 22],
[ 9, 40, 6, 10, 30],
[24, 49, 22, 8, 41]])
no_zero = np.where(a==0, np.inf, a)
no_zero.min(axis=0) # array([ 4., 15., 6., 8., 1.])
no_zero.min(axis=1) # array([12., 1., 21., 6., 8.])
I have a numpy array and would like to subset the first two arrays of each element in an ndarray.
Here is an example array:
import numpy as np
a1 = np.array([[ 1, 2, 3],
[ 4, 5, 6]])
a2 = np.array([[ 7, 8, 9],
[10, 11, 12],
[13, 14, 15],
[16, 17, 18]])
a3 = np.array([[19, 20, 21],
[22, 23, 24],
[25, 26, 27]])
A = np.array([a1, a2, a3])
print("A =\n", A)
Which prints:
A =
[array([[ 1, 2, 3],
[ 4, 5, 6]])
array([[ 7, 8, 9],
[10, 11, 12],
[13, 14, 15],
[16, 17, 18]])
array([[19, 20, 21],
[22, 23, 24],
[25, 26, 27]])]
The desired result is as follows:
A =
[array([[ 1, 2, 3],
[ 4, 5, 6]])
array([[ 7, 8, 9],
[10, 11, 12]])
array([[19, 20, 21],
[22, 23, 24]])]
To print the equivalent object, you could do
print(np.array([a1[0:2], a2[0:2], a3[0:2]]))
But I want to directly get what is desired using A.
What is the correct way of doing this in numpy?
Edit: I would like to subset the array without looping. Alternative ways of structuring the arrays so that they can be directly indexed are okay too. Any numpy function to avoid looping is fair game.
a = [i[0:2] for i in A]
This will work!
The Story:
Currently, I have a function-under-test that expects a list of lists of integers with the following rules:
number of sublists (let's call it N) can be from 1 to 50
number of values inside sublists is the same for all sublists (rectangular form) and should be >= 0 and <= 5
values inside sublists cannot be more than or equal to the total number of sublists. In other words, each value inside a sublist is an integer >= 0 and < N
Sample valid inputs:
[[0]]
[[2, 1], [2, 0], [3, 1], [1, 0]]
[[1], [0]]
Sample invalid inputs:
[[2]] # 2 is more than N=1 (total number of sublists)
[[0, 1], [2, 0]] # 2 is equal to N=2 (total number of sublists)
I'm trying to approach it with property-based-testing and generate different valid inputs with hypothesis library and trying to wrap my head around lists() and integers(), but cannot make it work:
the condition #1 is easy to approach with lists() and min_size and max_size arguments
the condition #2 is covered under Chaining strategies together
the condition #3 is what I'm struggling with - cause, if we use the rectangle_lists from the above example, we don't have a reference to the length of the "parent" list inside integers()
The Question:
How can I limit the integer values inside sublists to be less than the total number of sublists?
Some of my attempts:
from hypothesis import given
from hypothesis.strategies import lists, integers
#given(lists(lists(integers(min_value=0, max_value=5), min_size=1, max_size=5), min_size=1, max_size=50))
def test(l):
# ...
This one was very far from meeting the requirements - list is not strictly of a rectangular form and generated integer values can go over the generated size of the list.
from hypothesis import given
from hypothesis.strategies import lists, integers
#given(integers(min_value=0, max_value=5).flatmap(lambda n: lists(lists(integers(min_value=1, max_value=5), min_size=n, max_size=n), min_size=1, max_size=50)))
def test(l):
# ...
Here, the #1 and #2 are requirements were being met, but the integer values can go larger than the size of the list - requirement #3 is not met.
There's a good general technique that is often useful when trying to solve tricky constraints like this: try to build something that looks a bit like what you want but doesn't satisfy all the constraints and then compose it with a function that modifies it (e.g. by throwing away the bad bits or patching up bits that don't quite work) to make it satisfy the constraints.
For your case, you could do something like the following:
from hypothesis.strategies import builds, lists, integers
def prune_list(ls):
n = len(ls)
return [
[i for i in sublist if i < n][:5]
for sublist in ls
]
limited_list_strategy = builds(
prune_list,
lists(lists(integers(0, 49), average_size=5), max_size=50, min_size=1)
)
In this we:
Generate a list that looks roughly right (it's a list of list of integers and the integers are in the same range as all possible indices that could be valid).
Prune out any invalid indices from the sublists
Truncate any sublists that still have more than 5 elements in them
The result should satisfy all three conditions you needed.
The average_size parameter isn't strictly necessary but in experimenting with this I found it was a bit too prone to producing empty sublists otherwise.
ETA: Apologies. I've just realised that I misread one of your conditions - this doesn't actually do quite what you want because it doesn't ensure each list is the same length. Here's a way to modify this to fix that (it gets a bit more complicated, so I've switched to using composite instead of builds):
from hypothesis.strategies import composite, lists, integers, permutations
#composite
def limisted_lists(draw):
ls = draw(
lists(lists(integers(0, 49), average_size=5), max_size=50, min_size=1)
)
filler = draw(permutations(range(50)))
sublist_length = draw(integers(0, 5))
n = len(ls)
pruned = [
[i for i in sublist if i < n][:sublist_length]
for sublist in ls
]
for sublist in pruned:
for i in filler:
if len(sublist) == sublist_length:
break
elif i < n:
sublist.append(i)
return pruned
The idea is that we generate a "filler" list that provides the defaults for what a sublist looks like (so they will tend to shrink in the direction of being more similar to eachother) and then draw the length of the sublists to prune to to get that consistency.
This has got pretty complicated I admit. You might want to use RecursivelyIronic's flatmap based version. The main reason I prefer this over that is that it will tend to shrink better, so you'll get nicer examples out of it.
You can also do this with flatmap, though it's a bit of a contortion.
from hypothesis import strategies as st
from hypothesis import given, settings
number_of_lists = st.integers(min_value=1, max_value=50)
list_lengths = st.integers(min_value=0, max_value=5)
def build_strategy(number_and_length):
number, length = number_and_length
list_elements = st.integers(min_value=0, max_value=number - 1)
return st.lists(
st.lists(list_elements, min_size=length, max_size=length),
min_size=number, max_size=number)
mystrategy = st.tuples(number_of_lists, list_lengths).flatmap(build_strategy)
#settings(max_examples=5000)
#given(mystrategy)
def test_constraints(list_of_lists):
N = len(list_of_lists)
# condition 1
assert 1 <= N <= 50
# Condition 2
[length] = set(map(len, list_of_lists))
assert 0 <= length <= 5
# Condition 3
assert all((0 <= element < N) for lst in list_of_lists for element in lst)
As David mentioned, this does tend to produce a lot of empty lists, so some average size tuning would be required.
>>> mystrategy.example()
[[24, 6, 4, 19], [26, 9, 15, 15], [1, 2, 25, 4], [12, 8, 18, 19], [12, 15, 2, 31], [3, 8, 17, 2], [5, 1, 1, 5], [7, 1, 16, 8], [9, 9, 6, 4], [22, 24, 28, 16], [18, 11, 20, 21], [16, 23, 30, 5], [13, 1, 16, 16], [24, 23, 16, 32], [13, 30, 10, 1], [7, 5, 14, 31], [31, 15, 23, 18], [3, 0, 13, 9], [32, 26, 22, 23], [4, 11, 20, 10], [6, 15, 32, 22], [32, 19, 1, 31], [20, 28, 4, 21], [18, 29, 0, 8], [6, 9, 24, 3], [20, 17, 31, 8], [6, 12, 8, 22], [32, 22, 9, 4], [16, 27, 29, 9], [21, 15, 30, 5], [19, 10, 20, 21], [31, 13, 0, 21], [16, 9, 8, 29]]
>>> mystrategy.example()
[[28, 18], [17, 25], [26, 27], [20, 6], [15, 10], [1, 21], [23, 15], [7, 5], [9, 3], [8, 3], [3, 4], [19, 29], [18, 11], [6, 6], [8, 19], [14, 7], [25, 3], [26, 11], [24, 20], [22, 2], [19, 12], [19, 27], [13, 20], [16, 5], [6, 2], [4, 18], [10, 2], [26, 16], [24, 24], [11, 26]]
>>> mystrategy.example()
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
>>> mystrategy.example()
[[], [], [], [], [], [], [], [], [], [], [], [], [], [], []]
>>> mystrategy.example()
[[6, 8, 22, 21, 22], [3, 0, 24, 5, 18], [16, 17, 25, 16, 11], [2, 12, 0, 3, 15], [0, 12, 12, 12, 14], [11, 20, 6, 6, 23], [5, 19, 2, 0, 12], [16, 0, 1, 24, 10], [2, 13, 21, 19, 15], [2, 14, 27, 6, 7], [22, 25, 18, 24, 9], [26, 21, 15, 18, 17], [7, 11, 22, 17, 21], [3, 11, 3, 20, 16], [22, 13, 18, 21, 11], [4, 27, 21, 20, 25], [4, 1, 13, 5, 13], [16, 19, 6, 6, 25], [19, 10, 14, 12, 14], [18, 13, 13, 16, 3], [12, 7, 26, 26, 12], [25, 21, 12, 23, 22], [11, 4, 24, 5, 27], [25, 10, 10, 26, 27], [8, 25, 20, 6, 23], [8, 0, 12, 26, 14], [7, 11, 6, 27, 26], [6, 24, 22, 23, 19]]
Pretty late, but for posterity: the easiest solution is to pick dimensions, then build up from the element strategy.
from hypothesis.strategies import composite, integers, lists
#composite
def complicated_rectangles(draw, max_N):
list_len = draw(integers(1, max_N))
sublist_len = draw(integers(0, 5))
element_strat = integers(0, min(list_len, 5))
sublist_strat = lists(
element_strat, min_size=sublist_len, max_size=sublist_len)
return draw(lists(
sublist_strat, min_size=list_len, max_size=list_len))
Given a numpy 3-d array
[[[1][4]][[7][10]]]
let's say the first row is 1 4 and the second row is 7 10. If I have a multiplier of 3, the first through third rows would become 1 1 1 4 4 4 and the 4th through 6th rows would become 7 7 7 10 10 10, that is:
[[[1][1][1][4][4][4]][[1][1][1][4][4][4]][[1][1][1][4][4][4]][[7][7][7][10][10][10]][[7][7][7][10][10][10]][[7][7][7][10][10][10]]]
Is there a quick way to do this in numpy? The actual array I'm using has 3 or 4 elements instead of 1 at the bottom level so [1][1][1] could be [1,8,7][1,8,7][1,8,7] instead, but I simplified it here.
numpy.repeat sounds like what you want.
Here are some examples:
>>> a = numpy.array( [[[1,2,3],[4,5,6]], [[10,20,30],[40,50,60]]] )
>>> a
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[10, 20, 30],
[40, 50, 60]]])
>>>
>>> a.repeat( 3, axis=0 )
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 1, 2, 3],
[ 4, 5, 6]],
[[10, 20, 30],
[40, 50, 60]],
[[10, 20, 30],
[40, 50, 60]],
[[10, 20, 30],
[40, 50, 60]]])
>>>
>>> a.repeat( 3, axis=1 )
array([[[ 1, 2, 3],
[ 1, 2, 3],
[ 1, 2, 3],
[ 4, 5, 6],
[ 4, 5, 6],
[ 4, 5, 6]],
[[10, 20, 30],
[10, 20, 30],
[10, 20, 30],
[40, 50, 60],
[40, 50, 60],
[40, 50, 60]]])
>>>
>>> a.repeat( 3, axis=2 )
array([[[ 1, 1, 1, 2, 2, 2, 3, 3, 3],
[ 4, 4, 4, 5, 5, 5, 6, 6, 6]],
[[10, 10, 10, 20, 20, 20, 30, 30, 30],
[40, 40, 40, 50, 50, 50, 60, 60, 60]]])
Depending on the desired shape of your output, you may wish to chain multiple calls to .repeat() with different axis values.