Extract array values that are nearly identical - python

I have this numpy array:
a = np.array([[8.04,9], [2.02,3], [8,10], [2,3], [8.12,18], [8.04,18],[2,8],[11,14]])
From this array, I would like to find nearly identical row values (not more than 0.05 for the first index AND not more than 1 for the second index) and create new sub-arrays.
For this example, this would give 6 different arrays (which could be part of large array).
a1 = [[8.04,9],[8,10]]
a2 = [[2.02,3],[2,3]]
a3 = [8.12,18]
a4 = [8.04,18]
a5 = [2,8]
a6 = [11,14]
Is there a way to do that ?
Best

Here's a simple method:
for pair in a:
cond1 = np.isclose(a[:,0], pair[0], atol=0.05)
cond2 = np.isclose(a[:,1], pair[1], atol=1)
print(a[cond1 & cond2])
With deduplication:
done = np.zeros(len(a), bool)
for ii, pair in enumerate(a):
if done[ii]:
continue
cond = np.isclose(a[:,0], pair[0], atol=0.05)
cond &= np.isclose(a[:,1], pair[1], atol=1)
print(a[cond])
done |= cond

The OP asks for grouping pairs, not simply printing pairs, so the solution proposed by John Zwink is incomplete. To get the complete answer, the idea is to convert ndarrays into a hashable equivalent (e.g. tuple) and combine all them into a set to avoid duplication. Here:
import numpy as np
a = np.array([[8.04,9], [2.02,3], [8,10], [2,3], [8.12,18], [8.04,18],[2,8],[11,14]])
groups = set()
for pair in a:
cond1 = np.isclose(a[:,0], pair[0], atol=0.05)
cond2 = np.isclose(a[:,1], pair[1], atol=1.000000001)
groups.add(tuple(map(tuple, a[cond1 & cond2])))
print(groups)
Result:
{((8.12, 18.0),),
((8.04, 18.0),),
((2.02, 3.0), (2.0, 3.0)),
((11.0, 14.0),),
((8.04, 9.0), (8.0, 10.0)),
((2.0, 8.0),)}
Note: I added an arbitrary epsilon to the second condition, to get the same grouping as wanted in the OP

Related

Pandas, get all possible value combinations of length k grouped by feature

I have a Pandas dataframe something like:
Feature A
Feature B
Feature C
A1
B1
C1
A2
B2
C2
Given k as input, i want all values combination grouped by feature of length k, for example for k = 2 I want:
[{A:A1, B:B1},
{A:A1, B:B2},
{A:A1, C:C1},
{A:A1, C:C2},
{A:A2, B:B1},
{A:A2, B:B2},
{A:A2, C:C1},
{A:A2, C:C2},
{B:B1, C:C1},
{B:B1, C:C2},
{B:B2, C:C1},
{B:B2, C:C2}]
How can I achieve that?
This is probably not that efficient but it works for small scale.
First, determine the unique combinations of k columns.
from itertools import combinations
k = 2
cols = list(combinations(df.columns, k))
Then use MultiIndex.from_product to get cartesian product of k columns.
result = []
for c in cols:
result += pd.MultiIndex.from_product([df[x] for x in c]).values.tolist()

How to use a dictionary to speed up the task of look up and counting?

Consider the following snippet:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
Given a huge dataframe and 2 lists, I want to count the number of elements in new_list that are in the same in the dataframe. In the above pseudo example, The result would be 3 as: "aaa-fff", "ccc-ggg", & "ddd-ccc" are in the same row of the dataframe.
Right now, I am using a linear search algorithm but it is very slow as I have to scan through the entire dataframe.
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
Can someone kindly help me implement a faster algorithm preferably with a dictionary data structure?
Note: I have to iterate through the 7,000 rows of another dataframe and create the 2 lists dynamically, and get an aggregate count for each row.
Here is another way. First, I used your definition of df (with 2 columns), list_a and list_b.
# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']
# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} |
{ f'{b}-{a}' for a, b in zip(list_a, list_b)})
# find intersection
result = set(df['col3']) & s
print(len(result), '\n', result)
3
{'ddd-ccc', 'ccc-ggg', 'aaa-fff'}
UPDATE to handle duplicate values.
# build list (not set) from list_a and list_b
idx = ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
[ f'{b}-{a}' for a, b in zip(list_a, list_b) ])
# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()
# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]
# results:
ddd-ccc 1
aaa-fff 1
ccc-ggg 1
Name: col3, dtype: int64
Try this:
from itertools import product
# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b))
# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)]
# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))
yields
3
First join the columns before looping, then instead of looping pass an optional regex to contains with all possible strings.
joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
[f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()
To work with dicts instead of dataframes, you can use filter(re, joined) as in this question
import re
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
### build the regex pattern
pat_set = set('-'.join(combo) for combo in set(
list(itertools.product(list_a, list_b)) +
list(itertools.product(list_b, list_a))))
pat = '|'.join(pat_set)
# use itertools to generalize with many colums, remove duplicates with set()
### join the columns row-wise
joined = ['-'.join(row) for row in zip(*[vals for key, vals in data.items()])]
### filter joined
match_list = list(filter(re.compile(pat).match, joined))
ct = len(match_list)
Third option with series.isin() inspired by jsmart's answer
joined = df.col1 + '-' + df.col2
ct = joined.isin(pat_set).sum()
Speed testing
I repeated data 100,000 times for scalability testing. series.isin() takes the day, while jsmart's answer is fast but does not find all occurrences because it removes duplicates from joined
with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s

Sorting by absolute value of difference between two columns in Python

I have 2 large array of integers A and B
Now I have to sort A and B in the decreasing order of | A[i]-B[i]|
Example
A={16,5}
B={14,1}
|A[i]-B[i]|={2,4}
So sorted A={5,16}
sorted B={1,14}
The array can contain much more than 2 integers
Let's do it with NumPy!
import numpy as np
A = np.random.randint(0, 100, 100) # or np.array(A) to convert from list
B = np.random.randint(0, 100, 100)
diff = np.abs(A-B) # your sort order
sortidx = np.argsort(diff) # indexes which sort the data by diff
print A[sortidx] # print A sorted in order of abs(A-B)
print B[sortidx]
If you prefer without NumPy (see Equivalent of Numpy.argsort() in basic python?):
import operator
diff = map(operator.sub, A, B)
sortidx = sorted(range(len(diff)), key=diff.__getitem__)
print [A[idx] for idx in sortidx]
print [B[idx] for idx in sortidx]

Easiest way to create a NumPy record array from a list of dictionaries?

Say I have data like d = [dict(animal='cat', weight=5), dict(animal='dog', weight=20)] (basically JSON, where all entries have consistent data types).
In Pandas you can make this a table with df = pandas.DataFrame(d) -- is there something comparable for plain NumPy record arrays? np.rec.fromrecords(d) doesn't seem to given me what I want.
You could make an empty structured array of the right size and dtype, and then fill it from the list.
http://docs.scipy.org/doc/numpy/user/basics.rec.html
Structured arrays can be filled by field or row by row.
...
If you fill it in row by row, it takes a take a tuple (but not a list or array!):
In [72]: dt=dtype([('weight',int),('animal','S10')])
In [73]: values = [tuple(each.values()) for each in d]
In [74]: values
Out[74]: [(5, 'cat'), (20, 'dog')]
fields in the dt occur in the same order as in values.
In [75]: a=np.zeros((2,),dtype=dt)
In [76]: a[:]=[tuple(each.values()) for each in d]
In [77]: a
Out[77]:
array([(5, 'cat'), (20, 'dog')],
dtype=[('weight', '<i4'), ('animal', 'S10')])
With a bit more testing I found I can create the array directly from values.
In [83]: a = np.array(values, dtype=dt)
In [84]: a
Out[84]:
array([(5, 'cat'), (20, 'dog')],
dtype=[('weight', '<i4'), ('animal', 'S10')])
The dtype could be deduced from one (or more) of the dictionary items:
def gettype(v):
if isinstance(v,int): return 'int'
elif isinstance(v,float): return 'float'
else:
assert isinstance(v,str)
return '|S%s'%(len(v)+10)
d0 = d[0]
names = d0.keys()
formats = [gettype(v) for v in d0.values()]
dt = np.dtype({'names':names, 'formats':formats})
producing:
dtype=[('weight', '<i4'), ('animal', 'S13')]
Well you could make your life extra easy and just rely on Pandas since numpy doesn't use column headers
Pandas
df = pandas.DataFrame(d)
numpyMatrix = df.as_matrix() #spits out a numpy matrix
Or you can ignore Pandas and use numpy + list comprehension to knock down the dicts to values and store as matrix
Numpy
numpMatrix = numpy.matrix([each.values() for each in d])
Proposal from me (generally it's slightly improved hpaulj's answer):
dicts = [dict(animal='cat', weight=5), dict(animal='dog', weight=20)]
Creation od dtype object:
dt_tuples = []
for key, value in dicts[0].items():
if not isinstance(value, str):
value_dtype = np.array([value]).dtype
else:
value_dtype = '|S{}'.format(max([len(d[key]) for d in dicts]))
dt_tuples.append((key, value_dtype))
dt = np.dtype(dt_tuples)
As you see there's a problem with string handling - we need to check it's maximum length to define dtype. This additional condition can be skipped if you do not have string values in your dict or if you're sure that all those values have exactly same length.
If you're looking for one-liner it would be something like this:
dt = np.dtype([(k, np.array([v]).dtype if not isinstance(v, str) else '|S{}'.format(max([len(d[k]) for d in dicts]))) for k, v in dicts[0].items()])
(still it's probably better to break it for readability).
Values list:
values = [tuple(d[name] for name in dt.names) for d in dicts]
Because we iterate over dt.names we are sure that order of values is correct.
And, at the end, array creation:
a = np.array(values, dtype=dt)

Merging python lists based on a 'similar' float value

I have a list (containing tuples) and I want to merge the list based on if the first element is within a maximum distance of the other elements (if if delta value < 0.05). I have the following list as an example:
[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
This should yield something like:
[(0.0, 1.883659017),(1.00422, 0.9998252466431066),(2.00425,0.9951777494430947)]
I am thinking that I can use something similar as in this question (Merge nested list items based on a repeating value) altho a lot of other questions yield a similar answer. The only problem that I see there is that they use collections.defaultdict or itertools.groupby which require exact matching of the element. An important addition here is that I want the first element of a merged tuple to be the weighted mixture of elements, example as follows:
(1.001,80) and (0.99,20) are matched then the result should be (0.9988,100).
Is something similar possible but with the matching based on value difference and not exact match?
What I was trying myself (but don't really like the look of it) is:
Res = 0.05
combinations = itertools.combination(list,2)
for i in combinations:
if i[0][0] > i[1][0]-Res and i[0][0] < i[1][0]+Res:
newValue = ...
-- UPDATE --
Based on some comments and Dawgs answer I tried the following approach:
for fv, v in total:
k=round(fv, 2)
data[k]=data.get(k, 0)+v
using the following list (actual data example, instead of short example list):
total = [(0.0, 0.11630591852564721), (1.00335, 0.25158664272201053), (2.0067, 0.2707487305913156), (3.0100499999999997, 0.19327075057473678), (4.0134, 0.10295042331357719), (5.01675, 0.04364856520231155), (6.020099999999999, 0.015342958201863783), (0.0, 0.9811758192941256), (1.00422, 0.018649427348981), (0.0, 0.9024831978342827), (2.00425, 0.09269455160881204), (0.0, 0.6944298762418107), (0.99703, 0.2536959281304138), (1.99406, 0.045877927988415786)]
which then yields problems with values such as 2.0067 (rounded to 2.01) and 1.99406 (rounded to 1.99( where the total difference is 0.01264 (which is far below 0.05, a value that I had in mind as a 'limit' for now but that should set changeable). Rounding the values to 1 decimal place is also not an option since that would result in a window of ~0.09 with values such as 2.04999 and 1.95001 which both yield 2.0 in that case.
The exact output was:
{0.0: 2.694394811895866, 1.0: 0.5239319982014053, 4.01: 0.10295042331357719, 5.02: 0.04364856520231155, 2.0: 0.09269455160881204, 1.99: 0.045877927988415786, 3.01: 0.19327075057473678, 6.02: 0.015342958201863783, 2.01: 0.2707487305913156}
accum = list()
data = [(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
EPSILON = 0.05
newdata = {d: True for d in data}
for k, v in data:
if not newdata[(k,v)]: continue
newdata[(k,v)] = False
# use each piece of data only once
keys,values = [k*v],[v]
for kk, vv in [d for d in data if newdata[d]]:
if abs(k-kk) < EPSILON:
keys.append(kk*vv)
values.append(vv)
newdata[(kk,vv)] = False
accum.append((sum(keys)/sum(values),sum(values)))
You can round the float values then use setdefault:
li=[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
data={}
for fv, v in li:
k=round(fv, 5)
data.setdefault(k, 0)
data[k]+=v
print data
# {0.0: 1.8836590171284082, 2.00425: 0.9951777494430947, 1.00422: 0.9998252466431066}
If you want some more complex comparison (other than fixed rounding) you can create a hashable object based on the epsilon value you want and use the same method from there.
As pointed out in the comments, this works too:
data={}
for fv, v in li:
k=round(fv, 5)
data[k]=data.get(k, 0)+v

Categories