Pandas groupby.ngroup() in index order? - python

Pandas groupby "ngroup" function tags each group in "group" order.
I'm looking for similar behaviour but need the assigned tags to be in original (index) order, how can I do so efficiently (this will happen often with large arrays) in pandas and numpy?
> df = pd.DataFrame(
{"A": [9,8,7,8,9]},
index=list("abcde"))
A
a 9
b 8
c 7
d 8
e 9
> df.groupby("A").ngroup()
a 2
b 1
c 0
d 1
e 2
# LOOKING FOR ###################
a 0
b 1
c 2
d 1
e 0
How can I achieve the desired output with a single dimension numpy array?
arr = np.array([9,8,7,8 ,9])
# looking for [0,1,2,1,0]

Perhaps a better way is factorize:
df['A'].factorize()[0]
Output:
array([0, 1, 2, 1, 0])

You can use np.unique -
In [105]: a = np.array([9,8,7,8,9])
In [106]: u,idx,tags = np.unique(a, return_index=True, return_inverse=True)
In [107]: idx.argsort().argsort()[tags]
Out[107]: array([0, 1, 2, 1, 0])

You can pass sort=Flase to groupby():
df.groupby('A', sort=False).ngroup()
a 0
b 1
c 2
d 1
e 0
dtype: int64
As far as I can tell, there isn't a direct equivalent of groupby in numpy. For a pure numpy version, you can use numpy.unique() to get the unique values. numpy.unique() has the option to return the inverse, basically the array of indices that would recreate your input array, but it sorts the unique values first, so the result is the same as using the regular (sorted) pandas.groupby() command.
To get around this, you can capture the index values of the first occurrence of each unique value. Sort the index values and use these as indices into the original array to get the unique values in their original order. Create a dictionary to map between the unique values and the group numbers and then use that dictionary to convert the values in the array to the appropriate group numbers.
import numpy as np
arr = np.array([9, 8, 7, 8, 9])
_, i = np.unique(arr, return_index=True) # get the indexes of the first occurence of each unique value
groups = arr[np.sort(i)] # sort the indexes and retrieve the values from the array so that they are in the array order
m = {value:ngroup for ngroup, value in enumerate(groups)} # create a mapping of value:groupnumber
np.vectorize(m.get)(arr) # use vectorize to create a new array using m
array([0, 1, 2, 1, 0])

I've benchmarked the suggested solutions:
Turns out that:
— factorize is the fastest for array sizes > 10³
— unique-argsort is the fastest for array sizes < 10³ (but slower by a factor of 10 for larger ones),
— ngroup is always slower, but for array sizes >3*10³ it has roughly the same speed as factorize.
from contextlib import contextmanager
from time import perf_counter as clock
from itertools import count
import numpy as np
import pandas as pd
def f1(a):
return s.factorize()[0]
def f2(s):
return s.groupby(s, sort=False).ngroup().values
def f3(s):
u, idx, tags = np.unique(s.values, return_index=True, return_inverse=True)
return idx.argsort().argsort()[tags]
#contextmanager
def bench(r):
t1 = clock()
yield
t2 = clock()
r.append(t2-t1)
res = []
for i in count():
n = 2**i
a = np.random.randint(0, n, n)
s = pd.Series(a)
rr = []
for j in range(5):
r = []
with bench(r):
a1 = f1(s)
with bench(r):
a2 = f2(s)
with bench(r):
a3 = f3(s)
rr.append(r)
if max(r) > 0.5:
break
res.append(np.min(rr, axis=0))
if np.max(rr) > 0.4:
break
np.save('results.npy', np.array(res))

Related

numpy indexing does not sum inplace if duplicate indices [duplicate]

I have a Numpy array and a list of indices whose values I would like to increment by one. This list may contain repeated indices, and I would like the increment to scale with the number of repeats of each index. Without repeats, the command is simple:
a=np.zeros(6).astype('int')
b=[3,2,5]
a[b]+=1
With repeats, I've come up with the following method.
b=[3,2,5,2] # indices to increment by one each replicate
bbins=np.bincount(b)
b.sort() # sort b because bincount is sorted
incr=bbins[np.nonzero(bbins)] # create increment array
bu=np.unique(b) # sorted, unique indices (len(bu)=len(incr))
a[bu]+=incr
Is this the best way? Is there are risk involved with assuming that the np.bincount and np.unique operations would result in the same sorted order? Am I missing some simple Numpy operation to solve this?
In numpy >= 1.8, you can also use the at method of the addition 'universal function' ('ufunc'). As the docs note:
For addition ufunc, this method is equivalent to a[indices] += b, except that results are accumulated for elements that are indexed more than once.
So taking your example:
a = np.zeros(6).astype('int')
b = [3, 2, 5, 2]
…to then…
np.add.at(a, b, 1)
…will leave a as…
array([0, 0, 2, 1, 0, 1])
After you do
bbins=np.bincount(b)
why not do:
a[:len(bbins)] += bbins
(Edited for further simplification.)
If b is a small subrange of a, one can refine Alok's answer like this:
import numpy as np
a = np.zeros( 100000, int )
b = np.array( [99999, 99997, 99999] )
blo, bhi = b.min(), b.max()
bbins = np.bincount( b - blo )
a[blo:bhi+1] += bbins
print a[blo:bhi+1] # 1 0 2

How to sort a NumPy array by frequency?

I am attempting to sort a NumPy array by frequency of elements. So for example, if there's an array [3,4,5,1,2,4,1,1,2,4], the output would be another NumPy sorted from most common to least common elements (no duplicates). So the solution would be [4,1,2,3,5]. If two elements have the same number of occurrences, the element that appears first is placed first in the output. I have tried doing this, but I can't seem to get a functional answer. Here is my code so far:
temp1 = problems[j]
indexes = np.unique(temp1, return_index = True)[1]
temp2 = temp1[np.sort(indexes)]
temp3 = np.unique(temp1, return_counts = True)[1]
temp4 = np.argsort(temp3)[::-1] + 1
where problems[j] is a NumPy array like [3,4,5,1,2,4,1,1,2,4]. temp4 returns [4,1,2,5,3] so far but it is not correct because it can't handle when two elements have the same number of occurrences.
You can use argsort on the frequency of each element to find the sorted positions and apply the indexes to the unique element array
unique_elements, frequency = np.unique(array, return_counts=True)
sorted_indexes = np.argsort(frequency)[::-1]
sorted_by_freq = unique_elements[sorted_indexes]
A non-NumPy solution, which does still work with NumPy arrays, is to use an OrderedCounter followed by sorted with a custom function:
from collections import OrderedDict, Counter
class OrderedCounter(Counter, OrderedDict):
pass
L = [3,4,5,1,2,4,1,1,2,4]
c = OrderedCounter(L)
keys = list(c)
res = sorted(c, key=lambda x: (-c[x], keys.index(x)))
print(res)
[4, 1, 2, 3, 5]
If the values are integer and small, or you only care about bins of size 1:
def sort_by_frequency(arr):
return np.flip(np.argsort(np.bincount(arr))[-(np.unique(arr).size):])
v = [1,1,1,1,1,2,2,9,3,3,3,3,7,8,8]
sort_by_frequency(v)
this should yield
array([1, 3, 8, 2, 9, 7]
Use zip and itemgetter should help
from operator import itemgetter
import numpy as np
temp1 = problems[j]
temp, idx, cnt = np.unique(temp1, return_index = True, return_counts=True)
cnt = 1 / cnt
k = sorted(zip(temp, cnt, idx), key=itemgetter(1, 2))
print(next(zip(*k)))
You can count up the number of each element in the array, and then use it as a key to the build-in sorted function
def sortbyfreq(arr):
s = set(arr)
keys = {n: (-arr.count(n), arr.index(n)) for n in s}
return sorted(list(s), key=lambda n: keys[n])

Python: Plot an array of strings with repeated entries vs float without for loop

Hi I am trying to plot a numpy array of strings in y axis, for example
arr = np.array(['a','a','bas','dgg','a']) #The actual strings are about 11 characters long
vs a float array with equal length. The string array I am working with is very large ~ 100 million entries. One of the solutions I had in mind was to convert the string array to unique integer ids, for example,
vocab = np.unique(arr)
vocab = list(vocab)
arrId = np.zeros(len(arr))
for i in range(len(arr)):
arrId[i] = vocab.index(arr[i])
and then matplotlib.pyplot.plot(arrId). But I cannot afford to run a for loop to convert the array of strings to an array of unique integer ids. In an initial search I could not find a way to map strings to an unique id without using a loop. Maybe I am missing something, but is there a smart way to do this in python?
EDIT -
Thanks. The solutions provided use vocab,ind = np.unique(arr, return_index = True) where idx is the returned unique integer array. But it seems like np.unique is O(N*log(N)) according to this ( numpy.unique with order preserved), but pandas.unique is of order O(N). But I am not sure how to get ind from pandas.unique. plotting data i guess can be done in O(N). So I was wondering is there a way to do this O(N)? perhaps by hashing of some sort?
numpy.unique used with the return_inverse argument allows you to obtain the inverted index.
arr = np.array(['a','a','bas','dgg','a'])
unique, rev = np.unique(arr, return_inverse=True)
#unique: ['a' 'bas' 'dgg']
#rev: [0 0 1 2 0]
such that unique[rev] returns the original array ['a' 'a' 'bas' 'dgg' 'a'].
This can be easily used to plot the data.
import numpy as np
import matplotlib.pyplot as plt
arr = np.array(['a','a','bas','dgg','a'])
x = np.array([1,2,3,4,5])
unique, rev = np.unique(arr, return_inverse=True)
print unique
print rev
print unique[rev]
fig,ax=plt.subplots()
ax.scatter(x, rev)
ax.set_yticks(range(len(unique)))
ax.set_yticklabels(unique)
plt.show()
you can factorize your strings:
In [75]: arr = np.array(['a','a','bas','dgg','a'])
In [76]: cats, idx = np.unique(arr, return_inverse=True)
In [77]: plt.plot(idx)
Out[77]: [<matplotlib.lines.Line2D at 0xf82da58>]
In [78]: cats
Out[78]:
array(['a', 'bas', 'dgg'],
dtype='<U3')
In [79]: idx
Out[79]: array([0, 0, 1, 2, 0], dtype=int64)
You can use the numpy unique funciton to return a unique array of values?
print(np.unique(arr))
['a' 'bas' 'dgg']
collections.counter also return the value and number of counts:
print(collections.Counter(arr))
Counter({'a': 3, 'bas': 1, 'dgg': 1})
Does this help at all?

Deleting values from multiple arrays that have a particular value

Lets say I have two arrays: a = array([1,2,3,0,4,5,0]) and b = array([1,2,3,4,0,5,6]). I am interested in removing the instances where a and bare 0. But I also want to remove the corresponding instances from both lists. Therefore what I want to end up with is a = array([1,2,3,5]) and b = array([1,2,3,5]). This is because a[3] == 0 and a[6] == 0, so both b[3] and b[6] are also deleted. Likewise, since b[4] == 0, a[4] is also deleted.Its simple to do this for say two arrays:
import numpy as np
a = np.array([1,2,3,0,4,5,0])
b = np.array([1,2,3,4,0,5,6])
ix = np.where(b == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
ix = np.where(a == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
However this solution doesnt scale up if I have many many arrays (which I do). What would be a more elegant way to do this?
If I try the following:
import numpy as np
a = np.array([1,2,3,0,4,5,0])
b = np.array([1,2,3,4,0,5,6])
arrays = [a,b]
for array in arrays:
ix = np.where(array == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
I get a = array([1, 2, 3, 4]) and b = array([1, 2, 3, 0]), not the answers I need. Any idea where this is wrong?
Assuming both/all arrays always have the same length, you can use masks:
ma = a != 0 # mask elements which are not equal to zero in a
mb = b != 0 # mask elements which are not equal to zero in b
m = ma * mb # assign the intersection of ma and mb to m
print a[m], b[m] # [1 2 3 5] [1 2 3 5]
You can of course also do it in one line
m = (a != 0) * (b != 0)
Or use the inverse
ma = a == 0
mb = b == 0
m = ~(ma + mb) # not the union of ma and mb
This is happening because when you return from np.delete, you get an array that is stored in b and a inside the loop. However, the arrays stored in the arrays variable are copies, not references. Hence, when you're updating the arrays by deleting them, it deletes with regard to the original arrays. The first loop will return the corrects indices of 0 in the array but the second loop will return ix as 4 (look at the original array).Like if you display the arrays variable in each iteration, it is going to remain the same.
You need to reassign arrays once you are done processing one array so that it's taken into consideration the next iteration. Here's how you'd do it -
a = np.array([1, 2, 3, 0, 4, 5, 0])
b = np.array([1, 2, 3, 4, 0, 5, 6])
arrays = [a,b]
for i in range(0, len(arrays)):
ix = np.where(arrays[i] == 0)
b = np.delete(b, ix)
a = np.delete(a, ix)
arrays = [a, b]
Of course you can automate what happens inside the loop. I just wanted to give an explanation of what was happening.
A slow method involves operating over the whole list twice, first to build an intermediate list of indices to delete, and then second to delete all of the values at those indices:
import numpy as np
a = np.array([1,2,3,0,4,5,0])
b = np.array([1,2,3,4,0,5,6])
arrays = [a, b]
vals = []
for array in arrays:
ix = np.where(array == 0)
vals.extend([y for x in ix for y in x.tolist()])
vals = list(set(vals))
new_array = []
for array in arrays:
new_array.append(np.delete(array, vals))
Building up on top of Christoph Terasa's answer, you can use array operations instead of for loops:
arrays = np.vstack([a,b]) # ...long list of arrays of equal length
zeroind = (arrays==0).max(0)
pos_arrays = arrays[:,~zeroind] # a 2d array only containing those columns where none of the lines contained zeros

Manipulating data from python Numpy array: Using values from one column to sum over adjacent value

here is what my data looks like:
a = np.array([[1,2],[2,1],[7,1],[3,2]])
I want to sum for each number in the second row here. So, in the example, there are two possible values in second column: 1 and 2.
I want to sum all values in the first column that have the same value in second column.
Is there an inbuilt numpy function for this?
For example a sum for each 1 in the second column would be: 2 + 7 = 9
A short but a bit dodgy way is through numpy function bincount:
np.bincount(a[:,1], weights=a[:,0])
What it does is counts the number of occurrences of 0, 1, 2, etc in the array (in this case, a[:,1] which is the list of your category numbers). Now, weights is multiplying the count by some weight which is in this case your first value in a list, essentially making a sum this way.
What it return is this:
array([ 0., 9., 4.])
where 0 is the sum of first elements where the second element is 0, etc... So, it will only work if your second numbers by which you group are integers.
If they are not consecutive integers starting from 0, you can select those you need by doing:
np.bincount(a[:,1], weights=a[:,0])[np.unique(a[:,1])]
This will return
array([9., 4.])
which is an array of sums, sorted by the second element (because unique returns a sorted list).
If your second elements are not integers, first off you are in some kind of trouble because of floating point arithmetic (elements which you think are equal could be different in reality). However, if you are sure it is fine, you can sort them and assign integers to them (using scipy's rank function, for example):
ind = rd(a[:,1], method = 'dense').astype(int) - 1 # ranking begins from 1, we need from 0
sums = np.bincount(ind, weights=a[:,0])
This will return array([9., 4.]), in order sorted by your second element. You can zip them to pair sums with appropriate elements:
zip(np.unique(a[:,1]), sums)
Contents of play.py
import numpy as np
def compute_sum1(a):
unique = np.unique(a[:, 1])
same_idxs = ((u, np.argwhere(a[:, 1] == u)) for u in unique)
# First coordinate of tuple contains value of col 2
# Second coordinate contains the sum of entries from col 1
same_sum = [(u, np.sum(a[idx, 0])) for u, idx in same_idxs]
return same_sum
def compute_sum2(a):
"""A minimal implementation of compute_sum"""
unique = np.unique(a[:, 1])
same_idxs = (np.argwhere(a[:, 1] == u) for u in unique)
same_sum = (np.sum(a[idx, 0]) for idx in same_idxs)
return same_sum
def compute_sum3(a):
unique = np.unique(a[:, 1])
same_idxs = [np.argwhere(a[:, 1] == u) for u in unique]
same_sum = np.sum(a[same_idxs, 0].squeeze(), 1)
return same_sum
def main():
a = np.array([[1,2],[2,1],[7,1],[3,2]]).astype("float")
print("compute_sum1")
print(compute_sum1(a))
print("compute_sum3")
print(compute_sum3(a))
print("compute_sum2")
same_sum = [s for s in compute_sum2(a)]
print(same_sum)
if __name__ == '__main__':
main()
Output:
In [59]: play.main()
compute_sum1
[(1.0, 9.0), (2.0, 4.0)]
compute_sum3
[ 9. 4.]
compute_sum2
[9.0, 4.0]
In [60]: %timeit play.compute_sum1(a)
10000 loops, best of 3: 95 µs per loop
In [61]: %timeit play.compute_sum2(a)
100000 loops, best of 3: 14.1 µs per loop
In [62]: %timeit play.compute_sum3(a)
10000 loops, best of 3: 77.4 µs per loop
Note that compute_sum2() is the fastest.
If your matrix is huge, I suggest using this implementation as it uses generator comprehension instead of list comprehension, which is more memory efficient.
Similarly, same_sum in compute_sum1() can be converted to a generator comprehension by replacing [] with ().
You might want to have a look at this library: https://github.com/ml31415/accumarray . It's a clone from matlabs accumarray for numpy.
a = np.array([[1,2],[2,1],[7,1],[3,2]])
accum(a[:,1], a[:,0])
>>> array([0, 9, 4])
The first 0 means, that there were no fields with 0 in the index column.
The easiest straightforward way I see is though list comprehension:
s = [[sum(x[0] for x in a if x[1] == y), y] for y in set([q[1] for q in a])]
However, if the second number in your lists represents some kind of a category, I suggest you convert your data into a dictionary.
As far as I know, there is no function to do this in numpy, but this can easily be done with pandas.DataFrame.groupby.
In [7]: import pandas as pd
In [8]: import numpy as np
In [9]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [10]: df = pd.DataFrame(a)
In [11]: df.groupby(1)[0].sum()
Out[11]:
1
1 9
2 4
Name: 0, dtype: int64
Of course, you could do the same thing with itertools.groupby
In [1]: import numpy as np
...: from itertools import groupby
...: from operator import itemgetter
...:
In [3]: a = np.array([[1,2],[2,1],[7,1],[3,2]])
In [4]: sa = sorted(a.tolist(), key=itemgetter(1))
In [5]: grouper = groupby(sa, key=itemgetter(1))
In [6]: sums = {idx : sum(row[0] for row in group) for idx, group in grouper}
In [7]: sums
Out[7]: {1: 9, 2: 4}

Categories