Can you access pandas rolling window object.
rs = pd.Series(range(10))
rs.rolling(window = 3)
#print's
Rolling [window=3,center=False,axis=0]
Can I get as groups?:
[0,1,2]
[1,2,3]
[2,3,4]
I will start off this by saying this is reaching into the internal impl. But if you really really wanted to compute the indexers the same way as pandas.
You will need v0.19.0rc1 (just about released), you can conda install -c pandas pandas=0.19.0rc1
In [41]: rs = pd.Series(range(10))
In [42]: rs
Out[42]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
# this reaches into an internal implementation
# the first 3 is the window, then second the minimum periods we
# need
In [43]: start, end, _, _, _, _ = pandas._window.get_window_indexer(rs.values,3,3,None,use_mock=False)
# starting index
In [44]: start
Out[44]: array([0, 0, 0, 1, 2, 3, 4, 5, 6, 7])
# ending index
In [45]: end
Out[45]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# windo size
In [3]: end-start
Out[3]: array([1, 2, 3, 3, 3, 3, 3, 3, 3, 3])
# the indexers
In [47]: [np.arange(s, e) for s, e in zip(start, end)]
Out[47]:
[array([0]),
array([0, 1]),
array([0, 1, 2]),
array([1, 2, 3]),
array([2, 3, 4]),
array([3, 4, 5]),
array([4, 5, 6]),
array([5, 6, 7]),
array([6, 7, 8]),
array([7, 8, 9])]
So this is sort of trivial in the fixed window case, this becomes extremely useful in a variable window scenario, e.g. in 0.19.0 you can specify things like 2S for example to aggregate by-time.
All of that said, getting these indexers is not particularly useful. you generally want to do something with the results. That is the point of the aggregation functions, or .apply if you want to generically aggregate.
Here's a workaround, but waiting to see if anyone has pandas solution:
def rolling_window(a, step):
shape = a.shape[:-1] + (a.shape[-1] - step + 1, step)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
rolling_window(rs, 3)
array([[ 0, 1, 2],
[ 1, 2, 3],
[ 2, 3, 4],
[ 3, 4, 5],
[ 4, 5, 6],
[ 5, 6, 7],
[ 6, 7, 8],
[ 7, 8, 9],
[ 8, 9, 10]])
This is solved in pandas 1.1, as the rolling object is now an iterable:
[window.tolist() for window in rs.rolling(window=3) if len(window) == 3]
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have two 1D arrays of the same length as this:
import numpy as np
a = np.array([1, 1, 1, 2, 2, 3, 4, 5])
b = np.array([7, 7, 8, 8, 9, 8, 10, 10])
The value of a is increasing while b is random.
I wanna pair them by their values following the steps below:
Pick the first unique value ([1]) of array a and get the unique numbers ([7, 8]) of array b at the same index.
If some paired numbers ([8]) appear again in b, then pick the number at the same index of a.
Then, some new paired number ([2]) which appears again in a, the numbers in b at the same index are selected.
Finally, the result should be:
[1, 2, 3] is paired with [7, 8, 9]
[4, 5] is paired with [10]
It looks like there is no easy way for a vectorised (no looping) solution since it's a graph theory problem of finding connected components. If you still want to have a performant script that works fast on big data, you could use igraph library which is written in C.
TL;DR
I assume your input corresponds to edges of some graph:
>>> np.transpose([a, b])
array([[ 1, 7],
[ 1, 7],
[ 1, 8],
[ 2, 8],
[ 2, 9],
[ 3, 8],
[ 4, 10],
[ 5, 10]])
So your vertices are:
>>> np.unique(np.transpose([a, b]))
array([ 1, 2, 3, 4, 5, 7, 8, 9, 10])
And you would be quite happy (at least at the beginning) to recognise communities, like:
tags = np.transpose([a, b, communities])
>>> tags
array([[ 1, 7, 0],
[ 1, 7, 0],
[ 1, 8, 0],
[ 2, 8, 0],
[ 2, 9, 0],
[ 3, 8, 0],
[ 4, 10, 1],
[ 5, 10, 1]])
so that you have vertices (1, 2, 3, 7, 8, 9) included in community number 0 and vertices (4, 5, 10) included in community number 1.
Unfortunately, igraph doesn't support labeling graph nodes from 1 to 10 or any gaps of ids in labels. It must start from 0 and have no gaps in ids. So you need to store initial indices and after that relabel vertices so that edges are:
vertices_old, inv = np.unique(np.transpose([a,b]), return_inverse=True)
edges_new = inv.reshape(-1, 2)
>>> vertices_old
array([ 1, 2, 3, 4, 5, 7, 8, 9, 10]) #new ones are: [0, 1, 2, ..., 8]
>>> edges_new
array([[0, 5],
[0, 5],
[0, 6],
[1, 6],
[1, 7],
[2, 6],
[3, 8],
[4, 8]], dtype=int64)
The next step is to find communities using igraph (pip install python-igraph). You can run the following:
import igraph as ig
graph = ig.Graph(edges = edges_new)
communities = graph.clusters().membership #type: list
communities = np.array(communities)
>>> communities
array([0, 0, 0, 1, 1, 0, 0, 0, 1]) #tags of nodes [1 2 3 4 5 7 8 9 10]
And then retrieve tags of source vertices (as well as tags of target vertices):
>>> communities = communities[edges_new[:, 0]] #or [:, 1]
array([0, 0, 0, 0, 0, 0, 1, 1])
After you find communities, the second part of solution appears to be a typical groupby problem. You can do it in pandas:
import pandas as pd
def get_part(source, communities):
part_edges = np.transpose([source, communities])
part_idx = pd.DataFrame(part_edges).groupby([1]).indices.values() #might contain duplicated source values
part = [np.unique(source[idx]) for idx in part_idx]
return part
>>> get_part(a, communities), get_part(b, communities)
([array([1, 2, 3]), array([4, 5])], [array([7, 8, 9]), array([10])])
Final Code
import igraph as ig
import numpy as np
import pandas as pd
def get_part(source, communities):
'''find set of nodes for each community'''
part_edges = np.transpose([source, communities])
part_idx = pd.DataFrame(part_edges).groupby([1]).indices.values() #might contain duplicated source values
part = [np.unique(source[idx]) for idx in part_idx]
return part
a = np.array([1, 1, 1, 2, 2, 3, 4, 5])
b = np.array([7, 7, 8, 8, 9, 8, 10, 10])
vertices_old, inv = np.unique(np.transpose([a,b]), return_inverse=True)
edges_new = inv.reshape(-1, 2)
graph = ig.Graph(edges = edges_new)
communities = np.array(graph.clusters().membership)
communities = communities[edges_new[:,0]] #or communities[edges_new[:,1]]
>>> get_part(a, communities), get_part(b, communities)
([array([1, 2, 3]), array([4, 5])], [array([7, 8, 9]), array([10])])
I tried doing this by iterating both the arrays simultaneously and keeping track of what element is associated with which index of the result. Let me know if this works for you?
a = [1, 1, 1, 2, 2, 3, 4, 5]
b = [7, 7, 8, 8, 9, 8, 10, 10]
tracker_a = dict()
tracker_b = dict()
result = []
index = 0
for elem_a, elem_b in zip(a, b):
if elem_a in tracker_a:
result[tracker_a[elem_a]][1].add(elem_b)
tracker_b[elem_b] = tracker_a[elem_a]
elif elem_b in tracker_b:
result[tracker_b[elem_b]][0].add(elem_a)
tracker_a[elem_a] = tracker_b[elem_b]
else:
tracker_a[elem_a] = index
tracker_b[elem_b] = index
result.append([{elem_a}, {elem_b}])
index += 1
print(result)
Output:
[[{1, 2, 3}, {8, 9, 7}], [{4, 5}, {10}]]
Complexity: O(n)
Struggling to describe this issue in words, but have a seemingly simple issue I can't find an answer for.
I want to create an array using values from one list/array and indices from another. I want the shape of the new array to be the same as the index array.
import numpy as np
a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
result = func(a, b) #some function or operator...
print(result)
[[9, 8], [7, 6, 5], [3, 2, 1, 0, -1]]
Thank you! :)
EDIT:
Good solutions so far, but I would rather do this without a for loop as we are looking at hundreds of thousands of rows and need to keep computing time down. Thanks again :)
You can use a list comprehension:
>>> [a[x[0]:x[-1]+1] for x in b]
[array([9, 8]), array([7, 6, 5]), array([ 3, 2, 1, 0, -1])]
EDIT: Your question indicates that you want a faster option, so you might test the following script to see which is faster for your Python installation:
#!/usr/bin/env python
import timeit
setup = '''
import numpy as np
a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
'''
test1 = '''
def test():
return [a[x[0]:x[-1]+1] for x in b]
'''
test2 = '''
def test():
return [a[idx] for idx in b]
'''
print(timeit.timeit(setup = setup,
stmt = test1,
number = 1000000))
print(timeit.timeit(setup = setup,
stmt = test2,
number = 1000000))
On my machine, the two approaches given you so far run about the same, but hpaulj's answer might be very slightly faster (unless Python is caching data behind the scenes), which may be of more use to you in production. Test it out locally and see if you get a similar or different answer.
Just apply each indexing sublist to a:
In [483]: a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2])
...:
...: b = [[0, 1], [2, 3, 4], [6, 7, 8, 9, 10]]
...:
...:
In [484]: [a[idx] for idx in b]
Out[484]: [array([9, 8]), array([7, 6, 5]), array([ 3, 2, 1, 0, -1])]
The sublists differ in length, so the result cannot be made into a 2d array - it has to remain a list (or if you insist 1d object dtype array).
I have a massive array but for illustration I am using an array of size 14. I have another list which contains 2, 3, 3, 6. How do I efficiently without for look create a list of new arrays such that:
import numpy as np
A = np.array([1,2,4,5,7,1,2,4,5,7,2,8,12,3]) # array with 1 axis
subArraysizes = np.array( 2, 3, 3, 6 ) #sums to number of elements in A
B = list()
B[0] = [1,2]
B[1] = [4,5,7]
B[2] = [1,2,4]
B[3] = [5,7,2,8,12,3]
i.e. select first 2 elements from A store it in B, select next 3 elements of A store it in B and so on in the order it appears in A.
You can use np.split -
B = np.split(A,subArraysizes.cumsum())[:-1]
Sample run -
In [75]: A
Out[75]: array([ 1, 2, 4, 5, 7, 1, 2, 4, 5, 7, 2, 8, 12, 3])
In [76]: subArraysizes
Out[76]: array([2, 3, 3, 6])
In [77]: np.split(A,subArraysizes.cumsum())[:-1]
Out[77]:
[array([1, 2]),
array([4, 5, 7]),
array([1, 2, 4]),
array([ 5, 7, 2, 8, 12, 3])]
I have couple of lists:
a = [1,2,3]
b = [1,2,3,4,5,6]
which are of variable length.
I want to return a vector of length five, such that if the input list length is < 5 then it will be padded with zeros on the right, and if it is > 5, then it will be truncated at the 5th element.
For example, input a would return np.array([1,2,3,0,0]), and input b would return np.array([1,2,3,4,5]).
I feel like I ought to be able to use np.pad, but I can't seem to follow the documentation.
This might be slow or fast, I am not sure, however it works for your purpose.
In [22]: pad = lambda a,i : a[0:i] if len(a) > i else a + [0] * (i-len(a))
In [23]: pad([1,2,3], 5)
Out[23]: [1, 2, 3, 0, 0]
In [24]: pad([1,2,3,4,5,6,7], 5)
Out[24]: [1, 2, 3, 4, 5]
np.pad is overkill, better for adding a border all around a 2d image than adding some zeros to a list.
I like the zip_longest, especially if the inputs are lists, and don't need to be arrays. It's probably the closest you'll find to a code that operates on all lists at once in compiled code).
a, b = zip(*list(itertools.izip_longest(a, b, fillvalue=0)))
is a version that does not use np.array at all (saving some array overhead)
But by itself it does not truncate. It stills something like [x[:5] for x in (a,b)].
Here's my variation on all_ms function, working with a simple list or 1d array:
def foo_1d(x, n=5):
x = np.asarray(x)
assert x.ndim==1
s = np.min([x.shape[0], n])
ret = np.zeros((n,), dtype=x.dtype)
ret[:s] = x[:s]
return ret
In [772]: [foo_1d(x) for x in [[1,2,3], [1,2,3,4,5], np.arange(10)[::-1]]]
Out[772]: [array([1, 2, 3, 0, 0]), array([1, 2, 3, 4, 5]), array([9, 8, 7, 6, 5])]
One way or other the numpy solutions do the same thing - construct a blank array of the desired shape, and then fill it with the relevant values from the original.
One other detail - when truncating the solution could, in theory, return a view instead of a copy. But that requires handling that case separately from a pad case.
If the desired output is a list of equal lenth arrays, it may be worth while collecting them in a 2d array.
In [792]: def foo1(x, out):
x = np.asarray(x)
s = np.min((x.shape[0], out.shape[0]))
out[:s] = x[:s]
In [794]: lists = [[1,2,3], [1,2,3,4,5], np.arange(10)[::-1], []]
In [795]: ret=np.zeros((len(lists),5),int)
In [796]: for i,xx in enumerate(lists):
foo1(xx, ret[i,:])
In [797]: ret
Out[797]:
array([[1, 2, 3, 0, 0],
[1, 2, 3, 4, 5],
[9, 8, 7, 6, 5],
[0, 0, 0, 0, 0]])
Pure python version, where a is a python list (not a numpy array): a[:n] + [0,]*(n-len(a)).
For example:
In [42]: n = 5
In [43]: a = [1, 2, 3]
In [44]: a[:n] + [0,]*(n - len(a))
Out[44]: [1, 2, 3, 0, 0]
In [45]: a = [1, 2, 3, 4]
In [46]: a[:n] + [0,]*(n - len(a))
Out[46]: [1, 2, 3, 4, 0]
In [47]: a = [1, 2, 3, 4, 5]
In [48]: a[:n] + [0,]*(n - len(a))
Out[48]: [1, 2, 3, 4, 5]
In [49]: a = [1, 2, 3, 4, 5, 6]
In [50]: a[:n] + [0,]*(n - len(a))
Out[50]: [1, 2, 3, 4, 5]
Function using numpy:
In [121]: def tosize(a, n):
.....: a = np.asarray(a)
.....: x = np.zeros(n, dtype=a.dtype)
.....: m = min(n, len(a))
.....: x[:m] = a[:m]
.....: return x
.....:
In [122]: tosize([1, 2, 3], 5)
Out[122]: array([1, 2, 3, 0, 0])
In [123]: tosize([1, 2, 3, 4], 5)
Out[123]: array([1, 2, 3, 4, 0])
In [124]: tosize([1, 2, 3, 4, 5], 5)
Out[124]: array([1, 2, 3, 4, 5])
In [125]: tosize([1, 2, 3, 4, 5, 6], 5)
Out[125]: array([1, 2, 3, 4, 5])
Let's consider a 2d-array A
2 3 5 7
2 3 5 7
1 7 1 4
5 8 6 0
2 3 5 7
The first, second and last lines are identical. The algorithm I'm looking for should return the number of identical rows for each different row (=number of duplicates of each element). If the script can be easily modified to also count the number of identical column also, it would be great.
I use an inefficient naive algorithm to do that:
import numpy
A=numpy.array([[2, 3, 5, 7],[2, 3, 5, 7],[1, 7, 1, 4],[5, 8, 6, 0],[2, 3, 5, 7]])
i=0
end = len(A)
while i<end:
print i,
j=i+1
numberID = 1
while j<end:
print j
if numpy.array_equal(A[i,:] ,A[j,:]):
numberID+=1
j+=1
i+=1
print A, len(A)
Expected result:
array([3,1,1]) # number identical arrays per line
My algo looks like using native python within numpy, thus inefficient. Thanks for help.
In unumpy >= 1.9.0, np.unique has a return_counts keyword argument you can combine with the solution here to get the counts:
b = np.ascontiguousarray(A).view(np.dtype((np.void, A.dtype.itemsize * A.shape[1])))
unq_a, unq_cnt = np.unique(b, return_counts=True)
unq_a = unq_a.view(A.dtype).reshape(-1, A.shape[1])
>>> unq_a
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> unq_cnt
array([1, 3, 1])
In an older numpy, you can replicate what np.unique does, which would look something like:
a_view = np.array(A, copy=True)
a_view = a_view.view(np.dtype((np.void,
a_view.dtype.itemsize*a_view.shape[1]))).ravel()
a_view.sort()
a_flag = np.concatenate(([True], a_view[1:] != a_view[:-1]))
a_unq = A[a_flag]
a_idx = np.concatenate(np.nonzero(a_flag) + ([a_view.size],))
a_cnt = np.diff(a_idx)
>>> a_unq
array([[1, 7, 1, 4],
[2, 3, 5, 7],
[5, 8, 6, 0]])
>>> a_cnt
array([1, 3, 1])
You can lexsort on the row entries, which will give you the indices for traversing the rows in sorted order, making the search O(n) rather than O(n^2). Note that by default, the elements in the last column sort last, i.e. the rows are 'alphabetized' right to left rather than left to right.
In [9]: a
Out[9]:
array([[2, 3, 5, 7],
[2, 3, 5, 7],
[1, 7, 1, 4],
[5, 8, 6, 0],
[2, 3, 5, 7]])
In [10]: lexsort(a.T)
Out[10]: array([3, 2, 0, 1, 4])
In [11]: a[lexsort(a.T)]
Out[11]:
array([[5, 8, 6, 0],
[1, 7, 1, 4],
[2, 3, 5, 7],
[2, 3, 5, 7],
[2, 3, 5, 7]])
You can use Counter class from collections module for this.
It works like this :
x = [2, 2, 1, 5, 2]
from collections import Counter
c=Counter(x)
print c
Output : Counter({2: 3, 1: 1, 5: 1})
Only issue you will face is in your case since every value of x is itself a list which is a non hashable data structure.
If you can convert every value of x in a tuple that it should works as :
x = [(2, 3, 5, 7),(2, 3, 5, 7),(1, 7, 1, 4),(5, 8, 6, 0),(2, 3, 5, 7)]
from collections import Counter
c=Counter(x)
print c
Output : Counter({(2, 3, 5, 7): 3, (5, 8, 6, 0): 1, (1, 7, 1, 4): 1})