Python - Split array into multiple arrays dependent on array values - python

I have a list which needs to be split into multiple lists of differing size. The values in the original list randomly increase in size until the split point, where the value drops before continuing to increase. The values must remain in order after being split.
E.g.
Original list
[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631, 70, 119,
125, 130, 134, 139, 144, 149, 154, 159, 614, 669, 100, 136, 144, 149, 153,
158, 163, 167, 173, 179, 62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117,
124, 426, 100, 129, 135, 140, 145, 151]
After split:
[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631]
[70, 119, 125, 130, 134, 139, 144, 149, 154, 159, 614, 669]
[100, 136, 144, 149, 153, 158, 163, 167, 173, 179]
[62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117, 124, 426]
[100, 129, 135, 140, 145, 151]
I have searched for a solution, finding numpy.where and numpy.diff as likely candidates, but I'm unsure how to implement.
Thanks for the help!

Approach #1
Using NumPy's numpy.split to have list of arrays as output -
import numpy as np
arr = np.array(a) # a is input list
out = np.split(arr,np.flatnonzero(arr[1:] < arr[:-1])+1)
Approach #2
Using loop comrehension to split the list directly and thus avoid numpy.split for efficiency purposes -
idx = np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
out = [a[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Output for given sample -
In [52]: idx = np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
In [53]: [a[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Out[53]:
[[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631],
[70, 119, 125, 130, 134, 139, 144, 149, 154, 159, 614, 669],
[100, 136, 144, 149, 153, 158, 163, 167, 173, 179],
[62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117, 124, 426],
[100, 129, 135, 140, 145, 151]]
We are using np.diff here, which feeds in a list in this case and then computes the differentiation. So, a better alternative would be with converting to array and then using comparison between shifted slices of it instead of actually computing the differentiation values. Thus, we could get idx like this as well -
arr = np.asarray(a)
idx = np.r_[0, np.flatnonzero(arr[1:] < arr[:-1])+1, len(arr)]
Let's time it and see if there's any improvement -
In [84]: a = np.random.randint(0,100,(1000,100)).cumsum(1).ravel().tolist()
In [85]: %timeit np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
100 loops, best of 3: 3.24 ms per loop
In [86]: arr = np.asarray(a)
In [87]: %timeit np.asarray(a)
100 loops, best of 3: 3.05 ms per loop
In [88]: %timeit np.r_[0, np.flatnonzero(arr[1:] < arr[:-1])+1, len(arr)]
10000 loops, best of 3: 77 µs per loop
In [89]: 3.05+0.077
Out[89]: 3.127
So, a marginal improvement there with the shifting and comparing method with the conversion : np.asarray(a) eating-up most of the runtime.

I know you tagged numpy. But here's a implementation without any dependencies too:
lst = [100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631, 70, 119,
125, 130, 134, 139, 144, 149, 154, 159, 614, 669, 100, 136, 144, 149, 153,
158, 163, 167, 173, 179, 62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117,
124, 426, 100, 129, 135, 140, 145, 151]
def split(lst):
last_pos = 0
for i in range(1, len(lst)):
if lst[i] < lst[i-1]:
yield lst[last_pos:i]
last_pos = i
if(last_pos <= len(lst)-1):
yield lst[last_pos:]
print([x for x in split(lst)])

If you want to use numpy.diff and numpy.where, you can try
a = numpy.array(your original list)
numpy.split(a, numpy.where(numpy.diff(a) < 0)[0] + 1)
Explanation:
numpy.diff(a) calculates the difference of each item and its preceding one, and returns an array.
numpy.diff(a) < 0 returns a boolean array, where each element is replaced by whether it satisfies the predicate, in this case less than zero. This is the result of numpy.ndarray overloading the comparison operators.
numpy.where takes this boolean array and returns the indices where the element is not zero. In this context False evaluates to zero, so you take the indices of True.
[0] takes the first (and only) axis
+ 1 You want to break off from after the indices, not before
Finally, numpy.split breaks them off at the given indices.

Related

How to concat corresponding elements (which are integers) of two 2D arrays of the same shape?

I have two 10x8 arrays, C and D. I need to concat the corresponding elements of these two arrays and store the result in another 10x8 array. For example, if C = [[1, 2, 3, 4, 5, 6, 7, 8],[9, 10, 11, 12, 13, 14, 15, 16],[8 elements],... [10th row which has 8 elements]] and D = [[100, 99, 98, 97, 96, 95, 94, 93],[92, 90, 89, 88, 87, 86, 85, 84],[8 elements],... [10th row which has 8 elements]]. I need another 10x8 array, E, which looks like E = [[1100, 299, 398, 497, 596, 695, 794, 893], [992, 1090, 1189, 1288, 1387, 1486, 1585, 1684],... [10th row which contains concatenation of the corresponding 8 elements in the 10th row of C and D]]. How do I obtain this? Appreciate your help!
Nested list comprehension:
>>> C = [[1, 2, 3, 4, 5, 6, 7, 8],[9, 10, 11, 12, 13, 14, 15, 16]]
>>> D = [[100, 99, 98, 97, 96, 95, 94, 93],[92, 90, 89, 88, 87, 86, 85, 84]]
>>> [[int(f'{c}{d}') for c, d in zip(lc, ld)] for lc, ld in zip(C, D)]
[[1100, 299, 398, 497, 596, 695, 794, 893],
[992, 1090, 1189, 1288, 1387, 1486, 1585, 1684]]
Just for fun, here is a functional solution:
>>> from functools import partial
>>> from itertools import starmap
>>> list(map(list, map(partial(map, int), map(partial(starmap, '{0}{1}'.format), map(zip, C, D)))))
[[1100, 299, 398, 497, 596, 695, 794, 893],
[992, 1090, 1189, 1288, 1387, 1486, 1585, 1684]]
Just run a loop and concatenate with the concatenation method. Do not run two loop just run one loop and concatenate with the loop if they are of same dimensions. This is a very easy method if they are of same dimensions.

How to find consecutive items in a list, a set of 5, setting counter = 0 and reset the counter to check for next 5 consecutive values?

I have a list in which there are consecutive and non-consecutive values as shown in the code.
I want to find 5 consecutive values and then append the 4 values excluding the first one in new list row1. for eg.201,202,203,204,205 this is a set of 5 consecutive values and I want to store 202,203,204,205 in row1 list. Then the new set begins from 206 onwards. If there is a set of 5 then the same approach like above should be followed. I have done the simple coding for it. But the output which I get when I print row1 is-- [202, 203, 204, 205, 206, 231]. I WANT [202, 203, 204, 205, 231]. What "if" condition should I provide to get the desired result? I had used break---if count becomes 4, but it stops the execution of the loop. So I used continue but it doesn't give the required output. Any help is much appreciated.
row1 = []
l = [201, 202, 203, 204, 205, 206, 230, 231]
count = 0
for i in range(len(l) - 1):
if l[i+1] == l[i] + 1:
row1.append(l[i+1])
count += 1
if count == 4:
continue
else:
count = 0
print(row1)
Thanks for all the replies but it did not resolve my above query. I am adding more information here for my intended output.
The set can be maximum of 5 consecutive values. Even if they are 2 immediate consecutive values, or 3 or 4 still it will be a set but maximum should be 5.
The values will be in the set if the difference is of 1. For example if I have 206, 210 then that is not the set.
Only if in my list I get immediate values like 206, 207, 208 even if they are a set of 3, the output will have 207, 208.
I have shown examples below on how the output should be. I am trying to write a code which works for all these cases. I really appreciate the help. Sorry if anything is not clear.
Examples-----
1) li = [11, 12, 224] output should be [12]
2) li = [11, 12, 13, 14, 15, 224] output should be [12, 13, 14, 15]
3) li = [11, 12, 13, 14, 15, 16, 224] output should be [12, 13, 14, 15]
4) li = [11, 12, 13, 14, 15, 224, 225] output should be [12, 13, 14, 15, 225]
5) li = [184, 185, 186, 187, 201, 202, 203, 204, 205, 206] output should be [185, 186, 187, 202, 203, 204, 205]
6) li = [201, 202, 203, 204, 205, 206, 230, 231] output should be [202, 203, 204, 205, 231]
You can use sum of 1st and 5th element and sum of range(1st ele, 5th ele + 1) if the sum is equals that means numbers are consecutive
lst = [201, 202, 203, 204, 205, 206, 230, 233, 234, 235, 236, 237, 239]
i, res = 0, []
while (i + 4) < len(lst):
sub = lst[i: i + 5]
if sum(range(sub[0], sub[-1] + 1)) == sum(sub):
res += lst[i + 1:i + 5]
i += 5
else:
i += 1
print(res) # [202, 203, 204, 205, 234, 235, 236, 237]
following my proposal if i got your point right :
from math import ceil
l = [201, 202, 203, 204, 205, 206, 230, 231, 234, 235, 236, 237, 238]
l = sorted(l)
print([l[i+1:i+5] for i in range(0,max(len(l), int(ceil(len(l)/5.0)*5)),5) if l[i+1:i+5] !=[]])
Result:
[[202, 203, 204, 205], [230, 231, 234, 235], [237, 238]]
This code does exactly what OP wants. I begin at index 0 and append the next four consecutive numbers(if there are that much). After that I move to the element after the last element in the previous consecutive and start all over again.
row1 = []
l = [201, 202, 203, 204, 205, 206, 230, 231]
# Counter to iterate through l
i = 0
n = len(l)
while i < n:
for j in range(1, 5):
ind = i + j
# We use try and except so that we can handle any out of bounds errors
try:
curr = l[ind]
prev = l[ind-1]
if curr == prev+1:
row1.append(curr)
else:
i = ind
break
except:
i = ind
break
# After reaching the last consecutive in the five, move on to the next element
if j == 4:
i = ind + 1
Output for this:
[202, 203, 204, 205, 231]
Not completely sure, but it appears to me that you want to find all sequences of 5 elements in a row. If that is the case you need a list of lists as a result. The counter can then be replaced by the length of the list holding the current sequence:
l = [201, 202, 203, 204, 205, 206, 230, 231, 234, 235, 236, 237, 238]
sequences = []
count = 0
current_sequence = []
for i in range(len(l) - 1):
if l[i+1] == l[i] + 1:
current_sequence.append(l[i+1])
if len(current_sequence) == 4:
sequences.append(current_sequence)
current_sequence = []
continue
else:
current_sequence = []
print(sequences)
Output for input with two sequences:
[[202, 203, 204, 205], [235, 236, 237, 238]]
A better solution would be this I think as you do not create lists that do not appear in the result:
l = [201, 202, 203, 204, 205, 206, 207, 230, 231, 234, 235, 236, 237, 238]
sequences = []
i=0
while i < len(l) - 4:
if l[i] == l[i+1]-1 == l[i+2]-2 == l[i+3]-3 == l[i+4]-4:
sequences.append(l[i+1:i+5])
i=i+5
else:
i=i+1
print(sequences)

Using np.newaxis to compute sum of squared differences

In chapter 2 of "Python Data Science Handbook" by Jake VanderPlas, he computes the sum of squared differences of several 2-d points using the following code:
rand = np.random.RandomState(42)
X = rand.rand(10,2)
dist_sq = np.sum(X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2, axis=-1)
Two questions:
Why is a third axis created? What is the best way to visualize what is going on?
Is there a more intuitive way to perform this calculation?
Why is a third axis created? What is the best way to visualize what is going on?
The adding new dimensions before adding/subtracting trick is a relatively common one to generate all pairs, by using broadcasting (None is the same as np.newaxis here):
>>> a = np.arange(10)
>>> a[:,None]
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
>>> a[None,:]
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> a[:,None] + 100*a[None,:]
array([[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900],
[ 1, 101, 201, 301, 401, 501, 601, 701, 801, 901],
[ 2, 102, 202, 302, 402, 502, 602, 702, 802, 902],
[ 3, 103, 203, 303, 403, 503, 603, 703, 803, 903],
[ 4, 104, 204, 304, 404, 504, 604, 704, 804, 904],
[ 5, 105, 205, 305, 405, 505, 605, 705, 805, 905],
[ 6, 106, 206, 306, 406, 506, 606, 706, 806, 906],
[ 7, 107, 207, 307, 407, 507, 607, 707, 807, 907],
[ 8, 108, 208, 308, 408, 508, 608, 708, 808, 908],
[ 9, 109, 209, 309, 409, 509, 609, 709, 809, 909]])
Your example does the same, just with 2-vectors instead of scalars at the innermost level:
>>> X[:,np.newaxis,:].shape
(10, 1, 2)
>>> X[np.newaxis,:,:].shape
(1, 10, 2)
>>> (X[:,np.newaxis,:] - X[np.newaxis,:,:]).shape
(10, 10, 2)
Thus we find that the 'magical subtraction' is just all combinations of the coordinate X subtracted from each other.
Is there a more intuitive way to perform this calculation?
Yes, use scipy.spatial.distance.pdist for pairwise distances. To get an equivalent result to your example:
from scipy.spatial.distance import pdist, squareform
dist_sq = squareform(pdist(X))**2

Plotting a list of numbers over a range of numbers

I have a data frame that looks like this:
Season Dist
0 '14 - '15 [120, 128, 175, 474, 615]
1 '15 - '16 [51, 305, 398, 839, 991, 1093, 1304]
2 '16 - '17 [223, 293, 404, 588, 661, 706, 964, 1049, 1206]
3 '17 - '18 [12, 37, 204, 229, 276, 349, 809, 845, 1072, 1...
4 '18 - '19 [210, 214, 259, 383, 652, 798, 1150]
5 '19 - '20 [182, 206, 221, 282, 283, 297, 1330, 1332]
I'm trying to plot it with matplotlib where the x axis is the range of instances and for each season on the y axis, the plot shows the distribution of the df['Dist']. I've sketched a very crappy graph below to illustrate my point.
Does anyone know how I could do this?
Plot each list individually on the same graph. The list values will work as x-coordinates, so for y-coordinates map each season values to ints. i.e something like this
Season Dist
0 0 [120, 128, 175, 474, 615]
1 1 [51, 305, 398, 839, 991, 1093, 1304]
2 ' 2 [223, 293, 404, 588, 661, 706, 964, 1049, 1206]
Now scatterplot will require y-coordinates for every x-coordinate.
So create something like this
y x
[0,0,0,0,0] [120, 128, 175, 474, 615]
[1,1,1,1,1,1,1] [51, 305, 398, 839, 991, 1093, 1304]
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.DataFrame({'Season':['14 - 15','15 - 16','16 - 17'],'Dist':\
[[120, 128, 175, 474, 615],\
[51, 305, 398, 839, 991, 1093, 1304],\
[223, 293, 404, 588, 661, 706, 964, 1049, 1206]]})
y = np.arange(len(df)) #map the seasons
for i in range(len(df)):
plt.scatter(df['Dist'][i],[y[i] for j in range(len(df['Dist'][i]))]) #create a list of y coordinates for every x coordinate
plt.yticks(y,df['Season']) #show the actual seasons as xticks
plt.show()

Time series with matrix

I have a mat extension data which I want to separate every seconds values. My matrix is (7,5,2500) time series 3 dimensional matrix which want to get the values of (7,5,1) ...(7,5,2500) separately and save it
for example
array([155, 33, 129,167,189,63,35
161, 218, 6,58,36,25,3
89,63,36,25,78,95,21
78,52,36,56,25,15,68
]],
[215, 142, 235,
143, 249, 164],
[221, 71, 229,
56, 91, 120],
[236, 4, 177,
171, 105, 40])
for getting every part of this data for example this matrix
[215, 142, 235,
143, 249, 164]
what should I do?
a = [[155, 33, 129, 161, 218, 6],
[215, 142, 235, 143, 249, 164],
[221, 71, 229, 56, 91, 120],
[236, 4, 177, 171, 105, 40]]
print(a[1])
Assuming you have your data saved in a numpy array you could use slicing to extract the sub-matrices you need. Here is an example with a (3,5,3) matrix (but the example could be applied to any dimension):
A = numpy.array([[[1,1,1],
[2,2,2],
[3,3,3],
[4,4,4],
[5,5,5]],
[[11,11,11],
[21,21,21],
[31,31,31],
[41,41,41],
[51,51,51]],
[[12,12,12],
[22,22,22],
[32,32,32],
[42,42,42],
[52,52,52]]]
sub_matrix_1 = A[:,:,0]
print (sub_matrix_1)
Will produce:
[[ 1 2 3 4 5]
[11 21 31 41 51]
[12 22 32 42 52]]
EDIT: it is also possible to iterate over the array to get the 3rd dimension array:
for i in range(A.shape[-1]):
print (A[:,:,i])
# Your submatrix is A[:,:,i], you can directly manipulate it

Categories