Related
Trying to convert a list to 1D array and that list contain arrays like this
from
[array([1145, 330, 1205, 364], dtype=int64),
array([1213, 330, 1247, 364], dtype=int64),
array([ 883, 377, 1025, 412], dtype=int64),
array([1038, 377, 1071, 404], dtype=int64),
array([1085, 377, 1195, 405], dtype=int64),
array([1210, 377, 1234, 405], dtype=int64)]
Required
array([array([1145, 330, 1205, 364], dtype=int64),
array([[1213, 330, 1247, 364], dtype=int64),
array([883, 377, 1025, 412], dtype=int64),
array([1038, 377, 1071, 404], dtype=int64),
array([1085, 377, 1195, 405], dtype=int64),
array([1085, 377, 1195, 405], dtype=int64), dtype=object))
tried this code but getting 2D array but need 1D array like above
art = []
for i in boxs:
art.append(np.array(i, dtype=np.int64))
new_ary = np.array(art)
new_ary
Your question is little ambiguous. Your input is already a 2D and in
for i in boxs:
art.append(np.array(i, dtype=np.int64))
i represents a list in each iteration. So, You are appending a list each time in art.
You can try new_ary = new_ary.flatten(). It will give you
[1145 330 1205 364 1213 330 1247 364 883 377 1025 412 1038 377 1071 404 1085 377 1195 405 1210 377 1234 405]
Otherwise, provide an output to clarify your question.
While you are changing the list of arrays to an array of arrays with dtype=object, the value in the array is still int.
np.array(a, dtype=object)
array([[1145, 330, 1205, 364],
[1213, 330, 1247, 364],
[883, 377, 1025, 412],
[1038, 377, 1071, 404],
[1085, 377, 1195, 405],
[1210, 377, 1234, 405]], dtype=object)
type(np.array(a, dtype=object)[0][0])
Out[151]: int
Update
If you want to flatten the 2D array to a 1D array, you can use np.ravel
np.ravel(a)
array([1145, 330, 1205, 364, 1213, 330, 1247, 364, 883, 377, 1025,
412, 1038, 377, 1071, 404, 1085, 377, 1195, 405, 1210, 377,
1234, 405], dtype=int64)
Or let's say you want a 1D list, you can first do map to convert the array to list and then do reduce
from functools import reduce
mylist = list(map(list, a))
print(reduce((lambda x, y: x+y) , mylist))
[1145, 330, 1205, 364, 1213, 330, 1247, 364, 883, 377, 1025, 412, 1038, 377, 1071, 404, 1085, 377, 1195, 405, 1210, 377, 1234, 405]
use .reshape(-1) on the individual arrays inside that list.
In chapter 2 of "Python Data Science Handbook" by Jake VanderPlas, he computes the sum of squared differences of several 2-d points using the following code:
rand = np.random.RandomState(42)
X = rand.rand(10,2)
dist_sq = np.sum(X[:,np.newaxis,:] - X[np.newaxis,:,:]) ** 2, axis=-1)
Two questions:
Why is a third axis created? What is the best way to visualize what is going on?
Is there a more intuitive way to perform this calculation?
Why is a third axis created? What is the best way to visualize what is going on?
The adding new dimensions before adding/subtracting trick is a relatively common one to generate all pairs, by using broadcasting (None is the same as np.newaxis here):
>>> a = np.arange(10)
>>> a[:,None]
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
>>> a[None,:]
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
>>> a[:,None] + 100*a[None,:]
array([[ 0, 100, 200, 300, 400, 500, 600, 700, 800, 900],
[ 1, 101, 201, 301, 401, 501, 601, 701, 801, 901],
[ 2, 102, 202, 302, 402, 502, 602, 702, 802, 902],
[ 3, 103, 203, 303, 403, 503, 603, 703, 803, 903],
[ 4, 104, 204, 304, 404, 504, 604, 704, 804, 904],
[ 5, 105, 205, 305, 405, 505, 605, 705, 805, 905],
[ 6, 106, 206, 306, 406, 506, 606, 706, 806, 906],
[ 7, 107, 207, 307, 407, 507, 607, 707, 807, 907],
[ 8, 108, 208, 308, 408, 508, 608, 708, 808, 908],
[ 9, 109, 209, 309, 409, 509, 609, 709, 809, 909]])
Your example does the same, just with 2-vectors instead of scalars at the innermost level:
>>> X[:,np.newaxis,:].shape
(10, 1, 2)
>>> X[np.newaxis,:,:].shape
(1, 10, 2)
>>> (X[:,np.newaxis,:] - X[np.newaxis,:,:]).shape
(10, 10, 2)
Thus we find that the 'magical subtraction' is just all combinations of the coordinate X subtracted from each other.
Is there a more intuitive way to perform this calculation?
Yes, use scipy.spatial.distance.pdist for pairwise distances. To get an equivalent result to your example:
from scipy.spatial.distance import pdist, squareform
dist_sq = squareform(pdist(X))**2
I've such a square matrix:
[[0, 516, 226, 853, 1008, 1729, 346, 1353, 1554, 827, 226, 853, 1729, 1008],
[548, 0, 474, 1292, 1442, 2170, 373, 1801, 1989, 1068, 474, 1292, 2170, 1442],
[428, 466, 0, 1103, 1175, 1998, 226, 1561, 1715, 947, 0, 1103, 1998, 1175],
[663, 1119, 753, 0, 350, 1063, 901, 681, 814, 1111, 753, 0, 1063, 350],
[906, 1395, 1003, 292, 0, 822, 1058, 479, 600, 1518, 1003, 292, 822, 0],
[1488, 1994, 1591, 905, 776, 0, 1746, 603, 405, 1676, 1591, 905, 0, 776],
[521, 357, 226, 1095, 1167, 1987, 0, 1552, 1705, 1051, 226, 1095, 1987, 1167],
[1092, 1590, 1191, 609, 485, 627, 1353, 0, 422, 1583, 1191, 609, 627, 485],
[1334, 1843, 1436, 734, 609, 396, 1562, 421, 0, 1745, 1436, 734, 396, 609],
[858, 1186, 864, 1042, 1229, 1879, 984, 1525, 1759, 0, 864, 1042, 1879, 1229],
[428, 466, 0, 1103, 1175, 1998, 226, 1561, 1715, 947, 0, 1103, 1998, 1175],
[663, 1119, 753, 0, 350, 1063, 901, 681, 814, 1111, 753, 0, 1063, 350],
[1488, 1994, 1591, 905, 776, 0, 1746, 603, 405, 1676, 1591, 905, 0, 776],
[906, 1395, 1003, 292, 0, 822, 1058, 479, 600, 1518, 1003, 292, 822, 0]]
And I need to remove say a1 a2 and a3 indexed columns and rows at the sametime. How can I do this? What is the neat way?
Note that, I need to get another square matrix. Both rows and columns at the same index should be removed. Also note that, when you remove a row/column, indexes get shifted. Either I need to shift e.g. a1, a2, a3 too or do something more clever.
An example case
The square matrix:
[[10,11,12,13],
[14,15,16,17],
[18,19,20,21],
[22,23,24,25]]
remove 1st and 3rd indexes and the result is:
[[10,12],
[18,20]]
If you are open to other packages, pandascan make it easy:
import pandas as pd
to_drop = [a1,a2,a3]
out = pd.DataFrame(a).drop(to_drop).drop(to_drop, axis=1).to_numpy()
Update: output of the code on sample data
array([[10, 12],
[18, 20]])
If you want numpy only and assuming the array is always squared:
a = np.array([[10,11,12,13],
[14,15,16,17],
[18,19,20,21],
[22,23,24,25]])
valid = [r for r in range(a.shape[0]) if r not in [1,3]]
a[valid][:,valid]
>>>array([[10, 12],
[18, 20]])
Try this method in numpy. np.ix_ creates a meshgrid for you to index the numpy array columns and rows. The list of indexes can simply be created by taking the set.difference between the range of rows in square matrix and the list of indexes of row/columns you want to remove -
sqm = np.array([[10,11,12,13],
[14,15,16,17],
[18,19,20,21],
[22,23,24,25]])
rem = [1,3] #Rows/columns to remove
idx = list(set(range(sqm.shape[0])).difference(rem))
print('Rows/columns to keep:',idx)
output = sqm[np.ix_(idx,idx)]
print(output)
Rows/columns to keep: [0, 2]
array([[10, 12],
[18, 20]])
EDIT: Benchmarking results are added below for square matrix 10000X10000 and ~500 row/columns to remove. (macbook pro 13)
sqm = np.random.random((10000,10000))
rem = np.unique(np.random.randint(0,10000,size=500))
Quang Hoang's Approach - 841 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
MBeale's Approach - 1.62 s ± 48.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Akshay Sehgal's Approach - 655 ms ± 19.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a list which needs to be split into multiple lists of differing size. The values in the original list randomly increase in size until the split point, where the value drops before continuing to increase. The values must remain in order after being split.
E.g.
Original list
[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631, 70, 119,
125, 130, 134, 139, 144, 149, 154, 159, 614, 669, 100, 136, 144, 149, 153,
158, 163, 167, 173, 179, 62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117,
124, 426, 100, 129, 135, 140, 145, 151]
After split:
[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631]
[70, 119, 125, 130, 134, 139, 144, 149, 154, 159, 614, 669]
[100, 136, 144, 149, 153, 158, 163, 167, 173, 179]
[62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117, 124, 426]
[100, 129, 135, 140, 145, 151]
I have searched for a solution, finding numpy.where and numpy.diff as likely candidates, but I'm unsure how to implement.
Thanks for the help!
Approach #1
Using NumPy's numpy.split to have list of arrays as output -
import numpy as np
arr = np.array(a) # a is input list
out = np.split(arr,np.flatnonzero(arr[1:] < arr[:-1])+1)
Approach #2
Using loop comrehension to split the list directly and thus avoid numpy.split for efficiency purposes -
idx = np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
out = [a[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Output for given sample -
In [52]: idx = np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
In [53]: [a[idx[i]:idx[i+1]] for i in range(len(idx)-1)]
Out[53]:
[[100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631],
[70, 119, 125, 130, 134, 139, 144, 149, 154, 159, 614, 669],
[100, 136, 144, 149, 153, 158, 163, 167, 173, 179],
[62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117, 124, 426],
[100, 129, 135, 140, 145, 151]]
We are using np.diff here, which feeds in a list in this case and then computes the differentiation. So, a better alternative would be with converting to array and then using comparison between shifted slices of it instead of actually computing the differentiation values. Thus, we could get idx like this as well -
arr = np.asarray(a)
idx = np.r_[0, np.flatnonzero(arr[1:] < arr[:-1])+1, len(arr)]
Let's time it and see if there's any improvement -
In [84]: a = np.random.randint(0,100,(1000,100)).cumsum(1).ravel().tolist()
In [85]: %timeit np.r_[0, np.flatnonzero(np.diff(a)<0)+1, len(a)]
100 loops, best of 3: 3.24 ms per loop
In [86]: arr = np.asarray(a)
In [87]: %timeit np.asarray(a)
100 loops, best of 3: 3.05 ms per loop
In [88]: %timeit np.r_[0, np.flatnonzero(arr[1:] < arr[:-1])+1, len(arr)]
10000 loops, best of 3: 77 µs per loop
In [89]: 3.05+0.077
Out[89]: 3.127
So, a marginal improvement there with the shifting and comparing method with the conversion : np.asarray(a) eating-up most of the runtime.
I know you tagged numpy. But here's a implementation without any dependencies too:
lst = [100, 564, 572, 578, 584, 590, 596, 602, 608, 614, 620, 625, 631, 70, 119,
125, 130, 134, 139, 144, 149, 154, 159, 614, 669, 100, 136, 144, 149, 153,
158, 163, 167, 173, 179, 62, 72, 78, 82, 87, 92, 97, 100, 107, 112, 117,
124, 426, 100, 129, 135, 140, 145, 151]
def split(lst):
last_pos = 0
for i in range(1, len(lst)):
if lst[i] < lst[i-1]:
yield lst[last_pos:i]
last_pos = i
if(last_pos <= len(lst)-1):
yield lst[last_pos:]
print([x for x in split(lst)])
If you want to use numpy.diff and numpy.where, you can try
a = numpy.array(your original list)
numpy.split(a, numpy.where(numpy.diff(a) < 0)[0] + 1)
Explanation:
numpy.diff(a) calculates the difference of each item and its preceding one, and returns an array.
numpy.diff(a) < 0 returns a boolean array, where each element is replaced by whether it satisfies the predicate, in this case less than zero. This is the result of numpy.ndarray overloading the comparison operators.
numpy.where takes this boolean array and returns the indices where the element is not zero. In this context False evaluates to zero, so you take the indices of True.
[0] takes the first (and only) axis
+ 1 You want to break off from after the indices, not before
Finally, numpy.split breaks them off at the given indices.
import pandas as pd
import numpy as np
f = pd.read_csv('151101.mnd',skiprows=33, sep ='\s+',chunksize=30)
data = pd.concat(f)
data = data.convert_objects(convert_numeric=True)
print data.head()
print ''
height = data['#']
wspd = data['z']
hub = np.where(height==80)
print np.where(height==80)
Beginning Part of the File:
# z speed dir W sigW bck error
0 30 5.05 333.0 0.23 0.13 144000 0 NaN
1 40 5.05 337.1 -0.02 0.14 7690 0 NaN
2 50 5.03 338.5 0.00 0.15 4830 0 NaN
3 60 6.21 344.3 -0.09 0.18 6130 0 NaN
4 70 5.30 336.5 0.01 0.21 158000 0 NaN
Output (indices Where height column = 80):
(array([ 5, 37, 69, 101, 133, 165, 197, 229, 261, 293, 325,
357, 389, 421, 453, 485, 517, 549, 581, 613, 645, 677,
709, 741, 773, 805, 837, 869, 901, 933, 965, 997, 1029,
1061, 1093, 1125, 1157, 1189, 1221, 1253, 1285, 1317, 1349, 1381,
1413, 1445, 1477, 1509, 1541, 1573, 1605, 1637, 1669, 1701, 1733,
1765, 1797, 1829, 1861, 1893, 1925, 1957, 1989, 2021, 2053, 2085,
2117, 2149, 2181, 2213, 2245, 2277, 2309, 2341, 2373, 2405, 2437,
2469, 2501, 2533, 2565, 2597, 2629, 2661, 2693, 2725, 2757, 2789,
2821, 2853, 2885, 2917, 2949, 2981, 3013, 3045, 3077, 3109, 3141,
3173, 3205, 3237, 3269, 3301, 3333, 3365, 3397, 3429, 3461, 3493,
3525, 3557, 3589, 3621, 3653, 3685, 3717, 3749, 3781, 3813, 3845,
3877, 3909, 3941, 3973, 4005, 4037, 4069, 4101, 4133, 4165, 4197,
4229, 4261, 4293, 4325, 4357, 4389, 4421, 4453, 4485, 4517, 4549,
4581], dtype=int64),)
So I want to find the wspd, data.['z'], where the height, data.['#']=80 and store that as a variable. How do I do this? I tried to do a np.where(height=80) and store that as a variable 'hub' but when I take wspd at the indices of hub, wspd[hub] I get an error. ValueError: Can only tuple-index with a MultiIndex. Is there an easier way to do this?
Example usage :
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': [2,3,2,5],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
print df1
c = df1[df1.A == 2].index # get all the indices where value is 2 in column 'A'
d= df1.iloc[c,] #Subset dataframe with only these row indices
d_values = df1.iloc[c,1].values #to return an array of values in column 'B'/2nd column.
Output:
array(['B0', 'B2'], dtype=object)
In your case:
hub = data[data['#'] == 80].index
new_data = data.iloc[hub,]
To get the wspd values only, use this instead:
new_data = data.iloc[hub,1].values #assuming that it is the 2nd column always, this will return an array.