Extract specific elements from array to create ranges - python

I want to extract very specific elements from an array to create various ranges.
For example,
ranges = np.arange(0, 525, 25)
#array([ 0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500])
I want to create ranges that output as this-
0 50
75 125
150 200
225 275
300 350
375 425
450 500
I know if I want to extract every number into a range, I can write something like this-
for start, end in zip(ranges[:-1], ranges[1:]):
print(start, end)
Which would give the result-
0 50
25 75
50 100
75 125
100 150
125 175
150 200
175 225
200 250
225 275
250 300
275 325
300 350
325 375
350 400
375 425
400 450
425 475
However, I'm not sure how to extract every other element from the array.

There is no need to stack anything here, or create more buffers than the one you already have. A couple of reshapes and simple indices should do the trick.
Notice that you are taking the first and last element in a sequence of three. That means you can truncate your array to a multiple of three, reshape to have N rows of three elements, and simply extract the first and last one of each.
k = 3
ranges = np.arange(0, 525, 25)
ranges[:ranges.size - ranges.size % k].reshape(-1, k)[:, ::k - 1]
Here is a breakdown of the one-liner:
trimmed_to_multiple = ranges[:ranges.size - ranges.size % k]
reshaped_to_n_by_k = trimmed_to_multiple.reshape(-1, k)
first_and_last_column_only = reshaped_to_n_by_k[:, ::k - 1]

You can create your two ranges, then stack them on the column axis:
np.stack((np.arange(0, 500, 75), np.arange(50, 550, 75)), axis=1)
Generalized:
>>> start = 50
>>> step = 75
>>> end = 550
>>> np.stack((np.arange(0, end - start, step), np.arange(start, end, step)), axis=1)
array([[ 0, 50],
[ 75, 125],
[150, 200],
[225, 275],
[300, 350],
[375, 425],
[450, 500]])

lowestMidPoint, highestMidPoint = 25, 475
ranges = [[midpoint-25, midpoint+25] for midpoint in range(lowestMidPoint,highestMidPoint+1,75)]
will get you the results as an array of arrays. You can then just call np.array(ranges) to get it in numpy.
To answer your question "how do i get every other entry in the array?"... if you have a numpy array
out =
1 2
3 4
5 6
you can slice out as follows
stride = 2
out[::stride,:]
to get every other row

Related

Python Dataframe categorize values

I have a data coming from the field and I want to categorize it with a gap of specific range.
I want to categorize in 100 range. That is, 0-100, 100-200, 200-300
My code:
df=pd.DataFrame([112,341,234,78,154],columns=['value'])
value
0 112
1 341
2 234
3 78
4 154
Expected answer:
value value_range
0 112 100-200
1 341 200-400
2 234 200-300
3 78 0-100
4 154 100-200
My code:
df['value_range'] = df['value'].apply(lambda x:[a,b] if x>a and x<b for a,b in zip([0,100,200,300,400],[100,200,300,400,500]))
Present solution:
SyntaxError: invalid syntax
You can use pd.cut:
df["value_range"] = pd.cut(df["value"], [0, 100, 200, 300, 400], labels=['0-100', '100-200', '200-300', '300-400'])
print(df)
Prints:
value value_range
0 112 100-200
1 341 300-400
2 234 200-300
3 78 0-100
4 154 100-200
you can use the odd IntervalIndex.from_tuples. Just set the tuple values to the values that are in your data and you should be good to go! -Listen to Lil Wayne
df = pd.DataFrame([112,341,234,78,154],columns=['value'])
bins = pd.IntervalIndex.from_tuples([(0, 100), (100, 200), (200, 300), (300, 400)])
df['value_range'] = pd.cut(df['value'], bins)

Numpyic way to take the first N rows and columns out of every M rows and columns from a square matrix

I have a 20 x 20 square matrix. I want to take the first 2 rows and columns out of every 5 rows and columns, which means the output should be a 8 x 8 square matrix. This can be done in 2 consecutive steps as follows:
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
B = np.asarray([row for i, row in enumerate(A) if i % m < n])
C = np.asarray([col for j, col in enumerate(B.T) if j % m < n]).T
However, I am looking for efficiency. Is there a more Numpyic way to do this? I would prefer to do this in one step.
You can use np.ix_ to retain the elements whose row / column indices are less than 2 modulo 5:
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
mask = np.arange(20) % 5 < 2
result = A[np.ix_(mask, mask)]
print(result)
This outputs:
[[ 0 1 5 6 10 11 15 16]
[ 20 21 25 26 30 31 35 36]
[100 101 105 106 110 111 115 116]
[120 121 125 126 130 131 135 136]
[200 201 205 206 210 211 215 216]
[220 221 225 226 230 231 235 236]
[300 301 305 306 310 311 315 316]
[320 321 325 326 330 331 335 336]]
Very similar to accepted answered, but can just reference rows/column indices directly. Would be interested to see if benchmark is any different than using np.ix_() in accepted answer
Return Specific Row/Column by Numeric Indices
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
B = np.asarray([row for i, row in enumerate(A) if i % m < n])
C = np.asarray([col for j, col in enumerate(B.T) if j % m < n]).T
rowAndColIds = list(filter(lambda x: x % m < n,range(20)))
# print(rowAndColsIds)
result = A[:,rowAndColIds][rowAndColIds]
print (result)
You could use index broadcasting
i = (np.r_[:20:5][:, None] + np.r_[:2]).ravel()
A[i[:,None], i]
output:
array([[ 0, 1, 5, 6, 10, 11, 15, 16],
[ 20, 21, 25, 26, 30, 31, 35, 36],
[100, 101, 105, 106, 110, 111, 115, 116],
[120, 121, 125, 126, 130, 131, 135, 136],
[200, 201, 205, 206, 210, 211, 215, 216],
[220, 221, 225, 226, 230, 231, 235, 236],
[300, 301, 305, 306, 310, 311, 315, 316],
[320, 321, 325, 326, 330, 331, 335, 336]])

How can I group multiple columns and sum the last one?

I have this problem which I've been trying to solve:
I want the code to take this DataFrame and group multiple columns based on the most frequent number and sum the values on the last column. For example:
df = pd.DataFrame({'A':[1000, 1000, 1000, 1000, 1000, 200, 200, 500, 500],
'B':[380, 380, 270, 270, 270, 45, 45, 45, 55],
'C':[380, 380, 270, 270, 270, 88, 88, 88, 88],
'D':[45, 32, 67, 89, 51, 90, 90, 90, 90]})
df
A B C D
0 1000 380 380 45
1 1000 380 380 32
2 1000 270 270 67
3 1000 270 270 89
4 1000 270 270 51
5 200 45 88 90
6 200 45 88 90
7 500 45 88 90
8 500 55 88 90
I would like the code to show the result below:
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Notice that the most frequent value on the first rows is 1000, and this way I group the column 'A' so I get the sum 284 on the column 'D'. However, on the last rows, the most frequent number, which is 88, is not on column 'A', but in column 'C'. I am trying to sum the values on column 'D' by grouping column 'C' and get 360. I am not sure if I made myself clear.
I tried to use the function df['D'] = df.groupby(['A', 'B', 'C'])['D'].transform('sum'), but it does not show the desired result aforementioned.
Is there any pandas-style way of resolving this? Thanks in advance!
Code
def get_count_sum(col, func):
return df.groupby(col).D.transform(func)
ga = get_count_sum('A', 'count')
gb = get_count_sum('B', 'count')
gc = get_count_sum('C', 'count')
conditions = [
((ga > gb) & (ga > gc)),
((gb > ga) & (gb > gc)),
((gc > ga) & (gc > gb)),
]
choices = [get_count_sum('A', 'sum'),
get_count_sum('B', 'sum'),
get_count_sum('C', 'sum')]
df['D'] = np.select(conditions, choices)
df
Output
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Explanation
Since we need to group by each column 'A','B' or 'C' considering which one has max repeated number, so first we are checking the max repeated number and storing the groupby output in ga, gb, gc for A,B,C col respectively.
We are checking which col has max frequent number in conditions.
According to the conditions we are applying choices for if else conditions.
np.select is like if-elif-else where we placed the conditions and required output in choices.

Is there a way to remove similar (numerical) elements from array in python

I have a function which produces an array as such:
[ 14 48 81 111 112 113 114 148 179 213 247 279 311 313 314 344 345 346]
which corresponds to data values where a curve crosses the x axis. As the data is imperfect, it generates false positives, where my output array has elements all very close to each other e.g. [111 112 113 114]. I need to remove the false positives from this array but still retain the initial positive around where the false positives are showing. Basically I need my function to produce and array more like
[ 14 48 81 112 148 179 213 247 279 313 345]
where the false positives from imperfect data have been removed.
Here is a possible approach:
arr = [14, 48, 81, 111, 112, 113, 114, 148, 179, 213, 247, 279, 311, 313, 314, 344, 345, 346]
def filter_arr(arr, offset):
filtered_nums = set()
for num in sorted(arr):
# Check if there are any "similar" numbers already found
if any(num+x in filtered_nums for x in range(-offset, offset+1)):
continue
else:
filtered_nums.add(num)
return list(sorted(filtered_nums))
Then you can apply the filtering with any offset that you think makes the most sense.
filter_arr(arr, offset=5)
Output: [14, 48, 81, 111, 148, 179, 213, 247, 279, 311, 344]
This can do
#arr is the array you want, num is the number difference between them
def check(arr, num):
for r in arr:
for c in arr:
if abs(r-c) < num + 1:
arr.remove(c)
return arr
yourarray = [14,48 ,81 ,111 ,112 ,113 ,114, 148 , 179 ,213 ,247 ,279 ,311, 313 ,314 ,344, 345, 346]
print(check(yourarray, 1))
I would do it following way:
Conceptually:
Lets say that ten of number is quantity of 10 which could be fitted into given number for example ten of 111 is 11, ten of 247 is 24 and ten of 250 is 25 and so on.
For our data if number with given ten already exist discard it.
Code:
data = [14,48,81,111,112,113,114,148,179,213,247,279,311,313,314,344,345,346]
cleaned = [i for inx,i in enumerate(data) if not i//10 in [j//10 for j in data[:inx]]]
print(cleaned) #[14, 48, 81, 111, 148, 179, 213, 247, 279, 311, 344]
Note that 10 is only example value, that you can replace with another value - bigger value means more elements will be potentially removed. Keep in mind that specific trait of this solution is that specific values pairs (for 10 for example 110 and 111) will be treated as different and would stay in output list, so you need to examine if that is not a problem in your case of usage.

Grouping data into bins

I want to subset the following data frame df into bins of a size 50:
ID FREQ
0 358081 6151
1 431511 952
2 410632 350
3 398149 220
4 177791 158
5 509179 151
6 485346 99
7 536655 50
8 389180 51
9 406622 45
10 410191 112
The result should be this one:
FREQ_BIN QTY_IDs
>200 3
150-200 2
100-150 1
50-100 3
<50 1
How can I do it? Should I use groupBy or any other approach?
You could use pd.cut.
df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
right=False ensures that we take half-open intervals as your output suggests, and unlike np.digitize we need to include np.inf in the bins for "infinite endpoints".
Demo
>>> df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
FREQ
[-inf, 50) 1
[50, 100) 3
[100, 150) 1
[150, 200) 2
[200, inf) 4
dtype: int64

Categories