Binning of continuous variable into given number of classes [duplicate] - python

I have a dataframe column which specifies how many times a user has performed an activity.
eg.
>>> df['ActivityCount']
Users ActivityCount
User0 220
User1 190
User2 105
User3 109
User4 271
User5 265
...
User95 64
User96 15
User97 168
User98 251
User99 278
Name: ActivityCount, Length: 100, dtype: int32
>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]
According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D and E.
Activity Count range varies from time to time. In the above example it's approx in-between (9-290) (lowest and highest of the series), it could be (5-500) or (5 to 30).
In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5) like Range A: 0-58, Range B: 59-116, Range C: 117-174...etc
Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories?
Expected output: -
>>> df
Users ActivityCount Category/Range
User0 220 D
User1 190 D
User2 105 B
User3 109 B
User4 271 E
User5 265 E
...
User95 64 B
User96 15 A
User97 168 C
User98 251 E
User99 278 E

The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
The output is something like:
Activity Category
34 115 b
15 43 a
57 192 d
78 271 e
26 88 b
6 25 a
55 186 d
63 220 d
1 15 a
76 268 e
An alternative view - clustering
In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.
One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.
In this case, k-means clustering can be done in the following way:
import scipy
from scipy.cluster.vq import vq, kmeans, whiten
df = pd.DataFrame({"Activity": l})
features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)
df["Category"] = code
And the output looks like:
Activity Category
40 138 1
79 272 0
72 255 0
13 38 3
41 139 1
65 231 0
26 88 2
59 197 4
76 268 0
45 145 1
A couple of notes:
The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.

Try the below solution:
df['Categ'] = pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'))
It creates Categ column - a result of division of ActivityCount
into 5 bins, labelled with A, ... E.
Borders of bins are set by division of full range into n subranges of
equal size.
You can also see the borders of each bin, calling:
pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'), retbins=True)[1]

Related

Can a Dataframe of NBA players be sorted by various conditions: to combine the rows of players w/ multiple entries bc they played on many teams?

I want to remove any players who didn't have over 1000 MP(minutes played).
I could easily write:
league_stats= pd.read_csv("1996.csv")
league_stats = league_stats.drop("Player-additional", axis=1)
league_stats_1000 = league_stats[league_stats['MP'] > 1000]
However, because players sometimes play for multiple teams in a year...this code doesn't account for that.
For example, Sam Cassell has four entries and none are above 1000 MP, but in total his MP for the season was over 1000. By running the above code I remove him from the new dataframe.
I am wondering if there is a way to sort the Dataframe by matching Rank(the RK column gives players who played on different teams the same rank number for each team they played on) and then sort it by... if the total of their MP is 1000=<.
This is the page I got the data from: 1996-1997 season.
Above the data table and to the left of the blue check box there is a dropdown menu called "Share and Export". From there I clicked on "Get table as CSV (for Excel)". After that I saved the CSV to a text editor and change the file extension to .csv to upload it to Jupyter Notebook.
This is a solution I came up with:
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
tot_df = df.loc[df['Tm'] == 'TOT']
mp_1000 = tot_df.loc[tot_df["MP"] < 1000]
# Create list of indexes with unnecessary entries to be removed. We have TOT and don't need these rows.
# *** For the record, I came up with this list by manually going through the data.
indexes_to_remove = [5,6,24, 25, 66, 67, 248, 249, 447, 448, 449, 275, 276, 277, 19, 20, 21, 377, 378, 477, 478, 479,
54, 55, 451, 452, 337, 338, 156, 157, 73, 74, 546, 547, 435, 436, 437, 142, 143, 421, 42, 43, 232,
233, 571, 572, 363, 364, 531, 532, 201, 202, 111, 112, 139, 140, 307, 308, 557, 558, 93, 94, 512,
513, 206, 207, 208, 250, 259, 286, 287, 367, 368, 271, 272, 102, 103, 34, 35, 457, 458, 190, 191,
372, 373, 165, 166
]
df_drop_tot = df.drop(labels=indexes_to_remove, axis=0)
df_drop_tot
First off, no need to manually download the csv and then read it into pandas. You can load in the table using pandas' .read_html().
And yes, you can simply get the list of ranks, player names, or whatever, that have greater than 1000 MP, then use that list to filter the dataframe.
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
df = df[df['Rk'].ne('Rk')]
df['MP'] = df['MP'].astype(int)
players_1000_rk_list = list(df[df['MP'] >= 1000]['Rk']) #<- coverts the "Rk" column into a list. I can then use that in the next line to only keep the "Rk" values that are in the list of "Rk"s that are >= 1000 MPs
players_df = df[df['Rk'].isin(players_1000_rk_list)]
Output: filters down from 574 rows to 282 rows
print(players_df)
Rk Player Pos Age Tm G ... AST STL BLK TOV PF PTS
0 1 Mahmoud Abdul-Rauf PG 27 SAC 75 ... 189 56 6 119 174 1031
1 2 Shareef Abdur-Rahim PF 20 VAN 80 ... 175 79 79 225 199 1494
3 4 Cory Alexander PG 23 SAS 80 ... 254 82 16 146 148 577
7 6 Ray Allen* SG 21 MIL 82 ... 210 75 10 149 218 1102
10 9 Greg Anderson C 32 SAS 82 ... 34 63 67 73 225 322
.. ... ... .. .. ... .. ... ... ... .. ... ... ...
581 430 Walt Williams SF 26 TOR 73 ... 197 97 62 174 282 1199
582 431 Corliss Williamson SF 23 SAC 79 ... 124 60 49 157 263 915
583 432 Kevin Willis PF 34 HOU 75 ... 71 42 32 119 216 842
589 438 Lorenzen Wright C 21 LAC 77 ... 49 48 60 79 211 561
590 439 Sharone Wright C 24 TOR 60 ... 28 15 50 93 146 390
[282 rows x 30 columns]

Numpyic way to take the first N rows and columns out of every M rows and columns from a square matrix

I have a 20 x 20 square matrix. I want to take the first 2 rows and columns out of every 5 rows and columns, which means the output should be a 8 x 8 square matrix. This can be done in 2 consecutive steps as follows:
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
B = np.asarray([row for i, row in enumerate(A) if i % m < n])
C = np.asarray([col for j, col in enumerate(B.T) if j % m < n]).T
However, I am looking for efficiency. Is there a more Numpyic way to do this? I would prefer to do this in one step.
You can use np.ix_ to retain the elements whose row / column indices are less than 2 modulo 5:
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
mask = np.arange(20) % 5 < 2
result = A[np.ix_(mask, mask)]
print(result)
This outputs:
[[ 0 1 5 6 10 11 15 16]
[ 20 21 25 26 30 31 35 36]
[100 101 105 106 110 111 115 116]
[120 121 125 126 130 131 135 136]
[200 201 205 206 210 211 215 216]
[220 221 225 226 230 231 235 236]
[300 301 305 306 310 311 315 316]
[320 321 325 326 330 331 335 336]]
Very similar to accepted answered, but can just reference rows/column indices directly. Would be interested to see if benchmark is any different than using np.ix_() in accepted answer
Return Specific Row/Column by Numeric Indices
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
B = np.asarray([row for i, row in enumerate(A) if i % m < n])
C = np.asarray([col for j, col in enumerate(B.T) if j % m < n]).T
rowAndColIds = list(filter(lambda x: x % m < n,range(20)))
# print(rowAndColsIds)
result = A[:,rowAndColIds][rowAndColIds]
print (result)
You could use index broadcasting
i = (np.r_[:20:5][:, None] + np.r_[:2]).ravel()
A[i[:,None], i]
output:
array([[ 0, 1, 5, 6, 10, 11, 15, 16],
[ 20, 21, 25, 26, 30, 31, 35, 36],
[100, 101, 105, 106, 110, 111, 115, 116],
[120, 121, 125, 126, 130, 131, 135, 136],
[200, 201, 205, 206, 210, 211, 215, 216],
[220, 221, 225, 226, 230, 231, 235, 236],
[300, 301, 305, 306, 310, 311, 315, 316],
[320, 321, 325, 326, 330, 331, 335, 336]])

I'am getting an memory error while iterating over pandas dataframe. How to resolve this?

I want to multiply each column with a different number and update the values for this data frame.
The code I have written is:
for j in test.columns:
for i in r:
for k in range(len(p)):
test[i] = test[j].apply(lambda x:x*p[k])
p.remove(p[k])
break
r.remove(i)
break
And p is list of numbers that I want to multiply with.
p = [74, 46, 97, 2023, 364, 1012, 8, 242, 422, 78, 55, 90, 10, 44, 1, 3, 105, 354, 4, 26, 87, 18, 889, 9, 557, 630, 214, 1765, 760, 3344, 136, 26, 56, 10, 2, 2171, 125, 446, 174, 4, 174, 2, 80, 11, 160, 17, 72]
r is list of column names.
How to get rid of this error?
According to your initial statement "I want to multiply each column with a different number" I wrote this answer.
It's unclear why, in your code, you have to use remove so many times and why you use so many for loops.
In my case, I generated a random dataframe of 100 rows and 5 columns, and an array of 5 values for the multiplication.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('12345'))
p=np.random.randint(0,100,5)
for i in range(5):
df.iloc[:,i]=df.iloc[:,i]*p[i]
Your stacktrace points to test[i] = test[j].apply(lambda x:x*p[k]).
Note that j (at least in your code sample) has not been set.
Maybe you should put i instead?
Another solution
If you want to multiply:
each column from test,
in-place,
by consecutive numbers from p (it may be even a plain Python list),
but only as many initial elements as p has,
assuming that p is not longer than the number of rows in test,
you can do it with the following one-liner:
test.iloc[:len(p)] = test.iloc[:len(p)].apply(lambda col: col * p)
To test this solution, I created test DataFrame containing first 10 rows
from your sample.
Then I defined p as: p = [2, 3, 4, 5, 6, 7].
The result of my code was:
0 1 2 3 4
0 6 8 8 282 42
1 39 24 42 1434 153
2 4 0 8 336 48
3 40 20 65 1085 160
4 84 66 72 2130 366
5 91 49 119 3283 469
6 5 6 11 140 17
7 4 8 12 278 51
8 6 8 12 271 36
9 29 25 37 741 149
So, as far as first 6 rows are concerned, in each column:
the first element has been multiplied by 2,
the second by 3,
and so on.
Maybe this is just what you need?

Is there a way to remove similar (numerical) elements from array in python

I have a function which produces an array as such:
[ 14 48 81 111 112 113 114 148 179 213 247 279 311 313 314 344 345 346]
which corresponds to data values where a curve crosses the x axis. As the data is imperfect, it generates false positives, where my output array has elements all very close to each other e.g. [111 112 113 114]. I need to remove the false positives from this array but still retain the initial positive around where the false positives are showing. Basically I need my function to produce and array more like
[ 14 48 81 112 148 179 213 247 279 313 345]
where the false positives from imperfect data have been removed.
Here is a possible approach:
arr = [14, 48, 81, 111, 112, 113, 114, 148, 179, 213, 247, 279, 311, 313, 314, 344, 345, 346]
def filter_arr(arr, offset):
filtered_nums = set()
for num in sorted(arr):
# Check if there are any "similar" numbers already found
if any(num+x in filtered_nums for x in range(-offset, offset+1)):
continue
else:
filtered_nums.add(num)
return list(sorted(filtered_nums))
Then you can apply the filtering with any offset that you think makes the most sense.
filter_arr(arr, offset=5)
Output: [14, 48, 81, 111, 148, 179, 213, 247, 279, 311, 344]
This can do
#arr is the array you want, num is the number difference between them
def check(arr, num):
for r in arr:
for c in arr:
if abs(r-c) < num + 1:
arr.remove(c)
return arr
yourarray = [14,48 ,81 ,111 ,112 ,113 ,114, 148 , 179 ,213 ,247 ,279 ,311, 313 ,314 ,344, 345, 346]
print(check(yourarray, 1))
I would do it following way:
Conceptually:
Lets say that ten of number is quantity of 10 which could be fitted into given number for example ten of 111 is 11, ten of 247 is 24 and ten of 250 is 25 and so on.
For our data if number with given ten already exist discard it.
Code:
data = [14,48,81,111,112,113,114,148,179,213,247,279,311,313,314,344,345,346]
cleaned = [i for inx,i in enumerate(data) if not i//10 in [j//10 for j in data[:inx]]]
print(cleaned) #[14, 48, 81, 111, 148, 179, 213, 247, 279, 311, 344]
Note that 10 is only example value, that you can replace with another value - bigger value means more elements will be potentially removed. Keep in mind that specific trait of this solution is that specific values pairs (for 10 for example 110 and 111) will be treated as different and would stay in output list, so you need to examine if that is not a problem in your case of usage.

Group by continuous indexes in Pandas DataFrame

I'm working on code for sensors data analysis using python.
I'm taking rows from DataFrame (of gyro data in the example) according to some condition.
import pandas as pd
gyro = pd.read_csv("gyroOutput.csv")
above = gyro[gyro['gyro_z'] > 0.30]
above
Out[162]:
gyro_x gyro_y gyro_z elapsed
27 0.026632 0.021305 0.305731 4.927
28 0.017044 0.011718 0.344080 5.115
29 0.008522 0.013848 0.380299 5.289
30 0.006392 0.026632 0.412257 5.470
31 0.007457 0.005326 0.448476 5.643
32 -0.004261 0.012783 0.465521 5.822
33 -0.001065 0.000000 0.452737 6.002
34 0.009587 0.006392 0.445281 6.181
35 0.010653 0.001065 0.412257 6.361
36 0.006392 0.003196 0.373908 6.543
37 -0.006392 0.007457 0.320645 6.722
108 -0.036219 0.052198 0.323840 19.470
109 -0.061785 -0.001065 0.389887 19.654
110 -0.049002 0.018109 0.453803 19.835
111 -0.038350 0.078830 0.513458 20.015
112 -0.034088 0.011718 0.555003 20.192
113 -0.005326 -0.001065 0.607201 20.374
114 0.009587 0.058590 0.629571 20.553
115 0.038350 -0.029827 0.598679 20.727
116 0.006392 0.013848 0.546481 20.907
117 0.007457 0.030893 0.478304 21.086
118 0.012783 -0.035154 0.446346 21.266
119 0.005326 -0.026632 0.367516 21.444
352 0.007457 0.028762 0.313188 63.284
353 0.006392 -0.011718 0.332363 63.463
354 0.008522 0.030893 0.378169 63.643
355 -0.015979 0.039415 0.409062 63.822
356 -0.009587 -0.022371 0.423975 64.002
357 -0.008522 0.023436 0.450607 64.181
358 -0.011718 0.047937 0.453803 64.361
That result data frame (above) holds groups of continuous indexes rows. For example, lines 27-37.
I want to get all those group's, couldn't find any way to do it using DataFrame.groupby or any other function.
I could iterate over the rows and separate them myself, but maybe there's some simpler way using pandas functions.
IIUC:
In [294]: df.groupby(df.index.to_series().diff().ne(1).cumsum()).groups
Out[294]:
{1: Int64Index([27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37], dtype='int64'),
2: Int64Index([108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119], dtype='int64'),
3: Int64Index([352, 353, 354, 355, 356, 357, 358], dtype='int64')}

Categories