Grouping data into bins - python

I want to subset the following data frame df into bins of a size 50:
ID FREQ
0 358081 6151
1 431511 952
2 410632 350
3 398149 220
4 177791 158
5 509179 151
6 485346 99
7 536655 50
8 389180 51
9 406622 45
10 410191 112
The result should be this one:
FREQ_BIN QTY_IDs
>200 3
150-200 2
100-150 1
50-100 3
<50 1
How can I do it? Should I use groupBy or any other approach?

You could use pd.cut.
df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
right=False ensures that we take half-open intervals as your output suggests, and unlike np.digitize we need to include np.inf in the bins for "infinite endpoints".
Demo
>>> df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
FREQ
[-inf, 50) 1
[50, 100) 3
[100, 150) 1
[150, 200) 2
[200, inf) 4
dtype: int64

Related

Python Dataframe categorize values

I have a data coming from the field and I want to categorize it with a gap of specific range.
I want to categorize in 100 range. That is, 0-100, 100-200, 200-300
My code:
df=pd.DataFrame([112,341,234,78,154],columns=['value'])
value
0 112
1 341
2 234
3 78
4 154
Expected answer:
value value_range
0 112 100-200
1 341 200-400
2 234 200-300
3 78 0-100
4 154 100-200
My code:
df['value_range'] = df['value'].apply(lambda x:[a,b] if x>a and x<b for a,b in zip([0,100,200,300,400],[100,200,300,400,500]))
Present solution:
SyntaxError: invalid syntax
You can use pd.cut:
df["value_range"] = pd.cut(df["value"], [0, 100, 200, 300, 400], labels=['0-100', '100-200', '200-300', '300-400'])
print(df)
Prints:
value value_range
0 112 100-200
1 341 300-400
2 234 200-300
3 78 0-100
4 154 100-200
you can use the odd IntervalIndex.from_tuples. Just set the tuple values to the values that are in your data and you should be good to go! -Listen to Lil Wayne
df = pd.DataFrame([112,341,234,78,154],columns=['value'])
bins = pd.IntervalIndex.from_tuples([(0, 100), (100, 200), (200, 300), (300, 400)])
df['value_range'] = pd.cut(df['value'], bins)

Bucket Customers based on Points

I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:
IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)
1:M relationship between ParentCustomerID and ChildCustomerID
ParentCustomerID
ChildCustomerID
Points
101
1
0.0
101
2
-20.0
101
3
-30.50
102
4
20.86
102
5
0.0
102
6
50.0
103
7
10.0
103
8
50.0
103
9
-30.0
104
10
-30.0
104
11
0.0
104
12
60.80
104
13
40.0
105
14
0.0
105
15
0.0
105
16
0.0
106
17
-20.0
106
18
-30.80
106
19
-40.20
Output should be:
ParentCustomerID
ChildCustomerID
Points
101
3
-30.50
102
6
50.0
103
8
50.0
104
12
60.80
105
16
0.0
106
19
-40.20
Note: for the rows customer 105, any row can be chosen because they all have 0 points.
Note2: Points can be float and ChildCustomerID can be missing (np.nan)
I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.
Thank you in advance!
Code
df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)
(
df.sort_values(['pri', 'abs'])
.drop_duplicates('ParentCustomerID', keep='last')
.drop(['pri', 'abs'], axis=1)
.sort_index()
)
How this works
Assign a temporary column named abs with the absolute values of Points
Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
Sort the values by priority and absolute values
Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID
Result
ParentCustomerID ChildCustomerID Points
2 101 3 -30.5
5 102 6 50.0
7 103 8 50.0
11 104 12 60.8
15 105 16 0.0
18 106 19 -40.2
import pandas as pd
import numpy as np
df = pd.DataFrame([
[101, 1, 0.0],
[101, 2, -20.0],
[101, 3, -30.50],
[102, 4, 20.86],
[102, 5, 0.0],
[102, 6, 50.0],
[103, 7, 10.0],
[103, 8, 50.0],
[103, 9, -30.0],
[104, 10, -30.0],
[104, 11, 0.0],
[104, 12, 60.80],
[104, 13, 40.0],
[105, 14, 0.0],
[105, 15, 0.0],
[105, 16, 0.0],
[106, 17, -20.0],
[106, 18, -30.80],
[106, 19, -40.20]
],columns=['ParentCustomerID', 'ChildCustomerID', 'Points'])
data = df.groupby('ParentCustomerID').agg({
'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
'ChildCustomerID': list
})
pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
0: "ChildCustomerID",
1: "Points"
}).reset_index()

Map counts of a numerical column from a new DataFrame to the bin range column of training data

I am trying to get the count of Age column and append it to my existing bin-range column created. I am able to do it for the training df and want to do it for prediction data. How do I map values of count of Age column from prediction data to to Age_bin column in my training data? The first one is my output DF whereas the 2nd one is the sample DF. I can get the count using value_counts() for the file I am reading.
First image - bin and count from training data
Second image - Training data
Third image - Prediction data
Fourth image - Final output
.
.
The Data
import pandas as pd
data = {
0: 0,
11: 1500,
12: 1000,
22: 3000,
32: 35000,
34: 40000,
44: 55000,
65: 7000,
80: 8000,
100: 1000000,
}
df = pd.DataFrame(data.items(), columns=['Age', 'Salary'])
Age Salary
0 0 0
1 11 1500
2 12 1000
3 22 3000
4 32 35000
5 34 40000
6 44 55000
7 65 7000
8 80 8000
9 100 1000000
The Code
bins = [-0.1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# create a "binned" column
df['binned'] = pd.cut(df['Age'], bins)
# add bin count
df['count'] = df.groupby('binned')['binned'].transform('count')
The Output
Age Salary binned count
0 0 0 (-0.1, 10.0] 1
1 11 1500 (10.0, 20.0] 2
2 12 1000 (10.0, 20.0] 2
3 22 3000 (20.0, 30.0] 1
4 32 35000 (30.0, 40.0] 2
5 34 40000 (30.0, 40.0] 2
6 44 55000 (40.0, 50.0] 1
7 65 7000 (60.0, 70.0] 1
8 80 8000 (70.0, 80.0] 1
9 100 1000000 (90.0, 100.0] 1

Pandas: Select row pairs based on specific combination of strings in one column

I'm fairly new to python/pandas and have struggled to find an example specific enough for me to work with.
Say I have the following pandas dataframe, consisting of a column of event markers and a column displaying the time each marker was presented:
df = pd.DataFrame({'Marker': ['S200', 'S4', 'S44', 'Tone', 'S200', 'S1', 'S44', 'Tone'],
'Time': [0, 100, 150, 230, 300, 340, 380, 400]})
Marker Time
0 S200 0
1 S4 100
2 S44 150
3 Tone 230
4 S200 300
5 S1 340
6 S44 380
7 Tone 400
I would like to extract pairs of rows where S44 is followed by a Tone. The resulting output should be:
newdf = pd.DataFrame({'Marker': ['S44', 'Tone', 'S44', 'Tone'],
'Time': [150, 230, 380, 400]})
Marker Time
0 S44 150
1 Tone 230
2 S44 380
3 Tone 400
Any ideas would be appreciated!
One way about it is to use shift to get the indexes, add 1 and pull with loc - note that this assumes that the index is numeric and monotonic increasing:
index = df.loc[df.Marker.shift(-1).eq('Tone') & (df.Marker.eq('S44'))].index
df.loc[index.union(index +1)]
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400
Another way:
s = ((df.Marker.eq('S44')) & (df.Marker.shift(-1).eq('Tone')))
df = df[s | s.shift()]
OUTPUT:
Marker Time
2 S44 150
3 Tone 230
6 S44 380
7 Tone 400

Extract specific elements from array to create ranges

I want to extract very specific elements from an array to create various ranges.
For example,
ranges = np.arange(0, 525, 25)
#array([ 0, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500])
I want to create ranges that output as this-
0 50
75 125
150 200
225 275
300 350
375 425
450 500
I know if I want to extract every number into a range, I can write something like this-
for start, end in zip(ranges[:-1], ranges[1:]):
print(start, end)
Which would give the result-
0 50
25 75
50 100
75 125
100 150
125 175
150 200
175 225
200 250
225 275
250 300
275 325
300 350
325 375
350 400
375 425
400 450
425 475
However, I'm not sure how to extract every other element from the array.
There is no need to stack anything here, or create more buffers than the one you already have. A couple of reshapes and simple indices should do the trick.
Notice that you are taking the first and last element in a sequence of three. That means you can truncate your array to a multiple of three, reshape to have N rows of three elements, and simply extract the first and last one of each.
k = 3
ranges = np.arange(0, 525, 25)
ranges[:ranges.size - ranges.size % k].reshape(-1, k)[:, ::k - 1]
Here is a breakdown of the one-liner:
trimmed_to_multiple = ranges[:ranges.size - ranges.size % k]
reshaped_to_n_by_k = trimmed_to_multiple.reshape(-1, k)
first_and_last_column_only = reshaped_to_n_by_k[:, ::k - 1]
You can create your two ranges, then stack them on the column axis:
np.stack((np.arange(0, 500, 75), np.arange(50, 550, 75)), axis=1)
Generalized:
>>> start = 50
>>> step = 75
>>> end = 550
>>> np.stack((np.arange(0, end - start, step), np.arange(start, end, step)), axis=1)
array([[ 0, 50],
[ 75, 125],
[150, 200],
[225, 275],
[300, 350],
[375, 425],
[450, 500]])
lowestMidPoint, highestMidPoint = 25, 475
ranges = [[midpoint-25, midpoint+25] for midpoint in range(lowestMidPoint,highestMidPoint+1,75)]
will get you the results as an array of arrays. You can then just call np.array(ranges) to get it in numpy.
To answer your question "how do i get every other entry in the array?"... if you have a numpy array
out =
1 2
3 4
5 6
you can slice out as follows
stride = 2
out[::stride,:]
to get every other row

Categories