Bucket Customers based on Points - python

I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:
IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)
1:M relationship between ParentCustomerID and ChildCustomerID
ParentCustomerID
ChildCustomerID
Points
101
1
0.0
101
2
-20.0
101
3
-30.50
102
4
20.86
102
5
0.0
102
6
50.0
103
7
10.0
103
8
50.0
103
9
-30.0
104
10
-30.0
104
11
0.0
104
12
60.80
104
13
40.0
105
14
0.0
105
15
0.0
105
16
0.0
106
17
-20.0
106
18
-30.80
106
19
-40.20
Output should be:
ParentCustomerID
ChildCustomerID
Points
101
3
-30.50
102
6
50.0
103
8
50.0
104
12
60.80
105
16
0.0
106
19
-40.20
Note: for the rows customer 105, any row can be chosen because they all have 0 points.
Note2: Points can be float and ChildCustomerID can be missing (np.nan)
I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.
Thank you in advance!

Code
df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)
(
df.sort_values(['pri', 'abs'])
.drop_duplicates('ParentCustomerID', keep='last')
.drop(['pri', 'abs'], axis=1)
.sort_index()
)
How this works
Assign a temporary column named abs with the absolute values of Points
Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
Sort the values by priority and absolute values
Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID
Result
ParentCustomerID ChildCustomerID Points
2 101 3 -30.5
5 102 6 50.0
7 103 8 50.0
11 104 12 60.8
15 105 16 0.0
18 106 19 -40.2

import pandas as pd
import numpy as np
df = pd.DataFrame([
[101, 1, 0.0],
[101, 2, -20.0],
[101, 3, -30.50],
[102, 4, 20.86],
[102, 5, 0.0],
[102, 6, 50.0],
[103, 7, 10.0],
[103, 8, 50.0],
[103, 9, -30.0],
[104, 10, -30.0],
[104, 11, 0.0],
[104, 12, 60.80],
[104, 13, 40.0],
[105, 14, 0.0],
[105, 15, 0.0],
[105, 16, 0.0],
[106, 17, -20.0],
[106, 18, -30.80],
[106, 19, -40.20]
],columns=['ParentCustomerID', 'ChildCustomerID', 'Points'])
data = df.groupby('ParentCustomerID').agg({
'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
'ChildCustomerID': list
})
pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
0: "ChildCustomerID",
1: "Points"
}).reset_index()

Related

delete redundant rows in a dataframe with set in columns

I have a dataframe df:
Cluster OsId BrowserId PageId VolumePred ConversionPred
0 11 11 {789615, 955761, 1149586, 955764, 955767, 1187... 147.0 71.0
1 0 11 12 {1184903, 955761, 1149586, 1158132, 955764, 10... 73.0 38.0
2 0 11 15 {1184903, 1109643, 955761, 955764, 1074581, 95... 72.0 40.0
3 0 11 16 {1123200, 1184903, 1109643, 1018637, 1005581, ... 7815.0 5077.0
4 0 11 17 {1184903, 789615, 1016529, 955761, 955764, 955... 52.0 47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1154705 220.0 182.0
308 {18} 99 16 1155314 12.0 6.0
309 {9} 99 16 1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1184903 966.0 539.0
This dataframe contains redundansts rows that I need to delete them , so I try this :
df.drop_duplicates()
But I got this error : TypeError: unhashable type: 'set'
Any idea to help me to fix this error? Thanks!
Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:
#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]
If no row was removed it means no row has duplicates (tested are all columns together)

Finding peaks in pandas series with non integer index

I have the following series and trying to find the index of the peaks which should be [1,8.5] or the peak value which should be [279,139]. the used threshold is 100. I tried many ways but, it always ignores the series index and returns [1,16].
0.5 0
1.0 279
1.5 256
2.0 84
2.5 23
3.0 11
3.5 3
4.0 2
4.5 7
5.0 5
5.5 4
6.0 4
6.5 10
7.0 30
7.5 88
8.0 133
8.5 139
9.0 84
9.5 55
10.0 26
10.5 10
11.0 8
11.5 4
12.0 4
12.5 1
13.0 0
13.5 0
14.0 1
14.5 0
I tried this code
thresh = 100
peak_idx, _ = find_peaks(out.value_counts(sort=False), height=thresh)
plt.plot(out.value_counts(sort=False).index[peak_idx], out.value_counts(sort=False)[peak_idx], 'r.')
out.value_counts(sort=False).plot.bar()
plt.show()
peak_idx
here is the output
array([ 1, 16], dtype=int64)
You are doing it right the only thing that you misunderstood is that find_peaks finds the indexes of the peaks, not peaks themselves.
Here is the documentation that clearly states that:
Returns
peaksndarray
Indices of peaks in x that satisfy all given conditions.
Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html
Try this code here:
thresh = 100
y = [0,279,256, 84, 23, 11, 3, 2, 7, 5, 4, 4, 10, 30, 88,133,139, 84, 55, 26, 10, 8, 4, 4, 1, 0, 0, 1, 0]
x = [0.5 ,1.0 ,1.5 ,2.0 ,2.5 ,3.0 ,3.5 ,4.0 ,4.5 ,5.0 ,5.5 ,6.0 ,6.5 ,7.0 ,7.5 ,8.0 ,8.5 ,9.0 ,9.5 ,10.0,10.5,11.0,11.5,12.0,12.5,13.0,13.5,14.0,14.5]
peak_idx, _ = find_peaks(x, height=thresh)
out_values = [x[peak] for peak in peak_idx]
Here out_vaules will contain what you want

Group by, aggregate, include separate column

Here's my data:
foo = pd.DataFrame({
'accnt' : [101, 102, 103, 104, 105, 101, 102, 103, 104, 105],
'gender' : [0, 1 , 0, 1, 0, 0, 1 , 0, 1, 0],
'date' : pd.to_datetime(["2019-01-01 00:10:21", "2019-01-05 00:09:18", "2019-01-05 00:09:30", "2019-02-05 00:05:12", "2019-04-01 00:08:46",
"2019-04-01 00:11:31", "2019-02-06 00:01:39", "2019-01-26 00:15:14", "2019-01-21 00:12:36", "2019-03-01 00:09:31"]),
'value' : [10, 20, 30, 40, 50, 5, 2, 6, 48, 96]
})
Which is:
accnt date gender value
0 101 2019-01-01 00:10:21 0 10
1 102 2019-01-05 00:09:18 1 20
2 103 2019-01-05 00:09:30 0 30
3 104 2019-02-05 00:05:12 1 40
4 105 2019-04-01 00:08:46 0 50
5 101 2019-04-01 00:11:31 0 5
6 102 2019-02-06 00:01:39 1 2
7 103 2019-01-26 00:15:14 0 6
8 104 2019-01-21 00:12:36 1 48
9 105 2019-03-01 00:09:31 0 96
I want to do the following:
- Group by accnt, include gender, take latest date as latest_date, count number of transactions as txn_count; resulting in:
accnt gender latest_date txn_count
101 0 2019-04-01 00:11:31 2
102 1 2019-02-06 00:01:39 2
103 0 2019-01-26 00:15:14 2
104 1 2019-02-05 00:05:12 2
105 0 2019-04-01 00:08:46 2
In R, I can do this using group_by and summarise from dplyr:
foo %>% group_by(accnt) %>%
summarise(gender = last(gender), most_recent_order_date = max(date), order_count = n()) %>% data.frame()
I'm taking last(gender) to include it, since gender is the same throughout for any accnt, I can take min, max or mean instead also.
How can I do the same in python using pandas?
I've tried:
foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']}).rename(columns = {'gender' : "gender",
'date' : "most_recent_order_date",
'value' : "order_count"})
But this leads to "extra" column names. I'd also like to know what is the best way to include a non-aggregation column like gender in the result.
In R summarise will equal to agg , mutate equal to transform
The reason why you have multiple index in columns : Since you pass the function call with list , which means you can do something like {'date':['mean','sum']}
foo.groupby('accnt').agg({'gender' : 'first',
'date': 'max',
'value': 'count'}).rename(columns = {'date' : "most_recent_order_date",
'value' : "order_count"}).reset_index()
Out[727]:
accnt most_recent_order_date order_count gender
0 101 2019-04-01 00:11:31 2 0
1 102 2019-02-06 00:01:39 2 1
2 103 2019-01-26 00:15:14 2 0
3 104 2019-02-05 00:05:12 2 1
4 105 2019-04-01 00:08:46 2 0
Some example : Here I called two function same time for one columns , which means there should be have two level of index to make sure the out columns names do not have duplicated
foo.groupby('accnt').agg({'gender' : ['first','mean']})
Out[728]:
gender
first mean
accnt
101 0 0
102 1 1
103 0 0
104 1 1
105 0 0
Sorry for the late response. Here's a solution I found.
# Pandas Operations
foo = foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']})
# Drop additionally created column names from Pandas Operations
foo.columns = foo.columns.droplevel(1)
# Rename original column names
foo.rename( columns = { 'date':'latest_date',
'value':'txn_count'},
inplace=True)
If you'd like to include an additional non aggregate column, you can simply append a new column to the grouped foo dataframe.

How do a join two columns into another seperate column in Pandas?

Any help would be greatly appreciated. This is probably easy, but im new to Python.
I want to add two columns which are Latitude and Longitude and put it into a column called Location.
For example:
First row in Latitude will have a value of 41.864073 and the first row of Longitude will have a value of -87.706819.
I would like the 'Locations' column to display 41.864073, -87.706819.
please and thank you.
Setup
df = pd.DataFrame(dict(lat=range(10, 20), lon=range(100, 110)))
zip
This should be better than using apply
df.assign(location=[*zip(df.lat, df.lon)])
lat lon location
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
list variant
Though I'd still suggest tuple
df.assign(location=df[['lat', 'lon']].values.tolist())
lat lon location
0 10 100 [10, 100]
1 11 101 [11, 101]
2 12 102 [12, 102]
3 13 103 [13, 103]
4 14 104 [14, 104]
5 15 105 [15, 105]
6 16 106 [16, 106]
7 17 107 [17, 107]
8 18 108 [18, 108]
9 19 109 [19, 109]
I question the usefulness of this column, but you can generate it by applying the tuple callable over the columns.
>>> df = pd.DataFrame([[1, 2], [3,4]], columns=['lon', 'lat'])
>>> df
>>>
lon lat
0 1 2
1 3 4
>>>
>>> df['Location'] = df.apply(tuple, axis=1)
>>> df
>>>
lon lat Location
0 1 2 (1, 2)
1 3 4 (3, 4)
If there are other columns than 'lon' and 'lat' in your dataframe, use
df['Location'] = df[['lon', 'lat']].apply(tuple, axis=1)
Data from Pir
df['New']=tuple(zip(*df[['lat','lon']].values.T))
df
Out[106]:
lat lon New
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
I definitely learned something from W-B and timgeb. My idea was to just convert to strings and concatenate. I posted my answer in case you wanted the result as a string. Otherwise it looks like the answers above are the way to go.
import pandas as pd
from pandas import *
Dic = {'Lattitude': [41.864073], 'Longitude': [-87.706819]}
DF = pd.DataFrame.from_dict(Dic)
DF['Location'] = DF['Lattitude'].astype(str) + ',' + DF['Longitude'].astype(str)

Pandas DataFrame: Complex linear interpolation

I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

Categories