delete redundant rows in a dataframe with set in columns

delete redundant rows in a dataframe with set in columns - python

I have a dataframe df:
Cluster OsId BrowserId PageId VolumePred ConversionPred
0 11 11 {789615, 955761, 1149586, 955764, 955767, 1187... 147.0 71.0
1 0 11 12 {1184903, 955761, 1149586, 1158132, 955764, 10... 73.0 38.0
2 0 11 15 {1184903, 1109643, 955761, 955764, 1074581, 95... 72.0 40.0
3 0 11 16 {1123200, 1184903, 1109643, 1018637, 1005581, ... 7815.0 5077.0
4 0 11 17 {1184903, 789615, 1016529, 955761, 955764, 955... 52.0 47.0
... ... ... ... ... ... ...
307 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1154705 220.0 182.0
308 {18} 99 16 1155314 12.0 6.0
309 {9} 99 16 1158132 4.0 4.0
310 {0, 4, 7, 9, 12, 15, 18, 21} 99 16 1184903 966.0 539.0
This dataframe contains redundansts rows that I need to delete them , so I try this :
df.drop_duplicates()
But I got this error : TypeError: unhashable type: 'set'
Any idea to help me to fix this error? Thanks!

Use frozensets for avoid unhashable sets type with DataFrame.duplicated and filter in boolean indexing with invert mask by ~:
#sets are in any column
df1 = df.applymap(lambda x: frozenset(x) if isinstance(x, set) else x)
df[~df1.duplicated()]
If no row was removed it means no row has duplicates (tested are all columns together)

Related

Bucket Customers based on Points

I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:
IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)
1:M relationship between ParentCustomerID and ChildCustomerID
ParentCustomerID
ChildCustomerID
Points
101
1
0.0
101
2
-20.0
101
3
-30.50
102
4
20.86
102
5
0.0
102
6
50.0
103
7
10.0
103
8
50.0
103
9
-30.0
104
10
-30.0
104
11
0.0
104
12
60.80
104
13
40.0
105
14
0.0
105
15
0.0
105
16
0.0
106
17
-20.0
106
18
-30.80
106
19
-40.20
Output should be:
ParentCustomerID
ChildCustomerID
Points
101
3
-30.50
102
6
50.0
103
8
50.0
104
12
60.80
105
16
0.0
106
19
-40.20
Note: for the rows customer 105, any row can be chosen because they all have 0 points.
Note2: Points can be float and ChildCustomerID can be missing (np.nan)
I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.
Thank you in advance!

Code
df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)
(
df.sort_values(['pri', 'abs'])
.drop_duplicates('ParentCustomerID', keep='last')
.drop(['pri', 'abs'], axis=1)
.sort_index()
)
How this works
Assign a temporary column named abs with the absolute values of Points
Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
Sort the values by priority and absolute values
Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID
Result
ParentCustomerID ChildCustomerID Points
2 101 3 -30.5
5 102 6 50.0
7 103 8 50.0
11 104 12 60.8
15 105 16 0.0
18 106 19 -40.2

import pandas as pd
import numpy as np
df = pd.DataFrame([
[101, 1, 0.0],
[101, 2, -20.0],
[101, 3, -30.50],
[102, 4, 20.86],
[102, 5, 0.0],
[102, 6, 50.0],
[103, 7, 10.0],
[103, 8, 50.0],
[103, 9, -30.0],
[104, 10, -30.0],
[104, 11, 0.0],
[104, 12, 60.80],
[104, 13, 40.0],
[105, 14, 0.0],
[105, 15, 0.0],
[105, 16, 0.0],
[106, 17, -20.0],
[106, 18, -30.80],
[106, 19, -40.20]
],columns=['ParentCustomerID', 'ChildCustomerID', 'Points'])
data = df.groupby('ParentCustomerID').agg({
'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
'ChildCustomerID': list
})
pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
0: "ChildCustomerID",
1: "Points"
}).reset_index()

Finding peaks in pandas series with non integer index

I have the following series and trying to find the index of the peaks which should be [1,8.5] or the peak value which should be [279,139]. the used threshold is 100. I tried many ways but, it always ignores the series index and returns [1,16].
0.5 0
1.0 279
1.5 256
2.0 84
2.5 23
3.0 11
3.5 3
4.0 2
4.5 7
5.0 5
5.5 4
6.0 4
6.5 10
7.0 30
7.5 88
8.0 133
8.5 139
9.0 84
9.5 55
10.0 26
10.5 10
11.0 8
11.5 4
12.0 4
12.5 1
13.0 0
13.5 0
14.0 1
14.5 0
I tried this code
thresh = 100
peak_idx, _ = find_peaks(out.value_counts(sort=False), height=thresh)
plt.plot(out.value_counts(sort=False).index[peak_idx], out.value_counts(sort=False)[peak_idx], 'r.')
out.value_counts(sort=False).plot.bar()
plt.show()
peak_idx
here is the output
array([ 1, 16], dtype=int64)

You are doing it right the only thing that you misunderstood is that find_peaks finds the indexes of the peaks, not peaks themselves.
Here is the documentation that clearly states that:
Returns
peaksndarray
Indices of peaks in x that satisfy all given conditions.
Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html
Try this code here:
thresh = 100
y = [0,279,256, 84, 23, 11, 3, 2, 7, 5, 4, 4, 10, 30, 88,133,139, 84, 55, 26, 10, 8, 4, 4, 1, 0, 0, 1, 0]
x = [0.5 ,1.0 ,1.5 ,2.0 ,2.5 ,3.0 ,3.5 ,4.0 ,4.5 ,5.0 ,5.5 ,6.0 ,6.5 ,7.0 ,7.5 ,8.0 ,8.5 ,9.0 ,9.5 ,10.0,10.5,11.0,11.5,12.0,12.5,13.0,13.5,14.0,14.5]
peak_idx, _ = find_peaks(x, height=thresh)
out_values = [x[peak] for peak in peak_idx]
Here out_vaules will contain what you want

Iterate through df row and append to a list w/o name and dtype

I have a df that has 24 cols and I want to iterate through each row and append to a list consecutively.
Code below does the job - but it also appends on the index value and, name, and dtype, which I need to remove.
results = []
for row in data.iterrows():
results.append(row)
(0, 1 11.87
2 7.60
3 0.32
4 3.11
5 47.43
6 47.81
7 24.74
8 32.57
9 39.49
10 24.48
11 18.14
12 26.52
13 14.17
14 13.45
15 17.80
16 17.89
17 27.39
18 51.55
19 60.22
20 69.64
21 75.97
22 67.45
23 52.88
24 53.25
Name: 0, dtype: float64)
(1, 1 54.49
2 51.67
3 53.68
4 33.81
5 26.99
6 25.80
7 36.35
8 28.85
9 26.01
10 8.44
11 1.64
12 8.01
13 23.41
14 16.22
15 16.30
16 8.90
17 1.93
18 0.00
19 2.79
20 30.24
21 55.58
22 62.79
23 74.70
24 68.46
Name: 1, dtype: float64)
It's similar to iterating through each row, transposing selected row, then adding them appending onto a list consecutively. If a df is (5, 24) then length of list will be 5*24 = 120.

You don't need to iterate through them. Try this:
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)
print(df)
c1 c2
0 10 100
1 11 110
2 12 120
Now you can use .values.ravel() to create a list of all dataframe values:
list(df.values.ravel())
Output:
[10, 100, 11, 110, 12, 120]

As your question is asked you are likely want a output in tuple/list with corresponding values for each row. the output you are asking for is not a flatten list.
pandas have good funtions to actually use numpy, numpy is a great module to work with when it comes to arrays/lists.
lets say you have a DataFrame called data in this case, if you use data.to_numpy() it will actually output a nested list with values for each row.
output:
[['joe' 'Doe' 34]
['bob' 'Warren' 20]
['Anna' 'Anderson' 10]]
you can even index your list like: data.to_numpy()[0].
you can even .flatten() your list like: data.to_numpy().flatten()
output with .flatten():
['joe' 'Doe' 34 'bob' 'Warren' 20 'Anna' 'Anderson' 10]
you can use a for loop to:
for i in data.to_numpy():
print(i)
this give you every list in the nested list.

Convert list to dataframe

I am running a loop that appends three fields. Predictfinal is a list, though it is not necessary that it should be a list.
predictfinal.append(y_hat_orig[0])
predictfinal.append(mape)
predictfinal.append(length)
At the end, predictfinal returns a long list. But I really want to conform the list into a Dataframe, where each row is 3 columns. However the list does not designate between the 3 columns, it's just a long list with commas in between. Somehow I am trying to slice predictfinal into 3 columns and a Dataframe from currnet unstructured list - any help how?
predictfinal
Out[88]:
[1433.0459967608983,
1.6407741379111223,
23,
1433.6389125340916,
1.6474721044455922,
22,
1433.867408791692,
1.6756763089082383,
21,
1433.8484984008207,
1.6457581105556003,
20,
1433.6340460965778,
1.6380908467895527,
19,
1437.0294365907992,
1.6147672264908473,
18,
1439.7485102740507,
1.5010415925555876,
17,
1440.950406295299,
1.433891246672529,
16,
1434.837060644701,
1.5252803314930383,
15,
1434.9716303636983,
1.6125952442799232,
14,
1441.3153523102953,
3.2633984339696185,
13,
1435.6932462859334,
3.2703435261200497,
12,
1419.9057834496082,
1.9100005818319687,
11,
1426.0739741342488,
1.947684057178654,
10]

Based on https://stackoverflow.com/a/48347320/6926444
We can achieve it by using zip() and iter(). The code below iterates three elements each time.
res = pd.DataFrame(list(zip(*([iter(data)] * 3))), columns=['a', 'b', 'c'])
Result:
a b c
0 1433.045997 1.640774 23
1 1433.638913 1.647472 22
2 1433.867409 1.675676 21
3 1433.848498 1.645758 20
4 1433.634046 1.638091 19
5 1437.029437 1.614767 18
6 1439.748510 1.501042 17
7 1440.950406 1.433891 16
8 1434.837061 1.525280 15
9 1434.971630 1.612595 14
10 1441.315352 3.263398 13
11 1435.693246 3.270344 12
12 1419.905783 1.910001 11
13 1426.073974 1.947684 10

You could do:
pd.DataFrame(np.array(predictfinal).reshape(-1,3), columns=['origin', 'mape', 'length'])
Output:
origin mape length
0 1433.045997 1.640774 23.0
1 1433.638913 1.647472 22.0
2 1433.867409 1.675676 21.0
3 1433.848498 1.645758 20.0
4 1433.634046 1.638091 19.0
5 1437.029437 1.614767 18.0
6 1439.748510 1.501042 17.0
7 1440.950406 1.433891 16.0
8 1434.837061 1.525280 15.0
9 1434.971630 1.612595 14.0
10 1441.315352 3.263398 13.0
11 1435.693246 3.270344 12.0
12 1419.905783 1.910001 11.0
13 1426.073974 1.947684 10.0
Or you can also modify your loop:
predictfinal = []
for i in some_list:
predictfinal.append([y_hat_orig[0], mape, length])
# output dataframe
pd.DataFrame(predictfinal, columns=['origin', 'mape', 'length'])

Here is a pandas solution
s=pd.Series(l)
s.index=pd.MultiIndex.from_product([range(len(l)//3),['origin','map','len']])
s=s.unstack()
Out[268]:
len map origin
0 23.0 1.640774 1433.045997
1 22.0 1.647472 1433.638913
2 21.0 1.675676 1433.867409
3 20.0 1.645758 1433.848498
4 19.0 1.638091 1433.634046
5 18.0 1.614767 1437.029437
6 17.0 1.501042 1439.748510
7 16.0 1.433891 1440.950406
8 15.0 1.525280 1434.837061
9 14.0 1.612595 1434.971630
10 13.0 3.263398 1441.315352
11 12.0 3.270344 1435.693246
12 11.0 1.910001 1419.905783
13 10.0 1.947684 1426.073974

Group by, aggregate, include separate column

Here's my data:
foo = pd.DataFrame({
'accnt' : [101, 102, 103, 104, 105, 101, 102, 103, 104, 105],
'gender' : [0, 1 , 0, 1, 0, 0, 1 , 0, 1, 0],
'date' : pd.to_datetime(["2019-01-01 00:10:21", "2019-01-05 00:09:18", "2019-01-05 00:09:30", "2019-02-05 00:05:12", "2019-04-01 00:08:46",
"2019-04-01 00:11:31", "2019-02-06 00:01:39", "2019-01-26 00:15:14", "2019-01-21 00:12:36", "2019-03-01 00:09:31"]),
'value' : [10, 20, 30, 40, 50, 5, 2, 6, 48, 96]
})
Which is:
accnt date gender value
0 101 2019-01-01 00:10:21 0 10
1 102 2019-01-05 00:09:18 1 20
2 103 2019-01-05 00:09:30 0 30
3 104 2019-02-05 00:05:12 1 40
4 105 2019-04-01 00:08:46 0 50
5 101 2019-04-01 00:11:31 0 5
6 102 2019-02-06 00:01:39 1 2
7 103 2019-01-26 00:15:14 0 6
8 104 2019-01-21 00:12:36 1 48
9 105 2019-03-01 00:09:31 0 96
I want to do the following:
- Group by accnt, include gender, take latest date as latest_date, count number of transactions as txn_count; resulting in:
accnt gender latest_date txn_count
101 0 2019-04-01 00:11:31 2
102 1 2019-02-06 00:01:39 2
103 0 2019-01-26 00:15:14 2
104 1 2019-02-05 00:05:12 2
105 0 2019-04-01 00:08:46 2
In R, I can do this using group_by and summarise from dplyr:
foo %>% group_by(accnt) %>%
summarise(gender = last(gender), most_recent_order_date = max(date), order_count = n()) %>% data.frame()
I'm taking last(gender) to include it, since gender is the same throughout for any accnt, I can take min, max or mean instead also.
How can I do the same in python using pandas?
I've tried:
foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']}).rename(columns = {'gender' : "gender",
'date' : "most_recent_order_date",
'value' : "order_count"})
But this leads to "extra" column names. I'd also like to know what is the best way to include a non-aggregation column like gender in the result.

In R summarise will equal to agg , mutate equal to transform
The reason why you have multiple index in columns : Since you pass the function call with list , which means you can do something like {'date':['mean','sum']}
foo.groupby('accnt').agg({'gender' : 'first',
'date': 'max',
'value': 'count'}).rename(columns = {'date' : "most_recent_order_date",
'value' : "order_count"}).reset_index()
Out[727]:
accnt most_recent_order_date order_count gender
0 101 2019-04-01 00:11:31 2 0
1 102 2019-02-06 00:01:39 2 1
2 103 2019-01-26 00:15:14 2 0
3 104 2019-02-05 00:05:12 2 1
4 105 2019-04-01 00:08:46 2 0
Some example : Here I called two function same time for one columns , which means there should be have two level of index to make sure the out columns names do not have duplicated
foo.groupby('accnt').agg({'gender' : ['first','mean']})
Out[728]:
gender
first mean
accnt
101 0 0
102 1 1
103 0 0
104 1 1
105 0 0

Sorry for the late response. Here's a solution I found.
# Pandas Operations
foo = foo.groupby('accnt').agg({'gender' : ['mean'],
'date': ['max'],
'value': ['count']})
# Drop additionally created column names from Pandas Operations
foo.columns = foo.columns.droplevel(1)
# Rename original column names
foo.rename( columns = { 'date':'latest_date',
'value':'txn_count'},
inplace=True)
If you'd like to include an additional non aggregate column, you can simply append a new column to the grouped foo dataframe.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

delete redundant rows in a dataframe with set in columns - python

Related

Bucket Customers based on Points

Finding peaks in pandas series with non integer index

Iterate through df row and append to a list w/o name and dtype

Convert list to dataframe

Group by, aggregate, include separate column

Categories

Resources